All of lore.kernel.org
 help / color / mirror / Atom feed
* URL decoding changed semantics of + in URLs
@ 2010-07-23 13:18 Thomas Rast
  2010-07-23 13:21 ` Thomas Rast
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Thomas Rast @ 2010-07-23 13:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git, jstpierre

Hi Jeff

As pointed out by Jasper St. Pierre on #git, it is no longer possible
to clone

  git://git.gnome.org/gtk+

because your 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23)
decodes + characters in URLs to spaces in the http style.  It was
later fixed by ce83eda (url.c: "<scheme>://" part at the beginning
should not be URL decoded, 2010-06-23) but the later part of the url
still decodes + as space.

The tests that go along with the commit make it clear that it was an
intended change.  But the interesting thing is, I cannot find any
reference in any RFC that + must have this meaning.  In particular,

  http://www.ietf.org/rfc/rfc2396.txt

doesn't say much about + and the only escaping defined is the usual
%xx style.  So is there a standard that mandates this, or was it just
a well-meaning but unnecessary backwards incompatible change?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-23 13:18 URL decoding changed semantics of + in URLs Thomas Rast
@ 2010-07-23 13:21 ` Thomas Rast
  2010-07-23 14:10 ` Ævar Arnfjörð Bjarmason
  2010-07-26 15:40 ` URL decoding changed semantics of + " Jeff King
  2 siblings, 0 replies; 18+ messages in thread
From: Thomas Rast @ 2010-07-23 13:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, jstpierre

Thomas Rast wrote:
> Hi Jeff
> 
> As pointed out by Jasper St. Pierre on #git, it is no longer possible
> to clone
> 
>   git://git.gnome.org/gtk+

Well, that reads wrong now that I'm seeing it again.

I of course meant to say that the above URL no longer works and must
be spelled

    git://git.gnome.org/gtk%2B

instead.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-23 13:18 URL decoding changed semantics of + in URLs Thomas Rast
  2010-07-23 13:21 ` Thomas Rast
@ 2010-07-23 14:10 ` Ævar Arnfjörð Bjarmason
  2010-07-23 14:25   ` Jasper St. Pierre
  2010-07-23 21:23   ` [PATCH] Do not unquote + into ' ' " Thomas Rast
  2010-07-26 15:40 ` URL decoding changed semantics of + " Jeff King
  2 siblings, 2 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-23 14:10 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Jeff King, git, jstpierre

On Fri, Jul 23, 2010 at 13:18, Thomas Rast <trast@student.ethz.ch> wrote:

> doesn't say much about + and the only escaping defined is the usual
> %xx style.  So is there a standard that mandates this, or was it just
> a well-meaning but unnecessary backwards incompatible change?

+ and %20 are as far as I know only interchangable in *query strings*,
so having to clone 'git://git.gnome.org/gtk%2B' where you could
previously clone 'git://git.gnome.org/gtk+' is a bug. Git shouldn't be
changing that + to a %20.

I haven't followed why we need to escape + to %20 at all, even in the
query string. E.g. curl(1) doesn't do that before sending requests to
Apache, which can handle either one. The + v.s. %20 duality is always
handled at the server AFAIK.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-23 14:10 ` Ævar Arnfjörð Bjarmason
@ 2010-07-23 14:25   ` Jasper St. Pierre
  2010-07-23 21:23   ` [PATCH] Do not unquote + into ' ' " Thomas Rast
  1 sibling, 0 replies; 18+ messages in thread
From: Jasper St. Pierre @ 2010-07-23 14:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Thomas Rast, Jeff King, git

Yep. http://www.ietf.org/rfc/rfc2396.txt defines '+' as a reserved character,
but doesn't give a purpose for it. www-form-encoded replaces space with '+'
but in a URL it can mean anything it wants.

On Fri, Jul 23, 2010 at 10:10 AM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Fri, Jul 23, 2010 at 13:18, Thomas Rast <trast@student.ethz.ch> wrote:
>
>> doesn't say much about + and the only escaping defined is the usual
>> %xx style.  So is there a standard that mandates this, or was it just
>> a well-meaning but unnecessary backwards incompatible change?
>
> + and %20 are as far as I know only interchangable in *query strings*,
> so having to clone 'git://git.gnome.org/gtk%2B' where you could
> previously clone 'git://git.gnome.org/gtk+' is a bug. Git shouldn't be
> changing that + to a %20.
>
> I haven't followed why we need to escape + to %20 at all, even in the
> query string. E.g. curl(1) doesn't do that before sending requests to
> Apache, which can handle either one. The + v.s. %20 duality is always
> handled at the server AFAIK.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Do not unquote + into ' ' in URLs
  2010-07-23 14:10 ` Ævar Arnfjörð Bjarmason
  2010-07-23 14:25   ` Jasper St. Pierre
@ 2010-07-23 21:23   ` Thomas Rast
  2010-07-23 22:20     ` Ævar Arnfjörð Bjarmason
  2010-07-23 22:26     ` Junio C Hamano
  1 sibling, 2 replies; 18+ messages in thread
From: Thomas Rast @ 2010-07-23 21:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git, avarab, jstpierre

Since 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23) the URL
logic unquotes escaped URLs.  For the %2B type of escape, this is
conformant with RFC 2396.  However, it also unquotes + into a space
character, which is only appropriate for the query strings in HTTP.
This notably broke fetching from the gtk+ repository.

Remove the corresponding bit of code.

Reported-by: Jasper St. Pierre <jstpierre@mecheye.net>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---

Jasper St. Pierre wrote:
> Yep. http://www.ietf.org/rfc/rfc2396.txt defines '+' as a reserved character,
> but doesn't give a purpose for it. www-form-encoded replaces space with '+'
> but in a URL it can mean anything it wants.

So let's do this then, instead?

Based on the discussion, I would consider this a bugfix that should go
in 1.7.2.1.


 t/t5601-clone.sh |   10 ++++++++--
 url.c            |    5 +----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 8abb71a..4431dfd 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -178,8 +178,14 @@ test_expect_success 'clone respects global branch.autosetuprebase' '
 
 test_expect_success 'respect url-encoding of file://' '
 	git init x+y &&
-	test_must_fail git clone "file://$PWD/x+y" xy-url &&
-	git clone "file://$PWD/x%2By" xy-url
+	git clone "file://$PWD/x+y" xy-url-1 &&
+	git clone "file://$PWD/x%2By" xy-url-2
+'
+
+test_expect_success 'do not query-string-decode + in URLs' '
+	rm -rf x+y &&
+	git init "x y" &&
+	test_must_fail git clone "file://$PWD/x+y" xy-no-plus
 '
 
 test_expect_success 'do not respect url-encoding of non-url path' '
diff --git a/url.c b/url.c
index 2306236..fa4b8d4 100644
--- a/url.c
+++ b/url.c
@@ -90,10 +90,7 @@ static char *url_decode_internal(const char **query, const char *stop_at, struct
 			}
 		}
 
-		if (c == '+')
-			strbuf_addch(out, ' ');
-		else
-			strbuf_addch(out, c);
+		strbuf_addch(out, c);
 		q++;
 	} while (1);
 	*query = q;
-- 
1.7.2.rc3.335.g26d7d

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Do not unquote + into ' ' in URLs
  2010-07-23 21:23   ` [PATCH] Do not unquote + into ' ' " Thomas Rast
@ 2010-07-23 22:20     ` Ævar Arnfjörð Bjarmason
  2010-07-23 22:26     ` Junio C Hamano
  1 sibling, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-23 22:20 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Jeff King, git, jstpierre

On Fri, Jul 23, 2010 at 21:23, Thomas Rast <trast@student.ethz.ch> wrote:
> Since 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23) the URL
> logic unquotes escaped URLs.  For the %2B type of escape, this is
> conformant with RFC 2396.  However, it also unquotes + into a space
> character, which is only appropriate for the query strings in HTTP.
> This notably broke fetching from the gtk+ repository.
>
> Remove the corresponding bit of code.
>
> Reported-by: Jasper St. Pierre <jstpierre@mecheye.net>
> Signed-off-by: Thomas Rast <trast@student.ethz.ch>
> ---
>
> Jasper St. Pierre wrote:
>> Yep. http://www.ietf.org/rfc/rfc2396.txt defines '+' as a reserved character,
>> but doesn't give a purpose for it. www-form-encoded replaces space with '+'
>> but in a URL it can mean anything it wants.
>
> So let's do this then, instead?
>
> Based on the discussion, I would consider this a bugfix that should go
> in 1.7.2.1.
>
>
>  t/t5601-clone.sh |   10 ++++++++--
>  url.c            |    5 +----
>  2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
> index 8abb71a..4431dfd 100755
> --- a/t/t5601-clone.sh
> +++ b/t/t5601-clone.sh
> @@ -178,8 +178,14 @@ test_expect_success 'clone respects global branch.autosetuprebase' '
>
>  test_expect_success 'respect url-encoding of file://' '
>        git init x+y &&
> -       test_must_fail git clone "file://$PWD/x+y" xy-url &&
> -       git clone "file://$PWD/x%2By" xy-url
> +       git clone "file://$PWD/x+y" xy-url-1 &&
> +       git clone "file://$PWD/x%2By" xy-url-2
> +'
> +
> +test_expect_success 'do not query-string-decode + in URLs' '
> +       rm -rf x+y &&
> +       git init "x y" &&
> +       test_must_fail git clone "file://$PWD/x+y" xy-no-plus
>  '
>
>  test_expect_success 'do not respect url-encoding of non-url path' '
> diff --git a/url.c b/url.c
> index 2306236..fa4b8d4 100644
> --- a/url.c
> +++ b/url.c
> @@ -90,10 +90,7 @@ static char *url_decode_internal(const char **query, const char *stop_at, struct
>                        }
>                }
>
> -               if (c == '+')
> -                       strbuf_addch(out, ' ');
> -               else
> -                       strbuf_addch(out, c);
> +               strbuf_addch(out, c);
>                q++;
>        } while (1);
>        *query = q;
> --
> 1.7.2.rc3.335.g26d7d
>

This looks good, Ack.

But as icing, it'd be nice to extend these tests to create files /
clone repositories with the rest of the reserved characters:
http://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_reserved_characters

Of course that would have to be wrapped in something that skips the
tests if those paths can't be created. E.g. "/" is a no-no on Unix,
and some of the others will probably cause troubles on other systems.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Do not unquote + into ' ' in URLs
  2010-07-23 21:23   ` [PATCH] Do not unquote + into ' ' " Thomas Rast
  2010-07-23 22:20     ` Ævar Arnfjörð Bjarmason
@ 2010-07-23 22:26     ` Junio C Hamano
  2010-07-23 23:04       ` Thomas Rast
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2010-07-23 22:26 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Jeff King, git, avarab, jstpierre

Thomas Rast <trast@student.ethz.ch> writes:

> Since 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23) the URL
> logic unquotes escaped URLs.  For the %2B type of escape, this is
> conformant with RFC 2396.  However, it also unquotes + into a space
> character, which is only appropriate for the query strings in HTTP.
> This notably broke fetching from the gtk+ repository.

Wait a minute.

> Based on the discussion, I would consider this a bugfix that should go
> in 1.7.2.1.

Some form of this may need to be applied to help the client side, but what
will happen to

  http-backend.c::get_info_refs()
   -> http-backend.c::get_parameter()
     -> http-backend.c::get_parameters()
       -> url.c::url_decode_parameter_value()
         -> url.c::url_decode_internal()

codepath, which is the server-side handing of query strings?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Do not unquote + into ' ' in URLs
  2010-07-23 22:26     ` Junio C Hamano
@ 2010-07-23 23:04       ` Thomas Rast
  2010-07-24 14:49         ` [PATCH v2] " Thomas Rast
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Rast @ 2010-07-23 23:04 UTC (permalink / raw)
  To: Junio C Hamano, Shawn O. Pearce; +Cc: Jeff King, git, avarab, jstpierre

Junio C Hamano wrote:
> Thomas Rast <trast@student.ethz.ch> writes:
> 
> > Since 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23) the URL
> > logic unquotes escaped URLs.  For the %2B type of escape, this is
> > conformant with RFC 2396.  However, it also unquotes + into a space
> > character, which is only appropriate for the query strings in HTTP.
> > This notably broke fetching from the gtk+ repository.
> 
> Wait a minute.
> 
> > Based on the discussion, I would consider this a bugfix that should go
> > in 1.7.2.1.
> 
> Some form of this may need to be applied to help the client side, but what
> will happen to
> 
>   http-backend.c::get_info_refs()
>    -> http-backend.c::get_parameter()
>      -> http-backend.c::get_parameters()
>        -> url.c::url_decode_parameter_value()
>          -> url.c::url_decode_internal()
> 
> codepath, which is the server-side handing of query strings?

You're right, I forgot about those.  I imagine it would be one of two
cases:

* It should never have dequoted + before the ? query-delimiter, so we
  need to fix that too;

* It should never have dequoted + before the ? query-delimiter, but
  it's too late to change that now.

Shawn, can you help with this?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] Do not unquote + into ' ' in URLs
  2010-07-23 23:04       ` Thomas Rast
@ 2010-07-24 14:49         ` Thomas Rast
  2010-07-31 21:18           ` Jasper St. Pierre
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Rast @ 2010-07-24 14:49 UTC (permalink / raw)
  To: git; +Cc: Jeff King, avarab, jstpierre, Shawn O. Pearce, Junio C Hamano

Since 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23) the URL
logic unquotes escaped URLs.  For the %2B type of escape, this is
conformant with RFC 2396.  However, it also unquotes + into a space
character, which is only appropriate for the query strings in HTTP.
This notably broke fetching from the gtk+ repository.

We cannot just remove the corresponding code since the same
url_decode_internal() is also used by the HTTP backend to decode query
parameters.  Introduce a new argument that controls whether the +
decoding happens, and use it only in the (client-side) url_decode().

Reported-by: Jasper St. Pierre <jstpierre@mecheye.net>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---

I wrote:
> Junio C Hamano wrote:
> > 
> >   http-backend.c::get_info_refs()
> >    -> http-backend.c::get_parameter()
> >      -> http-backend.c::get_parameters()
> >        -> url.c::url_decode_parameter_value()
> >          -> url.c::url_decode_internal()
> 
> You're right, I forgot about those.  I imagine it would be one of two
> cases:
[...]
> Shawn, can you help with this?

The third case, of course, is:

* It only uses these functions for parameter decoding, which of course
  was correct to begin with.

So after hopefully drinking enough coffee, I made this one.  The catch
is that I'm not entirely clear whether *not* decoding the +
client-side anywhere in the URL is correct for http:// URLs?  If the
client decodes and re-encodes the URL, then the + would be turned into
a %2B on the re-encoding.  Then again maybe UI-facing URLs should
never have a query part at all?


 t/t5601-clone.sh |   10 ++++++++--
 url.c            |   11 ++++++-----
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 8abb71a..4431dfd 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -178,8 +178,14 @@ test_expect_success 'clone respects global branch.autosetuprebase' '
 
 test_expect_success 'respect url-encoding of file://' '
 	git init x+y &&
-	test_must_fail git clone "file://$PWD/x+y" xy-url &&
-	git clone "file://$PWD/x%2By" xy-url
+	git clone "file://$PWD/x+y" xy-url-1 &&
+	git clone "file://$PWD/x%2By" xy-url-2
+'
+
+test_expect_success 'do not query-string-decode + in URLs' '
+	rm -rf x+y &&
+	git init "x y" &&
+	test_must_fail git clone "file://$PWD/x+y" xy-no-plus
 '
 
 test_expect_success 'do not respect url-encoding of non-url path' '
diff --git a/url.c b/url.c
index 2306236..cd8f74f 100644
--- a/url.c
+++ b/url.c
@@ -67,7 +67,8 @@ static int url_decode_char(const char *q)
 	return val;
 }
 
-static char *url_decode_internal(const char **query, const char *stop_at, struct strbuf *out)
+static char *url_decode_internal(const char **query, const char *stop_at,
+				 struct strbuf *out, int decode_plus)
 {
 	const char *q = *query;
 
@@ -90,7 +91,7 @@ static char *url_decode_internal(const char **query, const char *stop_at, struct
 			}
 		}
 
-		if (c == '+')
+		if (decode_plus && c == '+')
 			strbuf_addch(out, ' ');
 		else
 			strbuf_addch(out, c);
@@ -110,17 +111,17 @@ char *url_decode(const char *url)
 		strbuf_add(&out, url, colon - url);
 		url = colon;
 	}
-	return url_decode_internal(&url, NULL, &out);
+	return url_decode_internal(&url, NULL, &out, 0);
 }
 
 char *url_decode_parameter_name(const char **query)
 {
 	struct strbuf out = STRBUF_INIT;
-	return url_decode_internal(query, "&=", &out);
+	return url_decode_internal(query, "&=", &out, 1);
 }
 
 char *url_decode_parameter_value(const char **query)
 {
 	struct strbuf out = STRBUF_INIT;
-	return url_decode_internal(query, "&", &out);
+	return url_decode_internal(query, "&", &out, 1);
 }
-- 
1.7.2.278.g76edd.dirty

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-23 13:18 URL decoding changed semantics of + in URLs Thomas Rast
  2010-07-23 13:21 ` Thomas Rast
  2010-07-23 14:10 ` Ævar Arnfjörð Bjarmason
@ 2010-07-26 15:40 ` Jeff King
  2010-07-26 17:57   ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 18+ messages in thread
From: Jeff King @ 2010-07-26 15:40 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, jstpierre

On Fri, Jul 23, 2010 at 03:18:30PM +0200, Thomas Rast wrote:

> As pointed out by Jasper St. Pierre on #git, it is no longer possible
> to clone
> 
>   git://git.gnome.org/gtk+
> 
> because your 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23)
> decodes + characters in URLs to spaces in the http style.  It was
> later fixed by ce83eda (url.c: "<scheme>://" part at the beginning
> should not be URL decoded, 2010-06-23) but the later part of the url
> still decodes + as space.
> 
> The tests that go along with the commit make it clear that it was an
> intended change.  But the interesting thing is, I cannot find any
> reference in any RFC that + must have this meaning.  In particular,
> 
>   http://www.ietf.org/rfc/rfc2396.txt
> 
> doesn't say much about + and the only escaping defined is the usual
> %xx style.  So is there a standard that mandates this, or was it just
> a well-meaning but unnecessary backwards incompatible change?

Sorry for the slow reply. I am in the middle of moving and just got a
usable machine running.

I think it is well-meaning but unnecessary. That is, I think %-decoding
is probably still a good idea, and I just dragged '+'-decoding along out
of cluelessness. I thought I cross-checked with what curl was doing to
http URLs, but now I can't even get it to do anything but pass the path
component literally to the webserver. So I'm really not sure what I
tested before.

As Jasper noted, "+" is a reserved character, which means "gtk+"
probably _should_ be escaped. But clearly it doesn't happen in practice,
and I am more interested in not breaking current use than in nitpicking
with a standard.

So I agree with the spirit of your patches, and reading over your v2, it
looks fine to me, but I didn't do any extensive testing.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-26 15:40 ` URL decoding changed semantics of + " Jeff King
@ 2010-07-26 17:57   ` Ævar Arnfjörð Bjarmason
  2010-07-26 18:22     ` Jasper St. Pierre
  0 siblings, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-26 17:57 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, git, jstpierre

On Mon, Jul 26, 2010 at 15:40, Jeff King <peff@peff.net> wrote:

> As Jasper noted, "+" is a reserved character, which means "gtk+"
> probably _should_ be escaped. But clearly it doesn't happen in practice,
> and I am more interested in not breaking current use than in nitpicking
> with a standard.

Reserved characters only have to be escaped in certain contexts, from
RFC 2396:

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

E.g. @ is special in the hostname part (http://user@example.org), but
writing http://example.org/Git@Big%20companies:%20A%20Study is just
fine.

Which is why curl passes it along literally, it *can* escape them, and
real webservers like Apache handle reserved characters equivalently
(in their unreserved contexts) whether they're escaped or not, but the
git-daemon at git.gnome.org evidently doesn't implement RFC 2396
carefully enough.

So we shouldn't escape + for backwards compatibility and because it's
not necessary, but we should probably also fix git-daemon to accept
both forms if that hasn't been done already.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-26 17:57   ` Ævar Arnfjörð Bjarmason
@ 2010-07-26 18:22     ` Jasper St. Pierre
  2010-07-26 18:30       ` Matthieu Moy
  2010-07-26 18:35       ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 18+ messages in thread
From: Jasper St. Pierre @ 2010-07-26 18:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jeff King, Thomas Rast, git

Where is '+' used in the URL context? I don't see that it's a
replacement for '+'
aside from formencoded, which is from CGI, not HTTP or the URI spec.

I also can't access something called "test 2.txt" from Apache with
"http://localhost/test+2.txt", so I don't think it's unescaping the '+'.

I don't think we should do anything about the '+' case, except where used in
formencoded parameters (aka the "query string"), where it is used.

On Mon, Jul 26, 2010 at 1:57 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Mon, Jul 26, 2010 at 15:40, Jeff King <peff@peff.net> wrote:
>
>> As Jasper noted, "+" is a reserved character, which means "gtk+"
>> probably _should_ be escaped. But clearly it doesn't happen in practice,
>> and I am more interested in not breaking current use than in nitpicking
>> with a standard.
>
> Reserved characters only have to be escaped in certain contexts, from
> RFC 2396:
>
>   Many URI include components consisting of or delimited by, certain
>   special characters.  These characters are called "reserved", since
>   their usage within the URI component is limited to their reserved
>   purpose.  If the data for a URI component would conflict with the
>   reserved purpose, then the conflicting data must be escaped before
>   forming the URI.
>
>      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
>                    "$" | ","
>
> E.g. @ is special in the hostname part (http://user@example.org), but
> writing http://example.org/Git@Big%20companies:%20A%20Study is just
> fine.
>
> Which is why curl passes it along literally, it *can* escape them, and
> real webservers like Apache handle reserved characters equivalently
> (in their unreserved contexts) whether they're escaped or not, but the
> git-daemon at git.gnome.org evidently doesn't implement RFC 2396
> carefully enough.
>
> So we shouldn't escape + for backwards compatibility and because it's
> not necessary, but we should probably also fix git-daemon to accept
> both forms if that hasn't been done already.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-26 18:22     ` Jasper St. Pierre
@ 2010-07-26 18:30       ` Matthieu Moy
  2010-07-26 18:35       ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 18+ messages in thread
From: Matthieu Moy @ 2010-07-26 18:30 UTC (permalink / raw)
  To: Jasper St. Pierre
  Cc: Ævar Arnfjörð Bjarmason, Jeff King, Thomas Rast, git

"Jasper St. Pierre" <jstpierre@mecheye.net> writes:

> I don't think we should do anything about the '+' case, except where used in
> formencoded parameters (aka the "query string"), where it is used.

This is what the latest version of the patch does, AIUI, yes.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-26 18:22     ` Jasper St. Pierre
  2010-07-26 18:30       ` Matthieu Moy
@ 2010-07-26 18:35       ` Ævar Arnfjörð Bjarmason
  2010-07-26 18:44         ` Jasper St. Pierre
  1 sibling, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-26 18:35 UTC (permalink / raw)
  To: Jasper St. Pierre; +Cc: Jeff King, Thomas Rast, git

On Mon, Jul 26, 2010 at 18:22, Jasper St. Pierre <jstpierre@mecheye.net> wrote:
> Where is '+' used in the URL context? I don't see that it's a
> replacement for '+'
> aside from formencoded, which is from CGI, not HTTP or the URI spec.

In my example? Nowhere, but I used : and @, which are also reserved
characters.

> I also can't access something called "test 2.txt" from Apache with
> "http://localhost/test+2.txt", so I don't think it's unescaping the '+'.

Yes, that's not supposed to work. But you should be able to access
"test+2.txt" using /test+2.txt and /test%2B.txt and
/%74%65%73%74%2B2%2E%74%78%74 for that matter.

git-daemon only seems to handle the first form. Which is probably a
bug, maybe it doesn't *have to* URI unescape its arguments, but it's
probably a good idea anyway. E.g. some systems that handle URIs will
convert : to %3A automatically when passing them through. That would
break a git URL.

> I don't think we should do anything about the '+' case, except where used in
> formencoded parameters (aka the "query string"), where it is used.

I don't really have an opinion on what we should do. It's not a
problem for me, I'm just noting how it could break, and that maybe we
should try harder and support URI escaping where we handle URLs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: URL decoding changed semantics of + in URLs
  2010-07-26 18:35       ` Ævar Arnfjörð Bjarmason
@ 2010-07-26 18:44         ` Jasper St. Pierre
  0 siblings, 0 replies; 18+ messages in thread
From: Jasper St. Pierre @ 2010-07-26 18:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jeff King, Thomas Rast, git

Hm. I've always wondered about the percent double-encoding problem,
where the "%"
in "%2B" double-encodes to "%252B", which might require the mandatory server
implementation then.

I do think that because "git://git.gnome.org/git+" was working,
something on the server
was going right.

On Mon, Jul 26, 2010 at 2:35 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Mon, Jul 26, 2010 at 18:22, Jasper St. Pierre <jstpierre@mecheye.net> wrote:
>> Where is '+' used in the URL context? I don't see that it's a
>> replacement for '+'
>> aside from formencoded, which is from CGI, not HTTP or the URI spec.
>
> In my example? Nowhere, but I used : and @, which are also reserved
> characters.
>
>> I also can't access something called "test 2.txt" from Apache with
>> "http://localhost/test+2.txt", so I don't think it's unescaping the '+'.
>
> Yes, that's not supposed to work. But you should be able to access
> "test+2.txt" using /test+2.txt and /test%2B.txt and
> /%74%65%73%74%2B2%2E%74%78%74 for that matter.
>
> git-daemon only seems to handle the first form. Which is probably a
> bug, maybe it doesn't *have to* URI unescape its arguments, but it's
> probably a good idea anyway. E.g. some systems that handle URIs will
> convert : to %3A automatically when passing them through. That would
> break a git URL.
>
>> I don't think we should do anything about the '+' case, except where used in
>> formencoded parameters (aka the "query string"), where it is used.
>
> I don't really have an opinion on what we should do. It's not a
> problem for me, I'm just noting how it could break, and that maybe we
> should try harder and support URI escaping where we handle URLs.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] Do not unquote + into ' ' in URLs
  2010-07-24 14:49         ` [PATCH v2] " Thomas Rast
@ 2010-07-31 21:18           ` Jasper St. Pierre
  2010-07-31 21:33             ` Thomas Rast
  0 siblings, 1 reply; 18+ messages in thread
From: Jasper St. Pierre @ 2010-07-31 21:18 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Jeff King, avarab, Shawn O. Pearce, Junio C Hamano

Hmm.. could we get a follow-up on this patch? It's annoying because it's both
backwards and forwards incompatible: the %2b workaround doesn't work in
older versions.

On Sat, Jul 24, 2010 at 10:49 AM, Thomas Rast <trast@student.ethz.ch> wrote:
> Since 9d2e942 (decode file:// and ssh:// URLs, 2010-05-23) the URL
> logic unquotes escaped URLs.  For the %2B type of escape, this is
> conformant with RFC 2396.  However, it also unquotes + into a space
> character, which is only appropriate for the query strings in HTTP.
> This notably broke fetching from the gtk+ repository.
>
> We cannot just remove the corresponding code since the same
> url_decode_internal() is also used by the HTTP backend to decode query
> parameters.  Introduce a new argument that controls whether the +
> decoding happens, and use it only in the (client-side) url_decode().
>
> Reported-by: Jasper St. Pierre <jstpierre@mecheye.net>
> Signed-off-by: Thomas Rast <trast@student.ethz.ch>
> ---
>
> I wrote:
>> Junio C Hamano wrote:
>> >
>> >   http-backend.c::get_info_refs()
>> >    -> http-backend.c::get_parameter()
>> >      -> http-backend.c::get_parameters()
>> >        -> url.c::url_decode_parameter_value()
>> >          -> url.c::url_decode_internal()
>>
>> You're right, I forgot about those.  I imagine it would be one of two
>> cases:
> [...]
>> Shawn, can you help with this?
>
> The third case, of course, is:
>
> * It only uses these functions for parameter decoding, which of course
>  was correct to begin with.
>
> So after hopefully drinking enough coffee, I made this one.  The catch
> is that I'm not entirely clear whether *not* decoding the +
> client-side anywhere in the URL is correct for http:// URLs?  If the
> client decodes and re-encodes the URL, then the + would be turned into
> a %2B on the re-encoding.  Then again maybe UI-facing URLs should
> never have a query part at all?
>
>
>  t/t5601-clone.sh |   10 ++++++++--
>  url.c            |   11 ++++++-----
>  2 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
> index 8abb71a..4431dfd 100755
> --- a/t/t5601-clone.sh
> +++ b/t/t5601-clone.sh
> @@ -178,8 +178,14 @@ test_expect_success 'clone respects global branch.autosetuprebase' '
>
>  test_expect_success 'respect url-encoding of file://' '
>        git init x+y &&
> -       test_must_fail git clone "file://$PWD/x+y" xy-url &&
> -       git clone "file://$PWD/x%2By" xy-url
> +       git clone "file://$PWD/x+y" xy-url-1 &&
> +       git clone "file://$PWD/x%2By" xy-url-2
> +'
> +
> +test_expect_success 'do not query-string-decode + in URLs' '
> +       rm -rf x+y &&
> +       git init "x y" &&
> +       test_must_fail git clone "file://$PWD/x+y" xy-no-plus
>  '
>
>  test_expect_success 'do not respect url-encoding of non-url path' '
> diff --git a/url.c b/url.c
> index 2306236..cd8f74f 100644
> --- a/url.c
> +++ b/url.c
> @@ -67,7 +67,8 @@ static int url_decode_char(const char *q)
>        return val;
>  }
>
> -static char *url_decode_internal(const char **query, const char *stop_at, struct strbuf *out)
> +static char *url_decode_internal(const char **query, const char *stop_at,
> +                                struct strbuf *out, int decode_plus)
>  {
>        const char *q = *query;
>
> @@ -90,7 +91,7 @@ static char *url_decode_internal(const char **query, const char *stop_at, struct
>                        }
>                }
>
> -               if (c == '+')
> +               if (decode_plus && c == '+')
>                        strbuf_addch(out, ' ');
>                else
>                        strbuf_addch(out, c);
> @@ -110,17 +111,17 @@ char *url_decode(const char *url)
>                strbuf_add(&out, url, colon - url);
>                url = colon;
>        }
> -       return url_decode_internal(&url, NULL, &out);
> +       return url_decode_internal(&url, NULL, &out, 0);
>  }
>
>  char *url_decode_parameter_name(const char **query)
>  {
>        struct strbuf out = STRBUF_INIT;
> -       return url_decode_internal(query, "&=", &out);
> +       return url_decode_internal(query, "&=", &out, 1);
>  }
>
>  char *url_decode_parameter_value(const char **query)
>  {
>        struct strbuf out = STRBUF_INIT;
> -       return url_decode_internal(query, "&", &out);
> +       return url_decode_internal(query, "&", &out, 1);
>  }
> --
> 1.7.2.278.g76edd.dirty
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] Do not unquote + into ' ' in URLs
  2010-07-31 21:18           ` Jasper St. Pierre
@ 2010-07-31 21:33             ` Thomas Rast
  2010-08-06 10:46               ` Ralf Ebert
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Rast @ 2010-07-31 21:33 UTC (permalink / raw)
  To: Jasper St. Pierre; +Cc: git, Jeff King, avarab, Shawn O. Pearce, Junio C Hamano

Jasper St. Pierre wrote:
> Hmm.. could we get a follow-up on this patch? It's annoying because it's both
> backwards and forwards incompatible: the %2b workaround doesn't work in
> older versions.

I was hoping for an Ack from someone who knows the whole HTTP
business, because I don't feel confident enough to say it won't break
anything there.

To be precise, if the client ever attempts to decode URLs with query
strings (for whatever reasons) then it would break.  Probably there
are no such URLs, but I don't know for sure.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] Do not unquote + into ' ' in URLs
  2010-07-31 21:33             ` Thomas Rast
@ 2010-08-06 10:46               ` Ralf Ebert
  0 siblings, 0 replies; 18+ messages in thread
From: Ralf Ebert @ 2010-08-06 10:46 UTC (permalink / raw)
  To: git

Hi,

> To be precise, if the client ever attempts to decode URLs with query
> strings (for whatever reasons) then it would break.  Probably there
> are no such URLs, but I don't know for sure.

url_decode seems to be called only from connect.c to support 
percent-encoding for git/ssh-URLs, so at least as of now, there should 
be no problem here. Since applying the patch, git clone 
'git://git.gnome.org/gtk+' works ok again for me.

Greetings,

Ralf

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-08-06 10:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-23 13:18 URL decoding changed semantics of + in URLs Thomas Rast
2010-07-23 13:21 ` Thomas Rast
2010-07-23 14:10 ` Ævar Arnfjörð Bjarmason
2010-07-23 14:25   ` Jasper St. Pierre
2010-07-23 21:23   ` [PATCH] Do not unquote + into ' ' " Thomas Rast
2010-07-23 22:20     ` Ævar Arnfjörð Bjarmason
2010-07-23 22:26     ` Junio C Hamano
2010-07-23 23:04       ` Thomas Rast
2010-07-24 14:49         ` [PATCH v2] " Thomas Rast
2010-07-31 21:18           ` Jasper St. Pierre
2010-07-31 21:33             ` Thomas Rast
2010-08-06 10:46               ` Ralf Ebert
2010-07-26 15:40 ` URL decoding changed semantics of + " Jeff King
2010-07-26 17:57   ` Ævar Arnfjörð Bjarmason
2010-07-26 18:22     ` Jasper St. Pierre
2010-07-26 18:30       ` Matthieu Moy
2010-07-26 18:35       ` Ævar Arnfjörð Bjarmason
2010-07-26 18:44         ` Jasper St. Pierre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.