All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] git-am: fixed patch_format detection according to RFC2822
@ 2009-09-25 15:14 Christian Himpel
  2009-09-25 15:17 ` [PATCH 2/2] git-am: force egrep to use correct characters set Christian Himpel
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Himpel @ 2009-09-25 15:14 UTC (permalink / raw)
  To: git

RFC2822 specifies in paragraph 3.6.8, that optional header fields are
made up of any printable US-ASCII character except ' ' (space) and ':'
(colon).

The pattern for the egrep command is changed to match all of these
characters.

Signed-off-by: Christian Himpel <chressie@gmail.com>
---

Hi,

I had a problem with applying a patch with 'git am', because the one
header fields in the patch-e-mail contained numbers.  So I read the
RFC2822 what they say about header fields.  Unbelievable but any
printable character except space and colon are allowed in header fields.
So I changed the egrep expression according to this rule.

Regards,
chressie


 git-am.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-am.sh b/git-am.sh
index 26ffe70..0ddd80f 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -205,7 +205,7 @@ check_patch_format () {
 			# and see if it looks like that they all begin with the
 			# header field names...
 			sed -n -e '/^$/q' -e '/^[ 	]/d' -e p "$1" |
-			egrep -v '^[A-Za-z]+(-[A-Za-z]+)*:' >/dev/null ||
+			egrep -v '^[!-9;-~]+:' >/dev/null ||
 			patch_format=mbox
 		fi
 	} < "$1" || clean_abort
-- 
1.6.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] git-am: force egrep to use correct characters set
  2009-09-25 15:14 [PATCH 1/2] git-am: fixed patch_format detection according to RFC2822 Christian Himpel
@ 2009-09-25 15:17 ` Christian Himpel
  2009-09-25 15:45   ` [PATCH 2/2 v2] " Christian Himpel
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Himpel @ 2009-09-25 15:17 UTC (permalink / raw)
  To: git

According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
We do not rely here on the LC_ALL value we get from the environment.

Signed-off-by: Christian Himpel <chressie@gmail.com>
---

I don't know if this kind of patch is desired, but according to egrep(1)
it's not reliable to use the range expression with different character
sets than US-ASCII.

So this patch forces the usage of US-ASCII.

Regards,
chressie


 git-am.sh |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/git-am.sh b/git-am.sh
index 0ddd80f..e4dd49a 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -204,9 +204,13 @@ check_patch_format () {
 			# discarding the indented remainder of folded lines,
 			# and see if it looks like that they all begin with the
 			# header field names...
+			_tmp_locale=$LC_ALL
+			export LC_ALL=C
 			sed -n -e '/^$/q' -e '/^[ 	]/d' -e p "$1" |
 			egrep -v '^[!-9;-~]+:' >/dev/null ||
 			patch_format=mbox
+			export LC_ALL=$_tmp_locale
+			unset $_tmp_locale
 		fi
 	} < "$1" || clean_abort
 }
-- 
1.6.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2 v2] git-am: force egrep to use correct characters set
  2009-09-25 15:17 ` [PATCH 2/2] git-am: force egrep to use correct characters set Christian Himpel
@ 2009-09-25 15:45   ` Christian Himpel
  2009-09-25 16:43     ` [PATCH] " Christian Himpel
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Himpel @ 2009-09-25 15:45 UTC (permalink / raw)
  To: git

According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
We do not rely here on the LC_ALL value we get from the environment.
---

This is probably a saner approach :)

 git-am.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-am.sh b/git-am.sh
index 0ddd80f..c132f50 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -205,7 +205,7 @@ check_patch_format () {
 			# and see if it looks like that they all begin with the
 			# header field names...
 			sed -n -e '/^$/q' -e '/^[ 	]/d' -e p "$1" |
-			egrep -v '^[!-9;-~]+:' >/dev/null ||
+			LC_ALL=C egrep -v '^[!-9;-~]+:' >/dev/null ||
 			patch_format=mbox
 		fi
 	} < "$1" || clean_abort
-- 
1.6.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] git-am: force egrep to use correct characters set
  2009-09-25 15:45   ` [PATCH 2/2 v2] " Christian Himpel
@ 2009-09-25 16:43     ` Christian Himpel
  2009-09-27  7:40       ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Himpel @ 2009-09-25 16:43 UTC (permalink / raw)
  To: git

According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
We do not rely here on the LC_ALL value we get from the environment.

Signed-off-by: Christian Himpel <chressie@gmail.com>
---

sorry for being so noisy, but forgot to signoff last time

 git-am.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-am.sh b/git-am.sh
index 0ddd80f..c132f50 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -205,7 +205,7 @@ check_patch_format () {
 			# and see if it looks like that they all begin with the
 			# header field names...
 			sed -n -e '/^$/q' -e '/^[ 	]/d' -e p "$1" |
-			egrep -v '^[!-9;-~]+:' >/dev/null ||
+			LC_ALL=C egrep -v '^[!-9;-~]+:' >/dev/null ||
 			patch_format=mbox
 		fi
 	} < "$1" || clean_abort
-- 
1.6.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-25 16:43     ` [PATCH] " Christian Himpel
@ 2009-09-27  7:40       ` Jeff King
  2009-09-28  6:55         ` Christian Himpel
  2009-09-28  8:12         ` Johannes Sixt
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff King @ 2009-09-27  7:40 UTC (permalink / raw)
  To: Christian Himpel; +Cc: git

On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:

> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
> We do not rely here on the LC_ALL value we get from the environment.

Hmm. Probably makes sense here, as it is a wide enough range that it may
pick up other stray non-ascii characters in other charsets (though as
the manpage notes, the likely thing is to pick up A-Z along with a-z,
which is OK here as we encompass both in our range).

There are two other calls to egrep with brackets (both in
git-submodule.sh), but they are just [0-7], which is presumably OK in
just about any charset.

Do you happen to know a charset in which this is a problem, just for
reference?

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-27  7:40       ` Jeff King
@ 2009-09-28  6:55         ` Christian Himpel
  2009-09-28  7:16           ` Jeff King
  2009-09-28  8:12         ` Johannes Sixt
  1 sibling, 1 reply; 11+ messages in thread
From: Christian Himpel @ 2009-09-28  6:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Himpel, git

On Sun, Sep 27, 2009 at 03:40:15AM -0400, Jeff King wrote:
> On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
> 
> > According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
> > We do not rely here on the LC_ALL value we get from the environment.
> 
> Hmm. Probably makes sense here, as it is a wide enough range that it may
> pick up other stray non-ascii characters in other charsets (though as
> the manpage notes, the likely thing is to pick up A-Z along with a-z,
> which is OK here as we encompass both in our range).
> 
> There are two other calls to egrep with brackets (both in
> git-submodule.sh), but they are just [0-7], which is presumably OK in
> just about any charset.
> 
> Do you happen to know a charset in which this is a problem, just for
> reference?

No, I don't know any charset with stray ascii-chars.  I just listened
attentively, when I read the part about the mixed alphabet characters in
the grep(1) manpage.

I did some quick checks just now.  It seems the characters (' ' to '~')
are in any locale, offered by glibc, at the same place.

Maybe, we can just leave the charset as it is and ignore this patch,
until someone complains.


Regards,
chressie

> 
> -Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-28  6:55         ` Christian Himpel
@ 2009-09-28  7:16           ` Jeff King
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2009-09-28  7:16 UTC (permalink / raw)
  To: Christian Himpel; +Cc: git

On Mon, Sep 28, 2009 at 08:55:19AM +0200, Christian Himpel wrote:

> > Do you happen to know a charset in which this is a problem, just for
> > reference?
> 
> No, I don't know any charset with stray ascii-chars.  I just listened
> attentively, when I read the part about the mixed alphabet characters in
> the grep(1) manpage.
> 
> I did some quick checks just now.  It seems the characters (' ' to '~')
> are in any locale, offered by glibc, at the same place.
> 
> Maybe, we can just leave the charset as it is and ignore this patch,
> until someone complains.

Thanks for looking into it. My question was more of a "how bad is this,
and should we be fixing these other callsites, too". But I think 0-7 is
probably a pretty safe range in any charset.

Usually with portability issues, I am inclined to say "wait for a
problem", as we try to code to match actual reality instead of
documentation or standards. But in this case, we are unlikely to test
with strange charsets (or even know which strange charsets exist;
checking what's in glibc is reasonable, but I have no idea what else is
out there in other countries), and that the resulting bug would be
subtle and hard to find, it probably makes sense to be a little
defensive.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-27  7:40       ` Jeff King
  2009-09-28  6:55         ` Christian Himpel
@ 2009-09-28  8:12         ` Johannes Sixt
  2009-09-28  9:32           ` Christian Himpel
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Sixt @ 2009-09-28  8:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Himpel, git

Jeff King schrieb:
> On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
> 
>> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
>> We do not rely here on the LC_ALL value we get from the environment.
> 
> Hmm. Probably makes sense here, as it is a wide enough range that it may
> pick up other stray non-ascii characters in other charsets (though as
> the manpage notes, the likely thing is to pick up A-Z along with a-z,
> which is OK here as we encompass both in our range).
> 
> There are two other calls to egrep with brackets (both in
> git-submodule.sh), but they are just [0-7], which is presumably OK in
> just about any charset.
> 
> Do you happen to know a charset in which this is a problem, just for
> reference?

It's not so much about charsets than about languages:

       Within a bracket expression, a range expression consists
       of two characters separated by a hyphen.  It matches any
       single character that sorts between the two  characters,
       inclusive,  using  the  locale's  collating sequence and
       character set.  For example, in the  default  C  locale,
       [a-d]  is equivalent to [abcd].  Many locales sort char-
       acters in dictionary order, and in these  locales  [a-d]
       is  typically  not  equivalent  to  [abcd];  it might be
       equivalent to [aBbCcDd], for  example.   To  obtain  the
       traditional  interpretation  of bracket expressions, you
       can use the C locale by setting the  LC_ALL  environment
       variable to the value C.

For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase
letters, uppercase letters (!), and umlauts (!!) because in dictionary
order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must
be UTF-8, of course.)

Given that this applies not only to egrep, but to grep in general (and
perhaps even to other tools that support ranges, like sed), it may be
necessary to audit all range expressions.

The case identified by Christian is certainly important because it is
applied to a file whose contents can be anything, and the purpose of the
check is to identify the text as an mbox file, whose header section can be
only US-ASCII by definition. So, I think it has merit to apply the patch.

-- Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-28  8:12         ` Johannes Sixt
@ 2009-09-28  9:32           ` Christian Himpel
  2009-09-28  9:53             ` Johannes Sixt
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Himpel @ 2009-09-28  9:32 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Jeff King, Christian Himpel, git

On Mon, Sep 28, 2009 at 10:12:47AM +0200, Johannes Sixt wrote:
> Jeff King schrieb:
> > On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
> > 
> >> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
> >> We do not rely here on the LC_ALL value we get from the environment.
> > 
> > Hmm. Probably makes sense here, as it is a wide enough range that it may
> > pick up other stray non-ascii characters in other charsets (though as
> > the manpage notes, the likely thing is to pick up A-Z along with a-z,
> > which is OK here as we encompass both in our range).
> > 
> > There are two other calls to egrep with brackets (both in
> > git-submodule.sh), but they are just [0-7], which is presumably OK in
> > just about any charset.
> > 
> > Do you happen to know a charset in which this is a problem, just for
> > reference?
> 
> It's not so much about charsets than about languages:
> 
>        Within a bracket expression, a range expression consists
>        of two characters separated by a hyphen.  It matches any
>        single character that sorts between the two  characters,
>        inclusive,  using  the  locale's  collating sequence and
>        character set.  For example, in the  default  C  locale,
>        [a-d]  is equivalent to [abcd].  Many locales sort char-
>        acters in dictionary order, and in these  locales  [a-d]
>        is  typically  not  equivalent  to  [abcd];  it might be
>        equivalent to [aBbCcDd], for  example.   To  obtain  the
>        traditional  interpretation  of bracket expressions, you
>        can use the C locale by setting the  LC_ALL  environment
>        variable to the value C.
> 
> For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase
> letters, uppercase letters (!), and umlauts (!!) because in dictionary
> order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must
> be UTF-8, of course.)

Thanks for pointing this out.  You are right.  I must have read the
"dictonary order" part over.

> Given that this applies not only to egrep, but to grep in general (and
> perhaps even to other tools that support ranges, like sed), it may be
> necessary to audit all range expressions.

After doing a quick:

LC_ALL=C find . -name '*.sh' -exec \
         egrep -Hne '(grep|awk|sed).*\[.*-.*\]' {} \;

As far as I can see, range expressions are used:

1. to replace or grep hexadecimal numbers (SHA1 sums).  This shouldn't
be a problem, if we can assume that these numbers are never malformed.

2. to replace or grep numbers (with digits).  This shouldn't be a
problem, since digits should be in dictionary order in every language
(?!).

3. in git-rebase--interactive.sh:742 to grep for a previously generated
string.  So it should be safe here.

> The case identified by Christian is certainly important because it is
> applied to a file whose contents can be anything, and the purpose of the
> check is to identify the text as an mbox file, whose header section can be
> only US-ASCII by definition. So, I think it has merit to apply the patch.

Yes.  It seems that this is the only place where it is important to match
just the ASCII printable characters.


Regards,
chressie

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-28  9:32           ` Christian Himpel
@ 2009-09-28  9:53             ` Johannes Sixt
  2009-09-28 12:09               ` Christian Himpel
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Sixt @ 2009-09-28  9:53 UTC (permalink / raw)
  To: Christian Himpel; +Cc: Jeff King, git

Christian Himpel schrieb:
> After doing a quick:
> 
> LC_ALL=C find . -name '*.sh' -exec \
>          egrep -Hne '(grep|awk|sed).*\[.*-.*\]' {} \;
> 
> As far as I can see, range expressions are used:

0. The test suite scripts are not critical because test-lib.sh sets LANG=C
LC_ALL=C

> 1. to replace or grep hexadecimal numbers (SHA1 sums).  This shouldn't
> be a problem, if we can assume that these numbers are never malformed.

The assumption is valid if the input is piped from a git command, and such
uses aren't critical.

> 2. to replace or grep numbers (with digits).  This shouldn't be a
> problem, since digits should be in dictionary order in every language
> (?!).

I agree.

> 3. in git-rebase--interactive.sh:742 to grep for a previously generated
> string.  So it should be safe here.

I agree.

>> The case identified by Christian is certainly important because it is
>> applied to a file whose contents can be anything, and the purpose of the
>> check is to identify the text as an mbox file, whose header section can be
>> only US-ASCII by definition. So, I think it has merit to apply the patch.
> 
> Yes.  It seems that this is the only place where it is important to match
> just the ASCII printable characters.

There is another place in git-am.sh where a sed expression with a range
looks at the input file, doesn't it? Isn't it critical, too?

		if test -f "$dotest/rebasing" &&
			commit=$(sed -e 's/^From \([0-9a-f]*\) .*/\1/' \
				-e q "$dotest/$msgnum") &&
			test "$(git cat-file -t "$commit")" = commit
		then ...

-- Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] git-am: force egrep to use correct characters set
  2009-09-28  9:53             ` Johannes Sixt
@ 2009-09-28 12:09               ` Christian Himpel
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Himpel @ 2009-09-28 12:09 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Christian Himpel, Jeff King, git

On Mon, Sep 28, 2009 at 11:53:02AM +0200, Johannes Sixt wrote:
> Christian Himpel schrieb:
> >> The case identified by Christian is certainly important because it is
> >> applied to a file whose contents can be anything, and the purpose of the
> >> check is to identify the text as an mbox file, whose header section can be
> >> only US-ASCII by definition. So, I think it has merit to apply the patch.
> > 
> > Yes.  It seems that this is the only place where it is important to match
> > just the ASCII printable characters.
> 
> There is another place in git-am.sh where a sed expression with a range
> looks at the input file, doesn't it? Isn't it critical, too?
> 
> 		if test -f "$dotest/rebasing" &&
> 			commit=$(sed -e 's/^From \([0-9a-f]*\) .*/\1/' \
> 				-e q "$dotest/$msgnum") &&
> 			test "$(git cat-file -t "$commit")" = commit
> 		then ...

It seems to be the line generated from 'git format-patch' that is tested
here.  It specifies the SHA1 of the commit that is converted into a mbox
patch by 'format-patch'.  Hence, I don't see it critical here (until
someone edits this line by hand).


Regards,
chressie

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-09-28 12:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-25 15:14 [PATCH 1/2] git-am: fixed patch_format detection according to RFC2822 Christian Himpel
2009-09-25 15:17 ` [PATCH 2/2] git-am: force egrep to use correct characters set Christian Himpel
2009-09-25 15:45   ` [PATCH 2/2 v2] " Christian Himpel
2009-09-25 16:43     ` [PATCH] " Christian Himpel
2009-09-27  7:40       ` Jeff King
2009-09-28  6:55         ` Christian Himpel
2009-09-28  7:16           ` Jeff King
2009-09-28  8:12         ` Johannes Sixt
2009-09-28  9:32           ` Christian Himpel
2009-09-28  9:53             ` Johannes Sixt
2009-09-28 12:09               ` Christian Himpel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.