git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Usage of isspace and friends
@ 2005-10-12  1:40 Morten Welinder
  2005-10-13  6:49 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Morten Welinder @ 2005-10-12  1:40 UTC (permalink / raw)
  To: GIT Mailing List

Someone needs to audit the usage of isspace, tolower, and friends.  There are
things like this in the code:

static int is_dev_null(const char *str)
{
	return !memcmp("/dev/null", str, 9) && isspace(str[9]);
}

Since str[9] is of type char it should not be used as a argument to
isspace directly,
but rather be cast to unsigned char:

    ... isspace((unsigned char)str[9]);

Admittedly that is ugly.  Blame K&R.  (Glibc has a partial workaround for this
kind of coding bug.  On the up side you won't get a crash, but on the down
side you can get the wrong result.)

Morten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-12  1:40 Usage of isspace and friends Morten Welinder
@ 2005-10-13  6:49 ` Junio C Hamano
  2005-10-13  8:29   ` Antti-Juhani Kaijanaho
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Junio C Hamano @ 2005-10-13  6:49 UTC (permalink / raw)
  To: Morten Welinder; +Cc: git

Morten Welinder <mwelinder@gmail.com> writes:

> Someone needs to audit the usage of isspace, tolower, and
> friends.  There are things like this in the code:
>
> static int is_dev_null(const char *str)
> {
> 	return !memcmp("/dev/null", str, 9) && isspace(str[9]);
> }
>
> Since str[9] is of type char it should not be used as a argument to
> isspace directly,
> but rather be cast to unsigned char:
>
>     ... isspace((unsigned char)str[9]);

Huh?  isspace is "int isspace(int)".  Presumably standard
integral promotion rules applies here whether char is signed or
unsigned, doesn't it?

The snippet you quoted is from apply.c, and I would say what is
more problematic is that we do not force C locale while parsing
the diff (see another thread -- we would want to process diffs
as byte streams).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13  6:49 ` Junio C Hamano
@ 2005-10-13  8:29   ` Antti-Juhani Kaijanaho
  2005-10-13 13:27   ` H. Peter Anvin
  2005-10-13 15:04   ` Linus Torvalds
  2 siblings, 0 replies; 10+ messages in thread
From: Antti-Juhani Kaijanaho @ 2005-10-13  8:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, git

Junio C Hamano wrote:
> Morten Welinder <mwelinder@gmail.com> writes:
>>Since str[9] is of type char it should not be used as a argument to
>>isspace directly,
>>but rather be cast to unsigned char:
>>
>>    ... isspace((unsigned char)str[9]);
> 
> 
> Huh?  isspace is "int isspace(int)".  Presumably standard
> integral promotion rules applies here whether char is signed or
> unsigned, doesn't it?

Of course, but that's not the issue.  isspace treats its parameter as if
it had been converted from unsigned char to int.  If char is signed,
ïsspace will mistreat those characters that have a negative value.
Then again, I don't think a space character, one that the C locale
regards as such, anyway, wiill ever have a negative value, so the issue
is rather academic.

> The snippet you quoted is from apply.c, and I would say what is
> more problematic is that we do not force C locale while parsing
> the diff

Quite true.  One reason I tend to avoid the standard is* functions in my
own code.

-- 
Antti-Juhani

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13  6:49 ` Junio C Hamano
  2005-10-13  8:29   ` Antti-Juhani Kaijanaho
@ 2005-10-13 13:27   ` H. Peter Anvin
  2005-10-13 15:44     ` Junio C Hamano
  2005-10-13 15:04   ` Linus Torvalds
  2 siblings, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2005-10-13 13:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, git

Junio C Hamano wrote:
> 
> Huh?  isspace is "int isspace(int)".  Presumably standard
> integral promotion rules applies here whether char is signed or
> unsigned, doesn't it?
> 
> The snippet you quoted is from apply.c, and I would say what is
> more problematic is that we do not force C locale while parsing
> the diff (see another thread -- we would want to process diffs
> as byte streams).
> 

The problem is that isspace() is defined to operate on an integer which 
can be an unsigned char value promoted to int or EOF (-1).

	-hpa

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13  6:49 ` Junio C Hamano
  2005-10-13  8:29   ` Antti-Juhani Kaijanaho
  2005-10-13 13:27   ` H. Peter Anvin
@ 2005-10-13 15:04   ` Linus Torvalds
  2005-10-13 15:45     ` H. Peter Anvin
  2005-10-13 15:46     ` Linus Torvalds
  2 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2005-10-13 15:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, git



On Wed, 12 Oct 2005, Junio C Hamano wrote:
> 
> Huh?  isspace is "int isspace(int)".  Presumably standard
> integral promotion rules applies here whether char is signed or
> unsigned, doesn't it?

No.

The input range for the "isxxxxx()" macros is the same as the range for 
the "[f]getc[h]()" family: unsigned char + EOF (the latter usually being 
-1).

So Morten is right - if you have a "char *", it should not be dereferenced 
and used directly, although I think glibc does the right thing (and, in 
fact, I can't understand why the standards haven't been updated to do the 
right thing: it's _not_ that hard. In fact, it should be trivial apart 
from the special case of "255" that looks undistinguishable from EOF in 
signed char representation).

I'm almost goign to suggest that we do our own ctype.h, just to get the 
sane semantics: we want locale-independence, _and_ we want the right 
signed behaviour. Plus we only use a very small subset of ctype.h anyway 
(isspace, isalpha, isdigit and isalnum).

			Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13 13:27   ` H. Peter Anvin
@ 2005-10-13 15:44     ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2005-10-13 15:44 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: git

"H. Peter Anvin" <hpa@zytor.com> writes:

> The problem is that isspace() is defined to operate on an integer which 
> can be an unsigned char value promoted to int or EOF (-1).

Ah, thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13 15:04   ` Linus Torvalds
@ 2005-10-13 15:45     ` H. Peter Anvin
  2005-10-13 15:56       ` Linus Torvalds
  2005-10-13 15:46     ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2005-10-13 15:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Morten Welinder, git

Linus Torvalds wrote:
> 
> So Morten is right - if you have a "char *", it should not be dereferenced 
> and used directly, although I think glibc does the right thing (and, in 
> fact, I can't understand why the standards haven't been updated to do the 
> right thing: it's _not_ that hard. In fact, it should be trivial apart 
> from the special case of "255" that looks undistinguishable from EOF in 
> signed char representation).
> 

Because of the special case of 255 which looks indistinguishable from 
EOF, therefore making it required?

The original mistake, of course, was allowing EOF to be passed to the 
various isxxx() macros.

	-hpa

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13 15:04   ` Linus Torvalds
  2005-10-13 15:45     ` H. Peter Anvin
@ 2005-10-13 15:46     ` Linus Torvalds
  1 sibling, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2005-10-13 15:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, Git Mailing List



On Thu, 13 Oct 2005, Linus Torvalds wrote:
> 
> I'm almost goign to suggest that we do our own ctype.h, just to get the 
> sane semantics: we want locale-independence, _and_ we want the right 
> signed behaviour. Plus we only use a very small subset of ctype.h anyway 
> (isspace, isalpha, isdigit and isalnum).

Maybe something like this.

No, I'm not 100% sure we need it. But hey, it's probably less complicated 
than trying to de-localize various different targets.

Is there anything else that is locale-dependent that we use in the C 
toolchain?

		Linus

---
diff-tree f6fea67a590196d81bc4c6a6be1a16dd8bf2815d (from d06b689a933f6d2130f8afdf1ac0ddb83eeb59ab)
Author: Linus Torvalds <torvalds@osdl.org>
Date:   Thu Oct 13 08:36:35 2005 -0700

    Add locale-independent (and stupid) ctype.
    
    It's also safe for signed chars.

diff --git a/Makefile b/Makefile
index 5e7d055..31e62d4 100644
--- a/Makefile
+++ b/Makefile
@@ -158,7 +158,8 @@ LIB_OBJS = \
 	object.o pack-check.o patch-delta.o path.o pkt-line.o \
 	quote.o read-cache.o refs.o run-command.o \
 	server-info.o setup.o sha1_file.o sha1_name.o strbuf.o \
-	tag.o tree.o usage.o config.o environment.o $(DIFF_OBJS)
+	tag.o tree.o usage.o config.o environment.o ctype.o \
+	$(DIFF_OBJS)
 
 LIBS = $(LIB_FILE)
 LIBS += -lz
diff --git a/apply.c b/apply.c
index 155fbe8..f4d00f2 100644
--- a/apply.c
+++ b/apply.c
@@ -6,7 +6,6 @@
  * This applies patches on top of some (arbitrary) version of the SCM.
  *
  */
-#include <ctype.h>
 #include <fnmatch.h>
 #include "cache.h"
 
diff --git a/cache.h b/cache.h
index 1a7e047..a465952 100644
--- a/cache.h
+++ b/cache.h
@@ -386,4 +386,30 @@ extern int git_config_bool(const char *,
 extern char git_default_email[MAX_GITNAME];
 extern char git_default_name[MAX_GITNAME];
 
+/* Sane ctype - no locale, and works with signed chars */
+#undef isspace
+#undef isdigit
+#undef isalpha
+#undef isalnum
+#undef tolower
+#undef toupper
+extern unsigned char sane_ctype[256];
+#define GIT_SPACE 0x01
+#define GIT_DIGIT 0x02
+#define GIT_ALPHA 0x04
+#define sane_istest(x,mask) ((sane_ctype[(unsigned char)(x)] & (mask)) != 0)
+#define isspace(x) sane_istest(x,GIT_SPACE)
+#define isdigit(x) sane_istest(x,GIT_DIGIT)
+#define isalpha(x) sane_istest(x,GIT_ALPHA)
+#define isalnum(x) sane_istest(x,GIT_ALPHA | GIT_DIGIT)
+#define tolower(x) sane_case((unsigned char)(x), 0x20)
+#define toupper(x) sane_case((unsigned char)(x), 0)
+
+static inline int sane_case(int x, int high)
+{
+	if (sane_istest(x, GIT_ALPHA))
+		x = (x & ~0x20) | high;
+	return x;
+}
+
 #endif /* CACHE_H */
diff --git a/commit-tree.c b/commit-tree.c
index 030fb70..ea0fdd4 100644
--- a/commit-tree.c
+++ b/commit-tree.c
@@ -7,7 +7,6 @@
 
 #include <pwd.h>
 #include <time.h>
-#include <ctype.h>
 
 #define BLOCKING (1ul << 14)
 
diff --git a/commit.c b/commit.c
index f735f98..8f40318 100644
--- a/commit.c
+++ b/commit.c
@@ -1,4 +1,3 @@
-#include <ctype.h>
 #include "tag.h"
 #include "commit.h"
 #include "cache.h"
diff --git a/config.c b/config.c
index 9b7c6f2..519fecf 100644
--- a/config.c
+++ b/config.c
@@ -1,4 +1,3 @@
-#include <ctype.h>
 
 #include "cache.h"
 
diff --git a/convert-objects.c b/convert-objects.c
index 9ad0c77..a892013 100644
--- a/convert-objects.c
+++ b/convert-objects.c
@@ -1,6 +1,5 @@
 #define _XOPEN_SOURCE /* glibc2 needs this */
 #include <time.h>
-#include <ctype.h>
 #include "cache.h"
 
 struct entry {
diff --git a/ctype.c b/ctype.c
new file mode 100644
index 0000000..56bdffa
--- /dev/null
+++ b/ctype.c
@@ -0,0 +1,23 @@
+/*
+ * Sane locale-independent, ASCII ctype.
+ *
+ * No surprises, and works with signed and unsigned chars.
+ */
+#include "cache.h"
+
+#define SS GIT_SPACE
+#define AA GIT_ALPHA
+#define DD GIT_DIGIT
+
+unsigned char sane_ctype[256] = {
+	 0,  0,  0,  0,  0,  0,  0,  0,  0, SS, SS,  0,  0, SS,  0,  0,		/* 0-15 */
+	 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,		/* 16-15 */
+	SS,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,		/* 32-15 */
+	DD, DD, DD, DD, DD, DD, DD, DD, DD, DD,  0,  0,  0,  0,  0,  0,		/* 48-15 */
+	 0, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA,		/* 64-15 */
+	AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA,  0,  0,  0,  0,  0,		/* 80-15 */
+	 0, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA,		/* 96-15 */
+	AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA,  0,  0,  0,  0,  0,		/* 112-15 */
+	/* Nothing in the 128.. range */
+};
+
diff --git a/date.c b/date.c
index b21cadc..63f5a09 100644
--- a/date.c
+++ b/date.c
@@ -4,7 +4,6 @@
  * Copyright (C) Linus Torvalds, 2005
  */
 
-#include <ctype.h>
 #include <time.h>
 
 #include "cache.h"
diff --git a/diff-tree.c b/diff-tree.c
index 2203fa5..8517220 100644
--- a/diff-tree.c
+++ b/diff-tree.c
@@ -1,4 +1,3 @@
-#include <ctype.h>
 #include "cache.h"
 #include "diff.h"
 #include "commit.h"
diff --git a/ident.c b/ident.c
index 7a9f567..1bfbc6f 100644
--- a/ident.c
+++ b/ident.c
@@ -9,7 +9,6 @@
 
 #include <pwd.h>
 #include <time.h>
-#include <ctype.h>
 
 static char git_default_date[50];
 
diff --git a/mailsplit.c b/mailsplit.c
index 0f8100d..189f4ed 100644
--- a/mailsplit.c
+++ b/mailsplit.c
@@ -11,7 +11,6 @@
 #include <sys/stat.h>
 #include <string.h>
 #include <stdio.h>
-#include <ctype.h>
 #include <assert.h>
 #include "cache.h"
 
diff --git a/pack-objects.c b/pack-objects.c
index 3d62278..83ac22b 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -1,4 +1,3 @@
-#include <ctype.h>
 #include "cache.h"
 #include "object.h"
 #include "delta.h"
diff --git a/patch-id.c b/patch-id.c
index 960e7ce..edbc4aa 100644
--- a/patch-id.c
+++ b/patch-id.c
@@ -1,4 +1,3 @@
-#include <ctype.h>
 #include "cache.h"
 
 static void flush_current_id(int patchlen, unsigned char *id, SHA_CTX *c)
diff --git a/refs.c b/refs.c
index 5a8cbd4..42240d2 100644
--- a/refs.c
+++ b/refs.c
@@ -2,7 +2,6 @@
 #include "cache.h"
 
 #include <errno.h>
-#include <ctype.h>
 
 /* We allow "recursive" symbolic refs. Only within reason, though */
 #define MAXDEPTH 5
diff --git a/update-ref.c b/update-ref.c
index 4a1704c..65dc3d6 100644
--- a/update-ref.c
+++ b/update-ref.c
@@ -1,6 +1,5 @@
 #include "cache.h"
 #include "refs.h"
-#include <ctype.h>
 
 static const char git_update_ref_usage[] = "git-update-ref <refname> <value> [<oldval>]";
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13 15:45     ` H. Peter Anvin
@ 2005-10-13 15:56       ` Linus Torvalds
  2005-10-13 16:07         ` H. Peter Anvin
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2005-10-13 15:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Junio C Hamano, Morten Welinder, git



On Thu, 13 Oct 2005, H. Peter Anvin wrote:
> 
> Because of the special case of 255 which looks indistinguishable from EOF,
> therefore making it required?

Yeah, and I agree, that was a mistake. It could have been fixed by making 
EOF be MIN_INT (or any other value outside the range of either "unsigned 
char" or "signed char" - preferably still negative), but there are 
probably programs that depend on it being -1. 

The stupid thing I just posted doesn't care. It happens to return 0 for 
EOF for all cases, but that's a side effect of (a) not doing locales (so 
255 is never printable or alpha) and (b) not even implementing iscntrl().

In general, the rule for ctype and EOF _should_ have been that it's part 
of an acceptable input range, but that the return value is undefined ;)

(Which would allow you to test EOF later, and not worry about any faults).

		Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Usage of isspace and friends
  2005-10-13 15:56       ` Linus Torvalds
@ 2005-10-13 16:07         ` H. Peter Anvin
  0 siblings, 0 replies; 10+ messages in thread
From: H. Peter Anvin @ 2005-10-13 16:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Morten Welinder, git

Linus Torvalds wrote:
> 
> Yeah, and I agree, that was a mistake. It could have been fixed by making 
> EOF be MIN_INT (or any other value outside the range of either "unsigned 
> char" or "signed char" - preferably still negative), but there are 
> probably programs that depend on it being -1. 
> 
> The stupid thing I just posted doesn't care. It happens to return 0 for 
> EOF for all cases, but that's a side effect of (a) not doing locales (so 
> 255 is never printable or alpha) and (b) not even implementing iscntrl().
> 

Given that for isprint() et al are useless for non-ASCII UTF-8 anyway, 
might as well (they are not defined to be able to take wide character 
values.)

	-hpa

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-10-13 16:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-12  1:40 Usage of isspace and friends Morten Welinder
2005-10-13  6:49 ` Junio C Hamano
2005-10-13  8:29   ` Antti-Juhani Kaijanaho
2005-10-13 13:27   ` H. Peter Anvin
2005-10-13 15:44     ` Junio C Hamano
2005-10-13 15:04   ` Linus Torvalds
2005-10-13 15:45     ` H. Peter Anvin
2005-10-13 15:56       ` Linus Torvalds
2005-10-13 16:07         ` H. Peter Anvin
2005-10-13 15:46     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).