linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>,
	linux-kernel@vger.kernel.org, linux-kbuild@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-toolchains@vger.kernel.org,
	Masahiro Yamada <masahiroy@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: make ctype ascii only? (was [PATCH] kbuild: treat char as always signed)
Date: Thu, 27 Oct 2022 09:59:05 +0200	[thread overview]
Message-ID: <915a104b-0e70-dfb8-3c85-54fd1e5e63e5@rasmusvillemoes.dk> (raw)
In-Reply-To: <CAHk-=wg9RNhvDyanUQnxa_xnir70TUiMgjhVhRWUuF5Ojj96Dw@mail.gmail.com>

On 26/10/2022 20.10, Linus Torvalds wrote:
> On Tue, Oct 25, 2022 at 5:10 PM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>>
>> Only very tangentially related (because it has to do with chars...): Can
>> we switch our ctype to be ASCII only, just as it was back in the good'ol
>> mid 90s
> 
> Those US-ASCII days weren't really very "good" old days, but I forget
> why we did this (it's attributed to me, but that's from the
> pre-BK/pre-git days before we actually tracked things all that well,
> so..)
> 
> Anyway, I think anybody using ctype.h on 8-bit chars gets what they
> deserve, and I think Latin1 (or something close to it) is better than
> US-ASCII, in that it's at least the same as Unicode in the low 8
> chars.

My concern is that it's currently somewhat ill specified what our ctype
actually represents, and that would be a lot easier to specify if we
just said ASCII, everything above 0x7f is neither punct or ctrl or alpha
or anything else.

For example, people may do stuff like isprint(c) ? c : '.' in a printk()
call, but most likely the consumer (somebody doing dmesg) would, at
least these days, use utf-8, so that just results in a broken utf-8
sequence. Now I see that a lot of callers actually do "isascii(c) &&
isprint(c)", so they already know about this, but there are also many
instances where isprint() is used by itself.

There's also stuff like fs/afs/cell.c and other places that use
isprint/isalnum/... to make decisions on what is allowed on the wire
and/or in a disk format, where it's then hard to reason about just
exactly what is accepted. And places that use toupper() on their strings
to normalize them; that's broken when toupper() isn't idempotent.

> So no, I'm disinclined to go back in time to what I think is an even
> worse situation. Latin1 isn't great, but it sure beats US-ASCII. And
> if you really want just US-ASII, then don't use the high bit, and make
> your disgusting 7-bit code be *explicitly* 7-bit.
> 
> Now, if there are errors in that table wrt Latin1 / "first 256
> codepoints of Unicode" too, then we can fix those.

AFAICT, the differences are:

- 0xaa (FEMININE ORDINAL INDICATOR), 0xb5 (MICRO SIGN), 0xba (FEMININE
ORDINAL INDICATOR) should be lower (hence alpha and alnum), not punct.

- depending a little on just exactly what one wants latin1 to mean, but
if it does mean "first 256 codepoints of Unicode", 0x80-0x9f should be cntrl

- for some reason at least glibc seems to classify 0xa0 as punctuation
and not space (hence also as isgraph)

- 0xdf and 0xff are correctly classified as lower, but since they don't
have upper-case versions (at least not any that are representable in
latin1), correct toupper() behaviour is to return them unchanged, but we
just subtract 0x20, so 0xff becomes 0xdf which isn't isupper() and 0xdf
becomes something that isn't even isalpha().

Fixing the first would create more instances of the last, and I think
the only sane way to fix that would be a 256 byte lookup table to use by
toupper().

> Not that anybody has apparently cared since 2.0.1 was released back in
> July of 1996 
(btw, it's sad how none of the old linux git archive
> creations seem to have tried to import the dates, so you have to look
> those up separately)

Huh? That commit has 1996 as the author date, while its commit date is
indeed 2007. The very first line says:

author	linus1 <torvalds@linuxfoundation.org>	1996-07-02 11:00:00 -0600

> And if nobody has cared since 1996, I don't really think it matters.

Indeed, I don't think it's a huge problem in practice. But it still
bothers me that such a simple (and usually overlooked) corner of the
kernel's C library is ill-defined and arguably a little buggy.

Rasmus

  reply	other threads:[~2022-10-27  7:59 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-19 16:26 [PATCH] kbuild: treat char as always signed Jason A. Donenfeld
2022-10-19 16:54 ` Segher Boessenkool
2022-10-19 17:14   ` Linus Torvalds
2022-10-19 17:26     ` Linus Torvalds
2022-10-19 18:10       ` Nick Desaulniers
2022-10-19 18:35         ` Linus Torvalds
2022-10-19 19:23           ` Andy Shevchenko
2022-10-19 19:36             ` Linus Torvalds
2022-10-19 17:43     ` Segher Boessenkool
2022-10-19 18:11       ` Linus Torvalds
2022-10-19 18:20         ` Nick Desaulniers
2022-10-19 18:56           ` Linus Torvalds
2022-10-19 19:11             ` Kees Cook
2022-10-19 19:30               ` Linus Torvalds
2022-10-19 20:35                 ` Jason A. Donenfeld
2022-10-20  0:10                   ` Linus Torvalds
2022-10-20  3:11                     ` Jason A. Donenfeld
2022-10-19 20:15             ` Segher Boessenkool
2022-10-19 21:07         ` David Laight
2022-10-19 21:26           ` Segher Boessenkool
2022-10-20 10:41         ` Gabriel Paubert
2022-10-21 22:46           ` Linus Torvalds
2022-10-22  6:06             ` Gabriel Paubert
2022-10-22 18:16               ` Linus Torvalds
2022-10-23 20:23                 ` Gabriel Paubert
2022-10-25 23:00                   ` Kees Cook
2022-10-26  0:04                     ` Jason A. Donenfeld
2022-10-26 15:41                       ` Kees Cook
2022-10-19 19:54 ` Linus Torvalds
2022-10-19 20:23   ` Jason A. Donenfeld
2022-10-19 20:30     ` [PATCH v2] kbuild: treat char as always unsigned Jason A. Donenfeld
2022-10-19 23:56       ` Linus Torvalds
2022-10-20  0:02         ` Jason A. Donenfeld
2022-10-20  0:38           ` Linus Torvalds
2022-10-20  2:59             ` Jason A. Donenfeld
2022-10-20 18:41             ` Kees Cook
2022-10-21  1:01               ` Jason A. Donenfeld
2022-10-20 20:24         ` Segher Boessenkool
2022-10-24  9:24       ` Dan Carpenter
2022-10-24  9:30         ` Dan Carpenter
2022-10-24 16:33           ` Jason A. Donenfeld
2022-10-24 17:10             ` Linus Torvalds
2022-10-24 17:17               ` Jason A. Donenfeld
2022-10-25 19:22                 ` Kalle Valo
2022-10-25 10:16               ` David Laight
2022-10-24 15:17         ` Jason A. Donenfeld
2022-12-21 14:53       ` Guenter Roeck
2022-12-21 15:05         ` Geert Uytterhoeven
2022-12-21 15:23           ` Guenter Roeck
2022-12-21 15:29           ` Rasmus Villemoes
2022-12-21 15:56             ` Guenter Roeck
2022-12-21 17:06               ` Linus Torvalds
2022-12-21 17:19                 ` Guenter Roeck
2022-12-21 18:46                   ` Linus Torvalds
2022-12-21 19:08                     ` Linus Torvalds
2022-12-21 21:01                     ` Guenter Roeck
2022-12-22 13:05                     ` Geert Uytterhoeven
2022-12-22 10:41                 ` David Laight
     [not found]                   ` <f02e0ac7f2d805020a7ba66803aaff3e31b5eeff.camel@t-online.de>
2022-12-24  9:47                     ` Geert Uytterhoeven
2022-12-30 11:39                     ` David Laight
2022-12-30 13:13                       ` David Laight
2023-01-02  8:29                       ` Geert Uytterhoeven
2022-12-21 17:49               ` Andreas Schwab
2022-12-21 16:57             ` Geert Uytterhoeven
2022-10-19 20:58   ` [PATCH] kbuild: treat char as always signed David Laight
2022-10-26  0:10   ` make ctype ascii only? (was [PATCH] kbuild: treat char as always signed) Rasmus Villemoes
2022-10-26 18:10     ` Linus Torvalds
2022-10-27  7:59       ` Rasmus Villemoes [this message]
2022-10-27 18:28         ` Linus Torvalds
2022-10-20  8:39 ` [PATCH] kbuild: treat char as always signed kernel test robot
2022-10-20 16:33   ` Jason A. Donenfeld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=915a104b-0e70-dfb8-3c85-54fd1e5e63e5@rasmusvillemoes.dk \
    --to=linux@rasmusvillemoes.dk \
    --cc=Jason@zx2c4.com \
    --cc=akpm@linux-foundation.org \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=keescook@chromium.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kbuild@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-toolchains@vger.kernel.org \
    --cc=masahiroy@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).