From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7CC1C67871 for ; Thu, 27 Oct 2022 07:59:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233454AbiJ0H7V (ORCPT ); Thu, 27 Oct 2022 03:59:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234869AbiJ0H7L (ORCPT ); Thu, 27 Oct 2022 03:59:11 -0400 Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 814A11373A7 for ; Thu, 27 Oct 2022 00:59:09 -0700 (PDT) Received: by mail-lj1-x22e.google.com with SMTP id u2so1191068ljl.3 for ; Thu, 27 Oct 2022 00:59:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rasmusvillemoes.dk; s=google; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=wUr2EXQ4ftgpTV8ZEQ41jHHOycsGjgaOj2fkUN/SmwY=; b=RjUFbbMfNxCxFcmIKsdWa5O9byrd1jpSQlJe5eU3whNU+B6yepvb7B0D9b8+Jm2SvU K2AC1/Qphb//OHOael7l/Nmty1KTOoEHMsSJtADAHM3UvBWii75uGpQUvhid/+irjrGQ BxwH1/loV9BB3luEH+etQBpjeAC8ZqCizyO/M= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wUr2EXQ4ftgpTV8ZEQ41jHHOycsGjgaOj2fkUN/SmwY=; b=s+H6URtTRKQhIKxthrwSMqiXP54XSjshn2cOoS518kMNm9fhPxrpuQybtsbGi0mf2e r1Iea4Pk9hh+FI6XoF28ineTOkkgs2EhyLynUAPkzvAMfyGhG+vzTM0CfKXWyfWORzv+ T7+ykndt6dcQ9Swa/+hRn30HK+JGjRXKLNdklm39rSe/Ka1PrFK0SuHLHDR3UaEUmlQj FSj5jpmwFC4Ws66v2gVcD8mUGxISpQGALmYNT33HU6fzn2gs4G9u0HzGEG865Aqe6Nj/ PoVYWjm7bxxhIw8X3ulz7jOKc39WA4yRuZxMO/0+dE/Jhd3PsJFglowVj6EgBiwgOBB4 Dbqw== X-Gm-Message-State: ACrzQf0sM2VxbspjGLHBHwwFSefIo0WqO3HaOsJJhTE+PVjX7YrWVnYV XQEEgH63NAtSERadkvhnW7S9dQ== X-Google-Smtp-Source: AMsMyM6JfSRaduVbrfa57dcCJjTzQ7W0+QDnmLqDB+SkcXec0y+BqsXViensSxRqxUz8HX+M/nNvgg== X-Received: by 2002:a2e:9e47:0:b0:276:1ee8:7216 with SMTP id g7-20020a2e9e47000000b002761ee87216mr11763411ljk.339.1666857547760; Thu, 27 Oct 2022 00:59:07 -0700 (PDT) Received: from [172.16.11.74] ([81.216.59.226]) by smtp.gmail.com with ESMTPSA id t27-20020a2e8e7b000000b0027709875c3esm130842ljk.32.2022.10.27.00.59.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 27 Oct 2022 00:59:07 -0700 (PDT) Message-ID: <915a104b-0e70-dfb8-3c85-54fd1e5e63e5@rasmusvillemoes.dk> Date: Thu, 27 Oct 2022 09:59:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: make ctype ascii only? (was [PATCH] kbuild: treat char as always signed) Content-Language: en-US To: Linus Torvalds Cc: "Jason A. Donenfeld" , linux-kernel@vger.kernel.org, linux-kbuild@vger.kernel.org, linux-arch@vger.kernel.org, linux-toolchains@vger.kernel.org, Masahiro Yamada , Kees Cook , Andrew Morton , Andy Shevchenko , Greg Kroah-Hartman References: <20221019162648.3557490-1-Jason@zx2c4.com> <3a2fa7c1-2e31-0479-761f-9c189f8ed8c3@rasmusvillemoes.dk> From: Rasmus Villemoes In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-toolchains@vger.kernel.org On 26/10/2022 20.10, Linus Torvalds wrote: > On Tue, Oct 25, 2022 at 5:10 PM Rasmus Villemoes > wrote: >> >> Only very tangentially related (because it has to do with chars...): Can >> we switch our ctype to be ASCII only, just as it was back in the good'ol >> mid 90s > > Those US-ASCII days weren't really very "good" old days, but I forget > why we did this (it's attributed to me, but that's from the > pre-BK/pre-git days before we actually tracked things all that well, > so..) > > Anyway, I think anybody using ctype.h on 8-bit chars gets what they > deserve, and I think Latin1 (or something close to it) is better than > US-ASCII, in that it's at least the same as Unicode in the low 8 > chars. My concern is that it's currently somewhat ill specified what our ctype actually represents, and that would be a lot easier to specify if we just said ASCII, everything above 0x7f is neither punct or ctrl or alpha or anything else. For example, people may do stuff like isprint(c) ? c : '.' in a printk() call, but most likely the consumer (somebody doing dmesg) would, at least these days, use utf-8, so that just results in a broken utf-8 sequence. Now I see that a lot of callers actually do "isascii(c) && isprint(c)", so they already know about this, but there are also many instances where isprint() is used by itself. There's also stuff like fs/afs/cell.c and other places that use isprint/isalnum/... to make decisions on what is allowed on the wire and/or in a disk format, where it's then hard to reason about just exactly what is accepted. And places that use toupper() on their strings to normalize them; that's broken when toupper() isn't idempotent. > So no, I'm disinclined to go back in time to what I think is an even > worse situation. Latin1 isn't great, but it sure beats US-ASCII. And > if you really want just US-ASII, then don't use the high bit, and make > your disgusting 7-bit code be *explicitly* 7-bit. > > Now, if there are errors in that table wrt Latin1 / "first 256 > codepoints of Unicode" too, then we can fix those. AFAICT, the differences are: - 0xaa (FEMININE ORDINAL INDICATOR), 0xb5 (MICRO SIGN), 0xba (FEMININE ORDINAL INDICATOR) should be lower (hence alpha and alnum), not punct. - depending a little on just exactly what one wants latin1 to mean, but if it does mean "first 256 codepoints of Unicode", 0x80-0x9f should be cntrl - for some reason at least glibc seems to classify 0xa0 as punctuation and not space (hence also as isgraph) - 0xdf and 0xff are correctly classified as lower, but since they don't have upper-case versions (at least not any that are representable in latin1), correct toupper() behaviour is to return them unchanged, but we just subtract 0x20, so 0xff becomes 0xdf which isn't isupper() and 0xdf becomes something that isn't even isalpha(). Fixing the first would create more instances of the last, and I think the only sane way to fix that would be a 256 byte lookup table to use by toupper(). > Not that anybody has apparently cared since 2.0.1 was released back in > July of 1996 (btw, it's sad how none of the old linux git archive > creations seem to have tried to import the dates, so you have to look > those up separately) Huh? That commit has 1996 as the author date, while its commit date is indeed 2007. The very first line says: author linus1 1996-07-02 11:00:00 -0600 > And if nobody has cared since 1996, I don't really think it matters. Indeed, I don't think it's a huge problem in practice. But it still bothers me that such a simple (and usually overlooked) corner of the kernel's C library is ill-defined and arguably a little buggy. Rasmus