From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A015C433FE for ; Wed, 19 Oct 2022 18:11:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230429AbiJSSLi (ORCPT ); Wed, 19 Oct 2022 14:11:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230122AbiJSSLh (ORCPT ); Wed, 19 Oct 2022 14:11:37 -0400 Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E9A0A176B88 for ; Wed, 19 Oct 2022 11:11:36 -0700 (PDT) Received: by mail-qt1-x835.google.com with SMTP id c23so12223355qtw.8 for ; Wed, 19 Oct 2022 11:11:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=oUCCyUWxse3Ihxvp+Qy73UCcxKzJGUZipUhThJxcdKU=; b=WuiA/6yKGF2cchs6SyRubR0sUiAOrR7hnAR9NdZ2x0xwinAtix2ztOnRoTlNt1SXaI roMCjEKH9M6eg3c9U2UbtL3mBN4LsaBH4jFFCPzxkXbbkYzXm6jYktZtwMV7kprSpq30 K1C/H0DmKvpgKPvqRJv3DcVzYXeY9oOLsD3Jc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=oUCCyUWxse3Ihxvp+Qy73UCcxKzJGUZipUhThJxcdKU=; b=NnZEFGBGwyHXzPiKbUBP7HKBujNubhW8E5RLGEQetM0jXP66NRUI0Ry71Fi89Ckq6d b72gd0kp92vEnG4hFWm8W1/BHgtHYmKg6ahMnEUWQhQyvxVjgQ7WjWfm2ezft0ilHUop YlqBi5GWEtJuuAPp630asz1Wx42P/kN10cgYB20ClOB8HPDpnoU5kxBpTfisvHCCPk4n pCfAuYKhsVDnjBngMG0aJ2oEzna5DEq2V8OpKGiqmARRLV2J5RbbHEnpZ7jBKPkX9XYx Iycvy6I6EKCwQ54aG5ONFHfZvMOx51G8X12nFIEAiNesuhCQha8pj+dSZMwF40W76P5b 76vA== X-Gm-Message-State: ACrzQf3bLz1Tm1nrmJQVSIowcTGbbfjdhGt/3tXt9Dai43wLmXIGgM6e 0cZ87seDc1qWYLuBEmCfoY62OzugCPc0yQ== X-Google-Smtp-Source: AMsMyM6yD5f2LYwaRdov6WRoGaN26NrA7EdQXsJiGbe/KzR1MCor4rG7NoESULaeVZaX7cI65G4ZdA== X-Received: by 2002:a05:622a:10f:b0:39c:cd6a:ee2f with SMTP id u15-20020a05622a010f00b0039ccd6aee2fmr7743447qtw.388.1666203095702; Wed, 19 Oct 2022 11:11:35 -0700 (PDT) Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com. [209.85.128.182]) by smtp.gmail.com with ESMTPSA id gc12-20020a05622a59cc00b0039a3df76a26sm4442685qtb.18.2022.10.19.11.11.33 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 19 Oct 2022 11:11:34 -0700 (PDT) Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-35befab86a4so175933957b3.8 for ; Wed, 19 Oct 2022 11:11:33 -0700 (PDT) X-Received: by 2002:a81:1007:0:b0:357:45e3:304c with SMTP id 7-20020a811007000000b0035745e3304cmr7943725ywq.340.1666203093522; Wed, 19 Oct 2022 11:11:33 -0700 (PDT) MIME-Version: 1.0 References: <20221019162648.3557490-1-Jason@zx2c4.com> <20221019165455.GL25951@gate.crashing.org> <20221019174345.GM25951@gate.crashing.org> In-Reply-To: <20221019174345.GM25951@gate.crashing.org> From: Linus Torvalds Date: Wed, 19 Oct 2022 11:11:16 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] kbuild: treat char as always signed To: Segher Boessenkool Cc: "Jason A. Donenfeld" , linux-kernel@vger.kernel.org, linux-kbuild@vger.kernel.org, linux-arch@vger.kernel.org, linux-toolchains@vger.kernel.org, Masahiro Yamada , Kees Cook , Andrew Morton , Andy Shevchenko , Greg Kroah-Hartman Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-toolchains@vger.kernel.org On Wed, Oct 19, 2022 at 10:45 AM Segher Boessenkool wrote: > > When I did this more than a decade ago there indeed was a LOT of noise, > mostly caused by dubious code. It really happens with explicitly *not* dubious code. Using 'unsigned char[]' is very common in code that actually does anything where you care about the actual byte values. Things like utf-8 handling, things like compression, lots and lots of cases. But a number of those cases are still dealing with *strings*. UTF-8 is still a perfectly valid C string format, and using 'strlen()' on a buffer that contains UTF-8 is neither unusual nor wrong. It is still the proper way to get the byte length of the thing. It's how UTF-8 is literally designed. And -Wpointer-sign will complain about that, unless you start doing explicit casting, which is just a worse fix than the disease. Explicit casts are bad (unless, of course, you are explicitly trying to violate the type system, when they are both required, and a great way to say "look, I'm doing something dangerous"). So people who say "just cast it", don't understand that casts *should* be seen as "this code is doing something special, tread carefully". If you just randomly add casts to shut up a warning, the casts become normalized and don't raise the kind of warning signs that they *should* raise. And it's really annoying, because the code ends up using 'unsigned char' exactly _because_ it's trying to be careful and explicit about signs, and then the warning makes that carefully written code worse. > Then suggest something better? Or suggest improvements to the existing > warning? As I mentioned in the next email, I tried to come up with something better in sparse, which wasn't based on the pointer type comparison, but on the actual 'char' itself. My (admittedly only ever half-implemented) thing actually worked fine for the simple cases (where simplification would end up just undoing all the "expand char to int" because the end use was just assigned to another char, or it was masked for other reasons). But while sparse does a lot of basic optimizations, it still left enough "look, you're doing sign-extensions on a 'char'" on the table that it warned about perfectly valid stuff. And maybe that's fundamentally hard. The "-Wpointer-sign" thing could probably be fairly easily improved, by just recognizing that things like 'strlen()' and friends do not care about the sign of 'char', and neither does a 'strcmp()' that only checks for equality (but if you check the *sign* of strcmp, it does matter). It's been some time since I last tried it, but at least from memory, it really was mostly the standard C string functions that caused almost all problems. Your *own* functions you can just make sure the signedness is right, but it's really really annoying when you try to be careful about the byte signs, and the compiler starts complaining just because you want to use the bog-standard 'strlen()' function. And no, something like 'ustrlen()' with a hidden cast is just noise for a warning that really shouldn't exist. So some way to say 'this function really doesn't care about the sign of this pointer' (and having the compiler know that for the string functions it already knows about anyway) would probably make almost all problems with -Wsign-warning go away. Put another way: 'char *' is so fundamental and inherent in C, that you can't just warn when people use it in contexts where sign really doesn't matter. Linus