From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D361EC83F11 for ; Mon, 28 Aug 2023 11:32:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232457AbjH1Lbn (ORCPT ); Mon, 28 Aug 2023 07:31:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232450AbjH1LbM (ORCPT ); Mon, 28 Aug 2023 07:31:12 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B4EBE7; Mon, 28 Aug 2023 04:31:09 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 34C79644F5; Mon, 28 Aug 2023 11:31:09 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E0ABC433C7; Mon, 28 Aug 2023 11:31:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1693222268; bh=Nhion05gVnDK6S9dWuQYwIm+H21e1aZtGHRfwWW5B4Y=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=rfJF0avrAMX2oDW30SSR0xJi79YnoYmOJKbFFWoPuTJ4RNDjNDEMnOU0QopXJlhhZ hge8x62/bDP8pH+UCXpS18/szaR4jQoygldyjVOILGS2qLl/CaIKNl/i1a2lNS9Ba2 bHrRp6TrtM1hrb8anUwKb6kWbDHe1WWlhM3hoa/m89tcxpaOQQ87hnJ3odpN2YYILK 4Ra4umyL7N7cmEgKOQyaWFjGa1/X3XkWDkotAYvvqws+gBsFTuwBthWZWrlq/HW/H9 pGk39C25XAOIuNLogb6p4Ovhmhw0gLOUMLKjURwGph9QASHtzG+OS9ZgnzSDz02qLR cBlqKxcVRyWlw== Received: by mail-ej1-f44.google.com with SMTP id a640c23a62f3a-99c1d03e124so405894766b.2; Mon, 28 Aug 2023 04:31:08 -0700 (PDT) X-Gm-Message-State: AOJu0YwwbJNXBSWFs7/vERTrX2z5O1NJiDvoDDI8Dsl2BV9nQFj2PX1c 3+SOYjENUDjIxnyekun6tZ0uiCyYEnwY94iIqyk= X-Google-Smtp-Source: AGHT+IGLeODxbIyTL5Ow53F35GSsVkvyB/kUjH1zd48Qn339Qibc5wi6a1Bv79ZKnXlQIzXYbich/pPwHHY6t1DW78Q= X-Received: by 2002:a17:906:5198:b0:9a1:b967:aca8 with SMTP id y24-20020a170906519800b009a1b967aca8mr12841234ejk.4.1693222266793; Mon, 28 Aug 2023 04:31:06 -0700 (PDT) MIME-Version: 1.0 References: <16827b4e-9823-456d-a6be-157fbfae64c3@paulmck-laptop> <8792da20-a58e-4cc0-b3d2-231d5ade2242@paulmck-laptop> <24e34f50-32d2-4b67-8ec0-1034c984d035@paulmck-laptop> <20230825232807.GA97898@google.com> <2681134d-cc88-49a0-a1bc-4ec0816288f6@paulmck-laptop> In-Reply-To: <2681134d-cc88-49a0-a1bc-4ec0816288f6@paulmck-laptop> From: Huacai Chen Date: Mon, 28 Aug 2023 19:30:43 +0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH V4 2/2] rcu: Update jiffies in rcu_cpu_stall_reset() To: paulmck@kernel.org Cc: Joel Fernandes , Thomas Gleixner , Z qiang , Huacai Chen , Frederic Weisbecker , Neeraj Upadhyay , Josh Triplett , Boqun Feng , Ingo Molnar , John Stultz , Stephen Boyd , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Sergey Senozhatsky , rcu@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, Binbin Zhou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Paul and Joel, On Mon, Aug 28, 2023 at 6:47=E2=80=AFPM Paul E. McKenney wrote: > > On Sun, Aug 27, 2023 at 06:11:40PM -0400, Joel Fernandes wrote: > > On Sun, Aug 27, 2023 at 1:51=E2=80=AFAM Huacai Chen wrote: > > [..] > > > > > > > The only way I know of to avoid these sorts of false positive= s is for > > > > > > > the user to manually suppress all timeouts (perhaps using a k= ernel-boot > > > > > > > parameter for your early-boot case), do the gdb work, and the= n unsuppress > > > > > > > all stalls. Even that won't work for networking, because the= other > > > > > > > system's clock will be running throughout. > > > > > > > > > > > > > > In other words, from what I know now, there is no perfect sol= ution. > > > > > > > Therefore, there are sharp limits to the complexity of any so= lution that > > > > > > > I will be willing to accept. > > > > > > I think the simplest solution is (I hope Joel will not angry): > > > > > > > > > > Not angry at all, just want to help. ;-). The problem is the 300*= HZ solution > > > > > will also effect the VM workloads which also do a similar reset. = Allow me few > > > > > days to see if I can take a shot at fixing it slightly differentl= y. I am > > > > > trying Paul's idea of setting jiffies at a later time. I think it= is doable. > > > > > I think the advantage of doing this is it will make stall detecti= on more > > > > > robust in this face of these gaps in jiffie update. And that solu= tion does > > > > > not even need us to rely on ktime (and all the issues that come w= ith that). > > > > > > > > > > > > > I wrote a patch similar to Paul's idea and sent it out for review, = the > > > > advantage being it purely is based on jiffies. Could you try it out > > > > and let me know? > > > If you can cc my gmail , that could be better. > > > > Sure, will do. > > > > > I have read your patch, maybe the counter (nr_fqs_jiffies_stall) > > > should be atomic_t and we should use atomic operation to decrement it= s > > > value. Because rcu_gp_fqs() can be run concurrently, and we may miss > > > the (nr_fqs =3D=3D 1) condition. > > > > I don't think so. There is only 1 place where RMW operation happens > > and rcu_gp_fqs() is called only from the GP kthread. So a concurrent > > RMW (and hence a lost update) is not possible. > > Huacai, is your concern that the gdb user might have created a script > (for example, printing a variable or two, then automatically continuing), > so that breakpoints could happen in quick successsion, such that the > second breakpoint might run concurrently with rcu_gp_fqs()? > > If this can really happen, the point that Joel makes is a good one, namel= y > that rcu_gp_fqs() is single-threaded and (absent rcutorture) runs only > once every few jiffies. And gdb breakpoints, even with scripting, should > also be rather rare. So if this is an issue, a global lock should do the > trick, perhaps even one of the existing locks in the rcu_state structure. > The result should then be just as performant/scalable and a lot simpler > than use of atomics. Sorry, I made a mistake. Yes, there is no concurrent issue, and this approach probably works. But I have another problem: how to ensure that there is a jiffies update in three calls to rcu_gp_fqs()? Or in other word, is three also a magic number here? And I rechecked the commit message of a80be428fbc1f1f3bc9e ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()"). I don't know why Sergey said that the original code disables stall-detection forever, in fact it only disables the detection in the current GP. Huacai > > > Could you test the patch for the issue you are seeing and provide your > > Tested-by tag? Thanks, > > Either way, testing would of course be very good! ;-) > > Thanx, Paul