From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 407DFC41513 for ; Thu, 17 Aug 2023 04:05:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347960AbjHQEFA (ORCPT ); Thu, 17 Aug 2023 00:05:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35586 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347955AbjHQEE7 (ORCPT ); Thu, 17 Aug 2023 00:04:59 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FD48271D; Wed, 16 Aug 2023 21:04:58 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id EC39D63827; Thu, 17 Aug 2023 04:04:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5C2F9C433B6; Thu, 17 Aug 2023 04:04:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1692245097; bh=n93Y+NvoiCdhzO7vgtSvAbVtJcBfReB5Q3BzTXShkwQ=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=BIf+jhuUQeg4/1MhgIg4CpsmvhqewTlGzKp/7gEX/5xk+BQQQ4xdVvLii9VcQTMjt UiqozUyWJhbWbKRJ37rwMGZzFO5oQY7KUT73GX1KEUCxYSteEqO0LEC/NuIz0OgHir iHX5wwf1/0AtJivlFSv9aCcwW4VEsySrslDYcCU+yHMMjwiHGT9T06zXRuYXdDR9nq O8HTG7QBLuYU7E3pfE22GqpCDNNzHaZGbpBzt2aaB8hOp80EgjqgHVXqiyqH8oWRPB +lBosP5l5Cx6uhbob5wtDGXHziWEkFwkwPYmlYAlD/w1WeaYpl/Z+gZyzvzxcOS8h/ PQGjp+OfVSCyw== Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-5255da974c4so5637717a12.3; Wed, 16 Aug 2023 21:04:57 -0700 (PDT) X-Gm-Message-State: AOJu0YzRBKeg/NuqUpavhyBvC+BJ7uumYAlL9Vw9SPAnSSCyB5NHA6sp /w0yvQVDZiGUNFM+N9xvIVyYAONWOHDcAOOLMN8= X-Google-Smtp-Source: AGHT+IHp+7TMt3Tj5/RPaWSXYWVOe5SbUojxP0adL9e0JxkaBOlBhmT7bkD3la7r7y1S8eA63rcp6wYS+uiFyY1ffsw= X-Received: by 2002:aa7:db45:0:b0:522:29ec:3e0a with SMTP id n5-20020aa7db45000000b0052229ec3e0amr3009174edt.28.1692245095417; Wed, 16 Aug 2023 21:04:55 -0700 (PDT) MIME-Version: 1.0 References: <20230814020045.51950-1-chenhuacai@loongson.cn> <20230814020045.51950-2-chenhuacai@loongson.cn> <18b9119c-cbc8-42a1-a313-9154d73c9841@paulmck-laptop> <2EB2D2A7-7C69-4E75-BF62-92309121B0A5@gmail.com> In-Reply-To: <2EB2D2A7-7C69-4E75-BF62-92309121B0A5@gmail.com> From: Huacai Chen Date: Thu, 17 Aug 2023 12:04:42 +0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH V4 2/2] rcu: Update jiffies in rcu_cpu_stall_reset() To: Alan Huang Cc: Z qiang , "Paul E . McKenney" , Huacai Chen , Frederic Weisbecker , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Thomas Gleixner , Ingo Molnar , John Stultz , Stephen Boyd , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Sergey Senozhatsky , rcu@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, Binbin Zhou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Alan, On Thu, Aug 17, 2023 at 12:52=E2=80=AFAM Alan Huang = wrote: > > >> > >>>>>>>>>>> > >>>>>>>>>>> Currently rcu_cpu_stall_reset() set rcu_state.jiffies_stall t= o one check > >>>>>>>>>>> period later, i.e. jiffies + rcu_jiffies_till_stall_check(). = But jiffies > >>>>>>>>>>> is only updated in the timer interrupt, so when kgdb_cpu_ente= r() begins > >>>>>>>>>>> to run there may already be nearly one rcu check period after= jiffies. > >>>>>>>>>>> Since all interrupts are disabled during kgdb_cpu_enter(), ji= ffies will > >>>>>>>>>>> not be updated. When kgdb_cpu_enter() returns, rcu_state.jiff= ies_stall > >>>>>>>>>>> maybe already gets timeout. > >>>>>>>>>>> > >>>>>>>>>>> We can set rcu_state.jiffies_stall to two rcu check periods l= ater, e.g. > >>>>>>>>>>> jiffies + (rcu_jiffies_till_stall_check() * 2) in rcu_cpu_sta= ll_reset() > >>>>>>>>>>> to avoid this problem. But this isn't a complete solution bec= ause kgdb > >>>>>>>>>>> may take a very long time in irq disabled context. > >>>>>>>>>>> > >>>>>>>>>>> Instead, update jiffies at the beginning of rcu_cpu_stall_res= et() can > >>>>>>>>>>> solve all kinds of problems. > >>>>>>>>>> > >>>>>>>>>> Would it make sense for there to be a kgdb_cpu_exit()? In tha= t case, > >>>>>>>>>> the stalls could simply be suppressed at the beginning of the = debug > >>>>>>>>>> session and re-enabled upon exit, as is currently done for sys= rq output > >>>>>>>>>> via rcu_sysrq_start() and rcu_sysrq_end(). > >>>>>>>>> Thank you for your advice, but that doesn't help. Because > >>>>>>>>> rcu_sysrq_start() and rcu_sysrq_end() try to suppress the warni= ngs > >>>>>>>>> during sysrq, but kgdb already has no warnings during kgdb_cpu_= enter() > >>>>>>>>> since it is executed in irq disabled context. Instead, this pat= ch > >>>>>>>>> wants to suppress the warnings *after* kgdb_cpu_enter() due to = a very > >>>>>>>>> old jiffies value. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hello, Huacai > >>>>>>>> > >>>>>>>> Is it possible to set the rcu_cpu_stall_suppress is true in > >>>>>>>> dbg_touch_watchdogs() > >>>>>>>> and reset the rcu_cpu_stall_suppress at the beginning and end of= the > >>>>>>>> RCU grace period? > >>>>>>> This is possible but not the best: 1, kgdb is not the only caller= of > >>>>>>> rcu_cpu_stall_reset(); 2, it is difficult to find the "end" to re= set > >>>>>>> rcu_cpu_stall_suppress. > >>>>>>> > >>>>>> > >>>>>> You can replace rcu_state.jiffies_stall update by setting rcu_cpu_= stall_suppress > >>>>>> in rcu_cpu_stall_reset(), and reset rcu_cpu_stall_suppress in rcu= _gp_init() and > >>>>>> rcu_gp_cleanup(). > >>>>> What's the advantage compared with updating jiffies? Updating jiffi= es > >>>>> seems more straight forward. > >>>>> > >>>> > >>>> In do_update_jiffies_64(), need to acquire jiffies_lock raw spinlock= , > >>>> like you said, kgdb is not the only caller of rcu_cpu_stall_reset(), > >>>> the rcu_cpu_stall_reset() maybe invoke in NMI (arch/x86/platform/uv= /uv_nmi.c) > >>> Reset rcu_cpu_stall_suppress in rcu_gp_init()/rcu_gp_cleanup() is > >>> still not so good to me, because it does a useless operation in most > >>> cases. Moreover, the rcu core is refactored again and again, somethin= g > >>> may be changed in future. > >>> > >>> If do_update_jiffies_64() cannot be used in NMI context, can we > >> > >> What about updating jiffies in dbg_touch_watchdogs or adding a wrapper= which updates > >> both jiffies and jiffies_stall? > > This can solve the kgdb problem, but I found that most callers of > > rcu_cpu_stall_reset() are in irq disabled context so they may meet > > The duration of other contexts where interrupts are disabled may not be a= s long as in the case of kgdb=EF=BC=9F > > > similar problems. Modifying rcu_cpu_stall_reset() can solve all of > > them. > > > > But due to the NMI issue, from my point of view, setting jiffies_stall > > to jiffies + 300*HZ is the best solution now. :) > > If I understand correctly, the NMI issue is the deadlock issue? If so, pl= us the short duration of other irq disabled > contexts, it=E2=80=99s ok just update jiffies in dbg_touch_watchdogs. > > Please correct me if anything wrong. :) The timeout value can be configured as short as 3 seconds, in this case other callers may also have problems. Huacai > > > > > Huacai > >> > >>> consider my old method [1]? > >>> https://lore.kernel.org/rcu/CAAhV-H7j9Y=3DVvRLm8thLw-EX1PGqBA9YfT4G1A= N7ucYS=3DiP+DQ@mail.gmail.com/T/#t > >>> > >>> Of course we should set rcu_state.jiffies_stall large enough, so we > >>> can do like this: > >>> > >>> void rcu_cpu_stall_reset(void) > >>> { > >>> WRITE_ONCE(rcu_state.jiffies_stall, > >>> - jiffies + rcu_jiffies_till_stall_check()); > >>> + jiffies + 300 * HZ); > >>> } > >>> > >>> 300s is the largest timeout value, and I think 300s is enough here in= practice. > >>> > >>> Huacai > >>> > >>>> > >>>> Thanks > >>>> Zqiang > >>>> > >>>> > >>>>> Huacai > >>>>> > >>>>>> > >>>>>> Thanks > >>>>>> Zqiang > >>>>>> > >>>>>>> > >>>>>>>> or set rcupdate.rcu_cpu_stall_suppress_at_boot=3D1 in bootargs c= an > >>>>>>>> suppress RCU stall > >>>>>>>> in booting. > >>>>>>> This is also possible, but it suppresses all kinds of stall warni= ngs, > >>>>>>> which is not what we want. > >>>>>>> > >>>>>>> Huacai > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> Zqiang > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Huacai > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thanx, P= aul > >>>>>>>>>> > >>>>>>>>>>> Cc: stable@vger.kernel.org > >>>>>>>>>>> Fixes: a80be428fbc1f1f3bc9ed924 ("rcu: Do not disable GP stal= l detection in rcu_cpu_stall_reset()") > >>>>>>>>>>> Reported-off-by: Binbin Zhou > >>>>>>>>>>> Signed-off-by: Huacai Chen > >>>>>>>>>>> --- > >>>>>>>>>>> kernel/rcu/tree_stall.h | 1 + > >>>>>>>>>>> 1 file changed, 1 insertion(+) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.= h > >>>>>>>>>>> index b10b8349bb2a..1c7b540985bf 100644 > >>>>>>>>>>> --- a/kernel/rcu/tree_stall.h > >>>>>>>>>>> +++ b/kernel/rcu/tree_stall.h > >>>>>>>>>>> @@ -153,6 +153,7 @@ static void panic_on_rcu_stall(void) > >>>>>>>>>>> */ > >>>>>>>>>>> void rcu_cpu_stall_reset(void) > >>>>>>>>>>> { > >>>>>>>>>>> + do_update_jiffies_64(ktime_get()); > >>>>>>>>>>> WRITE_ONCE(rcu_state.jiffies_stall, > >>>>>>>>>>> jiffies + rcu_jiffies_till_stall_check()); > >>>>>>>>>>> } > >>>>>>>>>>> -- > >>>>>>>>>>> 2.39.3 > >