From: "Paul E. McKenney" <paulmck@kernel.org>
To: Chao Zhou <chaozhou1018@gmail.com>
Cc: rcu@vger.kernel.org
Subject: Re: Allow multiple GP misses before Panic
Date: Sun, 16 Aug 2020 11:01:37 -0700 [thread overview]
Message-ID: <20200816180137.GA23602@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <CAJdzfEdfCFqF4t8kNL4UriBW8xTMtsgx7FNjQZwPHzTFQaATiA@mail.gmail.com>
On Thu, Aug 13, 2020 at 12:00:07PM -0700, Chao Zhou wrote:
> Hi Paul,
>
> Because sysctl panic_on_rcu_stall is public interface, it might have
> already been used by adopters, will the change break them? Will a new
> sysctl max_rcu_stall_to_panic be more un-interruptive? Appreciate your
> insights about this .
It is a public interface, but using the previously forbidden values
other than zero and one is fine.
Thanx, Paul
> On Thu, Aug 13, 2020 at 11:50 AM Chao Zhou <chaozhou1018@gmail.com> wrote:
> >
> > Thanks Paul for the insights!
> >
> > I studied the 3 options and think that #1+#3 offers both flexibility
> > to users and coverage of boundary user cases.
> >
> > For example, as an user of RCU, we want the warnings to be spilled at
> > the default 21 seconds so that we know such events are happening. At
> > the same time, we want Panic to happen if the stall is long enough to
> > significantly affect available system memory on our system.
> >
> > Here is the plan based on our discussion, please advise if not inline
> > with the idea:
> > 1. modify panic_on_rcu_stall to be the maximum number of consecutive
> > warnings to trigger Panic.
> > 1) change its name to max_rcu_stall_to_panic,
> > 2) default value to 1, which is the same behavior as today's.
> > 2. use ((struct rcu_state *)->gpnum - (struct rcu_data *)->gpnum) >=
> > max_rcu_stall_to_panic as condition to trigger Panic;
> > 3. reset (struct rcu_data *)->gpnum to (struct rcu_state *)->gpnum
> > every time a new grace period starts;
> > 4. add a new member (struct rcu_data *)->gpmiss that is incremented at
> > each miss to track how many misses so far for statistics/debug
> > purpose.
> >
> > Your insights and advice are highly appreciated.
> >
> > Thanks!
> >
> > Chao
> >
> > On Thu, Aug 13, 2020 at 11:19 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Aug 13, 2020 at 10:22:09AM -0700, Chao Zhou wrote:
> > > > Hi,
> > > >
> > > > Some RCU stalls are transient and a system is fully capable to recover
> > > > after that, but we do want Panic after certain amount of GP misses.
> > > >
> > > > Current module parameter rcu_cpu_stall_panic only turn on/off Panic,
> > > > and 1 GP miss will trigger Panic when it is enabled.
> > > >
> > > > Plan to add a module parameter for users to fine-tune how many GP
> > > > misses are allowed before Panic.
> > > >
> > > > To save our precious time, a diff has been tested on our systems and
> > > > it works and solves our problem in transient RCU stall events.
> > > >
> > > > Your insights and guidance is highly appreciated.
> > >
> > > Please feel free to post a patch. I could imagine a number of things
> > > you might be doing from your description above:
> > >
> > > 1. Having a different time for panic, so that (for example) an
> > > RCU CPU stall warning appears at 21 seconds (in mainline), and
> > > if the grace period still has not ended at some time specified
> > > by some kernel parameter. For example, one approach would be
> > > to make the existing panic_on_rcu_stall sysctl take an integer
> > > instead of a boolean, and to make that integer specify how old
> > > the stall-warned grace period must be before panic() is invoked.
> > >
> > > 2. Instead use the number of RCU CPU stall warning messages to
> > > trigger the panic, so that (for example), the panic would happen
> > > on the tenth message. Again, the panic_on_rcu_stall sysctl
> > > might be used for this.
> > >
> > > 3. Like #2, but reset the count every time a new grace period
> > > starts. So if the panic_on_rcu_stall sysctl was set to
> > > ten, there would need to be ten RCU CPU stall warnings for
> > > the same grace period before panic() was invoked.
> > >
> > > Of the above three, #1 and #3 seem the most attractive, with a slight
> > > preference for #1.
> > >
> > > Or did you have something else in mind?
> > >
> > > Thanx, Paul
next prev parent reply other threads:[~2020-08-16 18:01 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-13 17:22 Allow multiple GP misses before Panic Chao Zhou
2020-08-13 18:19 ` Paul E. McKenney
2020-08-13 18:50 ` Chao Zhou
2020-08-13 19:00 ` Chao Zhou
2020-08-16 18:01 ` Paul E. McKenney [this message]
2020-08-16 23:56 ` Chao Zhou
2020-08-13 21:21 ` Paul E. McKenney
2020-08-13 21:57 ` Chao Zhou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200816180137.GA23602@paulmck-ThinkPad-P72 \
--to=paulmck@kernel.org \
--cc=chaozhou1018@gmail.com \
--cc=rcu@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).