All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Elliott, Robert (Servers)" <elliott@hpe.com>
To: "paulmck@kernel.org" <paulmck@kernel.org>,
	Zhen Lei <thunder.leizhen@huawei.com>
Cc: Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <quic_neeraju@quicinc.com>,
	Josh Triplett <josh@joshtriplett.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	"rcu@vger.kernel.org" <rcu@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH v4 4/4] rcu: Add RCU stall diagnosis information
Date: Mon, 7 Nov 2022 02:10:35 +0000	[thread overview]
Message-ID: <MW5PR84MB1842933AB81EECFAF2ECCE7DAB3C9@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <20221105200637.GC28461@paulmck-ThinkPad-P17-Gen-1>


> > > +config RCU_CPU_STALL_CPUTIME
> >
> > Since the information might change and grow over time, consider
> > calling it "ADDITIONAL_INFO" rather than just "CPUTIME".
> 
> Except that I already redirected Zhen Lei from a generic description
> to a more specific one.  The reason for this is that I already get
> complaints about the large volume of output from RCU CPU stall warnings,
> which suggests that unless the additional information is quite small,
> it should get its own Kconfig option and kernel boot parameter.
> 
> So we well be keeping RCU_CPU_STALL_CPUTIME.

No problem - CPUTIME is easier to search for, and will likely
always remain as one of the values reported.

> But please tie any such patch to an actual use case.  After all, if no
> one actually uses that additional information, we have irritates untold
> numbers of electrons for no purpose.

I've been working on fixing intermittent RCU stalls that turned out to be
caused by x86-optimized crypto modules holding kernel_fpu_begin()/end()
too long and crypto tests not calling cond_resched() often enough, so am
interested in features that will help identify and prevent such
problems.

> Also, some of those functions are on fastpaths, so adding unconditional
> instrumentation overhead might result in an objection or three.
> 
> As always, choose wisely!  ;-)
> 
> If I don't see anything from you by this coming Friday, I will fold my
> usual wordsmithing into the patch.

I applied the series to a tree not including any crypto module
fixes, but an overnight run didn't trigger any RCU stalls. So, I
modified the tcrypt test module to simulate a problem by
running a lengthy loop after kernel_fpu_begin().  
Here's an example of how the output looks:


[ 1816.698380] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 1816.704368] rcu:     0-....: (2999 ticks this GP) idle=1dbc/1/0x4000000000000000 softirq=39683/39683 fqs=751
[ 1816.714200] rcu:          hardirqs   softirqs   csw/system
[ 1816.719922] rcu:  number:        5         10            0
[ 1816.725643] rcu: cputime:        3          0         1492   ==> 1500(ms)
[ 1816.732669]  (t=3030 jiffies g=89765 q=308 ncpus=56)
[ 1816.737857] CPU: 0 PID: 46826 Comm: modprobe Tainted: G        W          6.0.0+ #5
[ 1816.745754] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/08/2022
[ 1816.754523] RIP: 0010:rude_sleep_cycles.constprop.0+0x1c/0x30 [tcrypt]
[ 1816.761290] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 0f 31 48 c1 e2 20 be ab 90 41 00 48 89 d1 48 c1 e6 0b 48 09 c1 0f 31 <48> c1 e2 20 48 09 c2 48 29 ca 48 39 d6 73 ef c3 cc cc cc cc 48 8b
...

That makes me realize what "csw/switch" means:
- "csw" context switches applies to the number line
- "system" applies the cputime line
 
Maybe this shouldn't be a table? Make it grep-friendly:
[ 1816.719922] rcu: half-timeout counts: hardirq =5 softirq=10 csw=0
[ 1816.725643] rcu: half_timeout cputimes (ms): time=1500 hardirq=3 softirq=0 system=1492

The new prints interfere with existing prints, pushing the "t=3030" line 
further away from its context. That's from a pr_cont. Existing prints after
the "self-detect stall" message all start with \t, so they are roughly
related to the first line. The new prints should probably do the same.

Since the first line ends with \n, the pr_cont will never make it on the
same line, so it might be cleaner to use pr_err instead. That way it'll have
the "rcu:" prefix of the other lines.

That's from:
        pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name);
        raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags);
        print_cpu_stall_info(smp_processor_id());
[that's where this patch adds more prints]
        raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags);
        for_each_possible_cpu(cpu)
                totqlen += rcu_get_n_cbs_cpu(cpu);
        pr_cont("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n",
                jiffies - gps,
                (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);


One other print has similar construction:
        pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name);
        ...
                                        print_cpu_stall_info(cpu);
        ...
        pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n",
               smp_processor_id(), (long)(jiffies - gps),
               (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);



  reply	other threads:[~2022-11-07  2:11 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-04  2:12 [PATCH v4 0/4] rcu: Add RCU stall diagnosis information Zhen Lei
2022-11-04  2:12 ` [PATCH v4 1/4] genirq: Fix the return type of kstat_cpu_irqs_sum() Zhen Lei
2022-11-04  2:12 ` [PATCH v4 2/4] sched: Add helper kstat_cpu_softirqs_sum() Zhen Lei
2022-11-04  2:12 ` [PATCH v4 3/4] sched: Add helper nr_context_switches_cpu() Zhen Lei
2022-11-04  2:12 ` [PATCH v4 4/4] rcu: Add RCU stall diagnosis information Zhen Lei
2022-11-04  3:12   ` Leizhen (ThunderTown)
2022-11-04 13:43     ` Paul E. McKenney
2022-11-05  1:58   ` Elliott, Robert (Servers)
2022-11-05  7:03     ` Leizhen (ThunderTown)
2022-11-05 20:32       ` Paul E. McKenney
2022-11-07  3:20         ` Leizhen (ThunderTown)
2022-11-05 20:06     ` Paul E. McKenney
2022-11-07  2:10       ` Elliott, Robert (Servers) [this message]
2022-11-07 11:49         ` Leizhen (ThunderTown)
2022-11-07 20:38           ` Elliott, Robert (Servers)
2022-11-07 21:57             ` Elliott, Robert (Servers)
2022-11-08  3:06               ` Leizhen (ThunderTown)
2022-11-08  5:53                 ` Elliott, Robert (Servers)
2022-11-08  6:41                   ` Leizhen (ThunderTown)
2022-11-08 19:29                     ` Elliott, Robert (Servers)
2022-11-08  3:18             ` Leizhen (ThunderTown)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MW5PR84MB1842933AB81EECFAF2ECCE7DAB3C9@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM \
    --to=elliott@hpe.com \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=paulmck@kernel.org \
    --cc=quic_neeraju@quicinc.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=thunder.leizhen@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.