All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>
To: "Elliott, Robert (Servers)" <elliott@hpe.com>,
	"paulmck@kernel.org" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <quic_neeraju@quicinc.com>,
	Josh Triplett <josh@joshtriplett.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	"rcu@vger.kernel.org" <rcu@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v4 4/4] rcu: Add RCU stall diagnosis information
Date: Mon, 7 Nov 2022 19:49:04 +0800	[thread overview]
Message-ID: <5133bd4b-c17f-f5ed-eae2-fe2d199dc5db@huawei.com> (raw)
In-Reply-To: <MW5PR84MB1842933AB81EECFAF2ECCE7DAB3C9@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM>



On 2022/11/7 10:10, Elliott, Robert (Servers) wrote:
> 
>>>> +config RCU_CPU_STALL_CPUTIME
>>>
>>> Since the information might change and grow over time, consider
>>> calling it "ADDITIONAL_INFO" rather than just "CPUTIME".
>>
>> Except that I already redirected Zhen Lei from a generic description
>> to a more specific one.  The reason for this is that I already get
>> complaints about the large volume of output from RCU CPU stall warnings,
>> which suggests that unless the additional information is quite small,
>> it should get its own Kconfig option and kernel boot parameter.
>>
>> So we well be keeping RCU_CPU_STALL_CPUTIME.
> 
> No problem - CPUTIME is easier to search for, and will likely
> always remain as one of the values reported.
> 
>> But please tie any such patch to an actual use case.  After all, if no
>> one actually uses that additional information, we have irritates untold
>> numbers of electrons for no purpose.
> 
> I've been working on fixing intermittent RCU stalls that turned out to be
> caused by x86-optimized crypto modules holding kernel_fpu_begin()/end()
> too long and crypto tests not calling cond_resched() often enough, so am
> interested in features that will help identify and prevent such
> problems.
> 
>> Also, some of those functions are on fastpaths, so adding unconditional
>> instrumentation overhead might result in an objection or three.
>>
>> As always, choose wisely!  ;-)
>>
>> If I don't see anything from you by this coming Friday, I will fold my
>> usual wordsmithing into the patch.
> 
> I applied the series to a tree not including any crypto module
> fixes, but an overnight run didn't trigger any RCU stalls. So, I
> modified the tcrypt test module to simulate a problem by
> running a lengthy loop after kernel_fpu_begin().  
> Here's an example of how the output looks:
> 
> 
> [ 1816.698380] rcu: INFO: rcu_preempt self-detected stall on CPU
> [ 1816.704368] rcu:     0-....: (2999 ticks this GP) idle=1dbc/1/0x4000000000000000 softirq=39683/39683 fqs=751
> [ 1816.714200] rcu:          hardirqs   softirqs   csw/system
> [ 1816.719922] rcu:  number:        5         10            0
> [ 1816.725643] rcu: cputime:        3          0         1492   ==> 1500(ms)
> [ 1816.732669]  (t=3030 jiffies g=89765 q=308 ncpus=56)
> [ 1816.737857] CPU: 0 PID: 46826 Comm: modprobe Tainted: G        W          6.0.0+ #5
> [ 1816.745754] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/08/2022
> [ 1816.754523] RIP: 0010:rude_sleep_cycles.constprop.0+0x1c/0x30 [tcrypt]
> [ 1816.761290] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 0f 31 48 c1 e2 20 be ab 90 41 00 48 89 d1 48 c1 e6 0b 48 09 c1 0f 31 <48> c1 e2 20 48 09 c2 48 29 ca 48 39 d6 73 ef c3 cc cc cc cc 48 8b
> ...
> 
> That makes me realize what "csw/switch" means:
> - "csw" context switches applies to the number line
> - "system" applies the cputime line
>  
> Maybe this shouldn't be a table? Make it grep-friendly:
> [ 1816.719922] rcu: half-timeout counts: hardirq =5 softirq=10 csw=0
> [ 1816.725643] rcu: half_timeout cputimes (ms): time=1500 hardirq=3 softirq=0 system=1492

I prefer the table. Table look clearer and easier to compare.

> 
> The new prints interfere with existing prints, pushing the "t=3030" line 
> further away from its context. That's from a pr_cont. Existing prints after
> the "self-detect stall" message all start with \t, so they are roughly
> related to the first line. The new prints should probably do the same.
> 
> Since the first line ends with \n, the pr_cont will never make it on the
> same line, so it might be cleaner to use pr_err instead. That way it'll have
> the "rcu:" prefix of the other lines.
> 
> That's from:
>         pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name);
>         raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags);
>         print_cpu_stall_info(smp_processor_id());
> [that's where this patch adds more prints]
>         raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags);
>         for_each_possible_cpu(cpu)
>                 totqlen += rcu_get_n_cbs_cpu(cpu);
>         pr_cont("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n",
>                 jiffies - gps,
>                 (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);
> 
> 
> One other print has similar construction:
>         pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name);
>         ...
>                                         print_cpu_stall_info(cpu);
>         ...
>         pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n",
>                smp_processor_id(), (long)(jiffies - gps),
>                (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);
> 
> 
> 
> .
> 

-- 
Regards,
  Zhen Lei

  reply	other threads:[~2022-11-07 11:49 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-04  2:12 [PATCH v4 0/4] rcu: Add RCU stall diagnosis information Zhen Lei
2022-11-04  2:12 ` [PATCH v4 1/4] genirq: Fix the return type of kstat_cpu_irqs_sum() Zhen Lei
2022-11-04  2:12 ` [PATCH v4 2/4] sched: Add helper kstat_cpu_softirqs_sum() Zhen Lei
2022-11-04  2:12 ` [PATCH v4 3/4] sched: Add helper nr_context_switches_cpu() Zhen Lei
2022-11-04  2:12 ` [PATCH v4 4/4] rcu: Add RCU stall diagnosis information Zhen Lei
2022-11-04  3:12   ` Leizhen (ThunderTown)
2022-11-04 13:43     ` Paul E. McKenney
2022-11-05  1:58   ` Elliott, Robert (Servers)
2022-11-05  7:03     ` Leizhen (ThunderTown)
2022-11-05 20:32       ` Paul E. McKenney
2022-11-07  3:20         ` Leizhen (ThunderTown)
2022-11-05 20:06     ` Paul E. McKenney
2022-11-07  2:10       ` Elliott, Robert (Servers)
2022-11-07 11:49         ` Leizhen (ThunderTown) [this message]
2022-11-07 20:38           ` Elliott, Robert (Servers)
2022-11-07 21:57             ` Elliott, Robert (Servers)
2022-11-08  3:06               ` Leizhen (ThunderTown)
2022-11-08  5:53                 ` Elliott, Robert (Servers)
2022-11-08  6:41                   ` Leizhen (ThunderTown)
2022-11-08 19:29                     ` Elliott, Robert (Servers)
2022-11-08  3:18             ` Leizhen (ThunderTown)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5133bd4b-c17f-f5ed-eae2-fe2d199dc5db@huawei.com \
    --to=thunder.leizhen@huawei.com \
    --cc=elliott@hpe.com \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=paulmck@kernel.org \
    --cc=quic_neeraju@quicinc.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.