From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org,
mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com,
akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org,
dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com,
oleg@redhat.com, joel@joelfernandes.org
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs
Date: Thu, 8 Aug 2019 14:30:12 -0700 [thread overview]
Message-ID: <20190808213012.GA28773@linux.ibm.com> (raw)
In-Reply-To: <20190808203541.GA8160@linux.ibm.com>
On Thu, Aug 08, 2019 at 01:35:41PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > > >
> > > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > > like a requirement for actually doing a patch.
> > > > > >
> > > > > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > > > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > > > > writing experimental patches, please feel free to take a long walk on
> > > > > > a short pier.
> > > > > >
> > > > > > Understood???
> > > > >
> > > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > > >
> > > > No problem!
> > > >
> > > > And of course adding tracing decreases the frequency and duration of
> > > > the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)
> > >
> > > And I did eventually get a good trace. If I am interpreting this trace
> > > correctly, the torture_-135 task didn't get around to attempting to wake
> > > up all of the CPUs. I will try again, but this time with the sched_switch
> > > trace event enabled.
> > >
> > > As a side note, enabling ftrace from the command line seems to interact
> > > badly with turning tracing off and on in the kernel, so I eventually
> > > resorted to trace_printk() in the functions of interest. The trace
> > > output is below, followed by the current diagnostic patch. Please note
> > > that I am -not- using the desperation hammer-the-scheduler patches.
> > >
> > > More as I learn more!
> >
> > And of course I forgot to dump out the online CPUs, so I really had no
> > idea whether or not all the CPUs were accounted for. I added tracing
> > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > reworked it a few times to get the problem to happen in reasonable time.
> > Please see below for the resulting annotated trace.
> >
> > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > but all the migration threads are running within about 2 milliseconds.
> > It is then almost two minutes(!) until the next trace message.
> >
> > Looks like time to (very carefully!) instrument multi_cpu_stop().
> >
> > Of course, if you have any ideas, please do not keep them a secret!
>
> Functionally, multi_cpu_stop() is working fine, according to the trace
> below (search for a line beginning with TAB). But somehow CPU 2 took
> almost three -minutes- to do one iteration of the loop. The prime suspect
> in that loop is cpu_relax() due to the hypervisor having an opportunity
> to do something at that point. The commentary below (again, search for
> a line beginning with TAB) gives my analysis.
>
> Of course, if I am correct, it should be possible to catch cpu_relax()
> in the act. That is the next step, give or take the Heisenbuggy nature
> of this beast.
>
> Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> case the earlier runs just got lucky.
>
> Thoughts?
And it really can happen:
[ 1881.467922] migratio-33 4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms
The previous timestamp was 1123391100us, so the cpu_relax() is almost
exactly the full delay.
But another instance stalled for many minutes without a ten-second
cpu_relax(). So it is not just cpu_relax() causing trouble. I could
rationalize that vCPU preemption being at fault...
And my diagnostic patch is below, just in case I am doing something
stupid with that.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ce00b442ced0..1a50ed258ef0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3569,6 +3569,7 @@ void __init rcu_init(void)
rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_par_gp_wq);
srcu_init();
+ tracing_off();
}
#include "tree_stall.h"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..a5a879a49051 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -479,6 +479,7 @@ void wake_up_q(struct wake_q_head *head)
{
struct wake_q_node *node = head->first;
+ trace_printk("entered\n");
while (node != WAKE_Q_TAIL) {
struct task_struct *task;
@@ -509,6 +510,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;
+ trace_printk("entered\n");
lockdep_assert_held(&rq->lock);
if (test_tsk_need_resched(curr))
@@ -1197,6 +1199,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
+ trace_printk("entered\n");
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
@@ -1298,6 +1301,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;
+ trace_printk("entered\n");
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
@@ -2097,6 +2101,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
{
+ trace_printk("entered\n");
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2132,6 +2137,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
+ trace_printk("entered\n");
lockdep_assert_held(&rq->lock);
#ifdef CONFIG_SMP
@@ -2178,9 +2184,11 @@ void sched_ttwu_pending(void)
struct task_struct *p, *t;
struct rq_flags rf;
+ trace_printk("entered\n");
if (!llist)
return;
+ trace_printk("non-NULL llist\n");
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
@@ -2192,6 +2200,7 @@ void sched_ttwu_pending(void)
void scheduler_ipi(void)
{
+ trace_printk("entered\n");
/*
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
* TIF_NEED_RESCHED remotely (for the first time) will also send
@@ -2232,6 +2241,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);
+ trace_printk("%s entered, CPU %d\n", __func__, cpu);
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
@@ -2277,6 +2287,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;
+ trace_printk("%s entered\n", __func__);
#if defined(CONFIG_SMP)
if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
@@ -2399,6 +2410,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;
+ trace_printk("entered\n");
preempt_disable();
if (p == current) {
/*
@@ -2545,6 +2557,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
int wake_up_process(struct task_struct *p)
{
+ trace_printk("entered\n");
return try_to_wake_up(p, TASK_NORMAL, 0);
}
EXPORT_SYMBOL(wake_up_process);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5c2b2f90fae1..a07f77b9c1f2 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/atomic.h>
#include <linux/nmi.h>
#include <linux/sched/wake_q.h>
+#include <linux/sched/clock.h>
/*
* Structure to determine completion condition and record errors. May
@@ -80,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
unsigned long flags;
bool enabled;
+ trace_printk("entered for CPU %u\n", cpu);
preempt_disable();
raw_spin_lock_irqsave(&stopper->lock, flags);
enabled = stopper->enabled;
@@ -167,7 +169,7 @@ static void set_state(struct multi_stop_data *msdata,
/* Reset ack counter. */
atomic_set(&msdata->thread_ack, msdata->num_threads);
smp_wmb();
- msdata->state = newstate;
+ WRITE_ONCE(msdata->state, newstate);
}
/* Last one to ack a state moves to the next state. */
@@ -179,7 +181,15 @@ static void ack_state(struct multi_stop_data *msdata)
void __weak stop_machine_yield(const struct cpumask *cpumask)
{
+ u64 starttime = local_clock();
+ u64 endtime;
+ const u64 delta = 100ULL * 1000ULL * 1000ULL * 1000ULL;
+
cpu_relax();
+ endtime = local_clock();
+ if (time_after64(endtime, starttime + delta))
+ trace_printk("cpu_relax() took %llu ms\n",
+ (endtime - starttime) / (1000ULL * 1000ULL));
}
/* This is the cpu_stop function which stops the CPU. */
@@ -210,8 +220,9 @@ static int multi_cpu_stop(void *data)
do {
/* Chill out and ensure we re-read multi_stop_state. */
stop_machine_yield(cpumask);
- if (msdata->state != curstate) {
- curstate = msdata->state;
+ if (READ_ONCE(msdata->state) != curstate) {
+ curstate = READ_ONCE(msdata->state);
+ trace_printk("curstate = %d, ack = %d\n", curstate, atomic_read(&msdata->thread_ack));
switch (curstate) {
case MULTI_STOP_DISABLE_IRQ:
local_irq_disable();
@@ -382,6 +393,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
* preempted by a stopper which might wait for other stoppers
* to enter @fn which can lead to deadlock.
*/
+ trace_printk("entered\n");
preempt_disable();
stop_cpus_in_progress = true;
for_each_cpu(cpu, cpumask) {
@@ -402,11 +414,18 @@ static int __stop_cpus(const struct cpumask *cpumask,
cpu_stop_fn_t fn, void *arg)
{
struct cpu_stop_done done;
+ unsigned long j = jiffies;
+ tracing_on();
+ trace_printk("entered\n");
+ trace_printk("CPUs %*pbl online\n", cpumask_pr_args(cpu_online_mask));
cpu_stop_init_done(&done, cpumask_weight(cpumask));
if (!queue_stop_cpus_work(cpumask, fn, arg, &done))
return -ENOENT;
wait_for_completion(&done.completion);
+ tracing_off();
+ if (time_after(jiffies, j + HZ * 20))
+ ftrace_dump(DUMP_ALL);
return done.ret;
}
@@ -442,6 +461,7 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
{
int ret;
+ trace_printk("entered\n");
/* static works are used, process one request at a time */
mutex_lock(&stop_cpus_mutex);
ret = __stop_cpus(cpumask, fn, arg);
@@ -599,6 +619,7 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
.active_cpus = cpus,
};
+ trace_printk("entered\n");
lockdep_assert_cpus_held();
if (!stop_machine_initialized) {
next prev parent reply other threads:[~2019-08-08 21:30 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-02 15:14 [PATCH tip/core/rcu 0/14] No-CBs bypass addition for v5.4 Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure Paul E. McKenney
2019-08-04 14:50 ` Peter Zijlstra
2019-08-04 14:52 ` Peter Zijlstra
2019-08-04 18:45 ` Paul E. McKenney
2019-08-04 18:42 ` Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing Paul E. McKenney
2019-08-07 0:03 ` Joel Fernandes
2019-08-07 0:16 ` Joel Fernandes
2019-08-07 0:35 ` Paul E. McKenney
2019-08-07 0:40 ` Steven Rostedt
2019-08-07 1:17 ` Paul E. McKenney
2019-08-07 1:24 ` Steven Rostedt
2019-08-07 3:47 ` Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 03/14] rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 04/14] rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 05/14] rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake() Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 06/14] rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks() Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 07/14] rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 08/14] rcu/nocb: Reduce __call_rcu_nocb_wake() " Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 09/14] rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 10/14] rcu: Allow rcu_do_batch() to dynamically adjust batch sizes Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 11/14] EXP nohz: Add TICK_DEP_BIT_RCU Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 12/14] rcu/nohz: Force on tick when invoking lots of callbacks Paul E. McKenney
2019-08-02 15:15 ` [PATCH RFC tip/core/rcu 13/14] rcutorture: Force on tick for readers and callback flooders Paul E. McKenney
2019-08-02 15:15 ` [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs Paul E. McKenney
2019-08-04 14:43 ` Peter Zijlstra
2019-08-04 14:48 ` Peter Zijlstra
2019-08-04 18:41 ` Paul E. McKenney
2019-08-04 20:24 ` Paul E. McKenney
2019-08-05 4:19 ` Paul E. McKenney
2019-08-05 8:07 ` Peter Zijlstra
2019-08-05 14:47 ` Paul E. McKenney
2019-08-05 8:05 ` Peter Zijlstra
2019-08-05 14:54 ` Paul E. McKenney
2019-08-05 15:50 ` Peter Zijlstra
2019-08-05 17:48 ` Paul E. McKenney
2019-08-06 18:08 ` Paul E. McKenney
2019-08-07 21:41 ` Paul E. McKenney
2019-08-08 20:35 ` Paul E. McKenney
2019-08-08 21:30 ` Paul E. McKenney [this message]
2019-08-09 16:51 ` Paul E. McKenney
2019-08-09 18:07 ` Joel Fernandes
2019-08-09 18:39 ` Paul E. McKenney
2019-08-12 21:02 ` Frederic Weisbecker
2019-08-12 23:23 ` Paul E. McKenney
2019-08-13 1:33 ` Joel Fernandes
2019-08-13 12:30 ` Frederic Weisbecker
2019-08-13 14:48 ` Paul E. McKenney
2019-08-14 17:55 ` Joel Fernandes
2019-08-14 22:05 ` Paul E. McKenney
2019-08-15 15:07 ` Joel Fernandes
2019-08-15 17:23 ` Paul E. McKenney
2019-08-15 18:15 ` Joel Fernandes
2019-08-15 18:39 ` Paul E. McKenney
2019-08-15 19:42 ` Joel Fernandes
2019-08-13 21:06 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190808213012.GA28773@linux.ibm.com \
--to=paulmck@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=dhowells@redhat.com \
--cc=dipankar@in.ibm.com \
--cc=edumazet@google.com \
--cc=fweisbec@gmail.com \
--cc=jiangshanlai@gmail.com \
--cc=joel@joelfernandes.org \
--cc=josh@joshtriplett.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=rcu@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).