linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: jiangshanlai@gmail.com, linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: Workqueues splat due to ending up on wrong CPU
Date: Mon, 2 Dec 2019 12:13:38 -0800	[thread overview]
Message-ID: <20191202201338.GH16681@devbig004.ftw2.facebook.com> (raw)
In-Reply-To: <20191202015548.GA13391@paulmck-ThinkPad-P72>

Hello, Paul.

(cc'ing scheduler folks - workqueue rescuer is very occassionally
triggering a warning which says that it isn't on the cpu it should be
on under rcu cpu hotplug torture test.  It's checking smp_processor_id
is the expected one after a successful set_cpus_allowed_ptr() call.)

On Sun, Dec 01, 2019 at 05:55:48PM -0800, Paul E. McKenney wrote:
> > And hyperthreading seems to have done the trick!  One splat thus far,
> > shown below.  The run should complete this evening, Pacific Time.
> 
> That was the only one for that run, but another 24*56-hour run got three
> more.  All of them expected to be on CPU 0 (which never goes offline, so
> why?) and the "XXX" diagnostic never did print.

Heh, I didn't expect that, so maybe set_cpus_allowed_ptr() is
returning 0 while not migrating the rescuer task to the target cpu for
some reason?

The rescuer is always calling to migrate itself, so it must always be
running.  set_cpus_allowed_ptr() migrates live ones by calling
stop_one_cpu() which schedules a migration function which runs from a
highpri task on the target cpu.  Please take a look at the following.

  static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
  {
          ...
	  enabled = stopper->enabled;
	  if (enabled)
		  __cpu_stop_queue_work(stopper, work, &wakeq);
	  else if (work->done)
		  cpu_stop_signal_done(work->done);
          ...
  }

So, if stopper->enabled is clear, it'll signal completion without
running the work.  stopper->enabled is cleared during cpu hotunplug
and restored from bringup_cpu() while cpu is being brought back up.

  static int bringup_wait_for_ap(unsigned int cpu)
  {
          ...
	  stop_machine_unpark(cpu);
          ....
  }

  static int bringup_cpu(unsigned int cpu)
  {
	  ...
	  ret = __cpu_up(cpu, idle);
          ...
	  return bringup_wait_for_ap(cpu);
  }

__cpu_up() is what marks the cpu online and once the cpu is online,
kthreads are free to migrate into the cpu, so it looks like there's a
brief window where a cpu is marked online but the stopper thread is
still disabled meaning that a kthread may schedule into the cpu but
not out of it, which would explain the symptom that you were seeing.

This makes the cpumask and the cpu the task is actually on disagree
and retries would become noops.  I can work around it by excluding
rescuer attachments against hotplugs but this looks like a genuine cpu
hotplug bug.

It could be that I'm misreading the code.  What do you guys think?

Thanks.

-- 
tejun

  reply	other threads:[~2019-12-02 20:13 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-25 23:03 Workqueues splat due to ending up on wrong CPU Paul E. McKenney
2019-11-26 18:33 ` Tejun Heo
2019-11-26 22:05   ` Paul E. McKenney
2019-11-27 15:50     ` Paul E. McKenney
2019-11-28 16:18       ` Paul E. McKenney
2019-11-29 15:58         ` Paul E. McKenney
2019-12-02  1:55           ` Paul E. McKenney
2019-12-02 20:13             ` Tejun Heo [this message]
2019-12-02 23:39               ` Paul E. McKenney
2019-12-03 10:00                 ` Peter Zijlstra
2019-12-03 17:45                   ` Paul E. McKenney
2019-12-03 18:13                     ` Tejun Heo
2019-12-03  9:55               ` Peter Zijlstra
2019-12-03 10:06                 ` Peter Zijlstra
2019-12-03 15:42                 ` Tejun Heo
2019-12-03 16:04                   ` Paul E. McKenney
2019-12-04 20:11                 ` Paul E. McKenney
2019-12-05 10:29                   ` Peter Zijlstra
2019-12-05 10:32                     ` Peter Zijlstra
2019-12-05 14:48                       ` Paul E. McKenney
2019-12-06  3:19                         ` Paul E. McKenney
2019-12-06 18:52                         ` Paul E. McKenney
2019-12-06 22:00                           ` Paul E. McKenney
2019-12-09 18:59                             ` Paul E. McKenney
2019-12-10  9:08                               ` Peter Zijlstra
2019-12-10 22:56                                 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191202201338.GH16681@devbig004.ftw2.facebook.com \
    --to=tj@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).