All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Joel Savitz <jsavitz@redhat.com>,
	Waiman Long <longman@redhat.com>, Phil Auld <pauld@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	cgroups@vger.kernel.org
Subject: [PATCH AUTOSEL 4.19 29/34] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
Date: Tue, 25 Jun 2019 23:43:30 -0400	[thread overview]
Message-ID: <20190626034335.23767-29-sashal@kernel.org> (raw)
In-Reply-To: <20190626034335.23767-1-sashal@kernel.org>

From: Joel Savitz <jsavitz@redhat.com>

[ Upstream commit d477f8c202d1f0d4791ab1263ca7657bbe5cf79e ]

In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off > /sys/devices/system/cpu/smt/control
	# echo on > /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long <longman@redhat.com>
Suggested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Acked-by: Phil Auld <pauld@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/cgroup/cpuset.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10cb7222..ff956ccbb6df 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2432,10 +2432,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 	spin_unlock_irqrestore(&callback_lock, flags);
 }
 
+/**
+ * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
+ * @tsk: pointer to task_struct with which the scheduler is struggling
+ *
+ * Description: In the case that the scheduler cannot find an allowed cpu in
+ * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
+ * mode however, this value is the same as task_cs(tsk)->effective_cpus,
+ * which will not contain a sane cpumask during cases such as cpu hotplugging.
+ * This is the absolute last resort for the scheduler and it is only used if
+ * _every_ other avenue has been traveled.
+ **/
+
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
 	rcu_read_lock();
-	do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+	do_set_cpus_allowed(tsk, is_in_v2_mode() ?
+		task_cs(tsk)->cpus_allowed : cpu_possible_mask);
 	rcu_read_unlock();
 
 	/*
-- 
2.20.1


  parent reply	other threads:[~2019-06-26  3:45 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-26  3:43 [PATCH AUTOSEL 4.19 01/34] ASoC : cs4265 : readable register too low Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 02/34] ASoC: ak4458: add return value for ak4458_probe Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 03/34] ASoC: soc-pcm: BE dai needs prepare when pause release after resume Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 04/34] ASoC: ak4458: rstn_control - return a non-zero on error only Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 05/34] spi: bitbang: Fix NULL pointer dereference in spi_unregister_master Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 06/34] ASoC: core: lock client_mutex while removing link components Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 07/34] ASoC: sun4i-codec: fix first delay on Speaker Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 08/34] drm/mediatek: fix unbind functions Sasha Levin
2019-06-26  3:43   ` Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 09/34] drm/mediatek: unbind components in mtk_drm_unbind() Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 10/34] drm/mediatek: call drm_atomic_helper_shutdown() when unbinding driver Sasha Levin
2019-06-26  3:43   ` Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 11/34] drm/mediatek: clear num_pipes when unbind driver Sasha Levin
2019-06-26  3:43   ` Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 12/34] drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable() Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 13/34] ASoC: max98090: remove 24-bit format support if RJ is 0 Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 14/34] ASoC: sun4i-i2s: Fix sun8i tx channel offset mask Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 15/34] ASoC: sun4i-i2s: Add offset to RX channel select Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 16/34] x86/CPU: Add more Icelake model numbers Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 17/34] usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i] Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 18/34] usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 19/34] ALSA: hdac: fix memory release for SST and SOF drivers Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 20/34] SoC: rt274: Fix internal jack assignment in set_jack callback Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 21/34] scsi: hpsa: correct ioaccel2 chaining Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 22/34] drm: panel-orientation-quirks: Add quirk for GPD pocket2 Sasha Levin
2019-06-26  3:43   ` Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 23/34] drm: panel-orientation-quirks: Add quirk for GPD MicroPC Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 24/34] platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 25/34] platform/x86: intel-vbtn: Report switch events when event wakes device Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 26/34] platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 27/34] platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 28/34] i2c: pca-platform: Fix GPIO lookup code Sasha Levin
2019-06-26  3:43 ` Sasha Levin [this message]
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 30/34] scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 31/34] mm/mlock.c: change count_mm_mlocked_page_nr return type Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 32/34] tracing: avoid build warning with HAVE_NOP_MCOUNT Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 33/34] module: Fix livepatch/ftrace module text permissions race Sasha Levin
2019-06-26  3:43 ` [PATCH AUTOSEL 4.19 34/34] ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190626034335.23767-29-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=jsavitz@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.