linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: [PATCH 4.10 01/69] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Date: Wed, 19 Apr 2017 16:36:30 +0200	[thread overview]
Message-ID: <20170419141555.177408801@linuxfoundation.org> (raw)
In-Reply-To: <20170419141555.114738231@linuxfoundation.org>

4.10-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Tejun Heo <tj@kernel.org>

commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 include/linux/cgroup.h |   21 +++++++++++++++++++++
 include/linux/sched.h  |    4 ++++
 kernel/cgroup.c        |    9 +++++----
 kernel/kthread.c       |    3 +++
 4 files changed, 33 insertions(+), 4 deletions(-)

--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -570,6 +570,25 @@ static inline void pr_cont_cgroup_path(s
 	pr_cont_kernfs_path(cgrp->kn);
 }
 
+static inline void cgroup_init_kthreadd(void)
+{
+	/*
+	 * kthreadd is inherited by all kthreads, keep it in the root so
+	 * that the new kthreads are guaranteed to stay in the root until
+	 * initialization is finished.
+	 */
+	current->no_cgroup_migration = 1;
+}
+
+static inline void cgroup_kthread_ready(void)
+{
+	/*
+	 * This kthread finished initialization.  The creator should have
+	 * set PF_NO_SETAFFINITY if this kthread should stay in the root.
+	 */
+	current->no_cgroup_migration = 0;
+}
+
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -590,6 +609,8 @@ static inline void cgroup_free(struct ta
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
+static inline void cgroup_init_kthreadd(void) {}
+static inline void cgroup_kthread_ready(void) {}
 
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1620,6 +1620,10 @@ struct task_struct {
 #ifdef CONFIG_COMPAT_BRK
 	unsigned brk_randomized:1;
 #endif
+#ifdef CONFIG_CGROUPS
+	/* disallow userland-initiated cgroup migration */
+	unsigned no_cgroup_migration:1;
+#endif
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2920,11 +2920,12 @@ static ssize_t __cgroup_procs_write(stru
 		tsk = tsk->group_leader;
 
 	/*
-	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
-	 * trapped in a cpuset, or RT worker may be born in a cgroup
-	 * with no rt_runtime allocated.  Just say no.
+	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+	 * If userland migrates such a kthread to a non-root cgroup, it can
+	 * become trapped in a cpuset, or RT kthread may be born in a
+	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
-	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		ret = -EINVAL;
 		goto out_unlock_rcu;
 	}
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/ptrace.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 #include <trace/events/sched.h>
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -223,6 +224,7 @@ static int kthread(void *_create)
 
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
+		cgroup_kthread_ready();
 		__kthread_parkme(self);
 		ret = threadfn(data);
 	}
@@ -536,6 +538,7 @@ int kthreadd(void *unused)
 	set_mems_allowed(node_states[N_MEMORY]);
 
 	current->flags |= PF_NOFREEZE;
+	cgroup_init_kthreadd();
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);

  reply	other threads:[~2017-04-19 14:38 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-19 14:36 [PATCH 4.10 00/69] 4.10.12-stable review Greg Kroah-Hartman
2017-04-19 14:36 ` Greg Kroah-Hartman [this message]
2017-04-19 14:36 ` [PATCH 4.10 02/69] audit: make sure we dont let the retry queue grow without bounds Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 03/69] tcmu: Fix possible overwrite of t_data_sgs last iov[] Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 04/69] tcmu: Fix wrongly calculating of the base_command_size Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 05/69] tcmu: Skip Data-Out blocks before gathering Data-In buffer for BIDI case Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 06/69] thp: fix MADV_DONTNEED vs. MADV_FREE race Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 07/69] thp: fix MADV_DONTNEED vs clear soft dirty race Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 08/69] zsmalloc: expand class bit Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 09/69] orangefs: free superblock when mount fails Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 10/69] drm/nouveau/mpeg: mthd returns true on success now Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 11/69] drm/nouveau/mmu/nv4a: use nv04 mmu rather than the nv44 one Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 12/69] drm/nouveau/kms/nv50: fix setting of HeadSetRasterVertBlankDmi method Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 13/69] drm/nouveau/kms/nv50: fix double dma_fence_put() when destroying plane state Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 14/69] drm/nouveau: initial support (display-only) for GP107 Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 15/69] drm/etnaviv: fix missing unlock on error in etnaviv_gpu_submit() Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 17/69] CIFS: reconnect thread reschedule itself Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 18/69] CIFS: store results of cifs_reopen_file to avoid infinite wait Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 19/69] Input: xpad - add support for Razer Wildcat gamepad Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 20/69] perf annotate s390: Fix perf annotate error -95 (4.10 regression) Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 21/69] perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32() Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 22/69] x86/efi: Dont try to reserve runtime regions Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 23/69] x86/signals: Fix lower/upper bound reporting in compat siginfo Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 24/69] x86/intel_rdt: Fix locking in rdtgroup_schemata_write() Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 25/69] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 26/69] x86/vdso: Ensure vdso32_enabled gets set to valid values only Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 27/69] x86/vdso: Plug race between mapping and ELF header setup Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 28/69] acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison) Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 29/69] ACPI / scan: Set the visited flag for all enumerated devices Greg Kroah-Hartman
2017-04-19 14:36 ` [PATCH 4.10 30/69] parisc: fix bugs in pa_memcpy Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 31/69] efi/libstub: Skip GOP with PIXEL_BLT_ONLY format Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 32/69] efi/fb: Avoid reconfiguration of BAR that covers the framebuffer Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 33/69] iscsi-target: Fix TMR reference leak during session shutdown Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 34/69] iscsi-target: Drop work-around for legacy GlobalSAN initiator Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 35/69] scsi: sr: Sanity check returned mode data Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 36/69] scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 37/69] scsi: qla2xxx: Add fix to read correct register value for ISP82xx Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 38/69] scsi: sd: Fix capacity calculation with 32-bit sector_t Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 39/69] target: Avoid mappedlun symlink creation during lun shutdown Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 40/69] xen, fbfront: fix connecting to backend Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 41/69] [iov_iter] new privimitive: iov_iter_revert() Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 42/69] make skb_copy_datagram_msg() et.al. preserve ->msg_iter on error Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 43/69] libnvdimm: fix blk free space accounting Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 44/69] libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 45/69] libnvdimm: band aid btt vs clear poison locking Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 46/69] can: ifi: use correct register to read rx status Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 47/69] pwm: rockchip: State of PWM clock should synchronize with PWM enabled state Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 48/69] cpufreq: Bring CPUs up even if cpufreq_online() failed Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 49/69] irqchip/irq-imx-gpcv2: Fix spinlock initialization Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 50/69] ftrace: Fix removing of second function probe Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 51/69] drm/i915/gvt: set the correct default value of CTX STATUS PTR Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 52/69] char: lack of bool string made CONFIG_DEVPORT always on Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 53/69] Revert "MIPS: Lantiq: Fix cascaded IRQ setup" Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 54/69] zram: do not use copy_page with non-page aligned address Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 55/69] ftrace: Fix function pid filter on instances Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 56/69] crypto: algif_aead - Fix bogus request dereference in completion function Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 57/69] crypto: xts - Fix use-after-free on EINPROGRESS Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 58/69] crypto: ahash - Fix EINPROGRESS notification callback Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 59/69] crypto: lrw - Fix use-after-free on EINPROGRESS Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 60/69] parisc: Fix get_user() for 64-bit value on 32-bit kernel Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 61/69] [media] dvb-usb-v2: avoid use-after-free Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 62/69] ASoC: Intel: select DW_DMAC_CORE since its mandatory Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 64/69] x86/xen: Fix APIC id mismatch warning on Intel Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 65/69] ACPI / EC: Use busy polling mode when GPE is not enabled Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 66/69] rtc: tegra: Implement clock handling Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 67/69] mm: Tighten x86 /dev/mem with zeroing reads Greg Kroah-Hartman
2017-04-19 14:37 ` [PATCH 4.10 69/69] virtio-console: avoid DMA from stack Greg Kroah-Hartman
2017-04-19 20:38 ` [PATCH 4.10 00/69] 4.10.12-stable review Shuah Khan
2017-04-20  6:33   ` Greg Kroah-Hartman
2017-04-19 23:22 ` Guenter Roeck
2017-04-20  6:29   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170419141555.177408801@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).