All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
@ 2017-03-15 23:18 Tejun Heo
  2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
  2017-03-16 14:54 ` [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Oleg Nesterov
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-15 23:18 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds, Andrew Morton
  Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team

Until now, all to_kthread() users are interlocked with kthread
creation and there's no need to have explicit barriers when setting
the kthread pointer or dereferencing it.

However, There is a race condition where userland can interfere with a
kthread while it's being initialized.  To close it, to_kthread() needs
to be used from an unsynchronized context.

This patch moves struct kthread initialization before
set_kthread_struct() and adds matching barriers in
set_kthread_struct() and to_kthread(), so that dereferencing
to_kthread() always returns initialized fields.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race < v4.3)
---
 kernel/kthread.c |   30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -57,6 +57,9 @@ enum KTHREAD_BITS {
 
 static inline void set_kthread_struct(void *kthread)
 {
+	/* paired with smp_read_data_barrier_depends() in to_kthread() */
+	smp_wmb();
+
 	/*
 	 * We abuse ->set_child_tid to avoid the new member and because it
 	 * can't be wrongly copied by copy_process(). We also rely on fact
@@ -67,8 +70,19 @@ static inline void set_kthread_struct(vo
 
 static inline struct kthread *to_kthread(struct task_struct *k)
 {
+	void *ptr;
+
 	WARN_ON(!(k->flags & PF_KTHREAD));
-	return (__force void *)k->set_child_tid;
+
+	ptr = (__force void *)k->set_child_tid;
+
+	/*
+	 * Paired with smp_wmb() in set_kthread_struct() and ensures that
+	 * the caller sees initialized content of the returned kthread.
+	 */
+	smp_read_barrier_depends();
+
+	return ptr;
 }
 
 void free_kthread_struct(struct task_struct *k)
@@ -196,6 +210,14 @@ static int kthread(void *_create)
 	int ret;
 
 	self = kmalloc(sizeof(*self), GFP_KERNEL);
+	if (self) {
+		self->flags = 0;
+		self->data = data;
+		init_completion(&self->exited);
+		init_completion(&self->parked);
+		current->vfork_done = &self->exited;
+	}
+
 	set_kthread_struct(self);
 
 	/* If user was SIGKILLed, I release the structure. */
@@ -211,12 +233,6 @@ static int kthread(void *_create)
 		do_exit(-ENOMEM);
 	}
 
-	self->flags = 0;
-	self->data = data;
-	init_completion(&self->exited);
-	init_completion(&self->parked);
-	current->vfork_done = &self->exited;
-
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
 	create->result = current;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-15 23:18 [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Tejun Heo
@ 2017-03-15 23:19 ` Tejun Heo
  2017-03-16 15:02   ` Oleg Nesterov
  2017-03-16 20:54     ` Tejun Heo
  2017-03-16 14:54 ` [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Oleg Nesterov
  1 sibling, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-15 23:19 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds, Andrew Morton
  Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel,
	kernel-team, Li Zefan, Johannes Weiner, cgroups

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch introduces KTHREAD_INITIALIZED which is set after the
kthread finishes initialization.  cgroup core closes the race window
by testing kthread_initialized() and rejecting migration accordingly.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race < v4.3)
---
 include/linux/kthread.h |    1 +
 kernel/cgroup/cgroup.c  |   10 ++++++----
 kernel/kthread.c        |   21 ++++++++++++++++++++-
 3 files changed, 27 insertions(+), 5 deletions(-)

--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -49,6 +49,7 @@ struct task_struct *kthread_create_on_cp
 })
 
 void free_kthread_struct(struct task_struct *k);
+bool kthread_initialized(struct task_struct *k);
 void kthread_bind(struct task_struct *k, unsigned int cpu);
 void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
 int kthread_stop(struct task_struct *k);
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
 		tsk = tsk->group_leader;
 
 	/*
-	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
-	 * trapped in a cpuset, or RT worker may be born in a cgroup
-	 * with no rt_runtime allocated.  Just say no.
+	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+	 * If userland migrates such kthread to a non-root cgroup, it can
+	 * become trapped in a cpuset, or RT kthread may be born in a
+	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
-	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
+	    ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
 		ret = -EINVAL;
 		goto out_unlock_rcu;
 	}
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -50,6 +50,7 @@ struct kthread {
 
 enum KTHREAD_BITS {
 	KTHREAD_IS_PER_CPU = 0,
+	KTHREAD_INITIALIZED,
 	KTHREAD_SHOULD_STOP,
 	KTHREAD_SHOULD_PARK,
 	KTHREAD_IS_PARKED,
@@ -57,7 +58,7 @@ enum KTHREAD_BITS {
 
 static inline void set_kthread_struct(void *kthread)
 {
-	/* paired with smp_read_data_barrier_depends() in to_kthread() */
+	/* paired with smp_read_barrier_depends() in to_kthread() */
 	smp_wmb();
 
 	/*
@@ -95,6 +96,23 @@ void free_kthread_struct(struct task_str
 }
 
 /**
+ * kthread_initialized - has the kthread finished initialization?
+ * @k: thread created by kthread_create().
+ *
+ * Test whether @k, which must be a kthread, finished initialization and is
+ * ready to execute the threadfn.  The kthread owner finishes
+ * initialization by waking up the new kthread for the first time.  If this
+ * function returns %false, the kthread owner could still be configuring
+ * the kthread.
+ */
+bool kthread_initialized(struct task_struct *k)
+{
+	struct kthread *kthread = to_kthread(k);
+
+	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
+}
+
+/**
  * kthread_should_stop - should this kthread return now?
  *
  * When someone calls kthread_stop() on your kthread, it will be woken
@@ -238,6 +256,7 @@ static int kthread(void *_create)
 	create->result = current;
 	complete(done);
 	schedule();
+	set_bit(KTHREAD_INITIALIZED, &self->flags);
 
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-15 23:18 [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Tejun Heo
  2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
@ 2017-03-16 14:54 ` Oleg Nesterov
  2017-03-16 15:33   ` Tejun Heo
  1 sibling, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

Hi Tejun,

On 03/15, Tejun Heo wrote:
>
> Until now, all to_kthread() users are interlocked with kthread
> creation and there's no need to have explicit barriers when setting
> the kthread pointer or dereferencing it.
>
> However, There is a race condition where userland can interfere with a
> kthread while it's being initialized.  To close it, to_kthread() needs
> to be used from an unsynchronized context.

So this is preparation for 2/2... IIUC, the current code is not buggy,
just you need to add kthread_initialized() which can't work without
this change.

>  static inline void set_kthread_struct(void *kthread)
>  {
> +	/* paired with smp_read_data_barrier_depends() in to_kthread() */
> +	smp_wmb();
> +
>  	/*
>  	 * We abuse ->set_child_tid to avoid the new member and because it
>  	 * can't be wrongly copied by copy_process(). We also rely on fact
> @@ -67,8 +70,19 @@ static inline void set_kthread_struct(vo
>
>  static inline struct kthread *to_kthread(struct task_struct *k)
>  {
> +	void *ptr;
> +
>  	WARN_ON(!(k->flags & PF_KTHREAD));
> -	return (__force void *)k->set_child_tid;
> +
> +	ptr = (__force void *)k->set_child_tid;
> +
> +	/*
> +	 * Paired with smp_wmb() in set_kthread_struct() and ensures that
> +	 * the caller sees initialized content of the returned kthread.
> +	 */
> +	smp_read_barrier_depends();
> +
> +	return ptr;

This is almost off-topic, but I think lockless_dereference() will look
better in to_kthread().

And perhaps we should add another helper, say,

	#define lockless_assign_pointer(ptr, val)	\
		smp_store_release(&ptr, val)

for set_kthread_struct() ? it can have more users.

Not that I think you should change your patch, I am just asking.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
@ 2017-03-16 15:02   ` Oleg Nesterov
  2017-03-16 15:39       ` Oleg Nesterov
  2017-03-16 16:05     ` Tejun Heo
  2017-03-16 20:54     ` Tejun Heo
  1 sibling, 2 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 15:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

On 03/15, Tejun Heo wrote:
>
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
>  		tsk = tsk->group_leader;
>  
>  	/*
> -	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
> -	 * trapped in a cpuset, or RT worker may be born in a cgroup
> -	 * with no rt_runtime allocated.  Just say no.
> +	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> +	 * If userland migrates such kthread to a non-root cgroup, it can
> +	 * become trapped in a cpuset, or RT kthread may be born in a
> +	 * cgroup with no rt_runtime allocated.  Just say no.
>  	 */
> -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> +	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
> +	    ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
>  		ret = -EINVAL;

...

> +bool kthread_initialized(struct task_struct *k)
> +{
> +	struct kthread *kthread = to_kthread(k);
> +
> +	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> +}

Not sure I understand...

With this patch you can no longer migrate a kernel thread created by
kernel_thread() ? Note that to_kthread() is NULL unless it was created
by kthread_create().

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-16 14:54 ` [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Oleg Nesterov
@ 2017-03-16 15:33   ` Tejun Heo
  2017-03-16 15:38     ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 15:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

Hello, Oleg.

On Thu, Mar 16, 2017 at 03:54:36PM +0100, Oleg Nesterov wrote:
> On 03/15, Tejun Heo wrote:
> >
> > Until now, all to_kthread() users are interlocked with kthread
> > creation and there's no need to have explicit barriers when setting
> > the kthread pointer or dereferencing it.
> >
> > However, There is a race condition where userland can interfere with a
> > kthread while it's being initialized.  To close it, to_kthread() needs
> > to be used from an unsynchronized context.
> 
> So this is preparation for 2/2... IIUC, the current code is not buggy,
> just you need to add kthread_initialized() which can't work without
> this change.

Yeah, I could have been clearer.

> > +	/*
> > +	 * Paired with smp_wmb() in set_kthread_struct() and ensures that
> > +	 * the caller sees initialized content of the returned kthread.
> > +	 */
> > +	smp_read_barrier_depends();
> > +
> > +	return ptr;
> 
> This is almost off-topic, but I think lockless_dereference() will look
> better in to_kthread().
> 
> And perhaps we should add another helper, say,
> 
> 	#define lockless_assign_pointer(ptr, val)	\
> 		smp_store_release(&ptr, val)
> 
> for set_kthread_struct() ? it can have more users.
> 
> Not that I think you should change your patch, I am just asking.

Ah yeah, that would look better.  I vaguely remembered the new macro
but couldn't quite remember it fully. :)  Will update the patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-16 15:33   ` Tejun Heo
@ 2017-03-16 15:38     ` Tejun Heo
  2017-03-16 15:46       ` Oleg Nesterov
  2017-03-16 15:55       ` Peter Zijlstra
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 15:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

On Thu, Mar 16, 2017 at 11:33:01AM -0400, Tejun Heo wrote:
> > And perhaps we should add another helper, say,
> > 
> > 	#define lockless_assign_pointer(ptr, val)	\
> > 		smp_store_release(&ptr, val)
> > 
> > for set_kthread_struct() ? it can have more users.
> > 
> > Not that I think you should change your patch, I am just asking.
> 
> Ah yeah, that would look better.  I vaguely remembered the new macro
> but couldn't quite remember it fully. :)  Will update the patch.

Oops, as for adding lockless_assign_pointer(), wouldn't smp_wmb() be a
better match for smp_read_barrier_depends()?  ISTR acquire/release
pairs being more expensive on some archs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 15:39       ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 15:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

On 03/16, Oleg Nesterov wrote:
>
> On 03/15, Tejun Heo wrote:
> >
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
> >  		tsk = tsk->group_leader;
> >
> >  	/*
> > -	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
> > -	 * trapped in a cpuset, or RT worker may be born in a cgroup
> > -	 * with no rt_runtime allocated.  Just say no.
> > +	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> > +	 * If userland migrates such kthread to a non-root cgroup, it can
> > +	 * become trapped in a cpuset, or RT kthread may be born in a
> > +	 * cgroup with no rt_runtime allocated.  Just say no.
> >  	 */
> > -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> > +	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
> > +	    ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
> >  		ret = -EINVAL;
>
> ...
>
> > +bool kthread_initialized(struct task_struct *k)
> > +{
> > +	struct kthread *kthread = to_kthread(k);
> > +
> > +	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > +}
>
> Not sure I understand...
>
> With this patch you can no longer migrate a kernel thread created by
> kernel_thread() ? Note that to_kthread() is NULL unless it was created
> by kthread_create().

Either way, I am wondering if we can do something really trivial like
the patch below. This way we can also remove the "tsk == kthreadd_task"
check, and we do not need the barriers.

Oleg.

--- x/kernel/kthread.c
+++ x/kernel/kthread.c
@@ -226,6 +226,7 @@
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
 		__kthread_parkme(self);
+		current->flags &= ~PF_IDONTLIKECGROUPS;
 		ret = threadfn(data);
 	}
 	do_exit(ret);
@@ -537,7 +538,7 @@
 	set_cpus_allowed_ptr(tsk, cpu_all_mask);
 	set_mems_allowed(node_states[N_MEMORY]);
 
-	current->flags |= PF_NOFREEZE;
+	current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
--- x/kernel/cgroup/cgroup.c
+++ x/kernel/cgroup/cgroup.c
@@ -2429,7 +2429,7 @@
 	 * trapped in a cpuset, or RT worker may be born in a cgroup
 	 * with no rt_runtime allocated.  Just say no.
 	 */
-	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+	if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
 		ret = -EINVAL;
 		goto out_unlock_rcu;
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 15:39       ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 15:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 03/16, Oleg Nesterov wrote:
>
> On 03/15, Tejun Heo wrote:
> >
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
> >  		tsk = tsk->group_leader;
> >
> >  	/*
> > -	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
> > -	 * trapped in a cpuset, or RT worker may be born in a cgroup
> > -	 * with no rt_runtime allocated.  Just say no.
> > +	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> > +	 * If userland migrates such kthread to a non-root cgroup, it can
> > +	 * become trapped in a cpuset, or RT kthread may be born in a
> > +	 * cgroup with no rt_runtime allocated.  Just say no.
> >  	 */
> > -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> > +	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
> > +	    ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
> >  		ret = -EINVAL;
>
> ...
>
> > +bool kthread_initialized(struct task_struct *k)
> > +{
> > +	struct kthread *kthread = to_kthread(k);
> > +
> > +	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > +}
>
> Not sure I understand...
>
> With this patch you can no longer migrate a kernel thread created by
> kernel_thread() ? Note that to_kthread() is NULL unless it was created
> by kthread_create().

Either way, I am wondering if we can do something really trivial like
the patch below. This way we can also remove the "tsk == kthreadd_task"
check, and we do not need the barriers.

Oleg.

--- x/kernel/kthread.c
+++ x/kernel/kthread.c
@@ -226,6 +226,7 @@
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
 		__kthread_parkme(self);
+		current->flags &= ~PF_IDONTLIKECGROUPS;
 		ret = threadfn(data);
 	}
 	do_exit(ret);
@@ -537,7 +538,7 @@
 	set_cpus_allowed_ptr(tsk, cpu_all_mask);
 	set_mems_allowed(node_states[N_MEMORY]);
 
-	current->flags |= PF_NOFREEZE;
+	current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
--- x/kernel/cgroup/cgroup.c
+++ x/kernel/cgroup/cgroup.c
@@ -2429,7 +2429,7 @@
 	 * trapped in a cpuset, or RT worker may be born in a cgroup
 	 * with no rt_runtime allocated.  Just say no.
 	 */
-	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+	if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
 		ret = -EINVAL;
 		goto out_unlock_rcu;
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-16 15:38     ` Tejun Heo
@ 2017-03-16 15:46       ` Oleg Nesterov
  2017-03-16 15:55       ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 15:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

On 03/16, Tejun Heo wrote:
>
> On Thu, Mar 16, 2017 at 11:33:01AM -0400, Tejun Heo wrote:
> > > And perhaps we should add another helper, say,
> > >
> > > 	#define lockless_assign_pointer(ptr, val)	\
> > > 		smp_store_release(&ptr, val)
> > >
> > > for set_kthread_struct() ? it can have more users.
> > >
> > > Not that I think you should change your patch, I am just asking.
> >
> > Ah yeah, that would look better.  I vaguely remembered the new macro
> > but couldn't quite remember it fully. :)  Will update the patch.
>
> Oops, as for adding lockless_assign_pointer(), wouldn't smp_wmb() be a
> better match for smp_read_barrier_depends()?  ISTR acquire/release
> pairs being more expensive on some archs.

No, no, don't ask me, I can't know ;)

But. Note that rcu_assign_pointer() (which should pair with
smp_read_barrier_depends/lockless_dereference too) uses
smp_store_release(), and the changelog says "potentially less overhead".
See 88c1863066ccfa456 "rcu: Define rcu_assign_pointer() in terms of
smp_store_release()".

And this discussion is another argument to add the new helper,
we can always change it to use wmb or store_release, or whatever else.
Plus arch/ can overwrite it.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-16 15:38     ` Tejun Heo
  2017-03-16 15:46       ` Oleg Nesterov
@ 2017-03-16 15:55       ` Peter Zijlstra
  2017-03-16 16:09         ` Tejun Heo
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-03-16 15:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

On Thu, Mar 16, 2017 at 11:38:43AM -0400, Tejun Heo wrote:
> On Thu, Mar 16, 2017 at 11:33:01AM -0400, Tejun Heo wrote:
> > > And perhaps we should add another helper, say,
> > > 
> > > 	#define lockless_assign_pointer(ptr, val)	\
> > > 		smp_store_release(&ptr, val)
> > > 
> > > for set_kthread_struct() ? it can have more users.
> > > 
> > > Not that I think you should change your patch, I am just asking.
> > 
> > Ah yeah, that would look better.  I vaguely remembered the new macro
> > but couldn't quite remember it fully. :)  Will update the patch.
> 
> Oops, as for adding lockless_assign_pointer(), wouldn't smp_wmb() be a
> better match for smp_read_barrier_depends()?  ISTR acquire/release
> pairs being more expensive on some archs.

88c1863066cc ("rcu: Define rcu_assign_pointer() in terms of smp_store_release()")

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-16 15:02   ` Oleg Nesterov
  2017-03-16 15:39       ` Oleg Nesterov
@ 2017-03-16 16:05     ` Tejun Heo
  2017-03-16 16:17         ` Oleg Nesterov
  1 sibling, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 16:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

Hello,

On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote:
> > +bool kthread_initialized(struct task_struct *k)
> > +{
> > +	struct kthread *kthread = to_kthread(k);
> > +
> > +	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > +}
> 
> Not sure I understand...
> 
> With this patch you can no longer migrate a kernel thread created by
> kernel_thread() ? Note that to_kthread() is NULL unless it was created
> by kthread_create().

Yeah, what it does is preventing migration of kthreads until the
kthread owner wakes it up for the first time.  The problem is that
kthread_bind() seals up future cgroup migrations from userland but
doesn't move back the kthread to the root cgroup, so the userland has
a window where it can mangle with cgroup membership inbetween and
break things.

The NULL test is there because the test may be performed before the
kthread itself sets up its struct kthread.

An alternative approach could be making kthread_bind() migrate the
kthread back to root cgroup, which btw is why affinity is fine as the
function overwrites it after setting NO_SETAFFINITY; however, the
problem there is that userland can put the kthread into !root cgroup
and starve it before it reaches create->done.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 16:07         ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 16:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

Hello,

On Thu, Mar 16, 2017 at 04:39:26PM +0100, Oleg Nesterov wrote:
> Either way, I am wondering if we can do something really trivial like
> the patch below. This way we can also remove the "tsk == kthreadd_task"
> check, and we do not need the barriers.
> 
> Oleg.
> 
> --- x/kernel/kthread.c
> +++ x/kernel/kthread.c
> @@ -226,6 +226,7 @@
>  	ret = -EINTR;
>  	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
>  		__kthread_parkme(self);
> +		current->flags &= ~PF_IDONTLIKECGROUPS;
>  		ret = threadfn(data);
>  	}
>  	do_exit(ret);
> @@ -537,7 +538,7 @@
>  	set_cpus_allowed_ptr(tsk, cpu_all_mask);
>  	set_mems_allowed(node_states[N_MEMORY]);
>  
> -	current->flags |= PF_NOFREEZE;
> +	current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
>  
>  	for (;;) {
>  		set_current_state(TASK_INTERRUPTIBLE);
> --- x/kernel/cgroup/cgroup.c
> +++ x/kernel/cgroup/cgroup.c
> @@ -2429,7 +2429,7 @@
>  	 * trapped in a cpuset, or RT worker may be born in a cgroup
>  	 * with no rt_runtime allocated.  Just say no.
>  	 */
> -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> +	if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
>  		ret = -EINVAL;
>  		goto out_unlock_rcu;
>  	}

Absolutely.  If we're willing to spend a PF flag on it, we can
properly wait for it too instead of failing it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 16:07         ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 16:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Thu, Mar 16, 2017 at 04:39:26PM +0100, Oleg Nesterov wrote:
> Either way, I am wondering if we can do something really trivial like
> the patch below. This way we can also remove the "tsk == kthreadd_task"
> check, and we do not need the barriers.
> 
> Oleg.
> 
> --- x/kernel/kthread.c
> +++ x/kernel/kthread.c
> @@ -226,6 +226,7 @@
>  	ret = -EINTR;
>  	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
>  		__kthread_parkme(self);
> +		current->flags &= ~PF_IDONTLIKECGROUPS;
>  		ret = threadfn(data);
>  	}
>  	do_exit(ret);
> @@ -537,7 +538,7 @@
>  	set_cpus_allowed_ptr(tsk, cpu_all_mask);
>  	set_mems_allowed(node_states[N_MEMORY]);
>  
> -	current->flags |= PF_NOFREEZE;
> +	current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
>  
>  	for (;;) {
>  		set_current_state(TASK_INTERRUPTIBLE);
> --- x/kernel/cgroup/cgroup.c
> +++ x/kernel/cgroup/cgroup.c
> @@ -2429,7 +2429,7 @@
>  	 * trapped in a cpuset, or RT worker may be born in a cgroup
>  	 * with no rt_runtime allocated.  Just say no.
>  	 */
> -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> +	if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
>  		ret = -EINVAL;
>  		goto out_unlock_rcu;
>  	}

Absolutely.  If we're willing to spend a PF flag on it, we can
properly wait for it too instead of failing it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-16 15:55       ` Peter Zijlstra
@ 2017-03-16 16:09         ` Tejun Heo
  2017-03-16 16:14           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

On Thu, Mar 16, 2017 at 04:55:34PM +0100, Peter Zijlstra wrote:
> > Oops, as for adding lockless_assign_pointer(), wouldn't smp_wmb() be a
> > better match for smp_read_barrier_depends()?  ISTR acquire/release
> > pairs being more expensive on some archs.
> 
> 88c1863066cc ("rcu: Define rcu_assign_pointer() in terms of smp_store_release()")

Hmmm, nice, can we always prefer store_release over wmb from now on?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread()
  2017-03-16 16:09         ` Tejun Heo
@ 2017-03-16 16:14           ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-03-16 16:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team

On Thu, Mar 16, 2017 at 12:09:44PM -0400, Tejun Heo wrote:
> On Thu, Mar 16, 2017 at 04:55:34PM +0100, Peter Zijlstra wrote:
> > > Oops, as for adding lockless_assign_pointer(), wouldn't smp_wmb() be a
> > > better match for smp_read_barrier_depends()?  ISTR acquire/release
> > > pairs being more expensive on some archs.
> > 
> > 88c1863066cc ("rcu: Define rcu_assign_pointer() in terms of smp_store_release()")
> 
> Hmmm, nice, can we always prefer store_release over wmb from now on?

I would advocate using whichever barrier is most natural for the
occasion.

Only if there really is a very very compelling performance argument
should we look further.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 16:17         ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 16:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

On 03/16, Tejun Heo wrote:
>
> Hello,
>
> On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote:
> > > +bool kthread_initialized(struct task_struct *k)
> > > +{
> > > +	struct kthread *kthread = to_kthread(k);
> > > +
> > > +	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > > +}
> >
> > Not sure I understand...
> >
> > With this patch you can no longer migrate a kernel thread created by
> > kernel_thread() ? Note that to_kthread() is NULL unless it was created
> > by kthread_create().
>
> Yeah, what it does is preventing migration of kthreads until the
> kthread owner wakes it up for the first time.  The problem is that
> kthread_bind() seals up future cgroup migrations from userland but
> doesn't move back the kthread to the root cgroup, so the userland has
> a window where it can mangle with cgroup membership inbetween and
> break things.

This is clear,

> The NULL test is there because the test may be performed before the
> kthread itself sets up its struct kthread.

This too.

But this also means that __cgroup_procs_write() will always fail if
this task is a kernel thread which was not created by kthread_create().

Currently you can use kernel_thread() (although you shouldn't) and it
can be migrated, this won't work after your patch.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 16:17         ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 16:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 03/16, Tejun Heo wrote:
>
> Hello,
>
> On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote:
> > > +bool kthread_initialized(struct task_struct *k)
> > > +{
> > > +	struct kthread *kthread = to_kthread(k);
> > > +
> > > +	return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > > +}
> >
> > Not sure I understand...
> >
> > With this patch you can no longer migrate a kernel thread created by
> > kernel_thread() ? Note that to_kthread() is NULL unless it was created
> > by kthread_create().
>
> Yeah, what it does is preventing migration of kthreads until the
> kthread owner wakes it up for the first time.  The problem is that
> kthread_bind() seals up future cgroup migrations from userland but
> doesn't move back the kthread to the root cgroup, so the userland has
> a window where it can mangle with cgroup membership inbetween and
> break things.

This is clear,

> The NULL test is there because the test may be performed before the
> kthread itself sets up its struct kthread.

This too.

But this also means that __cgroup_procs_write() will always fail if
this task is a kernel thread which was not created by kthread_create().

Currently you can use kernel_thread() (although you shouldn't) and it
can be migrated, this won't work after your patch.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 16:31           ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 16:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

On 03/16, Tejun Heo wrote:
>
> > --- x/kernel/kthread.c
> > +++ x/kernel/kthread.c
> > @@ -226,6 +226,7 @@
> >  	ret = -EINTR;
> >  	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
> >  		__kthread_parkme(self);
> > +		current->flags &= ~PF_IDONTLIKECGROUPS;
> >  		ret = threadfn(data);
> >  	}
> >  	do_exit(ret);
> > @@ -537,7 +538,7 @@
> >  	set_cpus_allowed_ptr(tsk, cpu_all_mask);
> >  	set_mems_allowed(node_states[N_MEMORY]);
> >
> > -	current->flags |= PF_NOFREEZE;
> > +	current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
> >
> >  	for (;;) {
> >  		set_current_state(TASK_INTERRUPTIBLE);
> > --- x/kernel/cgroup/cgroup.c
> > +++ x/kernel/cgroup/cgroup.c
> > @@ -2429,7 +2429,7 @@
> >  	 * trapped in a cpuset, or RT worker may be born in a cgroup
> >  	 * with no rt_runtime allocated.  Just say no.
> >  	 */
> > -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> > +	if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
> >  		ret = -EINVAL;
> >  		goto out_unlock_rcu;
> >  	}
>
> Absolutely.  If we're willing to spend a PF flag on it, we can
> properly wait for it too instead of failing it.

Or we can add another "unsigned no_cgroups:1" bit into task_struct,
not sure.

Anyway, I do not understand the PF_NO_SETAFFINITY check in
__cgroup_procs_write(). task_can_attach() checks it too, so cgroups
can't change the affinity. Imo something explicit like no_cgroups
makes more sense.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 16:31           ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-16 16:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 03/16, Tejun Heo wrote:
>
> > --- x/kernel/kthread.c
> > +++ x/kernel/kthread.c
> > @@ -226,6 +226,7 @@
> >  	ret = -EINTR;
> >  	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
> >  		__kthread_parkme(self);
> > +		current->flags &= ~PF_IDONTLIKECGROUPS;
> >  		ret = threadfn(data);
> >  	}
> >  	do_exit(ret);
> > @@ -537,7 +538,7 @@
> >  	set_cpus_allowed_ptr(tsk, cpu_all_mask);
> >  	set_mems_allowed(node_states[N_MEMORY]);
> >
> > -	current->flags |= PF_NOFREEZE;
> > +	current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
> >
> >  	for (;;) {
> >  		set_current_state(TASK_INTERRUPTIBLE);
> > --- x/kernel/cgroup/cgroup.c
> > +++ x/kernel/cgroup/cgroup.c
> > @@ -2429,7 +2429,7 @@
> >  	 * trapped in a cpuset, or RT worker may be born in a cgroup
> >  	 * with no rt_runtime allocated.  Just say no.
> >  	 */
> > -	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> > +	if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
> >  		ret = -EINVAL;
> >  		goto out_unlock_rcu;
> >  	}
>
> Absolutely.  If we're willing to spend a PF flag on it, we can
> properly wait for it too instead of failing it.

Or we can add another "unsigned no_cgroups:1" bit into task_struct,
not sure.

Anyway, I do not understand the PF_NO_SETAFFINITY check in
__cgroup_procs_write(). task_can_attach() checks it too, so cgroups
can't change the affinity. Imo something explicit like no_cgroups
makes more sense.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-16 16:17         ` Oleg Nesterov
  (?)
@ 2017-03-16 17:03         ` Tejun Heo
  -1 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 17:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

Hello,

On Thu, Mar 16, 2017 at 05:17:57PM +0100, Oleg Nesterov wrote:
> But this also means that __cgroup_procs_write() will always fail if
> this task is a kernel thread which was not created by kthread_create().
> 
> Currently you can use kernel_thread() (although you shouldn't) and it
> can be migrated, this won't work after your patch.

I see what you mean now.  The only users seem to be init/main.c and
kernel/kmod.c.  I'll see if there's a way we can do this only for
kthread_create() users.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-16 16:31           ` Oleg Nesterov
  (?)
@ 2017-03-16 17:41           ` Tejun Heo
  -1 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 17:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

Hello,

On Thu, Mar 16, 2017 at 05:31:59PM +0100, Oleg Nesterov wrote:
> Or we can add another "unsigned no_cgroups:1" bit into task_struct,
> not sure.

To synchronize around initialization, a PF flag would be easier as we
can use wait_on_bit().

> Anyway, I do not understand the PF_NO_SETAFFINITY check in
> __cgroup_procs_write(). task_can_attach() checks it too, so cgroups
> can't change the affinity. Imo something explicit like no_cgroups
> makes more sense.

task_can_attach() predates the __cgroup_procs_write() and currently
doesn't do anything.  We can split the flag or rename it so that it's
more generic.  The reasons for disallowing cgroup migration have a lot
of crosssection with affinity, so it's not a complete misnomer.
Either way is fine by me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 20:54     ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 20:54 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds, Andrew Morton
  Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel,
	kernel-team, Li Zefan, Johannes Weiner, cgroups

Hello,

I tried a couple variants but Oleg's suggestion turns out to be the
simplest.  This patch doesn't require the first barrier patch.  Oleg,
if you're okay with the patch, I can route this through
cgroup/for-4.11-fixes.

Thanks!

------ 8< ------
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
---
 include/linux/cgroup.h |   21 +++++++++++++++++++++
 include/linux/sched.h  |    4 ++++
 kernel/cgroup/cgroup.c |    9 +++++----
 kernel/kthread.c       |    3 +++
 4 files changed, 33 insertions(+), 4 deletions(-)

--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -570,6 +570,25 @@ static inline void pr_cont_cgroup_path(s
 	pr_cont_kernfs_path(cgrp->kn);
 }
 
+static inline void cgroup_init_kthreadd(void)
+{
+	/*
+	 * kthreadd is inherited by all kthreads, keep it in the root so
+	 * that the new kthreads are guaranteed to stay in the root until
+	 * initialization is finished.
+	 */
+	current->no_cgroup_migration = 1;
+}
+
+static inline void cgroup_kthread_ready(void)
+{
+	/*
+	 * This kthread finished initialization.  The creator should have
+	 * set PF_NO_SETAFFINITY if this kthread should stay in the root.
+	 */
+	current->no_cgroup_migration = 0;
+}
+
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -590,6 +609,8 @@ static inline void cgroup_free(struct ta
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
+static inline void cgroup_init_kthreadd(void) {}
+static inline void cgroup_kthread_ready(void) {}
 
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,6 +604,10 @@ struct task_struct {
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
 #endif
+#ifdef CONFIG_CGROUPS
+	/* disallow userland-initiated cgroup migration */
+	unsigned			no_cgroup_migration:1;
+#endif
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2425,11 +2425,12 @@ ssize_t __cgroup_procs_write(struct kern
 		tsk = tsk->group_leader;
 
 	/*
-	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
-	 * trapped in a cpuset, or RT worker may be born in a cgroup
-	 * with no rt_runtime allocated.  Just say no.
+	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+	 * If userland migrates such a kthread to a non-root cgroup, it can
+	 * become trapped in a cpuset, or RT kthread may be born in a
+	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
-	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		ret = -EINVAL;
 		goto out_unlock_rcu;
 	}
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,6 +20,7 @@
 #include <linux/freezer.h>
 #include <linux/ptrace.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 #include <trace/events/sched.h>
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -225,6 +226,7 @@ static int kthread(void *_create)
 
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
+		cgroup_kthread_ready();
 		__kthread_parkme(self);
 		ret = threadfn(data);
 	}
@@ -538,6 +540,7 @@ int kthreadd(void *unused)
 	set_mems_allowed(node_states[N_MEMORY]);
 
 	current->flags |= PF_NOFREEZE;
+	cgroup_init_kthreadd();
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
@ 2017-03-16 20:54     ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-16 20:54 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds, Andrew Morton
  Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

I tried a couple variants but Oleg's suggestion turns out to be the
simplest.  This patch doesn't require the first barrier patch.  Oleg,
if you're okay with the patch, I can route this through
cgroup/for-4.11-fixes.

Thanks!

------ 8< ------
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Suggested-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Reported-and-debugged-by: Chris Mason <clm-b10kYP2dOMg@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.3+ (we can't close the race on < v4.3)
---
 include/linux/cgroup.h |   21 +++++++++++++++++++++
 include/linux/sched.h  |    4 ++++
 kernel/cgroup/cgroup.c |    9 +++++----
 kernel/kthread.c       |    3 +++
 4 files changed, 33 insertions(+), 4 deletions(-)

--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -570,6 +570,25 @@ static inline void pr_cont_cgroup_path(s
 	pr_cont_kernfs_path(cgrp->kn);
 }
 
+static inline void cgroup_init_kthreadd(void)
+{
+	/*
+	 * kthreadd is inherited by all kthreads, keep it in the root so
+	 * that the new kthreads are guaranteed to stay in the root until
+	 * initialization is finished.
+	 */
+	current->no_cgroup_migration = 1;
+}
+
+static inline void cgroup_kthread_ready(void)
+{
+	/*
+	 * This kthread finished initialization.  The creator should have
+	 * set PF_NO_SETAFFINITY if this kthread should stay in the root.
+	 */
+	current->no_cgroup_migration = 0;
+}
+
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -590,6 +609,8 @@ static inline void cgroup_free(struct ta
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
+static inline void cgroup_init_kthreadd(void) {}
+static inline void cgroup_kthread_ready(void) {}
 
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,6 +604,10 @@ struct task_struct {
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
 #endif
+#ifdef CONFIG_CGROUPS
+	/* disallow userland-initiated cgroup migration */
+	unsigned			no_cgroup_migration:1;
+#endif
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2425,11 +2425,12 @@ ssize_t __cgroup_procs_write(struct kern
 		tsk = tsk->group_leader;
 
 	/*
-	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
-	 * trapped in a cpuset, or RT worker may be born in a cgroup
-	 * with no rt_runtime allocated.  Just say no.
+	 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+	 * If userland migrates such a kthread to a non-root cgroup, it can
+	 * become trapped in a cpuset, or RT kthread may be born in a
+	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
-	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		ret = -EINVAL;
 		goto out_unlock_rcu;
 	}
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,6 +20,7 @@
 #include <linux/freezer.h>
 #include <linux/ptrace.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 #include <trace/events/sched.h>
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -225,6 +226,7 @@ static int kthread(void *_create)
 
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
+		cgroup_kthread_ready();
 		__kthread_parkme(self);
 		ret = threadfn(data);
 	}
@@ -538,6 +540,7 @@ int kthreadd(void *unused)
 	set_mems_allowed(node_states[N_MEMORY]);
 
 	current->flags |= PF_NOFREEZE;
+	cgroup_init_kthreadd();
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-16 20:54     ` Tejun Heo
  (?)
@ 2017-03-17 13:50     ` Oleg Nesterov
  2017-03-17 14:44       ` Tejun Heo
  -1 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2017-03-17 13:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

On 03/16, Tejun Heo wrote:
>
> Oleg,
> if you're okay with the patch, I can route this through
> cgroup/for-4.11-fixes.

Thanks, looks good to me.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
  2017-03-17 13:50     ` Oleg Nesterov
@ 2017-03-17 14:44       ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2017-03-17 14:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Chris Mason, linux-kernel, kernel-team, Li Zefan,
	Johannes Weiner, cgroups

On Fri, Mar 17, 2017 at 02:50:21PM +0100, Oleg Nesterov wrote:
> On 03/16, Tejun Heo wrote:
> >
> > Oleg,
> > if you're okay with the patch, I can route this through
> > cgroup/for-4.11-fixes.
> 
> Thanks, looks good to me.

Applied to cgroup/for-4.11-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-03-17 14:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-15 23:18 [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Tejun Heo
2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
2017-03-16 15:02   ` Oleg Nesterov
2017-03-16 15:39     ` Oleg Nesterov
2017-03-16 15:39       ` Oleg Nesterov
2017-03-16 16:07       ` Tejun Heo
2017-03-16 16:07         ` Tejun Heo
2017-03-16 16:31         ` Oleg Nesterov
2017-03-16 16:31           ` Oleg Nesterov
2017-03-16 17:41           ` Tejun Heo
2017-03-16 16:05     ` Tejun Heo
2017-03-16 16:17       ` Oleg Nesterov
2017-03-16 16:17         ` Oleg Nesterov
2017-03-16 17:03         ` Tejun Heo
2017-03-16 20:54   ` [PATCH v2] cgroup, kthread: " Tejun Heo
2017-03-16 20:54     ` Tejun Heo
2017-03-17 13:50     ` Oleg Nesterov
2017-03-17 14:44       ` Tejun Heo
2017-03-16 14:54 ` [PATCH 1/2] kthread: add barriers to set_kthread_struct() and to_kthread() Oleg Nesterov
2017-03-16 15:33   ` Tejun Heo
2017-03-16 15:38     ` Tejun Heo
2017-03-16 15:46       ` Oleg Nesterov
2017-03-16 15:55       ` Peter Zijlstra
2017-03-16 16:09         ` Tejun Heo
2017-03-16 16:14           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.