linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
@ 2019-07-27 17:10 Waiman Long
  2019-07-29  8:18 ` Qais Yousef
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Waiman Long @ 2019-07-27 17:10 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Waiman Long

It was found that a dying mm_struct where the owning task has exited
can stay on as active_mm of kernel threads as long as no other user
tasks run on those CPUs that use it as active_mm. This prolongs the
life time of dying mm holding up memory and other resources like swap
space that cannot be freed.

Fix that by forcing the kernel threads to use init_mm as the active_mm
if the previous active_mm is dying.

The determination of a dying mm is based on the absence of an owning
task. The selection of the owning task only happens with the CONFIG_MEMCG
option. Without that, there is no simple way to determine the life span
of a given mm. So it falls back to the old behavior.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/mm_types.h | 15 +++++++++++++++
 kernel/sched/core.c      | 13 +++++++++++--
 mm/init-mm.c             |  4 ++++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3a37a89eb7a7..32712e78763c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending) > 1;
 }
 
+#ifdef CONFIG_MEMCG
+/*
+ * A mm is considered dying if there is no owning task.
+ */
+static inline bool mm_dying(struct mm_struct *mm)
+{
+	return !mm->owner;
+}
+#else
+static inline bool mm_dying(struct mm_struct *mm)
+{
+	return false;
+}
+#endif
+
 struct vm_fault;
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..923a63262dfd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 * Both of these contain the full memory barrier required by
 	 * membarrier after storing to rq->curr, before returning to
 	 * user-space.
+	 *
+	 * If mm is NULL and oldmm is dying (!owner), we switch to
+	 * init_mm instead to make sure that oldmm can be freed ASAP.
 	 */
-	if (!mm) {
+	if (!mm && !mm_dying(oldmm)) {
 		next->active_mm = oldmm;
 		mmgrab(oldmm);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
+		if (!mm) {
+			mm = &init_mm;
+			next->active_mm = mm;
+			mmgrab(mm);
+		}
 		switch_mm_irqs_off(oldmm, mm, next);
+	}
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a787a319211e..69090a11249c 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 #include <linux/list.h>
 #include <linux/cpumask.h>
+#include <linux/sched/task.h>
 
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
@@ -36,5 +37,8 @@ struct mm_struct init_mm = {
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
+#ifdef CONFIG_MEMCG
+	.owner		= &init_task,
+#endif
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
@ 2019-07-29  8:18 ` Qais Yousef
  2019-07-29 21:06   ` Waiman Long
  2019-07-29  8:52 ` Peter Zijlstra
  2019-07-29  9:12 ` Michal Hocko
  2 siblings, 1 reply; 26+ messages in thread
From: Qais Yousef @ 2019-07-29  8:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On 07/27/19 13:10, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up memory and other resources like swap
> space that cannot be freed.
> 
> Fix that by forcing the kernel threads to use init_mm as the active_mm
> if the previous active_mm is dying.
> 
> The determination of a dying mm is based on the absence of an owning
> task. The selection of the owning task only happens with the CONFIG_MEMCG
> option. Without that, there is no simple way to determine the life span
> of a given mm. So it falls back to the old behavior.

I don't really know a lot about this code, but does the owner field has to
depend on CONFIG_MEMCG? ie: can't the owner be always set?

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
  2019-07-29  8:18 ` Qais Yousef
@ 2019-07-29  8:52 ` Peter Zijlstra
  2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
                     ` (2 more replies)
  2019-07-29  9:12 ` Michal Hocko
  2 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29  8:52 UTC (permalink / raw)
  To: Waiman Long; +Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld

On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up memory and other resources like swap
> space that cannot be freed.

Sure, but this has been so 'forever', why is it a problem now?

> Fix that by forcing the kernel threads to use init_mm as the active_mm
> if the previous active_mm is dying.
> 
> The determination of a dying mm is based on the absence of an owning
> task. The selection of the owning task only happens with the CONFIG_MEMCG
> option. Without that, there is no simple way to determine the life span
> of a given mm. So it falls back to the old behavior.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/mm_types.h | 15 +++++++++++++++
>  kernel/sched/core.c      | 13 +++++++++++--
>  mm/init-mm.c             |  4 ++++
>  3 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3a37a89eb7a7..32712e78763c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>  	return atomic_read(&mm->tlb_flush_pending) > 1;
>  }
>  
> +#ifdef CONFIG_MEMCG
> +/*
> + * A mm is considered dying if there is no owning task.
> + */
> +static inline bool mm_dying(struct mm_struct *mm)
> +{
> +	return !mm->owner;
> +}
> +#else
> +static inline bool mm_dying(struct mm_struct *mm)
> +{
> +	return false;
> +}
> +#endif
> +
>  struct vm_fault;

Yuck. So people without memcg will still suffer the terrible 'whatever
it is this patch fixes'.

>  /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2b037f195473..923a63262dfd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	 * Both of these contain the full memory barrier required by
>  	 * membarrier after storing to rq->curr, before returning to
>  	 * user-space.
> +	 *
> +	 * If mm is NULL and oldmm is dying (!owner), we switch to
> +	 * init_mm instead to make sure that oldmm can be freed ASAP.
>  	 */
> -	if (!mm) {
> +	if (!mm && !mm_dying(oldmm)) {
>  		next->active_mm = oldmm;
>  		mmgrab(oldmm);
>  		enter_lazy_tlb(oldmm, next);
> -	} else
> +	} else {
> +		if (!mm) {
> +			mm = &init_mm;
> +			next->active_mm = mm;
> +			mmgrab(mm);
> +		}
>  		switch_mm_irqs_off(oldmm, mm, next);
> +	}
>  
>  	if (!prev->mm) {
>  		prev->active_mm = NULL;

Bah, I see we _still_ haven't 'fixed' that code. And you're making an
even bigger mess of it.

Let me go find where that cleanup went.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
  2019-07-29  8:18 ` Qais Yousef
  2019-07-29  8:52 ` Peter Zijlstra
@ 2019-07-29  9:12 ` Michal Hocko
  2019-07-29 15:27   ` Waiman Long
  2 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-07-29  9:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On Sat 27-07-19 13:10:47, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up memory and other resources like swap
> space that cannot be freed.

IIRC use_mm doesn't pin the address space. It only pins the mm_struct
itself. So what exactly is the problem here?

> 
> Fix that by forcing the kernel threads to use init_mm as the active_mm
> if the previous active_mm is dying.
> 
> The determination of a dying mm is based on the absence of an owning
> task. The selection of the owning task only happens with the CONFIG_MEMCG
> option. Without that, there is no simple way to determine the life span
> of a given mm. So it falls back to the old behavior.

Please don't. We really wont to remove mm->owner long term.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] sched: Clean up active_mm reference counting
  2019-07-29  8:52 ` Peter Zijlstra
@ 2019-07-29 14:24   ` Peter Zijlstra
  2019-07-29 15:01     ` Mathieu Desnoyers
                       ` (2 more replies)
  2019-07-29 14:27   ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra
  2019-07-29 14:51   ` Waiman Long
  2 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 14:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	riel, luto, mathieu.desnoyers

On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b037f195473..923a63262dfd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >  	 * Both of these contain the full memory barrier required by
> >  	 * membarrier after storing to rq->curr, before returning to
> >  	 * user-space.
> > +	 *
> > +	 * If mm is NULL and oldmm is dying (!owner), we switch to
> > +	 * init_mm instead to make sure that oldmm can be freed ASAP.
> >  	 */
> > -	if (!mm) {
> > +	if (!mm && !mm_dying(oldmm)) {
> >  		next->active_mm = oldmm;
> >  		mmgrab(oldmm);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> > +		if (!mm) {
> > +			mm = &init_mm;
> > +			next->active_mm = mm;
> > +			mmgrab(mm);
> > +		}
> >  		switch_mm_irqs_off(oldmm, mm, next);
> > +	}
> >  
> >  	if (!prev->mm) {
> >  		prev->active_mm = NULL;
> 
> Bah, I see we _still_ haven't 'fixed' that code. And you're making an
> even bigger mess of it.
> 
> Let me go find where that cleanup went.

---
Subject: sched: Clean up active_mm reference counting
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Jul 29 16:05:15 CEST 2019

The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

    user -> user

	When switching between two user tasks, all we need to consider
	is switch_mm().

    user -> kernel

	When switching from a user task to a kernel task (which
	doesn't have an associated mm) we retain the last mm in our
	active_mm. Increment a reference count on active_mm.

  kernel -> kernel

	When switching between kernel threads, all we need to do is
	pass along the active_mm reference.

  kernel -> user

	When switching between a kernel and user task, we must switch
	from the last active_mm to the next mm, hoping of course that
	these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.

Cc: riel@surriel.com
Cc: luto@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   49 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3214,12 +3214,8 @@ static __always_inline struct rq *
 context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next, struct rq_flags *rf)
 {
-	struct mm_struct *mm, *oldmm;
-
 	prepare_task_switch(rq, prev, next);
 
-	mm = next->mm;
-	oldmm = prev->active_mm;
 	/*
 	 * For paravirt, this is coupled with an exit in switch_to to
 	 * combine the page table reload and the switch backend into
@@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas
 	arch_start_context_switch(prev);
 
 	/*
-	 * If mm is non-NULL, we pass through switch_mm(). If mm is
-	 * NULL, we will pass through mmdrop() in finish_task_switch().
-	 * Both of these contain the full memory barrier required by
-	 * membarrier after storing to rq->curr, before returning to
-	 * user-space.
+	 * kernel -> kernel   lazy + transfer active
+	 *   user -> kernel   lazy + mmgrab() active
+	 *
+	 * kernel ->   user   switch + mmdrop() active
+	 *   user ->   user   switch
 	 */
-	if (!mm) {
-		next->active_mm = oldmm;
-		mmgrab(oldmm);
-		enter_lazy_tlb(oldmm, next);
-	} else
-		switch_mm_irqs_off(oldmm, mm, next);
-
-	if (!prev->mm) {
-		prev->active_mm = NULL;
-		rq->prev_mm = oldmm;
+	if (!next->mm) {                                // to kernel
+		enter_lazy_tlb(prev->active_mm, next);
+
+		next->active_mm = prev->active_mm;
+		if (prev->mm)                           // from user
+			mmgrab(prev->active_mm);
+		else
+			prev->active_mm = NULL;
+	} else {                                        // to user
+		/*
+		 * sys_membarrier() requires an smp_mb() between setting
+		 * rq->curr and returning to userspace.
+		 *
+		 * The below provides this either through switch_mm(), or in
+		 * case 'prev->active_mm == next->mm' through
+		 * finish_task_switch()'s mmdrop().
+		 */
+
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
+		if (!prev->mm) {                        // from kernel
+			/* will mmdrop() in finish_task_switch(). */
+			rq->prev_mm = prev->active_mm;
+			prev->active_mm = NULL;
+		}
 	}
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29  8:52 ` Peter Zijlstra
  2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
@ 2019-07-29 14:27   ` Peter Zijlstra
  2019-07-29 15:22     ` Waiman Long
  2019-07-29 14:51   ` Waiman Long
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 14:27 UTC (permalink / raw)
  To: Waiman Long; +Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld

On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> > It was found that a dying mm_struct where the owning task has exited
> > can stay on as active_mm of kernel threads as long as no other user
> > tasks run on those CPUs that use it as active_mm. This prolongs the
> > life time of dying mm holding up memory and other resources like swap
> > space that cannot be freed.
> 
> Sure, but this has been so 'forever', why is it a problem now?
> 
> > Fix that by forcing the kernel threads to use init_mm as the active_mm
> > if the previous active_mm is dying.
> > 
> > The determination of a dying mm is based on the absence of an owning
> > task. The selection of the owning task only happens with the CONFIG_MEMCG
> > option. Without that, there is no simple way to determine the life span
> > of a given mm. So it falls back to the old behavior.
> > 
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> >  include/linux/mm_types.h | 15 +++++++++++++++
> >  kernel/sched/core.c      | 13 +++++++++++--
> >  mm/init-mm.c             |  4 ++++
> >  3 files changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 3a37a89eb7a7..32712e78763c 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> >  	return atomic_read(&mm->tlb_flush_pending) > 1;
> >  }
> >  
> > +#ifdef CONFIG_MEMCG
> > +/*
> > + * A mm is considered dying if there is no owning task.
> > + */
> > +static inline bool mm_dying(struct mm_struct *mm)
> > +{
> > +	return !mm->owner;
> > +}
> > +#else
> > +static inline bool mm_dying(struct mm_struct *mm)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> >  struct vm_fault;
> 
> Yuck. So people without memcg will still suffer the terrible 'whatever
> it is this patch fixes'.

Also; why then not key off that owner tracking to free the resources
(and leave the struct mm around) and avoid touching this scheduling
hot-path ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29  8:52 ` Peter Zijlstra
  2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
  2019-07-29 14:27   ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra
@ 2019-07-29 14:51   ` Waiman Long
  2019-07-29 15:03     ` Peter Zijlstra
  2 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-07-29 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld

On 7/29/19 4:52 AM, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up memory and other resources like swap
>> space that cannot be freed.
> Sure, but this has been so 'forever', why is it a problem now?

I ran into this probem when running a test program that keeps on
allocating and touch memory and it eventually fails as the swap space is
full. After the failure, I could not rerun the test program again
because the swap space remained full. I finally track it down to the
fact that the mm stayed on as active_mm of kernel threads. I have to
make sure that all the idle cpus get a user task to run to bump the
dying mm off the active_mm of those cpus, but this is just a workaround,
not a solution to this problem.

>
>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> if the previous active_mm is dying.
>>
>> The determination of a dying mm is based on the absence of an owning
>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>> option. Without that, there is no simple way to determine the life span
>> of a given mm. So it falls back to the old behavior.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  include/linux/mm_types.h | 15 +++++++++++++++
>>  kernel/sched/core.c      | 13 +++++++++++--
>>  mm/init-mm.c             |  4 ++++
>>  3 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 3a37a89eb7a7..32712e78763c 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>>  	return atomic_read(&mm->tlb_flush_pending) > 1;
>>  }
>>  
>> +#ifdef CONFIG_MEMCG
>> +/*
>> + * A mm is considered dying if there is no owning task.
>> + */
>> +static inline bool mm_dying(struct mm_struct *mm)
>> +{
>> +	return !mm->owner;
>> +}
>> +#else
>> +static inline bool mm_dying(struct mm_struct *mm)
>> +{
>> +	return false;
>> +}
>> +#endif
>> +
>>  struct vm_fault;
> Yuck. So people without memcg will still suffer the terrible 'whatever
> it is this patch fixes'.
>
That is true.
>>  /**
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2b037f195473..923a63262dfd 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  	 * Both of these contain the full memory barrier required by
>>  	 * membarrier after storing to rq->curr, before returning to
>>  	 * user-space.
>> +	 *
>> +	 * If mm is NULL and oldmm is dying (!owner), we switch to
>> +	 * init_mm instead to make sure that oldmm can be freed ASAP.
>>  	 */
>> -	if (!mm) {
>> +	if (!mm && !mm_dying(oldmm)) {
>>  		next->active_mm = oldmm;
>>  		mmgrab(oldmm);
>>  		enter_lazy_tlb(oldmm, next);
>> -	} else
>> +	} else {
>> +		if (!mm) {
>> +			mm = &init_mm;
>> +			next->active_mm = mm;
>> +			mmgrab(mm);
>> +		}
>>  		switch_mm_irqs_off(oldmm, mm, next);
>> +	}
>>  
>>  	if (!prev->mm) {
>>  		prev->active_mm = NULL;
> Bah, I see we _still_ haven't 'fixed' that code. And you're making an
> even bigger mess of it.
>
> Let me go find where that cleanup went.

It would be nice if there is a better solution.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] sched: Clean up active_mm reference counting
  2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
@ 2019-07-29 15:01     ` Mathieu Desnoyers
  2019-07-29 15:16     ` Waiman Long
  2019-07-29 15:29     ` Rik van Riel
  2 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2019-07-29 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton,
	Phil Auld, riel, Andy Lutomirski

----- On Jul 29, 2019, at 10:24 AM, Peter Zijlstra peterz@infradead.org wrote:
[...]
> ---
> Subject: sched: Clean up active_mm reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 29 16:05:15 CEST 2019
> 
> The current active_mm reference counting is confusing and sub-optimal.
> 
> Rewrite the code to explicitly consider the 4 separate cases:
> 
>    user -> user
> 
>	When switching between two user tasks, all we need to consider
>	is switch_mm().
> 
>    user -> kernel
> 
>	When switching from a user task to a kernel task (which
>	doesn't have an associated mm) we retain the last mm in our
>	active_mm. Increment a reference count on active_mm.
> 
>  kernel -> kernel
> 
>	When switching between kernel threads, all we need to do is
>	pass along the active_mm reference.
> 
>  kernel -> user
> 
>	When switching between a kernel and user task, we must switch
>	from the last active_mm to the next mm, hoping of course that
>	these are the same. Decrement a reference on the active_mm.
> 
> The code keeps a different order, because as you'll note, both 'to
> user' cases require switch_mm().
> 
> And where the old code would increment/decrement for the 'kernel ->
> kernel' case, the new code observes this is a neutral operation and
> avoids touching the reference count.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Cc: riel@surriel.com
> Cc: luto@kernel.org
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> kernel/sched/core.c |   49 ++++++++++++++++++++++++++++++-------------------
> 1 file changed, 30 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,12 +3214,8 @@ static __always_inline struct rq *
> context_switch(struct rq *rq, struct task_struct *prev,
> 	       struct task_struct *next, struct rq_flags *rf)
> {
> -	struct mm_struct *mm, *oldmm;
> -
> 	prepare_task_switch(rq, prev, next);
> 
> -	mm = next->mm;
> -	oldmm = prev->active_mm;
> 	/*
> 	 * For paravirt, this is coupled with an exit in switch_to to
> 	 * combine the page table reload and the switch backend into
> @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas
> 	arch_start_context_switch(prev);
> 
> 	/*
> -	 * If mm is non-NULL, we pass through switch_mm(). If mm is
> -	 * NULL, we will pass through mmdrop() in finish_task_switch().
> -	 * Both of these contain the full memory barrier required by
> -	 * membarrier after storing to rq->curr, before returning to
> -	 * user-space.
> +	 * kernel -> kernel   lazy + transfer active
> +	 *   user -> kernel   lazy + mmgrab() active
> +	 *
> +	 * kernel ->   user   switch + mmdrop() active
> +	 *   user ->   user   switch
> 	 */
> -	if (!mm) {
> -		next->active_mm = oldmm;
> -		mmgrab(oldmm);
> -		enter_lazy_tlb(oldmm, next);
> -	} else
> -		switch_mm_irqs_off(oldmm, mm, next);
> -
> -	if (!prev->mm) {
> -		prev->active_mm = NULL;
> -		rq->prev_mm = oldmm;
> +	if (!next->mm) {                                // to kernel
> +		enter_lazy_tlb(prev->active_mm, next);
> +
> +		next->active_mm = prev->active_mm;
> +		if (prev->mm)                           // from user
> +			mmgrab(prev->active_mm);
> +		else
> +			prev->active_mm = NULL;
> +	} else {                                        // to user
> +		/*
> +		 * sys_membarrier() requires an smp_mb() between setting
> +		 * rq->curr and returning to userspace.
> +		 *
> +		 * The below provides this either through switch_mm(), or in
> +		 * case 'prev->active_mm == next->mm' through
> +		 * finish_task_switch()'s mmdrop().
> +		 */
> +
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> +		if (!prev->mm) {                        // from kernel
> +			/* will mmdrop() in finish_task_switch(). */
> +			rq->prev_mm = prev->active_mm;
> +			prev->active_mm = NULL;
> +		}
> 	}
> 
>  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 14:51   ` Waiman Long
@ 2019-07-29 15:03     ` Peter Zijlstra
  2019-07-29 15:28       ` Rik van Riel
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 15:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Rik van Riel, Andy Lutomirski

On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
> On 7/29/19 4:52 AM, Peter Zijlstra wrote:
> > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up memory and other resources like swap
> >> space that cannot be freed.
> > Sure, but this has been so 'forever', why is it a problem now?
> 
> I ran into this probem when running a test program that keeps on
> allocating and touch memory and it eventually fails as the swap space is
> full. After the failure, I could not rerun the test program again
> because the swap space remained full. I finally track it down to the
> fact that the mm stayed on as active_mm of kernel threads. I have to
> make sure that all the idle cpus get a user task to run to bump the
> dying mm off the active_mm of those cpus, but this is just a workaround,
> not a solution to this problem.

The 'sad' part is that x86 already switches to init_mm on idle and we
only keep the active_mm around for 'stupid'.

Rik and Andy were working on getting that 'fixed' a while ago, not sure
where that went.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] sched: Clean up active_mm reference counting
  2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
  2019-07-29 15:01     ` Mathieu Desnoyers
@ 2019-07-29 15:16     ` Waiman Long
  2019-07-29 15:22       ` Peter Zijlstra
  2019-07-29 15:29     ` Rik van Riel
  2 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-07-29 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	riel, luto, mathieu.desnoyers

On 7/29/19 10:24 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
>
> ---
> Subject: sched: Clean up active_mm reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 29 16:05:15 CEST 2019
>
> The current active_mm reference counting is confusing and sub-optimal.
>
> Rewrite the code to explicitly consider the 4 separate cases:
>
>     user -> user
>
> 	When switching between two user tasks, all we need to consider
> 	is switch_mm().
>
>     user -> kernel
>
> 	When switching from a user task to a kernel task (which
> 	doesn't have an associated mm) we retain the last mm in our
> 	active_mm. Increment a reference count on active_mm.
>
>   kernel -> kernel
>
> 	When switching between kernel threads, all we need to do is
> 	pass along the active_mm reference.
>
>   kernel -> user
>
> 	When switching between a kernel and user task, we must switch
> 	from the last active_mm to the next mm, hoping of course that
> 	these are the same. Decrement a reference on the active_mm.
>
> The code keeps a different order, because as you'll note, both 'to
> user' cases require switch_mm().
>
> And where the old code would increment/decrement for the 'kernel ->
> kernel' case, the new code observes this is a neutral operation and
> avoids touching the reference count.

I am aware of that behavior which is indeed redundant, but it is not
what I am trying to fix and so I kind of leave it alone in my patch.


>
> Cc: riel@surriel.com
> Cc: luto@kernel.org
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   49 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 30 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,12 +3214,8 @@ static __always_inline struct rq *
>  context_switch(struct rq *rq, struct task_struct *prev,
>  	       struct task_struct *next, struct rq_flags *rf)
>  {
> -	struct mm_struct *mm, *oldmm;
> -
>  	prepare_task_switch(rq, prev, next);
>  
> -	mm = next->mm;
> -	oldmm = prev->active_mm;
>  	/*
>  	 * For paravirt, this is coupled with an exit in switch_to to
>  	 * combine the page table reload and the switch backend into
> @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas
>  	arch_start_context_switch(prev);
>  
>  	/*
> -	 * If mm is non-NULL, we pass through switch_mm(). If mm is
> -	 * NULL, we will pass through mmdrop() in finish_task_switch().
> -	 * Both of these contain the full memory barrier required by
> -	 * membarrier after storing to rq->curr, before returning to
> -	 * user-space.
> +	 * kernel -> kernel   lazy + transfer active
> +	 *   user -> kernel   lazy + mmgrab() active
> +	 *
> +	 * kernel ->   user   switch + mmdrop() active
> +	 *   user ->   user   switch
>  	 */
> -	if (!mm) {
> -		next->active_mm = oldmm;
> -		mmgrab(oldmm);
> -		enter_lazy_tlb(oldmm, next);
> -	} else
> -		switch_mm_irqs_off(oldmm, mm, next);
> -
> -	if (!prev->mm) {
> -		prev->active_mm = NULL;
> -		rq->prev_mm = oldmm;
> +	if (!next->mm) {                                // to kernel
> +		enter_lazy_tlb(prev->active_mm, next);
> +
> +		next->active_mm = prev->active_mm;
> +		if (prev->mm)                           // from user
> +			mmgrab(prev->active_mm);
> +		else
> +			prev->active_mm = NULL;
> +	} else {                                        // to user
> +		/*
> +		 * sys_membarrier() requires an smp_mb() between setting
> +		 * rq->curr and returning to userspace.
> +		 *
> +		 * The below provides this either through switch_mm(), or in
> +		 * case 'prev->active_mm == next->mm' through
> +		 * finish_task_switch()'s mmdrop().
> +		 */
> +
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> +		if (!prev->mm) {                        // from kernel
> +			/* will mmdrop() in finish_task_switch(). */
> +			rq->prev_mm = prev->active_mm;
> +			prev->active_mm = NULL;
> +		}
>  	}
>  
>  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

This patch looks fine to me, I don't see any problem in its logic.

Acked-by: Waiman Long <longman@redhat.com>

The problem that I am trying to fix is in the kernel->kernel case where
the active_mm just get passed along. I would like to just bump the
active_mm off if it is dying. I will see what I can do to make it work
even with !CONFIG_MEMCG.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 14:27   ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra
@ 2019-07-29 15:22     ` Waiman Long
  2019-07-29 15:38       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-07-29 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld

On 7/29/19 10:27 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>>> It was found that a dying mm_struct where the owning task has exited
>>> can stay on as active_mm of kernel threads as long as no other user
>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>> life time of dying mm holding up memory and other resources like swap
>>> space that cannot be freed.
>> Sure, but this has been so 'forever', why is it a problem now?
>>
>>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>>> if the previous active_mm is dying.
>>>
>>> The determination of a dying mm is based on the absence of an owning
>>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>>> option. Without that, there is no simple way to determine the life span
>>> of a given mm. So it falls back to the old behavior.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>  include/linux/mm_types.h | 15 +++++++++++++++
>>>  kernel/sched/core.c      | 13 +++++++++++--
>>>  mm/init-mm.c             |  4 ++++
>>>  3 files changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 3a37a89eb7a7..32712e78763c 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>>>  	return atomic_read(&mm->tlb_flush_pending) > 1;
>>>  }
>>>  
>>> +#ifdef CONFIG_MEMCG
>>> +/*
>>> + * A mm is considered dying if there is no owning task.
>>> + */
>>> +static inline bool mm_dying(struct mm_struct *mm)
>>> +{
>>> +	return !mm->owner;
>>> +}
>>> +#else
>>> +static inline bool mm_dying(struct mm_struct *mm)
>>> +{
>>> +	return false;
>>> +}
>>> +#endif
>>> +
>>>  struct vm_fault;
>> Yuck. So people without memcg will still suffer the terrible 'whatever
>> it is this patch fixes'.
> Also; why then not key off that owner tracking to free the resources
> (and leave the struct mm around) and avoid touching this scheduling
> hot-path ?

The resources are pinned by the reference count. Making a special case
will certainly mess up the existing code.

It is actually a problem for systems that are mostly idle. Only the
kernel->kernel case needs to be updated. If the CPUs isn't busy running
user tasks, a little bit more overhead shouldn't really hurt IMHO.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] sched: Clean up active_mm reference counting
  2019-07-29 15:16     ` Waiman Long
@ 2019-07-29 15:22       ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 15:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	riel, luto, mathieu.desnoyers

On Mon, Jul 29, 2019 at 11:16:55AM -0400, Waiman Long wrote:
> On 7/29/19 10:24 AM, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> >
> > ---
> > Subject: sched: Clean up active_mm reference counting
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Mon Jul 29 16:05:15 CEST 2019
> >
> > The current active_mm reference counting is confusing and sub-optimal.
> >
> > Rewrite the code to explicitly consider the 4 separate cases:
> >
> >     user -> user
> >
> > 	When switching between two user tasks, all we need to consider
> > 	is switch_mm().
> >
> >     user -> kernel
> >
> > 	When switching from a user task to a kernel task (which
> > 	doesn't have an associated mm) we retain the last mm in our
> > 	active_mm. Increment a reference count on active_mm.
> >
> >   kernel -> kernel
> >
> > 	When switching between kernel threads, all we need to do is
> > 	pass along the active_mm reference.
> >
> >   kernel -> user
> >
> > 	When switching between a kernel and user task, we must switch
> > 	from the last active_mm to the next mm, hoping of course that
> > 	these are the same. Decrement a reference on the active_mm.
> >
> > The code keeps a different order, because as you'll note, both 'to
> > user' cases require switch_mm().
> >
> > And where the old code would increment/decrement for the 'kernel ->
> > kernel' case, the new code observes this is a neutral operation and
> > avoids touching the reference count.
> 
> I am aware of that behavior which is indeed redundant, but it is not
> what I am trying to fix and so I kind of leave it alone in my patch.

Oh sure; and it's not all that important either. It is jst that every
time I look at that code I get confused.

On top of that, the new is easier to rip the active_mm stuff out of,
which is where it came from.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29  9:12 ` Michal Hocko
@ 2019-07-29 15:27   ` Waiman Long
  2019-07-29 18:58     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-07-29 15:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On 7/29/19 5:12 AM, Michal Hocko wrote:
> On Sat 27-07-19 13:10:47, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up memory and other resources like swap
>> space that cannot be freed.
> IIRC use_mm doesn't pin the address space. It only pins the mm_struct
> itself. So what exactly is the problem here?

As explained in my response to Peter, I found that resource like swap
space were depleted even after the exit of the offending program in a
mostly idle system. This patch is to make sure that those resources get
freed after program exit ASAP.

>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> if the previous active_mm is dying.
>>
>> The determination of a dying mm is based on the absence of an owning
>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>> option. Without that, there is no simple way to determine the life span
>> of a given mm. So it falls back to the old behavior.
> Please don't. We really wont to remove mm->owner long term.

OK, if that is the case, I will need to find an alternative way to
determine if an mm is to be freed soon and perhaps set a flag to
indicate that.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:03     ` Peter Zijlstra
@ 2019-07-29 15:28       ` Rik van Riel
  2019-07-29 15:44         ` Peter Zijlstra
  2019-07-29 15:37       ` Waiman Long
  2019-07-29 16:22       ` Andy Lutomirski
  2 siblings, 1 reply; 26+ messages in thread
From: Rik van Riel @ 2019-07-29 15:28 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Andy Lutomirski

[-- Attachment #1: Type: text/plain, Size: 487 bytes --]

On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:

> The 'sad' part is that x86 already switches to init_mm on idle and we
> only keep the active_mm around for 'stupid'.

Wait, where do we do that?

> Rik and Andy were working on getting that 'fixed' a while ago, not
> sure
> where that went.

My lazy TLB stuff got merged last year. 

Did we miss a spot somewhere, where the code still
quietly switches to init_mm for no good reason?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] sched: Clean up active_mm reference counting
  2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
  2019-07-29 15:01     ` Mathieu Desnoyers
  2019-07-29 15:16     ` Waiman Long
@ 2019-07-29 15:29     ` Rik van Riel
  2 siblings, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2019-07-29 15:29 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	luto, mathieu.desnoyers

[-- Attachment #1: Type: text/plain, Size: 435 bytes --]

On Mon, 2019-07-29 at 16:24 +0200, Peter Zijlstra wrote:

> Subject: sched: Clean up active_mm reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 29 16:05:15 CEST 2019
> 
> The current active_mm reference counting is confusing and sub-
> optimal.
> 
> Rewrite the code to explicitly consider the 4 separate cases:
> 

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:03     ` Peter Zijlstra
  2019-07-29 15:28       ` Rik van Riel
@ 2019-07-29 15:37       ` Waiman Long
  2019-07-29 16:12         ` Rik van Riel
  2019-07-29 16:22       ` Andy Lutomirski
  2 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-07-29 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Rik van Riel, Andy Lutomirski

On 7/29/19 11:03 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
>> On 7/29/19 4:52 AM, Peter Zijlstra wrote:
>>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up memory and other resources like swap
>>>> space that cannot be freed.
>>> Sure, but this has been so 'forever', why is it a problem now?
>> I ran into this probem when running a test program that keeps on
>> allocating and touch memory and it eventually fails as the swap space is
>> full. After the failure, I could not rerun the test program again
>> because the swap space remained full. I finally track it down to the
>> fact that the mm stayed on as active_mm of kernel threads. I have to
>> make sure that all the idle cpus get a user task to run to bump the
>> dying mm off the active_mm of those cpus, but this is just a workaround,
>> not a solution to this problem.
> The 'sad' part is that x86 already switches to init_mm on idle and we
> only keep the active_mm around for 'stupid'.
>
> Rik and Andy were working on getting that 'fixed' a while ago, not sure
> where that went.

Good, perhaps the right thing to do is for the idle->kernel case to keep
init_mm as the active_mm instead of reuse whatever left behind the last
time around.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:22     ` Waiman Long
@ 2019-07-29 15:38       ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 15:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Will Deacon, Rik van Riel, Andy Lutomirski

On Mon, Jul 29, 2019 at 11:22:16AM -0400, Waiman Long wrote:
> On 7/29/19 10:27 AM, Peter Zijlstra wrote:

> > Also; why then not key off that owner tracking to free the resources
> > (and leave the struct mm around) and avoid touching this scheduling
> > hot-path ?
> 
> The resources are pinned by the reference count. Making a special case
> will certainly mess up the existing code.
> 
> It is actually a problem for systems that are mostly idle. Only the
> kernel->kernel case needs to be updated. If the CPUs isn't busy running
> user tasks, a little bit more overhead shouldn't really hurt IMHO.

But when you cannot find a new owner; you can start to strip mm_struct.
That is, what's stopping you from freeing swap reservations when that
happens?

That is; I think the moment mm_users drops to 0, you can destroy the
actual addres space. But you have to keep mm_struct around until
mm_count goes to 0.

This is going on the comments with mmget() and mmgrab(); they forever
confuse me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:28       ` Rik van Riel
@ 2019-07-29 15:44         ` Peter Zijlstra
  2019-07-29 16:10           ` Rik van Riel
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 15:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton,
	Phil Auld, Andy Lutomirski

On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:
> 
> > The 'sad' part is that x86 already switches to init_mm on idle and we
> > only keep the active_mm around for 'stupid'.
> 
> Wait, where do we do that?

drivers/idle/intel_idle.c:              leave_mm(cpu);
drivers/acpi/processor_idle.c:  acpi_unlazy_tlb(smp_processor_id());

> > Rik and Andy were working on getting that 'fixed' a while ago, not
> > sure
> > where that went.
> 
> My lazy TLB stuff got merged last year. 

Yes, but we never got around to getting rid of active_mm for x86, right?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:44         ` Peter Zijlstra
@ 2019-07-29 16:10           ` Rik van Riel
  2019-07-29 16:26             ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Rik van Riel @ 2019-07-29 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton,
	Phil Auld, Andy Lutomirski

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

On Mon, 2019-07-29 at 17:44 +0200, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote:
> > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:
> > 
> > > The 'sad' part is that x86 already switches to init_mm on idle
> > > and we
> > > only keep the active_mm around for 'stupid'.
> > 
> > Wait, where do we do that?
> 
> drivers/idle/intel_idle.c:              leave_mm(cpu);
> drivers/acpi/processor_idle.c:  acpi_unlazy_tlb(smp_processor_id());

This is only done for deeper c-states, isn't it?

> > > Rik and Andy were working on getting that 'fixed' a while ago,
> > > not
> > > sure
> > > where that went.
> > 
> > My lazy TLB stuff got merged last year. 
> 
> Yes, but we never got around to getting rid of active_mm for x86,
> right?

True, we still use active_mm. Getting rid of the
active_mm refcounting alltogether did not look
entirely worthwhile the hassle.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:37       ` Waiman Long
@ 2019-07-29 16:12         ` Rik van Riel
  0 siblings, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2019-07-29 16:12 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Andy Lutomirski

[-- Attachment #1: Type: text/plain, Size: 2163 bytes --]

On Mon, 2019-07-29 at 11:37 -0400, Waiman Long wrote:
> On 7/29/19 11:03 AM, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
> > > On 7/29/19 4:52 AM, Peter Zijlstra wrote:
> > > > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> > > > > It was found that a dying mm_struct where the owning task has
> > > > > exited
> > > > > can stay on as active_mm of kernel threads as long as no
> > > > > other user
> > > > > tasks run on those CPUs that use it as active_mm. This
> > > > > prolongs the
> > > > > life time of dying mm holding up memory and other resources
> > > > > like swap
> > > > > space that cannot be freed.
> > > > Sure, but this has been so 'forever', why is it a problem now?
> > > I ran into this probem when running a test program that keeps on
> > > allocating and touch memory and it eventually fails as the swap
> > > space is
> > > full. After the failure, I could not rerun the test program again
> > > because the swap space remained full. I finally track it down to
> > > the
> > > fact that the mm stayed on as active_mm of kernel threads. I have
> > > to
> > > make sure that all the idle cpus get a user task to run to bump
> > > the
> > > dying mm off the active_mm of those cpus, but this is just a
> > > workaround,
> > > not a solution to this problem.
> > The 'sad' part is that x86 already switches to init_mm on idle and
> > we
> > only keep the active_mm around for 'stupid'.
> > 
> > Rik and Andy were working on getting that 'fixed' a while ago, not
> > sure
> > where that went.
> 
> Good, perhaps the right thing to do is for the idle->kernel case to
> keep
> init_mm as the active_mm instead of reuse whatever left behind the
> last
> time around.

Absolutely not. That creates heavy cache line
contention on the mm_cpumask as we switch the
mm out and back in after an idle period.

The cache line contention on the mm_cpumask
alone can take up as much as a percent of
CPU time on a very busy system with a large
multi-threaded application, multiple sockets,
and lots of context switches.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:03     ` Peter Zijlstra
  2019-07-29 15:28       ` Rik van Riel
  2019-07-29 15:37       ` Waiman Long
@ 2019-07-29 16:22       ` Andy Lutomirski
  2 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2019-07-29 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton,
	Phil Auld, Rik van Riel, Andy Lutomirski



> On Jul 29, 2019, at 8:03 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
>>> On 7/29/19 4:52 AM, Peter Zijlstra wrote:
>>>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up memory and other resources like swap
>>>> space that cannot be freed.
>>> Sure, but this has been so 'forever', why is it a problem now?
>> 
>> I ran into this probem when running a test program that keeps on
>> allocating and touch memory and it eventually fails as the swap space is
>> full. After the failure, I could not rerun the test program again
>> because the swap space remained full. I finally track it down to the
>> fact that the mm stayed on as active_mm of kernel threads. I have to
>> make sure that all the idle cpus get a user task to run to bump the
>> dying mm off the active_mm of those cpus, but this is just a workaround,
>> not a solution to this problem.
> 
> The 'sad' part is that x86 already switches to init_mm on idle and we
> only keep the active_mm around for 'stupid'.
> 
> Rik and Andy were working on getting that 'fixed' a while ago, not sure
> where that went.

I thought the current status was that we don’t always switch to init_mm on idle and instead we use a fancier and actually correct flushing routine that only flushed idle CPUs when pagetables are freed.  I still think we should be able to kill active_mm in favor of explicit refcounting in the arch code.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 16:10           ` Rik van Riel
@ 2019-07-29 16:26             ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-07-29 16:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton,
	Phil Auld, Andy Lutomirski

On Mon, Jul 29, 2019 at 12:10:12PM -0400, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:44 +0200, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote:
> > > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:
> > > 
> > > > The 'sad' part is that x86 already switches to init_mm on idle
> > > > and we
> > > > only keep the active_mm around for 'stupid'.
> > > 
> > > Wait, where do we do that?
> > 
> > drivers/idle/intel_idle.c:              leave_mm(cpu);
> > drivers/acpi/processor_idle.c:  acpi_unlazy_tlb(smp_processor_id());
> 
> This is only done for deeper c-states, isn't it?

Not C1 but I forever forget where it starts doing that. IIRC it isn't
too hard to hit it often, and I'm fairly sure we always do it when we
hit NOHZ.

> > > > Rik and Andy were working on getting that 'fixed' a while ago,
> > > > not
> > > > sure
> > > > where that went.
> > > 
> > > My lazy TLB stuff got merged last year. 
> > 
> > Yes, but we never got around to getting rid of active_mm for x86,
> > right?
> 
> True, we still use active_mm. Getting rid of the
> active_mm refcounting alltogether did not look
> entirely worthwhile the hassle.

OK, clearly I forgot some of the details ;-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 15:27   ` Waiman Long
@ 2019-07-29 18:58     ` Michal Hocko
  2019-07-29 20:41       ` Waiman Long
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-07-29 18:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On Mon 29-07-19 11:27:35, Waiman Long wrote:
> On 7/29/19 5:12 AM, Michal Hocko wrote:
> > On Sat 27-07-19 13:10:47, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up memory and other resources like swap
> >> space that cannot be freed.
> > IIRC use_mm doesn't pin the address space. It only pins the mm_struct
> > itself. So what exactly is the problem here?
> 
> As explained in my response to Peter, I found that resource like swap
> space were depleted even after the exit of the offending program in a
> mostly idle system. This patch is to make sure that those resources get
> freed after program exit ASAP.

Could you elaborate more? How can a mm counter (do not confuse with
mm_users) prevent address space to be torn down on exit?


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 18:58     ` Michal Hocko
@ 2019-07-29 20:41       ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2019-07-29 20:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On 7/29/19 2:58 PM, Michal Hocko wrote:
> On Mon 29-07-19 11:27:35, Waiman Long wrote:
>> On 7/29/19 5:12 AM, Michal Hocko wrote:
>>> On Sat 27-07-19 13:10:47, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up memory and other resources like swap
>>>> space that cannot be freed.
>>> IIRC use_mm doesn't pin the address space. It only pins the mm_struct
>>> itself. So what exactly is the problem here?
>> As explained in my response to Peter, I found that resource like swap
>> space were depleted even after the exit of the offending program in a
>> mostly idle system. This patch is to make sure that those resources get
>> freed after program exit ASAP.
> Could you elaborate more? How can a mm counter (do not confuse with
> mm_users) prevent address space to be torn down on exit?

Many of the resources tied to mm_struct are indeed freed when mm_users
becomes 0 including swap space reservation, I think. I was testing a mm
patch and it did have a missing mmput bug that cause mm_users not going
to 0. I fixed the bug, and with sched patch to speed up the release the
mm_struct, every was fine. I didn't realize that fixing the mm bug is
enough to free the swap space.

Still there are some resources not being free when the mm_count is
non-zero. It is certainly less serious than what I have thought. Sorry
for the confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29  8:18 ` Qais Yousef
@ 2019-07-29 21:06   ` Waiman Long
  2019-07-29 21:33     ` Qais Yousef
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-07-29 21:06 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On 7/29/19 4:18 AM, Qais Yousef wrote:
> On 07/27/19 13:10, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up memory and other resources like swap
>> space that cannot be freed.
>>
>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> if the previous active_mm is dying.
>>
>> The determination of a dying mm is based on the absence of an owning
>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>> option. Without that, there is no simple way to determine the life span
>> of a given mm. So it falls back to the old behavior.
> I don't really know a lot about this code, but does the owner field has to
> depend on CONFIG_MEMCG? ie: can't the owner be always set?
>
Yes, the owner field is only used and defined when CONFIG_MEMCG is on.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:06   ` Waiman Long
@ 2019-07-29 21:33     ` Qais Yousef
  0 siblings, 0 replies; 26+ messages in thread
From: Qais Yousef @ 2019-07-29 21:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	Andrew Morton, Phil Auld

On 07/29/19 17:06, Waiman Long wrote:
> On 7/29/19 4:18 AM, Qais Yousef wrote:
> > On 07/27/19 13:10, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up memory and other resources like swap
> >> space that cannot be freed.
> >>
> >> Fix that by forcing the kernel threads to use init_mm as the active_mm
> >> if the previous active_mm is dying.
> >>
> >> The determination of a dying mm is based on the absence of an owning
> >> task. The selection of the owning task only happens with the CONFIG_MEMCG
> >> option. Without that, there is no simple way to determine the life span
> >> of a given mm. So it falls back to the old behavior.
> > I don't really know a lot about this code, but does the owner field has to
> > depend on CONFIG_MEMCG? ie: can't the owner be always set?
> >
> Yes, the owner field is only used and defined when CONFIG_MEMCG is on.

I guess this is the simpler answer of it's too much work to take it out of
CONFIG_MEMCG.

Anyway, given the direction of the thread this is moot now :-)

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-07-29 21:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
2019-07-29  8:18 ` Qais Yousef
2019-07-29 21:06   ` Waiman Long
2019-07-29 21:33     ` Qais Yousef
2019-07-29  8:52 ` Peter Zijlstra
2019-07-29 14:24   ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra
2019-07-29 15:01     ` Mathieu Desnoyers
2019-07-29 15:16     ` Waiman Long
2019-07-29 15:22       ` Peter Zijlstra
2019-07-29 15:29     ` Rik van Riel
2019-07-29 14:27   ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra
2019-07-29 15:22     ` Waiman Long
2019-07-29 15:38       ` Peter Zijlstra
2019-07-29 14:51   ` Waiman Long
2019-07-29 15:03     ` Peter Zijlstra
2019-07-29 15:28       ` Rik van Riel
2019-07-29 15:44         ` Peter Zijlstra
2019-07-29 16:10           ` Rik van Riel
2019-07-29 16:26             ` Peter Zijlstra
2019-07-29 15:37       ` Waiman Long
2019-07-29 16:12         ` Rik van Riel
2019-07-29 16:22       ` Andy Lutomirski
2019-07-29  9:12 ` Michal Hocko
2019-07-29 15:27   ` Waiman Long
2019-07-29 18:58     ` Michal Hocko
2019-07-29 20:41       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).