linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
@ 2019-07-29 21:07 Waiman Long
  2019-07-29 21:12 ` Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2019-07-29 21:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko,
	Rik van Riel, Waiman Long

It was found that a dying mm_struct where the owning task has exited
can stay on as active_mm of kernel threads as long as no other user
tasks run on those CPUs that use it as active_mm. This prolongs the
life time of dying mm holding up some resources that cannot be freed
on a mostly idle system.

Fix that by forcing the kernel threads to use init_mm as the active_mm
during a kernel thread to kernel thread transition if the previous
active_mm is dying (!mm_users). This will allows the freeing of resources
associated with the dying mm ASAP.

The presence of a kernel-to-kernel thread transition indicates that
the cpu is probably idling with no higher priority user task to run.
So the overhead of loading the mm_users cacheline should not really
matter in this case.

My testing on an x86 system showed that the mm_struct was freed within
seconds after the task exited instead of staying alive for minutes or
even longer on a mostly idle system before this patch.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 795077af4f1a..41997e676251 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3214,6 +3214,8 @@ static __always_inline struct rq *
 context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next, struct rq_flags *rf)
 {
+	struct mm_struct *next_mm = next->mm;
+
 	prepare_task_switch(rq, prev, next);
 
 	/*
@@ -3229,8 +3231,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 *
 	 * kernel ->   user   switch + mmdrop() active
 	 *   user ->   user   switch
+	 *
+	 * kernel -> kernel and !prev->active_mm->mm_users:
+	 *   switch to init_mm + mmgrab() + mmdrop()
 	 */
-	if (!next->mm) {                                // to kernel
+	if (!next_mm) {					// to kernel
+		/*
+		 * Checking is only done on kernel -> kernel transition
+		 * to avoid any performance overhead while user tasks
+		 * are running.
+		 */
+		if (unlikely(!prev->mm &&
+			     !atomic_read(&prev->active_mm->mm_users))) {
+			next_mm = next->active_mm = &init_mm;
+			mmgrab(next_mm);
+			goto mm_switch;
+		}
 		enter_lazy_tlb(prev->active_mm, next);
 
 		next->active_mm = prev->active_mm;
@@ -3239,6 +3255,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		else
 			prev->active_mm = NULL;
 	} else {                                        // to user
+mm_switch:
 		/*
 		 * sys_membarrier() requires an smp_mb() between setting
 		 * rq->curr and returning to userspace.
@@ -3248,7 +3265,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 
-		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		switch_mm_irqs_off(prev->active_mm, next_mm, next);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:07 [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
@ 2019-07-29 21:12 ` Waiman Long
  2019-07-29 21:21 ` Rik van Riel
  2019-07-30  8:43 ` Peter Zijlstra
  2 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2019-07-29 21:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko,
	Rik van Riel

On 7/29/19 5:07 PM, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up some resources that cannot be freed
> on a mostly idle system.
>
> Fix that by forcing the kernel threads to use init_mm as the active_mm
> during a kernel thread to kernel thread transition if the previous
> active_mm is dying (!mm_users). This will allows the freeing of resources
> associated with the dying mm ASAP.
>
> The presence of a kernel-to-kernel thread transition indicates that
> the cpu is probably idling with no higher priority user task to run.
> So the overhead of loading the mm_users cacheline should not really
> matter in this case.
>
> My testing on an x86 system showed that the mm_struct was freed within
> seconds after the task exited instead of staying alive for minutes or
> even longer on a mostly idle system before this patch.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sched/core.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 795077af4f1a..41997e676251 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,6 +3214,8 @@ static __always_inline struct rq *
>  context_switch(struct rq *rq, struct task_struct *prev,
>  	       struct task_struct *next, struct rq_flags *rf)
>  {
> +	struct mm_struct *next_mm = next->mm;
> +
>  	prepare_task_switch(rq, prev, next);
>  
>  	/*
> @@ -3229,8 +3231,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	 *
>  	 * kernel ->   user   switch + mmdrop() active
>  	 *   user ->   user   switch
> +	 *
> +	 * kernel -> kernel and !prev->active_mm->mm_users:
> +	 *   switch to init_mm + mmgrab() + mmdrop()
>  	 */
> -	if (!next->mm) {                                // to kernel
> +	if (!next_mm) {					// to kernel
> +		/*
> +		 * Checking is only done on kernel -> kernel transition
> +		 * to avoid any performance overhead while user tasks
> +		 * are running.
> +		 */
> +		if (unlikely(!prev->mm &&
> +			     !atomic_read(&prev->active_mm->mm_users))) {
> +			next_mm = next->active_mm = &init_mm;
> +			mmgrab(next_mm);
> +			goto mm_switch;
> +		}
>  		enter_lazy_tlb(prev->active_mm, next);
>  
>  		next->active_mm = prev->active_mm;
> @@ -3239,6 +3255,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  		else
>  			prev->active_mm = NULL;
>  	} else {                                        // to user
> +mm_switch:
>  		/*
>  		 * sys_membarrier() requires an smp_mb() between setting
>  		 * rq->curr and returning to userspace.
> @@ -3248,7 +3265,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  		 * finish_task_switch()'s mmdrop().
>  		 */
>  
> -		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +		switch_mm_irqs_off(prev->active_mm, next_mm, next);
>  
>  		if (!prev->mm) {                        // from kernel
>  			/* will mmdrop() in finish_task_switch(). */

OK, this is my final push.

My previous statements are not totally correct. Many of the resources
are indeed freed when mm_users reaches 0. However, I still think it is
an issue to let the a dying mm structure to stay alive for minutes or
even longer. I am totally fine if you think it is not worth doing.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:07 [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
  2019-07-29 21:12 ` Waiman Long
@ 2019-07-29 21:21 ` Rik van Riel
  2019-07-29 21:42   ` Waiman Long
  2019-07-30  8:43 ` Peter Zijlstra
  2 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2019-07-29 21:21 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 3671 bytes --]

On Mon, 2019-07-29 at 17:07 -0400, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up some resources that cannot be freed
> on a mostly idle system.

On what kernels does this happen?

Don't we explicitly flush all lazy TLB CPUs at exit
time, when we are about to free page tables?

Does this happen only on the CPU where the task in
question is exiting, or also on other CPUs?

If it is only on the CPU where the task is exiting,
would the TASK_DEAD handling in finish_task_switch()
be a better place to handle this?

> Fix that by forcing the kernel threads to use init_mm as the
> active_mm
> during a kernel thread to kernel thread transition if the previous
> active_mm is dying (!mm_users). This will allows the freeing of
> resources
> associated with the dying mm ASAP.
> 
> The presence of a kernel-to-kernel thread transition indicates that
> the cpu is probably idling with no higher priority user task to run.
> So the overhead of loading the mm_users cacheline should not really
> matter in this case.
> 
> My testing on an x86 system showed that the mm_struct was freed
> within
> seconds after the task exited instead of staying alive for minutes or
> even longer on a mostly idle system before this patch.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sched/core.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 795077af4f1a..41997e676251 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,6 +3214,8 @@ static __always_inline struct rq *
>  context_switch(struct rq *rq, struct task_struct *prev,
>  	       struct task_struct *next, struct rq_flags *rf)
>  {
> +	struct mm_struct *next_mm = next->mm;
> +
>  	prepare_task_switch(rq, prev, next);
>  
>  	/*
> @@ -3229,8 +3231,22 @@ context_switch(struct rq *rq, struct
> task_struct *prev,
>  	 *
>  	 * kernel ->   user   switch + mmdrop() active
>  	 *   user ->   user   switch
> +	 *
> +	 * kernel -> kernel and !prev->active_mm->mm_users:
> +	 *   switch to init_mm + mmgrab() + mmdrop()
>  	 */
> -	if (!next->mm) {                                // to kernel
> +	if (!next_mm) {					// to kernel
> +		/*
> +		 * Checking is only done on kernel -> kernel transition
> +		 * to avoid any performance overhead while user tasks
> +		 * are running.
> +		 */
> +		if (unlikely(!prev->mm &&
> +			     !atomic_read(&prev->active_mm->mm_users))) 
> {
> +			next_mm = next->active_mm = &init_mm;
> +			mmgrab(next_mm);
> +			goto mm_switch;
> +		}
>  		enter_lazy_tlb(prev->active_mm, next);
>  
>  		next->active_mm = prev->active_mm;
> @@ -3239,6 +3255,7 @@ context_switch(struct rq *rq, struct
> task_struct *prev,
>  		else
>  			prev->active_mm = NULL;
>  	} else {                                        // to user
> +mm_switch:
>  		/*
>  		 * sys_membarrier() requires an smp_mb() between
> setting
>  		 * rq->curr and returning to userspace.
> @@ -3248,7 +3265,7 @@ context_switch(struct rq *rq, struct
> task_struct *prev,
>  		 * finish_task_switch()'s mmdrop().
>  		 */
>  
> -		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +		switch_mm_irqs_off(prev->active_mm, next_mm, next);
>  
>  		if (!prev->mm) {                        // from kernel
>  			/* will mmdrop() in finish_task_switch(). */
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:21 ` Rik van Riel
@ 2019-07-29 21:42   ` Waiman Long
  2019-07-30  0:26     ` Rik van Riel
  2019-07-30  7:24     ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Waiman Long @ 2019-07-29 21:42 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

On 7/29/19 5:21 PM, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:07 -0400, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up some resources that cannot be freed
>> on a mostly idle system.
> On what kernels does this happen?
>
> Don't we explicitly flush all lazy TLB CPUs at exit
> time, when we are about to free page tables?

There are still a couple of calls that will be done until mm_count
reaches 0:

- mm_free_pgd(mm);
- destroy_context(mm);
- mmu_notifier_mm_destroy(mm);
- check_mm(mm);
- put_user_ns(mm->user_ns);

These are not big items, but holding it off for a long time is still not
a good thing.

> Does this happen only on the CPU where the task in
> question is exiting, or also on other CPUs?

What I have found is that a long running process on a mostly idle system
with many CPUs is likely to cycle through a lot of the CPUs during its
lifetime and leave behind its mm in the active_mm of those CPUs.  My
2-socket test system have 96 logical CPUs. After running the test
program for a minute or so, it leaves behind its mm in about half of the
CPUs with a mm_count of 45 after exit. So the dying mm will stay until
all those 45 CPUs get new user tasks to run.


>
> If it is only on the CPU where the task is exiting,
> would the TASK_DEAD handling in finish_task_switch()
> be a better place to handle this?

I need to switch the mm off the dying one. mm switching is only done in
context_switch(). I don't think finish_task_switch() is the right place.

-Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:42   ` Waiman Long
@ 2019-07-30  0:26     ` Rik van Riel
  2019-07-30 21:01       ` Waiman Long
  2019-07-30  7:24     ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2019-07-30  0:26 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:

> What I have found is that a long running process on a mostly idle
> system
> with many CPUs is likely to cycle through a lot of the CPUs during
> its
> lifetime and leave behind its mm in the active_mm of those CPUs.  My
> 2-socket test system have 96 logical CPUs. After running the test
> program for a minute or so, it leaves behind its mm in about half of
> the
> CPUs with a mm_count of 45 after exit. So the dying mm will stay
> until
> all those 45 CPUs get new user tasks to run.

OK. On what kernel are you seeing this?

On current upstream, the code in native_flush_tlb_others()
will send a TLB flush to every CPU in mm_cpumask() if page
table pages have been freed.

That should cause the lazy TLB CPUs to switch to init_mm
when the exit->zap_page_range path gets to the point where
it frees page tables.

> > If it is only on the CPU where the task is exiting,
> > would the TASK_DEAD handling in finish_task_switch()
> > be a better place to handle this?
> 
> I need to switch the mm off the dying one. mm switching is only done
> in
> context_switch(). I don't think finish_task_switch() is the right
> place.

mm switching is also done in flush_tlb_func_common,
if the CPU received a TLB shootdown IPI while in lazy
TLB mode.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:42   ` Waiman Long
  2019-07-30  0:26     ` Rik van Riel
@ 2019-07-30  7:24     ` Michal Hocko
  2019-07-30 21:05       ` Waiman Long
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-07-30  7:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-mm, Andrew Morton, Phil Auld

On Mon 29-07-19 17:42:20, Waiman Long wrote:
> On 7/29/19 5:21 PM, Rik van Riel wrote:
> > On Mon, 2019-07-29 at 17:07 -0400, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up some resources that cannot be freed
> >> on a mostly idle system.
> > On what kernels does this happen?
> >
> > Don't we explicitly flush all lazy TLB CPUs at exit
> > time, when we are about to free page tables?
> 
> There are still a couple of calls that will be done until mm_count
> reaches 0:
> 
> - mm_free_pgd(mm);
> - destroy_context(mm);
> - mmu_notifier_mm_destroy(mm);
> - check_mm(mm);
> - put_user_ns(mm->user_ns);
> 
> These are not big items, but holding it off for a long time is still not
> a good thing.

It would be helpful to give a ball park estimation of how much that
actually is. If we are talking about few pages worth of pages per idle
cpu in the worst case then I am not sure we want to find an elaborate
way around that. We are quite likely having more in per-cpu caches in
different subsystems already. It is also quite likely that large
machines with many CPUs will have a lot of memory as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-29 21:07 [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
  2019-07-29 21:12 ` Waiman Long
  2019-07-29 21:21 ` Rik van Riel
@ 2019-07-30  8:43 ` Peter Zijlstra
  2019-07-30 13:59   ` Waiman Long
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-07-30  8:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Michal Hocko, Rik van Riel

On Mon, Jul 29, 2019 at 05:07:28PM -0400, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up some resources that cannot be freed
> on a mostly idle system.
> 
> Fix that by forcing the kernel threads to use init_mm as the active_mm
> during a kernel thread to kernel thread transition if the previous
> active_mm is dying (!mm_users). This will allows the freeing of resources
> associated with the dying mm ASAP.
> 
> The presence of a kernel-to-kernel thread transition indicates that
> the cpu is probably idling with no higher priority user task to run.
> So the overhead of loading the mm_users cacheline should not really
> matter in this case.
> 
> My testing on an x86 system showed that the mm_struct was freed within
> seconds after the task exited instead of staying alive for minutes or
> even longer on a mostly idle system before this patch.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sched/core.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 795077af4f1a..41997e676251 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,6 +3214,8 @@ static __always_inline struct rq *
>  context_switch(struct rq *rq, struct task_struct *prev,
>  	       struct task_struct *next, struct rq_flags *rf)
>  {
> +	struct mm_struct *next_mm = next->mm;
> +
>  	prepare_task_switch(rq, prev, next);
>  
>  	/*
> @@ -3229,8 +3231,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	 *
>  	 * kernel ->   user   switch + mmdrop() active
>  	 *   user ->   user   switch
> +	 *
> +	 * kernel -> kernel and !prev->active_mm->mm_users:
> +	 *   switch to init_mm + mmgrab() + mmdrop()
>  	 */
> -	if (!next->mm) {                                // to kernel
> +	if (!next_mm) {					// to kernel
> +		/*
> +		 * Checking is only done on kernel -> kernel transition
> +		 * to avoid any performance overhead while user tasks
> +		 * are running.
> +		 */
> +		if (unlikely(!prev->mm &&
> +			     !atomic_read(&prev->active_mm->mm_users))) {
> +			next_mm = next->active_mm = &init_mm;
> +			mmgrab(next_mm);
> +			goto mm_switch;
> +		}
>  		enter_lazy_tlb(prev->active_mm, next);
>  
>  		next->active_mm = prev->active_mm;

So I _really_ hate this complication. I'm thinking if you really care
about this the time is much better spend getting rid of the active_mm
tracking for x86 entirely.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-30  8:43 ` Peter Zijlstra
@ 2019-07-30 13:59   ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2019-07-30 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld,
	Michal Hocko, Rik van Riel

On 7/30/19 4:43 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 05:07:28PM -0400, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up some resources that cannot be freed
>> on a mostly idle system.
>>
>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> during a kernel thread to kernel thread transition if the previous
>> active_mm is dying (!mm_users). This will allows the freeing of resources
>> associated with the dying mm ASAP.
>>
>> The presence of a kernel-to-kernel thread transition indicates that
>> the cpu is probably idling with no higher priority user task to run.
>> So the overhead of loading the mm_users cacheline should not really
>> matter in this case.
>>
>> My testing on an x86 system showed that the mm_struct was freed within
>> seconds after the task exited instead of staying alive for minutes or
>> even longer on a mostly idle system before this patch.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/sched/core.c | 21 +++++++++++++++++++--
>>  1 file changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 795077af4f1a..41997e676251 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3214,6 +3214,8 @@ static __always_inline struct rq *
>>  context_switch(struct rq *rq, struct task_struct *prev,
>>  	       struct task_struct *next, struct rq_flags *rf)
>>  {
>> +	struct mm_struct *next_mm = next->mm;
>> +
>>  	prepare_task_switch(rq, prev, next);
>>  
>>  	/*
>> @@ -3229,8 +3231,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  	 *
>>  	 * kernel ->   user   switch + mmdrop() active
>>  	 *   user ->   user   switch
>> +	 *
>> +	 * kernel -> kernel and !prev->active_mm->mm_users:
>> +	 *   switch to init_mm + mmgrab() + mmdrop()
>>  	 */
>> -	if (!next->mm) {                                // to kernel
>> +	if (!next_mm) {					// to kernel
>> +		/*
>> +		 * Checking is only done on kernel -> kernel transition
>> +		 * to avoid any performance overhead while user tasks
>> +		 * are running.
>> +		 */
>> +		if (unlikely(!prev->mm &&
>> +			     !atomic_read(&prev->active_mm->mm_users))) {
>> +			next_mm = next->active_mm = &init_mm;
>> +			mmgrab(next_mm);
>> +			goto mm_switch;
>> +		}
>>  		enter_lazy_tlb(prev->active_mm, next);
>>  
>>  		next->active_mm = prev->active_mm;
> So I _really_ hate this complication. I'm thinking if you really care
> about this the time is much better spend getting rid of the active_mm
> tracking for x86 entirely.
>
That is fine. I won't pursue further. I will take a look at your
suggestion when I have time, but it will probably be a while :-)

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-30  0:26     ` Rik van Riel
@ 2019-07-30 21:01       ` Waiman Long
  2019-07-31 13:48         ` Rik van Riel
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2019-07-30 21:01 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

On 7/29/19 8:26 PM, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
>
>> What I have found is that a long running process on a mostly idle
>> system
>> with many CPUs is likely to cycle through a lot of the CPUs during
>> its
>> lifetime and leave behind its mm in the active_mm of those CPUs.  My
>> 2-socket test system have 96 logical CPUs. After running the test
>> program for a minute or so, it leaves behind its mm in about half of
>> the
>> CPUs with a mm_count of 45 after exit. So the dying mm will stay
>> until
>> all those 45 CPUs get new user tasks to run.
> OK. On what kernel are you seeing this?
>
> On current upstream, the code in native_flush_tlb_others()
> will send a TLB flush to every CPU in mm_cpumask() if page
> table pages have been freed.
>
> That should cause the lazy TLB CPUs to switch to init_mm
> when the exit->zap_page_range path gets to the point where
> it frees page tables.
>
I was using the latest upstream 5.3-rc2 kernel. It may be the case that
the mm has been switched, but the mm_count field of the active_mm of the
kthread is not being decremented until a user task runs on a CPU.


>>> If it is only on the CPU where the task is exiting,
>>> would the TASK_DEAD handling in finish_task_switch()
>>> be a better place to handle this?
>> I need to switch the mm off the dying one. mm switching is only done
>> in
>> context_switch(). I don't think finish_task_switch() is the right
>> place.
> mm switching is also done in flush_tlb_func_common,
> if the CPU received a TLB shootdown IPI while in lazy
> TLB mode.
>
I see.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-30  7:24     ` Michal Hocko
@ 2019-07-30 21:05       ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2019-07-30 21:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-mm, Andrew Morton, Phil Auld

On 7/30/19 3:24 AM, Michal Hocko wrote:
> On Mon 29-07-19 17:42:20, Waiman Long wrote:
>> On 7/29/19 5:21 PM, Rik van Riel wrote:
>>> On Mon, 2019-07-29 at 17:07 -0400, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up some resources that cannot be freed
>>>> on a mostly idle system.
>>> On what kernels does this happen?
>>>
>>> Don't we explicitly flush all lazy TLB CPUs at exit
>>> time, when we are about to free page tables?
>> There are still a couple of calls that will be done until mm_count
>> reaches 0:
>>
>> - mm_free_pgd(mm);
>> - destroy_context(mm);
>> - mmu_notifier_mm_destroy(mm);
>> - check_mm(mm);
>> - put_user_ns(mm->user_ns);
>>
>> These are not big items, but holding it off for a long time is still not
>> a good thing.
> It would be helpful to give a ball park estimation of how much that
> actually is. If we are talking about few pages worth of pages per idle
> cpu in the worst case then I am not sure we want to find an elaborate
> way around that. We are quite likely having more in per-cpu caches in
> different subsystems already. It is also quite likely that large
> machines with many CPUs will have a lot of memory as well.

I think they are relatively small. So I am not going to pursue it
further at this point.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-30 21:01       ` Waiman Long
@ 2019-07-31 13:48         ` Rik van Riel
  2019-07-31 14:15           ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2019-07-31 13:48 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
> On 7/29/19 8:26 PM, Rik van Riel wrote:
> > On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
> > 
> > > What I have found is that a long running process on a mostly idle
> > > system
> > > with many CPUs is likely to cycle through a lot of the CPUs
> > > during
> > > its
> > > lifetime and leave behind its mm in the active_mm of those
> > > CPUs.  My
> > > 2-socket test system have 96 logical CPUs. After running the test
> > > program for a minute or so, it leaves behind its mm in about half
> > > of
> > > the
> > > CPUs with a mm_count of 45 after exit. So the dying mm will stay
> > > until
> > > all those 45 CPUs get new user tasks to run.
> > OK. On what kernel are you seeing this?
> > 
> > On current upstream, the code in native_flush_tlb_others()
> > will send a TLB flush to every CPU in mm_cpumask() if page
> > table pages have been freed.
> > 
> > That should cause the lazy TLB CPUs to switch to init_mm
> > when the exit->zap_page_range path gets to the point where
> > it frees page tables.
> > 
> I was using the latest upstream 5.3-rc2 kernel. It may be the case
> that
> the mm has been switched, but the mm_count field of the active_mm of
> the
> kthread is not being decremented until a user task runs on a CPU.

Is that something we could fix from the TLB flushing
code?

When switching to init_mm, drop the refcount on the
lazy mm?

That way that overhead is not added to the context
switching code.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-31 13:48         ` Rik van Riel
@ 2019-07-31 14:15           ` Waiman Long
  2019-07-31 15:07             ` Rik van Riel
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2019-07-31 14:15 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

On 7/31/19 9:48 AM, Rik van Riel wrote:
> On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
>> On 7/29/19 8:26 PM, Rik van Riel wrote:
>>> On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
>>>
>>>> What I have found is that a long running process on a mostly idle
>>>> system
>>>> with many CPUs is likely to cycle through a lot of the CPUs
>>>> during
>>>> its
>>>> lifetime and leave behind its mm in the active_mm of those
>>>> CPUs.  My
>>>> 2-socket test system have 96 logical CPUs. After running the test
>>>> program for a minute or so, it leaves behind its mm in about half
>>>> of
>>>> the
>>>> CPUs with a mm_count of 45 after exit. So the dying mm will stay
>>>> until
>>>> all those 45 CPUs get new user tasks to run.
>>> OK. On what kernel are you seeing this?
>>>
>>> On current upstream, the code in native_flush_tlb_others()
>>> will send a TLB flush to every CPU in mm_cpumask() if page
>>> table pages have been freed.
>>>
>>> That should cause the lazy TLB CPUs to switch to init_mm
>>> when the exit->zap_page_range path gets to the point where
>>> it frees page tables.
>>>
>> I was using the latest upstream 5.3-rc2 kernel. It may be the case
>> that
>> the mm has been switched, but the mm_count field of the active_mm of
>> the
>> kthread is not being decremented until a user task runs on a CPU.
> Is that something we could fix from the TLB flushing
> code?
>
> When switching to init_mm, drop the refcount on the
> lazy mm?
>
> That way that overhead is not added to the context
> switching code.

I have thought about that. That will require changing the active_mm of
the current task to point to init_mm, for example. Since TLB flush is
done in interrupt context, proper coordination between interrupt and
process context will require some atomic instruction which will defect
the purpose.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-31 14:15           ` Waiman Long
@ 2019-07-31 15:07             ` Rik van Riel
  2019-07-31 15:49               ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2019-07-31 15:07 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2521 bytes --]

On Wed, 2019-07-31 at 10:15 -0400, Waiman Long wrote:
> On 7/31/19 9:48 AM, Rik van Riel wrote:
> > On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
> > > On 7/29/19 8:26 PM, Rik van Riel wrote:
> > > > On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
> > > > 
> > > > > What I have found is that a long running process on a mostly
> > > > > idle
> > > > > system
> > > > > with many CPUs is likely to cycle through a lot of the CPUs
> > > > > during
> > > > > its
> > > > > lifetime and leave behind its mm in the active_mm of those
> > > > > CPUs.  My
> > > > > 2-socket test system have 96 logical CPUs. After running the
> > > > > test
> > > > > program for a minute or so, it leaves behind its mm in about
> > > > > half
> > > > > of
> > > > > the
> > > > > CPUs with a mm_count of 45 after exit. So the dying mm will
> > > > > stay
> > > > > until
> > > > > all those 45 CPUs get new user tasks to run.
> > > > OK. On what kernel are you seeing this?
> > > > 
> > > > On current upstream, the code in native_flush_tlb_others()
> > > > will send a TLB flush to every CPU in mm_cpumask() if page
> > > > table pages have been freed.
> > > > 
> > > > That should cause the lazy TLB CPUs to switch to init_mm
> > > > when the exit->zap_page_range path gets to the point where
> > > > it frees page tables.
> > > > 
> > > I was using the latest upstream 5.3-rc2 kernel. It may be the
> > > case
> > > that
> > > the mm has been switched, but the mm_count field of the active_mm
> > > of
> > > the
> > > kthread is not being decremented until a user task runs on a CPU.
> > Is that something we could fix from the TLB flushing
> > code?
> > 
> > When switching to init_mm, drop the refcount on the
> > lazy mm?
> > 
> > That way that overhead is not added to the context
> > switching code.
> 
> I have thought about that. That will require changing the active_mm
> of
> the current task to point to init_mm, for example. Since TLB flush is
> done in interrupt context, proper coordination between interrupt and
> process context will require some atomic instruction which will
> defect
> the purpose.

Would it be possible to work around that by scheduling
a work item that drops the active_mm?

After all, a work item runs in a kernel thread, so by
the time the work item is run, either the kernel will
still be running the mm you want to get rid of as
active_mm, or it will have already gotten rid of it
earlier.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads
  2019-07-31 15:07             ` Rik van Riel
@ 2019-07-31 15:49               ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2019-07-31 15:49 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Michal Hocko

On 7/31/19 11:07 AM, Rik van Riel wrote:
> On Wed, 2019-07-31 at 10:15 -0400, Waiman Long wrote:
>> On 7/31/19 9:48 AM, Rik van Riel wrote:
>>> On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
>>>> On 7/29/19 8:26 PM, Rik van Riel wrote:
>>>>> On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
>>>>>
>>>>>> What I have found is that a long running process on a mostly
>>>>>> idle
>>>>>> system
>>>>>> with many CPUs is likely to cycle through a lot of the CPUs
>>>>>> during
>>>>>> its
>>>>>> lifetime and leave behind its mm in the active_mm of those
>>>>>> CPUs.  My
>>>>>> 2-socket test system have 96 logical CPUs. After running the
>>>>>> test
>>>>>> program for a minute or so, it leaves behind its mm in about
>>>>>> half
>>>>>> of
>>>>>> the
>>>>>> CPUs with a mm_count of 45 after exit. So the dying mm will
>>>>>> stay
>>>>>> until
>>>>>> all those 45 CPUs get new user tasks to run.
>>>>> OK. On what kernel are you seeing this?
>>>>>
>>>>> On current upstream, the code in native_flush_tlb_others()
>>>>> will send a TLB flush to every CPU in mm_cpumask() if page
>>>>> table pages have been freed.
>>>>>
>>>>> That should cause the lazy TLB CPUs to switch to init_mm
>>>>> when the exit->zap_page_range path gets to the point where
>>>>> it frees page tables.
>>>>>
>>>> I was using the latest upstream 5.3-rc2 kernel. It may be the
>>>> case
>>>> that
>>>> the mm has been switched, but the mm_count field of the active_mm
>>>> of
>>>> the
>>>> kthread is not being decremented until a user task runs on a CPU.
>>> Is that something we could fix from the TLB flushing
>>> code?
>>>
>>> When switching to init_mm, drop the refcount on the
>>> lazy mm?
>>>
>>> That way that overhead is not added to the context
>>> switching code.
>> I have thought about that. That will require changing the active_mm
>> of
>> the current task to point to init_mm, for example. Since TLB flush is
>> done in interrupt context, proper coordination between interrupt and
>> process context will require some atomic instruction which will
>> defect
>> the purpose.
> Would it be possible to work around that by scheduling
> a work item that drops the active_mm?
>
> After all, a work item runs in a kernel thread, so by
> the time the work item is run, either the kernel will
> still be running the mm you want to get rid of as
> active_mm, or it will have already gotten rid of it
> earlier.

Yes, that may work.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-07-31 15:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-29 21:07 [PATCH v3] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long
2019-07-29 21:12 ` Waiman Long
2019-07-29 21:21 ` Rik van Riel
2019-07-29 21:42   ` Waiman Long
2019-07-30  0:26     ` Rik van Riel
2019-07-30 21:01       ` Waiman Long
2019-07-31 13:48         ` Rik van Riel
2019-07-31 14:15           ` Waiman Long
2019-07-31 15:07             ` Rik van Riel
2019-07-31 15:49               ` Waiman Long
2019-07-30  7:24     ` Michal Hocko
2019-07-30 21:05       ` Waiman Long
2019-07-30  8:43 ` Peter Zijlstra
2019-07-30 13:59   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).