* [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads @ 2019-07-27 17:10 Waiman Long 2019-07-29 8:18 ` Qais Yousef ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Waiman Long @ 2019-07-27 17:10 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: linux-kernel, linux-mm, Andrew Morton, Phil Auld, Waiman Long It was found that a dying mm_struct where the owning task has exited can stay on as active_mm of kernel threads as long as no other user tasks run on those CPUs that use it as active_mm. This prolongs the life time of dying mm holding up memory and other resources like swap space that cannot be freed. Fix that by forcing the kernel threads to use init_mm as the active_mm if the previous active_mm is dying. The determination of a dying mm is based on the absence of an owning task. The selection of the owning task only happens with the CONFIG_MEMCG option. Without that, there is no simple way to determine the life span of a given mm. So it falls back to the old behavior. Signed-off-by: Waiman Long <longman@redhat.com> --- include/linux/mm_types.h | 15 +++++++++++++++ kernel/sched/core.c | 13 +++++++++++-- mm/init-mm.c | 4 ++++ 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3a37a89eb7a7..32712e78763c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) return atomic_read(&mm->tlb_flush_pending) > 1; } +#ifdef CONFIG_MEMCG +/* + * A mm is considered dying if there is no owning task. + */ +static inline bool mm_dying(struct mm_struct *mm) +{ + return !mm->owner; +} +#else +static inline bool mm_dying(struct mm_struct *mm) +{ + return false; +} +#endif + struct vm_fault; /** diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..923a63262dfd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev, * Both of these contain the full memory barrier required by * membarrier after storing to rq->curr, before returning to * user-space. + * + * If mm is NULL and oldmm is dying (!owner), we switch to + * init_mm instead to make sure that oldmm can be freed ASAP. */ - if (!mm) { + if (!mm && !mm_dying(oldmm)) { next->active_mm = oldmm; mmgrab(oldmm); enter_lazy_tlb(oldmm, next); - } else + } else { + if (!mm) { + mm = &init_mm; + next->active_mm = mm; + mmgrab(mm); + } switch_mm_irqs_off(oldmm, mm, next); + } if (!prev->mm) { prev->active_mm = NULL; diff --git a/mm/init-mm.c b/mm/init-mm.c index a787a319211e..69090a11249c 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -5,6 +5,7 @@ #include <linux/spinlock.h> #include <linux/list.h> #include <linux/cpumask.h> +#include <linux/sched/task.h> #include <linux/atomic.h> #include <linux/user_namespace.h> @@ -36,5 +37,8 @@ struct mm_struct init_mm = { .mmlist = LIST_HEAD_INIT(init_mm.mmlist), .user_ns = &init_user_ns, .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0}, +#ifdef CONFIG_MEMCG + .owner = &init_task, +#endif INIT_MM_CONTEXT(init_mm) }; -- 2.18.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long @ 2019-07-29 8:18 ` Qais Yousef 2019-07-29 21:06 ` Waiman Long 2019-07-29 8:52 ` Peter Zijlstra 2019-07-29 9:12 ` Michal Hocko 2 siblings, 1 reply; 26+ messages in thread From: Qais Yousef @ 2019-07-29 8:18 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 07/27/19 13:10, Waiman Long wrote: > It was found that a dying mm_struct where the owning task has exited > can stay on as active_mm of kernel threads as long as no other user > tasks run on those CPUs that use it as active_mm. This prolongs the > life time of dying mm holding up memory and other resources like swap > space that cannot be freed. > > Fix that by forcing the kernel threads to use init_mm as the active_mm > if the previous active_mm is dying. > > The determination of a dying mm is based on the absence of an owning > task. The selection of the owning task only happens with the CONFIG_MEMCG > option. Without that, there is no simple way to determine the life span > of a given mm. So it falls back to the old behavior. I don't really know a lot about this code, but does the owner field has to depend on CONFIG_MEMCG? ie: can't the owner be always set? Cheers -- Qais Yousef ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 8:18 ` Qais Yousef @ 2019-07-29 21:06 ` Waiman Long 2019-07-29 21:33 ` Qais Yousef 0 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2019-07-29 21:06 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 7/29/19 4:18 AM, Qais Yousef wrote: > On 07/27/19 13:10, Waiman Long wrote: >> It was found that a dying mm_struct where the owning task has exited >> can stay on as active_mm of kernel threads as long as no other user >> tasks run on those CPUs that use it as active_mm. This prolongs the >> life time of dying mm holding up memory and other resources like swap >> space that cannot be freed. >> >> Fix that by forcing the kernel threads to use init_mm as the active_mm >> if the previous active_mm is dying. >> >> The determination of a dying mm is based on the absence of an owning >> task. The selection of the owning task only happens with the CONFIG_MEMCG >> option. Without that, there is no simple way to determine the life span >> of a given mm. So it falls back to the old behavior. > I don't really know a lot about this code, but does the owner field has to > depend on CONFIG_MEMCG? ie: can't the owner be always set? > Yes, the owner field is only used and defined when CONFIG_MEMCG is on. Cheers, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 21:06 ` Waiman Long @ 2019-07-29 21:33 ` Qais Yousef 0 siblings, 0 replies; 26+ messages in thread From: Qais Yousef @ 2019-07-29 21:33 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 07/29/19 17:06, Waiman Long wrote: > On 7/29/19 4:18 AM, Qais Yousef wrote: > > On 07/27/19 13:10, Waiman Long wrote: > >> It was found that a dying mm_struct where the owning task has exited > >> can stay on as active_mm of kernel threads as long as no other user > >> tasks run on those CPUs that use it as active_mm. This prolongs the > >> life time of dying mm holding up memory and other resources like swap > >> space that cannot be freed. > >> > >> Fix that by forcing the kernel threads to use init_mm as the active_mm > >> if the previous active_mm is dying. > >> > >> The determination of a dying mm is based on the absence of an owning > >> task. The selection of the owning task only happens with the CONFIG_MEMCG > >> option. Without that, there is no simple way to determine the life span > >> of a given mm. So it falls back to the old behavior. > > I don't really know a lot about this code, but does the owner field has to > > depend on CONFIG_MEMCG? ie: can't the owner be always set? > > > Yes, the owner field is only used and defined when CONFIG_MEMCG is on. I guess this is the simpler answer of it's too much work to take it out of CONFIG_MEMCG. Anyway, given the direction of the thread this is moot now :-) Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long 2019-07-29 8:18 ` Qais Yousef @ 2019-07-29 8:52 ` Peter Zijlstra 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra ` (2 more replies) 2019-07-29 9:12 ` Michal Hocko 2 siblings, 3 replies; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 8:52 UTC (permalink / raw) To: Waiman Long; +Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: > It was found that a dying mm_struct where the owning task has exited > can stay on as active_mm of kernel threads as long as no other user > tasks run on those CPUs that use it as active_mm. This prolongs the > life time of dying mm holding up memory and other resources like swap > space that cannot be freed. Sure, but this has been so 'forever', why is it a problem now? > Fix that by forcing the kernel threads to use init_mm as the active_mm > if the previous active_mm is dying. > > The determination of a dying mm is based on the absence of an owning > task. The selection of the owning task only happens with the CONFIG_MEMCG > option. Without that, there is no simple way to determine the life span > of a given mm. So it falls back to the old behavior. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > include/linux/mm_types.h | 15 +++++++++++++++ > kernel/sched/core.c | 13 +++++++++++-- > mm/init-mm.c | 4 ++++ > 3 files changed, 30 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 3a37a89eb7a7..32712e78763c 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) > return atomic_read(&mm->tlb_flush_pending) > 1; > } > > +#ifdef CONFIG_MEMCG > +/* > + * A mm is considered dying if there is no owning task. > + */ > +static inline bool mm_dying(struct mm_struct *mm) > +{ > + return !mm->owner; > +} > +#else > +static inline bool mm_dying(struct mm_struct *mm) > +{ > + return false; > +} > +#endif > + > struct vm_fault; Yuck. So people without memcg will still suffer the terrible 'whatever it is this patch fixes'. > /** > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2b037f195473..923a63262dfd 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev, > * Both of these contain the full memory barrier required by > * membarrier after storing to rq->curr, before returning to > * user-space. > + * > + * If mm is NULL and oldmm is dying (!owner), we switch to > + * init_mm instead to make sure that oldmm can be freed ASAP. > */ > - if (!mm) { > + if (!mm && !mm_dying(oldmm)) { > next->active_mm = oldmm; > mmgrab(oldmm); > enter_lazy_tlb(oldmm, next); > - } else > + } else { > + if (!mm) { > + mm = &init_mm; > + next->active_mm = mm; > + mmgrab(mm); > + } > switch_mm_irqs_off(oldmm, mm, next); > + } > > if (!prev->mm) { > prev->active_mm = NULL; Bah, I see we _still_ haven't 'fixed' that code. And you're making an even bigger mess of it. Let me go find where that cleanup went. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH] sched: Clean up active_mm reference counting 2019-07-29 8:52 ` Peter Zijlstra @ 2019-07-29 14:24 ` Peter Zijlstra 2019-07-29 15:01 ` Mathieu Desnoyers ` (2 more replies) 2019-07-29 14:27 ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra 2019-07-29 14:51 ` Waiman Long 2 siblings, 3 replies; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 14:24 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, riel, luto, mathieu.desnoyers On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 2b037f195473..923a63262dfd 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev, > > * Both of these contain the full memory barrier required by > > * membarrier after storing to rq->curr, before returning to > > * user-space. > > + * > > + * If mm is NULL and oldmm is dying (!owner), we switch to > > + * init_mm instead to make sure that oldmm can be freed ASAP. > > */ > > - if (!mm) { > > + if (!mm && !mm_dying(oldmm)) { > > next->active_mm = oldmm; > > mmgrab(oldmm); > > enter_lazy_tlb(oldmm, next); > > - } else > > + } else { > > + if (!mm) { > > + mm = &init_mm; > > + next->active_mm = mm; > > + mmgrab(mm); > > + } > > switch_mm_irqs_off(oldmm, mm, next); > > + } > > > > if (!prev->mm) { > > prev->active_mm = NULL; > > Bah, I see we _still_ haven't 'fixed' that code. And you're making an > even bigger mess of it. > > Let me go find where that cleanup went. --- Subject: sched: Clean up active_mm reference counting From: Peter Zijlstra <peterz@infradead.org> Date: Mon Jul 29 16:05:15 CEST 2019 The current active_mm reference counting is confusing and sub-optimal. Rewrite the code to explicitly consider the 4 separate cases: user -> user When switching between two user tasks, all we need to consider is switch_mm(). user -> kernel When switching from a user task to a kernel task (which doesn't have an associated mm) we retain the last mm in our active_mm. Increment a reference count on active_mm. kernel -> kernel When switching between kernel threads, all we need to do is pass along the active_mm reference. kernel -> user When switching between a kernel and user task, we must switch from the last active_mm to the next mm, hoping of course that these are the same. Decrement a reference on the active_mm. The code keeps a different order, because as you'll note, both 'to user' cases require switch_mm(). And where the old code would increment/decrement for the 'kernel -> kernel' case, the new code observes this is a neutral operation and avoids touching the reference count. Cc: riel@surriel.com Cc: luto@kernel.org Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++------------------- 1 file changed, 30 insertions(+), 19 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3214,12 +3214,8 @@ static __always_inline struct rq * context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { - struct mm_struct *mm, *oldmm; - prepare_task_switch(rq, prev, next); - mm = next->mm; - oldmm = prev->active_mm; /* * For paravirt, this is coupled with an exit in switch_to to * combine the page table reload and the switch backend into @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas arch_start_context_switch(prev); /* - * If mm is non-NULL, we pass through switch_mm(). If mm is - * NULL, we will pass through mmdrop() in finish_task_switch(). - * Both of these contain the full memory barrier required by - * membarrier after storing to rq->curr, before returning to - * user-space. + * kernel -> kernel lazy + transfer active + * user -> kernel lazy + mmgrab() active + * + * kernel -> user switch + mmdrop() active + * user -> user switch */ - if (!mm) { - next->active_mm = oldmm; - mmgrab(oldmm); - enter_lazy_tlb(oldmm, next); - } else - switch_mm_irqs_off(oldmm, mm, next); - - if (!prev->mm) { - prev->active_mm = NULL; - rq->prev_mm = oldmm; + if (!next->mm) { // to kernel + enter_lazy_tlb(prev->active_mm, next); + + next->active_mm = prev->active_mm; + if (prev->mm) // from user + mmgrab(prev->active_mm); + else + prev->active_mm = NULL; + } else { // to user + /* + * sys_membarrier() requires an smp_mb() between setting + * rq->curr and returning to userspace. + * + * The below provides this either through switch_mm(), or in + * case 'prev->active_mm == next->mm' through + * finish_task_switch()'s mmdrop(). + */ + + switch_mm_irqs_off(prev->active_mm, next->mm, next); + + if (!prev->mm) { // from kernel + /* will mmdrop() in finish_task_switch(). */ + rq->prev_mm = prev->active_mm; + prev->active_mm = NULL; + } } rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] sched: Clean up active_mm reference counting 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra @ 2019-07-29 15:01 ` Mathieu Desnoyers 2019-07-29 15:16 ` Waiman Long 2019-07-29 15:29 ` Rik van Riel 2 siblings, 0 replies; 26+ messages in thread From: Mathieu Desnoyers @ 2019-07-29 15:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, riel, Andy Lutomirski ----- On Jul 29, 2019, at 10:24 AM, Peter Zijlstra peterz@infradead.org wrote: [...] > --- > Subject: sched: Clean up active_mm reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Jul 29 16:05:15 CEST 2019 > > The current active_mm reference counting is confusing and sub-optimal. > > Rewrite the code to explicitly consider the 4 separate cases: > > user -> user > > When switching between two user tasks, all we need to consider > is switch_mm(). > > user -> kernel > > When switching from a user task to a kernel task (which > doesn't have an associated mm) we retain the last mm in our > active_mm. Increment a reference count on active_mm. > > kernel -> kernel > > When switching between kernel threads, all we need to do is > pass along the active_mm reference. > > kernel -> user > > When switching between a kernel and user task, we must switch > from the last active_mm to the next mm, hoping of course that > these are the same. Decrement a reference on the active_mm. > > The code keeps a different order, because as you'll note, both 'to > user' cases require switch_mm(). > > And where the old code would increment/decrement for the 'kernel -> > kernel' case, the new code observes this is a neutral operation and > avoids touching the reference count. Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Cc: riel@surriel.com > Cc: luto@kernel.org > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++------------------- > 1 file changed, 30 insertions(+), 19 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3214,12 +3214,8 @@ static __always_inline struct rq * > context_switch(struct rq *rq, struct task_struct *prev, > struct task_struct *next, struct rq_flags *rf) > { > - struct mm_struct *mm, *oldmm; > - > prepare_task_switch(rq, prev, next); > > - mm = next->mm; > - oldmm = prev->active_mm; > /* > * For paravirt, this is coupled with an exit in switch_to to > * combine the page table reload and the switch backend into > @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas > arch_start_context_switch(prev); > > /* > - * If mm is non-NULL, we pass through switch_mm(). If mm is > - * NULL, we will pass through mmdrop() in finish_task_switch(). > - * Both of these contain the full memory barrier required by > - * membarrier after storing to rq->curr, before returning to > - * user-space. > + * kernel -> kernel lazy + transfer active > + * user -> kernel lazy + mmgrab() active > + * > + * kernel -> user switch + mmdrop() active > + * user -> user switch > */ > - if (!mm) { > - next->active_mm = oldmm; > - mmgrab(oldmm); > - enter_lazy_tlb(oldmm, next); > - } else > - switch_mm_irqs_off(oldmm, mm, next); > - > - if (!prev->mm) { > - prev->active_mm = NULL; > - rq->prev_mm = oldmm; > + if (!next->mm) { // to kernel > + enter_lazy_tlb(prev->active_mm, next); > + > + next->active_mm = prev->active_mm; > + if (prev->mm) // from user > + mmgrab(prev->active_mm); > + else > + prev->active_mm = NULL; > + } else { // to user > + /* > + * sys_membarrier() requires an smp_mb() between setting > + * rq->curr and returning to userspace. > + * > + * The below provides this either through switch_mm(), or in > + * case 'prev->active_mm == next->mm' through > + * finish_task_switch()'s mmdrop(). > + */ > + > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > + if (!prev->mm) { // from kernel > + /* will mmdrop() in finish_task_switch(). */ > + rq->prev_mm = prev->active_mm; > + prev->active_mm = NULL; > + } > } > > rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] sched: Clean up active_mm reference counting 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra 2019-07-29 15:01 ` Mathieu Desnoyers @ 2019-07-29 15:16 ` Waiman Long 2019-07-29 15:22 ` Peter Zijlstra 2019-07-29 15:29 ` Rik van Riel 2 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2019-07-29 15:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, riel, luto, mathieu.desnoyers On 7/29/19 10:24 AM, Peter Zijlstra wrote: > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: > > --- > Subject: sched: Clean up active_mm reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Jul 29 16:05:15 CEST 2019 > > The current active_mm reference counting is confusing and sub-optimal. > > Rewrite the code to explicitly consider the 4 separate cases: > > user -> user > > When switching between two user tasks, all we need to consider > is switch_mm(). > > user -> kernel > > When switching from a user task to a kernel task (which > doesn't have an associated mm) we retain the last mm in our > active_mm. Increment a reference count on active_mm. > > kernel -> kernel > > When switching between kernel threads, all we need to do is > pass along the active_mm reference. > > kernel -> user > > When switching between a kernel and user task, we must switch > from the last active_mm to the next mm, hoping of course that > these are the same. Decrement a reference on the active_mm. > > The code keeps a different order, because as you'll note, both 'to > user' cases require switch_mm(). > > And where the old code would increment/decrement for the 'kernel -> > kernel' case, the new code observes this is a neutral operation and > avoids touching the reference count. I am aware of that behavior which is indeed redundant, but it is not what I am trying to fix and so I kind of leave it alone in my patch. > > Cc: riel@surriel.com > Cc: luto@kernel.org > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++------------------- > 1 file changed, 30 insertions(+), 19 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3214,12 +3214,8 @@ static __always_inline struct rq * > context_switch(struct rq *rq, struct task_struct *prev, > struct task_struct *next, struct rq_flags *rf) > { > - struct mm_struct *mm, *oldmm; > - > prepare_task_switch(rq, prev, next); > > - mm = next->mm; > - oldmm = prev->active_mm; > /* > * For paravirt, this is coupled with an exit in switch_to to > * combine the page table reload and the switch backend into > @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas > arch_start_context_switch(prev); > > /* > - * If mm is non-NULL, we pass through switch_mm(). If mm is > - * NULL, we will pass through mmdrop() in finish_task_switch(). > - * Both of these contain the full memory barrier required by > - * membarrier after storing to rq->curr, before returning to > - * user-space. > + * kernel -> kernel lazy + transfer active > + * user -> kernel lazy + mmgrab() active > + * > + * kernel -> user switch + mmdrop() active > + * user -> user switch > */ > - if (!mm) { > - next->active_mm = oldmm; > - mmgrab(oldmm); > - enter_lazy_tlb(oldmm, next); > - } else > - switch_mm_irqs_off(oldmm, mm, next); > - > - if (!prev->mm) { > - prev->active_mm = NULL; > - rq->prev_mm = oldmm; > + if (!next->mm) { // to kernel > + enter_lazy_tlb(prev->active_mm, next); > + > + next->active_mm = prev->active_mm; > + if (prev->mm) // from user > + mmgrab(prev->active_mm); > + else > + prev->active_mm = NULL; > + } else { // to user > + /* > + * sys_membarrier() requires an smp_mb() between setting > + * rq->curr and returning to userspace. > + * > + * The below provides this either through switch_mm(), or in > + * case 'prev->active_mm == next->mm' through > + * finish_task_switch()'s mmdrop(). > + */ > + > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > + if (!prev->mm) { // from kernel > + /* will mmdrop() in finish_task_switch(). */ > + rq->prev_mm = prev->active_mm; > + prev->active_mm = NULL; > + } > } > > rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); This patch looks fine to me, I don't see any problem in its logic. Acked-by: Waiman Long <longman@redhat.com> The problem that I am trying to fix is in the kernel->kernel case where the active_mm just get passed along. I would like to just bump the active_mm off if it is dying. I will see what I can do to make it work even with !CONFIG_MEMCG. Cheers, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] sched: Clean up active_mm reference counting 2019-07-29 15:16 ` Waiman Long @ 2019-07-29 15:22 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 15:22 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, riel, luto, mathieu.desnoyers On Mon, Jul 29, 2019 at 11:16:55AM -0400, Waiman Long wrote: > On 7/29/19 10:24 AM, Peter Zijlstra wrote: > > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: > > > > --- > > Subject: sched: Clean up active_mm reference counting > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Mon Jul 29 16:05:15 CEST 2019 > > > > The current active_mm reference counting is confusing and sub-optimal. > > > > Rewrite the code to explicitly consider the 4 separate cases: > > > > user -> user > > > > When switching between two user tasks, all we need to consider > > is switch_mm(). > > > > user -> kernel > > > > When switching from a user task to a kernel task (which > > doesn't have an associated mm) we retain the last mm in our > > active_mm. Increment a reference count on active_mm. > > > > kernel -> kernel > > > > When switching between kernel threads, all we need to do is > > pass along the active_mm reference. > > > > kernel -> user > > > > When switching between a kernel and user task, we must switch > > from the last active_mm to the next mm, hoping of course that > > these are the same. Decrement a reference on the active_mm. > > > > The code keeps a different order, because as you'll note, both 'to > > user' cases require switch_mm(). > > > > And where the old code would increment/decrement for the 'kernel -> > > kernel' case, the new code observes this is a neutral operation and > > avoids touching the reference count. > > I am aware of that behavior which is indeed redundant, but it is not > what I am trying to fix and so I kind of leave it alone in my patch. Oh sure; and it's not all that important either. It is jst that every time I look at that code I get confused. On top of that, the new is easier to rip the active_mm stuff out of, which is where it came from. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] sched: Clean up active_mm reference counting 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra 2019-07-29 15:01 ` Mathieu Desnoyers 2019-07-29 15:16 ` Waiman Long @ 2019-07-29 15:29 ` Rik van Riel 2 siblings, 0 replies; 26+ messages in thread From: Rik van Riel @ 2019-07-29 15:29 UTC (permalink / raw) To: Peter Zijlstra, Waiman Long Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, luto, mathieu.desnoyers [-- Attachment #1: Type: text/plain, Size: 435 bytes --] On Mon, 2019-07-29 at 16:24 +0200, Peter Zijlstra wrote: > Subject: sched: Clean up active_mm reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Jul 29 16:05:15 CEST 2019 > > The current active_mm reference counting is confusing and sub- > optimal. > > Rewrite the code to explicitly consider the 4 separate cases: > Reviewed-by: Rik van Riel <riel@surriel.com> -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 8:52 ` Peter Zijlstra 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra @ 2019-07-29 14:27 ` Peter Zijlstra 2019-07-29 15:22 ` Waiman Long 2019-07-29 14:51 ` Waiman Long 2 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 14:27 UTC (permalink / raw) To: Waiman Long; +Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: > > It was found that a dying mm_struct where the owning task has exited > > can stay on as active_mm of kernel threads as long as no other user > > tasks run on those CPUs that use it as active_mm. This prolongs the > > life time of dying mm holding up memory and other resources like swap > > space that cannot be freed. > > Sure, but this has been so 'forever', why is it a problem now? > > > Fix that by forcing the kernel threads to use init_mm as the active_mm > > if the previous active_mm is dying. > > > > The determination of a dying mm is based on the absence of an owning > > task. The selection of the owning task only happens with the CONFIG_MEMCG > > option. Without that, there is no simple way to determine the life span > > of a given mm. So it falls back to the old behavior. > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > --- > > include/linux/mm_types.h | 15 +++++++++++++++ > > kernel/sched/core.c | 13 +++++++++++-- > > mm/init-mm.c | 4 ++++ > > 3 files changed, 30 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 3a37a89eb7a7..32712e78763c 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) > > return atomic_read(&mm->tlb_flush_pending) > 1; > > } > > > > +#ifdef CONFIG_MEMCG > > +/* > > + * A mm is considered dying if there is no owning task. > > + */ > > +static inline bool mm_dying(struct mm_struct *mm) > > +{ > > + return !mm->owner; > > +} > > +#else > > +static inline bool mm_dying(struct mm_struct *mm) > > +{ > > + return false; > > +} > > +#endif > > + > > struct vm_fault; > > Yuck. So people without memcg will still suffer the terrible 'whatever > it is this patch fixes'. Also; why then not key off that owner tracking to free the resources (and leave the struct mm around) and avoid touching this scheduling hot-path ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 14:27 ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra @ 2019-07-29 15:22 ` Waiman Long 2019-07-29 15:38 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2019-07-29 15:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 7/29/19 10:27 AM, Peter Zijlstra wrote: > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: >> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: >>> It was found that a dying mm_struct where the owning task has exited >>> can stay on as active_mm of kernel threads as long as no other user >>> tasks run on those CPUs that use it as active_mm. This prolongs the >>> life time of dying mm holding up memory and other resources like swap >>> space that cannot be freed. >> Sure, but this has been so 'forever', why is it a problem now? >> >>> Fix that by forcing the kernel threads to use init_mm as the active_mm >>> if the previous active_mm is dying. >>> >>> The determination of a dying mm is based on the absence of an owning >>> task. The selection of the owning task only happens with the CONFIG_MEMCG >>> option. Without that, there is no simple way to determine the life span >>> of a given mm. So it falls back to the old behavior. >>> >>> Signed-off-by: Waiman Long <longman@redhat.com> >>> --- >>> include/linux/mm_types.h | 15 +++++++++++++++ >>> kernel/sched/core.c | 13 +++++++++++-- >>> mm/init-mm.c | 4 ++++ >>> 3 files changed, 30 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>> index 3a37a89eb7a7..32712e78763c 100644 >>> --- a/include/linux/mm_types.h >>> +++ b/include/linux/mm_types.h >>> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) >>> return atomic_read(&mm->tlb_flush_pending) > 1; >>> } >>> >>> +#ifdef CONFIG_MEMCG >>> +/* >>> + * A mm is considered dying if there is no owning task. >>> + */ >>> +static inline bool mm_dying(struct mm_struct *mm) >>> +{ >>> + return !mm->owner; >>> +} >>> +#else >>> +static inline bool mm_dying(struct mm_struct *mm) >>> +{ >>> + return false; >>> +} >>> +#endif >>> + >>> struct vm_fault; >> Yuck. So people without memcg will still suffer the terrible 'whatever >> it is this patch fixes'. > Also; why then not key off that owner tracking to free the resources > (and leave the struct mm around) and avoid touching this scheduling > hot-path ? The resources are pinned by the reference count. Making a special case will certainly mess up the existing code. It is actually a problem for systems that are mostly idle. Only the kernel->kernel case needs to be updated. If the CPUs isn't busy running user tasks, a little bit more overhead shouldn't really hurt IMHO. Cheers, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:22 ` Waiman Long @ 2019-07-29 15:38 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 15:38 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Will Deacon, Rik van Riel, Andy Lutomirski On Mon, Jul 29, 2019 at 11:22:16AM -0400, Waiman Long wrote: > On 7/29/19 10:27 AM, Peter Zijlstra wrote: > > Also; why then not key off that owner tracking to free the resources > > (and leave the struct mm around) and avoid touching this scheduling > > hot-path ? > > The resources are pinned by the reference count. Making a special case > will certainly mess up the existing code. > > It is actually a problem for systems that are mostly idle. Only the > kernel->kernel case needs to be updated. If the CPUs isn't busy running > user tasks, a little bit more overhead shouldn't really hurt IMHO. But when you cannot find a new owner; you can start to strip mm_struct. That is, what's stopping you from freeing swap reservations when that happens? That is; I think the moment mm_users drops to 0, you can destroy the actual addres space. But you have to keep mm_struct around until mm_count goes to 0. This is going on the comments with mmget() and mmgrab(); they forever confuse me. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 8:52 ` Peter Zijlstra 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra 2019-07-29 14:27 ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra @ 2019-07-29 14:51 ` Waiman Long 2019-07-29 15:03 ` Peter Zijlstra 2 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2019-07-29 14:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 7/29/19 4:52 AM, Peter Zijlstra wrote: > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: >> It was found that a dying mm_struct where the owning task has exited >> can stay on as active_mm of kernel threads as long as no other user >> tasks run on those CPUs that use it as active_mm. This prolongs the >> life time of dying mm holding up memory and other resources like swap >> space that cannot be freed. > Sure, but this has been so 'forever', why is it a problem now? I ran into this probem when running a test program that keeps on allocating and touch memory and it eventually fails as the swap space is full. After the failure, I could not rerun the test program again because the swap space remained full. I finally track it down to the fact that the mm stayed on as active_mm of kernel threads. I have to make sure that all the idle cpus get a user task to run to bump the dying mm off the active_mm of those cpus, but this is just a workaround, not a solution to this problem. > >> Fix that by forcing the kernel threads to use init_mm as the active_mm >> if the previous active_mm is dying. >> >> The determination of a dying mm is based on the absence of an owning >> task. The selection of the owning task only happens with the CONFIG_MEMCG >> option. Without that, there is no simple way to determine the life span >> of a given mm. So it falls back to the old behavior. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> include/linux/mm_types.h | 15 +++++++++++++++ >> kernel/sched/core.c | 13 +++++++++++-- >> mm/init-mm.c | 4 ++++ >> 3 files changed, 30 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 3a37a89eb7a7..32712e78763c 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) >> return atomic_read(&mm->tlb_flush_pending) > 1; >> } >> >> +#ifdef CONFIG_MEMCG >> +/* >> + * A mm is considered dying if there is no owning task. >> + */ >> +static inline bool mm_dying(struct mm_struct *mm) >> +{ >> + return !mm->owner; >> +} >> +#else >> +static inline bool mm_dying(struct mm_struct *mm) >> +{ >> + return false; >> +} >> +#endif >> + >> struct vm_fault; > Yuck. So people without memcg will still suffer the terrible 'whatever > it is this patch fixes'. > That is true. >> /** >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 2b037f195473..923a63262dfd 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev, >> * Both of these contain the full memory barrier required by >> * membarrier after storing to rq->curr, before returning to >> * user-space. >> + * >> + * If mm is NULL and oldmm is dying (!owner), we switch to >> + * init_mm instead to make sure that oldmm can be freed ASAP. >> */ >> - if (!mm) { >> + if (!mm && !mm_dying(oldmm)) { >> next->active_mm = oldmm; >> mmgrab(oldmm); >> enter_lazy_tlb(oldmm, next); >> - } else >> + } else { >> + if (!mm) { >> + mm = &init_mm; >> + next->active_mm = mm; >> + mmgrab(mm); >> + } >> switch_mm_irqs_off(oldmm, mm, next); >> + } >> >> if (!prev->mm) { >> prev->active_mm = NULL; > Bah, I see we _still_ haven't 'fixed' that code. And you're making an > even bigger mess of it. > > Let me go find where that cleanup went. It would be nice if there is a better solution. Cheers, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 14:51 ` Waiman Long @ 2019-07-29 15:03 ` Peter Zijlstra 2019-07-29 15:28 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 15:03 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Rik van Riel, Andy Lutomirski On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote: > On 7/29/19 4:52 AM, Peter Zijlstra wrote: > > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: > >> It was found that a dying mm_struct where the owning task has exited > >> can stay on as active_mm of kernel threads as long as no other user > >> tasks run on those CPUs that use it as active_mm. This prolongs the > >> life time of dying mm holding up memory and other resources like swap > >> space that cannot be freed. > > Sure, but this has been so 'forever', why is it a problem now? > > I ran into this probem when running a test program that keeps on > allocating and touch memory and it eventually fails as the swap space is > full. After the failure, I could not rerun the test program again > because the swap space remained full. I finally track it down to the > fact that the mm stayed on as active_mm of kernel threads. I have to > make sure that all the idle cpus get a user task to run to bump the > dying mm off the active_mm of those cpus, but this is just a workaround, > not a solution to this problem. The 'sad' part is that x86 already switches to init_mm on idle and we only keep the active_mm around for 'stupid'. Rik and Andy were working on getting that 'fixed' a while ago, not sure where that went. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:03 ` Peter Zijlstra @ 2019-07-29 15:28 ` Rik van Riel 2019-07-29 15:44 ` Peter Zijlstra 2019-07-29 15:37 ` Waiman Long 2019-07-29 16:22 ` Andy Lutomirski 2 siblings, 1 reply; 26+ messages in thread From: Rik van Riel @ 2019-07-29 15:28 UTC (permalink / raw) To: Peter Zijlstra, Waiman Long Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Andy Lutomirski [-- Attachment #1: Type: text/plain, Size: 487 bytes --] On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote: > The 'sad' part is that x86 already switches to init_mm on idle and we > only keep the active_mm around for 'stupid'. Wait, where do we do that? > Rik and Andy were working on getting that 'fixed' a while ago, not > sure > where that went. My lazy TLB stuff got merged last year. Did we miss a spot somewhere, where the code still quietly switches to init_mm for no good reason? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:28 ` Rik van Riel @ 2019-07-29 15:44 ` Peter Zijlstra 2019-07-29 16:10 ` Rik van Riel 0 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 15:44 UTC (permalink / raw) To: Rik van Riel Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Andy Lutomirski On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote: > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote: > > > The 'sad' part is that x86 already switches to init_mm on idle and we > > only keep the active_mm around for 'stupid'. > > Wait, where do we do that? drivers/idle/intel_idle.c: leave_mm(cpu); drivers/acpi/processor_idle.c: acpi_unlazy_tlb(smp_processor_id()); > > Rik and Andy were working on getting that 'fixed' a while ago, not > > sure > > where that went. > > My lazy TLB stuff got merged last year. Yes, but we never got around to getting rid of active_mm for x86, right? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:44 ` Peter Zijlstra @ 2019-07-29 16:10 ` Rik van Riel 2019-07-29 16:26 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Rik van Riel @ 2019-07-29 16:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Andy Lutomirski [-- Attachment #1: Type: text/plain, Size: 967 bytes --] On Mon, 2019-07-29 at 17:44 +0200, Peter Zijlstra wrote: > On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote: > > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote: > > > > > The 'sad' part is that x86 already switches to init_mm on idle > > > and we > > > only keep the active_mm around for 'stupid'. > > > > Wait, where do we do that? > > drivers/idle/intel_idle.c: leave_mm(cpu); > drivers/acpi/processor_idle.c: acpi_unlazy_tlb(smp_processor_id()); This is only done for deeper c-states, isn't it? > > > Rik and Andy were working on getting that 'fixed' a while ago, > > > not > > > sure > > > where that went. > > > > My lazy TLB stuff got merged last year. > > Yes, but we never got around to getting rid of active_mm for x86, > right? True, we still use active_mm. Getting rid of the active_mm refcounting alltogether did not look entirely worthwhile the hassle. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 16:10 ` Rik van Riel @ 2019-07-29 16:26 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2019-07-29 16:26 UTC (permalink / raw) To: Rik van Riel Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Andy Lutomirski On Mon, Jul 29, 2019 at 12:10:12PM -0400, Rik van Riel wrote: > On Mon, 2019-07-29 at 17:44 +0200, Peter Zijlstra wrote: > > On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote: > > > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote: > > > > > > > The 'sad' part is that x86 already switches to init_mm on idle > > > > and we > > > > only keep the active_mm around for 'stupid'. > > > > > > Wait, where do we do that? > > > > drivers/idle/intel_idle.c: leave_mm(cpu); > > drivers/acpi/processor_idle.c: acpi_unlazy_tlb(smp_processor_id()); > > This is only done for deeper c-states, isn't it? Not C1 but I forever forget where it starts doing that. IIRC it isn't too hard to hit it often, and I'm fairly sure we always do it when we hit NOHZ. > > > > Rik and Andy were working on getting that 'fixed' a while ago, > > > > not > > > > sure > > > > where that went. > > > > > > My lazy TLB stuff got merged last year. > > > > Yes, but we never got around to getting rid of active_mm for x86, > > right? > > True, we still use active_mm. Getting rid of the > active_mm refcounting alltogether did not look > entirely worthwhile the hassle. OK, clearly I forgot some of the details ;-) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:03 ` Peter Zijlstra 2019-07-29 15:28 ` Rik van Riel @ 2019-07-29 15:37 ` Waiman Long 2019-07-29 16:12 ` Rik van Riel 2019-07-29 16:22 ` Andy Lutomirski 2 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2019-07-29 15:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Rik van Riel, Andy Lutomirski On 7/29/19 11:03 AM, Peter Zijlstra wrote: > On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote: >> On 7/29/19 4:52 AM, Peter Zijlstra wrote: >>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: >>>> It was found that a dying mm_struct where the owning task has exited >>>> can stay on as active_mm of kernel threads as long as no other user >>>> tasks run on those CPUs that use it as active_mm. This prolongs the >>>> life time of dying mm holding up memory and other resources like swap >>>> space that cannot be freed. >>> Sure, but this has been so 'forever', why is it a problem now? >> I ran into this probem when running a test program that keeps on >> allocating and touch memory and it eventually fails as the swap space is >> full. After the failure, I could not rerun the test program again >> because the swap space remained full. I finally track it down to the >> fact that the mm stayed on as active_mm of kernel threads. I have to >> make sure that all the idle cpus get a user task to run to bump the >> dying mm off the active_mm of those cpus, but this is just a workaround, >> not a solution to this problem. > The 'sad' part is that x86 already switches to init_mm on idle and we > only keep the active_mm around for 'stupid'. > > Rik and Andy were working on getting that 'fixed' a while ago, not sure > where that went. Good, perhaps the right thing to do is for the idle->kernel case to keep init_mm as the active_mm instead of reuse whatever left behind the last time around. Cheers, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:37 ` Waiman Long @ 2019-07-29 16:12 ` Rik van Riel 0 siblings, 0 replies; 26+ messages in thread From: Rik van Riel @ 2019-07-29 16:12 UTC (permalink / raw) To: Waiman Long, Peter Zijlstra Cc: Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Andy Lutomirski [-- Attachment #1: Type: text/plain, Size: 2163 bytes --] On Mon, 2019-07-29 at 11:37 -0400, Waiman Long wrote: > On 7/29/19 11:03 AM, Peter Zijlstra wrote: > > On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote: > > > On 7/29/19 4:52 AM, Peter Zijlstra wrote: > > > > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: > > > > > It was found that a dying mm_struct where the owning task has > > > > > exited > > > > > can stay on as active_mm of kernel threads as long as no > > > > > other user > > > > > tasks run on those CPUs that use it as active_mm. This > > > > > prolongs the > > > > > life time of dying mm holding up memory and other resources > > > > > like swap > > > > > space that cannot be freed. > > > > Sure, but this has been so 'forever', why is it a problem now? > > > I ran into this probem when running a test program that keeps on > > > allocating and touch memory and it eventually fails as the swap > > > space is > > > full. After the failure, I could not rerun the test program again > > > because the swap space remained full. I finally track it down to > > > the > > > fact that the mm stayed on as active_mm of kernel threads. I have > > > to > > > make sure that all the idle cpus get a user task to run to bump > > > the > > > dying mm off the active_mm of those cpus, but this is just a > > > workaround, > > > not a solution to this problem. > > The 'sad' part is that x86 already switches to init_mm on idle and > > we > > only keep the active_mm around for 'stupid'. > > > > Rik and Andy were working on getting that 'fixed' a while ago, not > > sure > > where that went. > > Good, perhaps the right thing to do is for the idle->kernel case to > keep > init_mm as the active_mm instead of reuse whatever left behind the > last > time around. Absolutely not. That creates heavy cache line contention on the mm_cpumask as we switch the mm out and back in after an idle period. The cache line contention on the mm_cpumask alone can take up as much as a percent of CPU time on a very busy system with a large multi-threaded application, multiple sockets, and lots of context switches. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:03 ` Peter Zijlstra 2019-07-29 15:28 ` Rik van Riel 2019-07-29 15:37 ` Waiman Long @ 2019-07-29 16:22 ` Andy Lutomirski 2 siblings, 0 replies; 26+ messages in thread From: Andy Lutomirski @ 2019-07-29 16:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld, Rik van Riel, Andy Lutomirski > On Jul 29, 2019, at 8:03 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote: >>> On 7/29/19 4:52 AM, Peter Zijlstra wrote: >>>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: >>>> It was found that a dying mm_struct where the owning task has exited >>>> can stay on as active_mm of kernel threads as long as no other user >>>> tasks run on those CPUs that use it as active_mm. This prolongs the >>>> life time of dying mm holding up memory and other resources like swap >>>> space that cannot be freed. >>> Sure, but this has been so 'forever', why is it a problem now? >> >> I ran into this probem when running a test program that keeps on >> allocating and touch memory and it eventually fails as the swap space is >> full. After the failure, I could not rerun the test program again >> because the swap space remained full. I finally track it down to the >> fact that the mm stayed on as active_mm of kernel threads. I have to >> make sure that all the idle cpus get a user task to run to bump the >> dying mm off the active_mm of those cpus, but this is just a workaround, >> not a solution to this problem. > > The 'sad' part is that x86 already switches to init_mm on idle and we > only keep the active_mm around for 'stupid'. > > Rik and Andy were working on getting that 'fixed' a while ago, not sure > where that went. I thought the current status was that we don’t always switch to init_mm on idle and instead we use a fancier and actually correct flushing routine that only flushed idle CPUs when pagetables are freed. I still think we should be able to kill active_mm in favor of explicit refcounting in the arch code. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long 2019-07-29 8:18 ` Qais Yousef 2019-07-29 8:52 ` Peter Zijlstra @ 2019-07-29 9:12 ` Michal Hocko 2019-07-29 15:27 ` Waiman Long 2 siblings, 1 reply; 26+ messages in thread From: Michal Hocko @ 2019-07-29 9:12 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On Sat 27-07-19 13:10:47, Waiman Long wrote: > It was found that a dying mm_struct where the owning task has exited > can stay on as active_mm of kernel threads as long as no other user > tasks run on those CPUs that use it as active_mm. This prolongs the > life time of dying mm holding up memory and other resources like swap > space that cannot be freed. IIRC use_mm doesn't pin the address space. It only pins the mm_struct itself. So what exactly is the problem here? > > Fix that by forcing the kernel threads to use init_mm as the active_mm > if the previous active_mm is dying. > > The determination of a dying mm is based on the absence of an owning > task. The selection of the owning task only happens with the CONFIG_MEMCG > option. Without that, there is no simple way to determine the life span > of a given mm. So it falls back to the old behavior. Please don't. We really wont to remove mm->owner long term. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 9:12 ` Michal Hocko @ 2019-07-29 15:27 ` Waiman Long 2019-07-29 18:58 ` Michal Hocko 0 siblings, 1 reply; 26+ messages in thread From: Waiman Long @ 2019-07-29 15:27 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 7/29/19 5:12 AM, Michal Hocko wrote: > On Sat 27-07-19 13:10:47, Waiman Long wrote: >> It was found that a dying mm_struct where the owning task has exited >> can stay on as active_mm of kernel threads as long as no other user >> tasks run on those CPUs that use it as active_mm. This prolongs the >> life time of dying mm holding up memory and other resources like swap >> space that cannot be freed. > IIRC use_mm doesn't pin the address space. It only pins the mm_struct > itself. So what exactly is the problem here? As explained in my response to Peter, I found that resource like swap space were depleted even after the exit of the offending program in a mostly idle system. This patch is to make sure that those resources get freed after program exit ASAP. >> Fix that by forcing the kernel threads to use init_mm as the active_mm >> if the previous active_mm is dying. >> >> The determination of a dying mm is based on the absence of an owning >> task. The selection of the owning task only happens with the CONFIG_MEMCG >> option. Without that, there is no simple way to determine the life span >> of a given mm. So it falls back to the old behavior. > Please don't. We really wont to remove mm->owner long term. OK, if that is the case, I will need to find an alternative way to determine if an mm is to be freed soon and perhaps set a flag to indicate that. Thanks, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 15:27 ` Waiman Long @ 2019-07-29 18:58 ` Michal Hocko 2019-07-29 20:41 ` Waiman Long 0 siblings, 1 reply; 26+ messages in thread From: Michal Hocko @ 2019-07-29 18:58 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On Mon 29-07-19 11:27:35, Waiman Long wrote: > On 7/29/19 5:12 AM, Michal Hocko wrote: > > On Sat 27-07-19 13:10:47, Waiman Long wrote: > >> It was found that a dying mm_struct where the owning task has exited > >> can stay on as active_mm of kernel threads as long as no other user > >> tasks run on those CPUs that use it as active_mm. This prolongs the > >> life time of dying mm holding up memory and other resources like swap > >> space that cannot be freed. > > IIRC use_mm doesn't pin the address space. It only pins the mm_struct > > itself. So what exactly is the problem here? > > As explained in my response to Peter, I found that resource like swap > space were depleted even after the exit of the offending program in a > mostly idle system. This patch is to make sure that those resources get > freed after program exit ASAP. Could you elaborate more? How can a mm counter (do not confuse with mm_users) prevent address space to be torn down on exit? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads 2019-07-29 18:58 ` Michal Hocko @ 2019-07-29 20:41 ` Waiman Long 0 siblings, 0 replies; 26+ messages in thread From: Waiman Long @ 2019-07-29 20:41 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, Andrew Morton, Phil Auld On 7/29/19 2:58 PM, Michal Hocko wrote: > On Mon 29-07-19 11:27:35, Waiman Long wrote: >> On 7/29/19 5:12 AM, Michal Hocko wrote: >>> On Sat 27-07-19 13:10:47, Waiman Long wrote: >>>> It was found that a dying mm_struct where the owning task has exited >>>> can stay on as active_mm of kernel threads as long as no other user >>>> tasks run on those CPUs that use it as active_mm. This prolongs the >>>> life time of dying mm holding up memory and other resources like swap >>>> space that cannot be freed. >>> IIRC use_mm doesn't pin the address space. It only pins the mm_struct >>> itself. So what exactly is the problem here? >> As explained in my response to Peter, I found that resource like swap >> space were depleted even after the exit of the offending program in a >> mostly idle system. This patch is to make sure that those resources get >> freed after program exit ASAP. > Could you elaborate more? How can a mm counter (do not confuse with > mm_users) prevent address space to be torn down on exit? Many of the resources tied to mm_struct are indeed freed when mm_users becomes 0 including swap space reservation, I think. I was testing a mm patch and it did have a missing mmput bug that cause mm_users not going to 0. I fixed the bug, and with sched patch to speed up the release the mm_struct, every was fine. I didn't realize that fixing the mm bug is enough to free the swap space. Still there are some resources not being free when the mm_count is non-zero. It is certainly less serious than what I have thought. Sorry for the confusion. Cheers, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2019-07-29 21:33 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-07-27 17:10 [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Waiman Long 2019-07-29 8:18 ` Qais Yousef 2019-07-29 21:06 ` Waiman Long 2019-07-29 21:33 ` Qais Yousef 2019-07-29 8:52 ` Peter Zijlstra 2019-07-29 14:24 ` [PATCH] sched: Clean up active_mm reference counting Peter Zijlstra 2019-07-29 15:01 ` Mathieu Desnoyers 2019-07-29 15:16 ` Waiman Long 2019-07-29 15:22 ` Peter Zijlstra 2019-07-29 15:29 ` Rik van Riel 2019-07-29 14:27 ` [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads Peter Zijlstra 2019-07-29 15:22 ` Waiman Long 2019-07-29 15:38 ` Peter Zijlstra 2019-07-29 14:51 ` Waiman Long 2019-07-29 15:03 ` Peter Zijlstra 2019-07-29 15:28 ` Rik van Riel 2019-07-29 15:44 ` Peter Zijlstra 2019-07-29 16:10 ` Rik van Riel 2019-07-29 16:26 ` Peter Zijlstra 2019-07-29 15:37 ` Waiman Long 2019-07-29 16:12 ` Rik van Riel 2019-07-29 16:22 ` Andy Lutomirski 2019-07-29 9:12 ` Michal Hocko 2019-07-29 15:27 ` Waiman Long 2019-07-29 18:58 ` Michal Hocko 2019-07-29 20:41 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).