* [PATCH] mm: mempolicy: don't select exited threads as OOM victims
@ 2019-06-29 11:24 Tetsuo Handa
2019-07-01 11:17 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2019-06-29 11:24 UTC (permalink / raw)
To: Michal Hocko, Shakeel Butt; +Cc: linux-mm, Tetsuo Handa
Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
mempolicy_nodemask_intersects() considers exited threads (e.g. a process
with dying leader and live threads) as eligible. But it is possible that
all of live threads are still ineligible.
Since has_intersects_mems_allowed() returns true as soon as one of threads
is considered eligible, mempolicy_nodemask_intersects() needs to consider
exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
threads by checking whether mm is NULL.
While at it, since mempolicy_nodemask_intersects() is called by only
has_intersects_mems_allowed(), it is guaranteed that mask != NULL.
BTW, are there processes where some of threads use MPOL_{BIND,INTERLEAVE}
and the rest do not use MPOL_{BIND,INTERLEAVE} ? If no, we can use
find_lock_task_mm() instead of for_each_thread() for mask != NULL case
in has_intersects_mems_allowed().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
mm/mempolicy.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01600d8..938f0a0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1974,11 +1974,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
const nodemask_t *mask)
{
struct mempolicy *mempolicy;
- bool ret = true;
+ bool ret;
- if (!mask)
- return ret;
task_lock(tsk);
+ ret = tsk->mm;
mempolicy = tsk->mempolicy;
if (!mempolicy)
goto out;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-06-29 11:24 [PATCH] mm: mempolicy: don't select exited threads as OOM victims Tetsuo Handa
@ 2019-07-01 11:17 ` Michal Hocko
2019-07-01 13:04 ` Tetsuo Handa
0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-01 11:17 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Sat 29-06-19 20:24:34, Tetsuo Handa wrote:
> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
> mempolicy_nodemask_intersects() considers exited threads (e.g. a process
> with dying leader and live threads) as eligible. But it is possible that
> all of live threads are still ineligible.
>
> Since has_intersects_mems_allowed() returns true as soon as one of threads
> is considered eligible, mempolicy_nodemask_intersects() needs to consider
> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
> threads by checking whether mm is NULL.
Ok, this makes sense. For this change
Acked-by: Michal Hocko <mhocko@suse.com>
> While at it, since mempolicy_nodemask_intersects() is called by only
> has_intersects_mems_allowed(), it is guaranteed that mask != NULL.
Well, I am not really sure. It's true that mempolicy_nodemask_intersects
is currently only used by OOM path and never with mask == NULL but
the function seems to be more generic and hadnling NULL mask seems
reasonable to me. This is not a hot path that an additional check would
be harmful, right?
> BTW, are there processes where some of threads use MPOL_{BIND,INTERLEAVE}
> and the rest do not use MPOL_{BIND,INTERLEAVE} ? If no, we can use
> find_lock_task_mm() instead of for_each_thread() for mask != NULL case
> in has_intersects_mems_allowed().
I am afraid that mempolicy is allowed to be per thread which is quite
ugly and I am afraid we cannot change that right now.
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
> mm/mempolicy.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 01600d8..938f0a0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1974,11 +1974,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> const nodemask_t *mask)
> {
> struct mempolicy *mempolicy;
> - bool ret = true;
> + bool ret;
>
> - if (!mask)
> - return ret;
> task_lock(tsk);
> + ret = tsk->mm;
> mempolicy = tsk->mempolicy;
> if (!mempolicy)
> goto out;
> --
> 1.8.3.1
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 11:17 ` Michal Hocko
@ 2019-07-01 13:04 ` Tetsuo Handa
2019-07-01 13:17 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2019-07-01 13:04 UTC (permalink / raw)
To: Michal Hocko; +Cc: Shakeel Butt, linux-mm
On 2019/07/01 20:17, Michal Hocko wrote:
> On Sat 29-06-19 20:24:34, Tetsuo Handa wrote:
>> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
>> mempolicy_nodemask_intersects() considers exited threads (e.g. a process
>> with dying leader and live threads) as eligible. But it is possible that
>> all of live threads are still ineligible.
>>
>> Since has_intersects_mems_allowed() returns true as soon as one of threads
>> is considered eligible, mempolicy_nodemask_intersects() needs to consider
>> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
>> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
>> threads by checking whether mm is NULL.
>
> Ok, this makes sense. For this change
> Acked-by: Michal Hocko <mhocko@suse.com>
>
But I realized that this patch was too optimistic. We need to wait for mm-less
threads until MMF_OOM_SKIP is set if the process was already an OOM victim. If
we fail to allow the process to reach MMF_OOM_SKIP test, the process will be
ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for
has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used.
Well, the problem is that exited threads prematurely set mempolicy = NULL.
Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when
__put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects()
path should be freed as well when __put_task_struct() is called?
diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7..02a60ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -897,7 +897,6 @@ void __noreturn do_exit(long code)
exit_tasks_rcu_start();
exit_notify(tsk, group_dead);
proc_exit_connector(tsk);
- mpol_put_task_policy(tsk);
#ifdef CONFIG_FUTEX
if (unlikely(current->pi_state_cache))
kfree(current->pi_state_cache);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6166790..c17e436 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -726,6 +726,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);
+ mpol_put_task_policy(tsk);
cgroup_free(tsk);
task_numa_free(tsk);
security_task_free(tsk);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 13:04 ` Tetsuo Handa
@ 2019-07-01 13:17 ` Michal Hocko
2019-07-01 13:38 ` Tetsuo Handa
0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-01 13:17 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> On 2019/07/01 20:17, Michal Hocko wrote:
> > On Sat 29-06-19 20:24:34, Tetsuo Handa wrote:
> >> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
> >> mempolicy_nodemask_intersects() considers exited threads (e.g. a process
> >> with dying leader and live threads) as eligible. But it is possible that
> >> all of live threads are still ineligible.
> >>
> >> Since has_intersects_mems_allowed() returns true as soon as one of threads
> >> is considered eligible, mempolicy_nodemask_intersects() needs to consider
> >> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
> >> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
> >> threads by checking whether mm is NULL.
> >
> > Ok, this makes sense. For this change
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
>
> But I realized that this patch was too optimistic. We need to wait for mm-less
> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
If the process is an oom victim then _all_ threads are so as well
because that is the address space property. And we already do check that
before reaching oom_badness IIRC. So what is the actual problem you are
trying to solve here?
> If
> we fail to allow the process to reach MMF_OOM_SKIP test, the process will be
> ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for
> has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used.
>
> Well, the problem is that exited threads prematurely set mempolicy = NULL.
> Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when
> __put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects()
> path should be freed as well when __put_task_struct() is called?
I am sorry but I have hard time understanding what is the actual user
visible problem here.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 13:17 ` Michal Hocko
@ 2019-07-01 13:38 ` Tetsuo Handa
2019-07-01 13:48 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2019-07-01 13:38 UTC (permalink / raw)
To: Michal Hocko; +Cc: Shakeel Butt, linux-mm
On 2019/07/01 22:17, Michal Hocko wrote:
> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
>> But I realized that this patch was too optimistic. We need to wait for mm-less
>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
>
> If the process is an oom victim then _all_ threads are so as well
> because that is the address space property. And we already do check that
> before reaching oom_badness IIRC. So what is the actual problem you are
> trying to solve here?
I'm talking about behavioral change after tsk became an OOM victim.
If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
false because mempolicy_nodemask_intersects() returned false because all thread's
mm became NULL (despite tsk->signal->oom_mm != NULL).
static int oom_evaluate_task(struct task_struct *task, void *arg)
{
if (oom_unkillable_task(task, NULL, oc->nodemask))
goto next;
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
goto next;
goto abort;
}
}
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 13:38 ` Tetsuo Handa
@ 2019-07-01 13:48 ` Michal Hocko
2019-07-01 13:56 ` Tetsuo Handa
0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-01 13:48 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> On 2019/07/01 22:17, Michal Hocko wrote:
> > On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> >> But I realized that this patch was too optimistic. We need to wait for mm-less
> >> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> >
> > If the process is an oom victim then _all_ threads are so as well
> > because that is the address space property. And we already do check that
> > before reaching oom_badness IIRC. So what is the actual problem you are
> > trying to solve here?
>
> I'm talking about behavioral change after tsk became an OOM victim.
>
> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> false because mempolicy_nodemask_intersects() returned false because all thread's
> mm became NULL (despite tsk->signal->oom_mm != NULL).
OK, I finally got your point. It was not clear that you are referring to
the code _after_ the patch you are proposing. You are indeed right that
this would have a side effect that an additional victim could be
selected even though the current process hasn't terminated yet. Sigh,
another example how the whole thing is subtle so I retract my Ack and
request a real life example of where this matters before we think about
a proper fix and make the code even more complex.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 13:48 ` Michal Hocko
@ 2019-07-01 13:56 ` Tetsuo Handa
2019-07-01 14:04 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2019-07-01 13:56 UTC (permalink / raw)
To: Michal Hocko; +Cc: Shakeel Butt, linux-mm
On 2019/07/01 22:48, Michal Hocko wrote:
> On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
>> On 2019/07/01 22:17, Michal Hocko wrote:
>>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
>>>> But I realized that this patch was too optimistic. We need to wait for mm-less
>>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
>>>
>>> If the process is an oom victim then _all_ threads are so as well
>>> because that is the address space property. And we already do check that
>>> before reaching oom_badness IIRC. So what is the actual problem you are
>>> trying to solve here?
>>
>> I'm talking about behavioral change after tsk became an OOM victim.
>>
>> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
>> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
>> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
>> false because mempolicy_nodemask_intersects() returned false because all thread's
>> mm became NULL (despite tsk->signal->oom_mm != NULL).
>
> OK, I finally got your point. It was not clear that you are referring to
> the code _after_ the patch you are proposing. You are indeed right that
> this would have a side effect that an additional victim could be
> selected even though the current process hasn't terminated yet. Sigh,
> another example how the whole thing is subtle so I retract my Ack and
> request a real life example of where this matters before we think about
> a proper fix and make the code even more complex.
>
Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
do_exit() to __put_task_struct() ? That change will (if it is safe to do)
prevent exited threads from setting mempolicy = NULL (and confusing
mempolicy_nodemask_intersects() due to mempolicy == NULL).
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 13:56 ` Tetsuo Handa
@ 2019-07-01 14:04 ` Michal Hocko
2019-07-01 14:16 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-01 14:04 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Mon 01-07-19 22:56:12, Tetsuo Handa wrote:
> On 2019/07/01 22:48, Michal Hocko wrote:
> > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> >> On 2019/07/01 22:17, Michal Hocko wrote:
> >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> >>>> But I realized that this patch was too optimistic. We need to wait for mm-less
> >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> >>>
> >>> If the process is an oom victim then _all_ threads are so as well
> >>> because that is the address space property. And we already do check that
> >>> before reaching oom_badness IIRC. So what is the actual problem you are
> >>> trying to solve here?
> >>
> >> I'm talking about behavioral change after tsk became an OOM victim.
> >>
> >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> >> false because mempolicy_nodemask_intersects() returned false because all thread's
> >> mm became NULL (despite tsk->signal->oom_mm != NULL).
> >
> > OK, I finally got your point. It was not clear that you are referring to
> > the code _after_ the patch you are proposing. You are indeed right that
> > this would have a side effect that an additional victim could be
> > selected even though the current process hasn't terminated yet. Sigh,
> > another example how the whole thing is subtle so I retract my Ack and
> > request a real life example of where this matters before we think about
> > a proper fix and make the code even more complex.
> >
>
> Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
> do_exit() to __put_task_struct() ? That change will (if it is safe to do)
> prevent exited threads from setting mempolicy = NULL (and confusing
> mempolicy_nodemask_intersects() due to mempolicy == NULL).
I am sorry but I would have to study it much more and I am not convinced
the time spent on it would be well spent.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 14:04 ` Michal Hocko
@ 2019-07-01 14:16 ` Michal Hocko
2019-07-02 13:19 ` Tetsuo Handa
0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-01 14:16 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Mon 01-07-19 16:04:34, Michal Hocko wrote:
> On Mon 01-07-19 22:56:12, Tetsuo Handa wrote:
> > On 2019/07/01 22:48, Michal Hocko wrote:
> > > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> > >> On 2019/07/01 22:17, Michal Hocko wrote:
> > >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> > >>>> But I realized that this patch was too optimistic. We need to wait for mm-less
> > >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> > >>>
> > >>> If the process is an oom victim then _all_ threads are so as well
> > >>> because that is the address space property. And we already do check that
> > >>> before reaching oom_badness IIRC. So what is the actual problem you are
> > >>> trying to solve here?
> > >>
> > >> I'm talking about behavioral change after tsk became an OOM victim.
> > >>
> > >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> > >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> > >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> > >> false because mempolicy_nodemask_intersects() returned false because all thread's
> > >> mm became NULL (despite tsk->signal->oom_mm != NULL).
> > >
> > > OK, I finally got your point. It was not clear that you are referring to
> > > the code _after_ the patch you are proposing. You are indeed right that
> > > this would have a side effect that an additional victim could be
> > > selected even though the current process hasn't terminated yet. Sigh,
> > > another example how the whole thing is subtle so I retract my Ack and
> > > request a real life example of where this matters before we think about
> > > a proper fix and make the code even more complex.
> > >
> >
> > Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
> > do_exit() to __put_task_struct() ? That change will (if it is safe to do)
> > prevent exited threads from setting mempolicy = NULL (and confusing
> > mempolicy_nodemask_intersects() due to mempolicy == NULL).
>
> I am sorry but I would have to study it much more and I am not convinced
> the time spent on it would be well spent.
Thinking about it some more it seems that we can go with your original
fix if we also reorder oom_evaluate_task
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f719b64741d6..e5feb0f72e3b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
struct oom_control *oc = arg;
unsigned long points;
- if (oom_unkillable_task(task, NULL, oc->nodemask))
- goto next;
-
/*
* This task already has access to memory reserves and is being killed.
* Don't allow any other task to have access to the reserves unless
@@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto abort;
}
+ if (oom_unkillable_task(task, NULL, oc->nodemask))
+ goto next;
+
/*
* If task is allocating a lot of memory and has been marked to be
* killed first if it triggers an oom, then select it.
I do not see any strong reason to keep the current ordering. OOM victim
check is trivial so it shouldn't add a visible overhead for few
unkillable tasks that we might encounter.
--
Michal Hocko
SUSE Labs
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-01 14:16 ` Michal Hocko
@ 2019-07-02 13:19 ` Tetsuo Handa
2019-07-02 13:51 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2019-07-02 13:19 UTC (permalink / raw)
To: Michal Hocko; +Cc: Shakeel Butt, linux-mm
On 2019/07/01 23:16, Michal Hocko wrote:
> Thinking about it some more it seems that we can go with your original
> fix if we also reorder oom_evaluate_task
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f719b64741d6..e5feb0f72e3b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> struct oom_control *oc = arg;
> unsigned long points;
>
> - if (oom_unkillable_task(task, NULL, oc->nodemask))
> - goto next;
> -
> /*
> * This task already has access to memory reserves and is being killed.
> * Don't allow any other task to have access to the reserves unless
> @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> goto abort;
> }
>
> + if (oom_unkillable_task(task, NULL, oc->nodemask))
> + goto next;
> +
> /*
> * If task is allocating a lot of memory and has been marked to be
> * killed first if it triggers an oom, then select it.
>
> I do not see any strong reason to keep the current ordering. OOM victim
> check is trivial so it shouldn't add a visible overhead for few
> unkillable tasks that we might encounter.
>
Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
(because an OOM victim in a different OOM context will hit "goto abort;" path).
Thinking again, I think that the same problem exists for mask == NULL path
as long as "a process with dying leader and live threads" is possible. Then,
fixing up after has_intersects_mems_allowed()/cpuset_mems_allowed_intersects()
judged that some thread is eligible is better.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1c9c4e..43e499e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -109,8 +109,23 @@ static bool oom_cpuset_eligible(struct task_struct *start,
*/
ret = cpuset_mems_allowed_intersects(current, tsk);
}
- if (ret)
- break;
+ if (ret) {
+ /*
+ * Exclude dead threads as ineligible when selecting
+ * an OOM victim. But include dead threads as eligible
+ * when waiting for OOM victims to get MMF_OOM_SKIP.
+ *
+ * Strictly speaking, tsk->mm should be checked under
+ * task lock because cpuset_mems_allowed_intersects()
+ * does not take task lock. But racing with exit_mm()
+ * is not fatal. Thus, use cheaper barrier rather than
+ * strict task lock.
+ */
+ smp_rmb();
+ if (tsk->mm || tsk_is_oom_victim(tsk))
+ break;
+ ret = false;
+ }
}
rcu_read_unlock();
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-02 13:19 ` Tetsuo Handa
@ 2019-07-02 13:51 ` Michal Hocko
2019-07-02 21:26 ` Tetsuo Handa
0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-07-02 13:51 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Tue 02-07-19 22:19:27, Tetsuo Handa wrote:
> On 2019/07/01 23:16, Michal Hocko wrote:
> > Thinking about it some more it seems that we can go with your original
> > fix if we also reorder oom_evaluate_task
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index f719b64741d6..e5feb0f72e3b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> > struct oom_control *oc = arg;
> > unsigned long points;
> >
> > - if (oom_unkillable_task(task, NULL, oc->nodemask))
> > - goto next;
> > -
> > /*
> > * This task already has access to memory reserves and is being killed.
> > * Don't allow any other task to have access to the reserves unless
> > @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> > goto abort;
> > }
> >
> > + if (oom_unkillable_task(task, NULL, oc->nodemask))
> > + goto next;
> > +
> > /*
> > * If task is allocating a lot of memory and has been marked to be
> > * killed first if it triggers an oom, then select it.
> >
> > I do not see any strong reason to keep the current ordering. OOM victim
> > check is trivial so it shouldn't add a visible overhead for few
> > unkillable tasks that we might encounter.
> >
>
> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
> (because an OOM victim in a different OOM context will hit "goto abort;" path).
You are right. Considering that we now have a guarantee of a forward
progress then this should be tolerateable (a victim in a disjoint
numaset will go away and other one can go ahead and trigger its own
OOM).
> Thinking again, I think that the same problem exists for mask == NULL path
> as long as "a process with dying leader and live threads" is possible. Then,
> fixing up after has_intersects_mems_allowed()/cpuset_mems_allowed_intersects()
> judged that some thread is eligible is better.
This is getting more and more hair for something that is not really
clear to be an actual problem. Don't you think?
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d1c9c4e..43e499e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -109,8 +109,23 @@ static bool oom_cpuset_eligible(struct task_struct *start,
> */
> ret = cpuset_mems_allowed_intersects(current, tsk);
> }
> - if (ret)
> - break;
> + if (ret) {
> + /*
> + * Exclude dead threads as ineligible when selecting
> + * an OOM victim. But include dead threads as eligible
> + * when waiting for OOM victims to get MMF_OOM_SKIP.
> + *
> + * Strictly speaking, tsk->mm should be checked under
> + * task lock because cpuset_mems_allowed_intersects()
> + * does not take task lock. But racing with exit_mm()
> + * is not fatal. Thus, use cheaper barrier rather than
> + * strict task lock.
> + */
> + smp_rmb();
> + if (tsk->mm || tsk_is_oom_victim(tsk))
> + break;
> + ret = false;
> + }
> }
> rcu_read_unlock();
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-02 13:51 ` Michal Hocko
@ 2019-07-02 21:26 ` Tetsuo Handa
2019-07-03 7:06 ` Michal Hocko
0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2019-07-02 21:26 UTC (permalink / raw)
To: Michal Hocko; +Cc: Shakeel Butt, linux-mm
On 2019/07/02 22:51, Michal Hocko wrote:
>>> I do not see any strong reason to keep the current ordering. OOM victim
>>> check is trivial so it shouldn't add a visible overhead for few
>>> unkillable tasks that we might encounter.
>>>
>>
>> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
>> (because an OOM victim in a different OOM context will hit "goto abort;" path).
>
> You are right. Considering that we now have a guarantee of a forward
> progress then this should be tolerateable (a victim in a disjoint
> numaset will go away and other one can go ahead and trigger its own
> OOM).
But it might take very long period before MMF_OOM_SKIP is set by the OOM reaper
or exit_mmap(). Until MMF_OOM_SKIP is set, OOM events from disjoint numaset can't
make forward progress.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims
2019-07-02 21:26 ` Tetsuo Handa
@ 2019-07-03 7:06 ` Michal Hocko
0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2019-07-03 7:06 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Shakeel Butt, linux-mm
On Wed 03-07-19 06:26:55, Tetsuo Handa wrote:
> On 2019/07/02 22:51, Michal Hocko wrote:
> >>> I do not see any strong reason to keep the current ordering. OOM victim
> >>> check is trivial so it shouldn't add a visible overhead for few
> >>> unkillable tasks that we might encounter.
> >>>
> >>
> >> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
> >> (because an OOM victim in a different OOM context will hit "goto abort;" path).
> >
> > You are right. Considering that we now have a guarantee of a forward
> > progress then this should be tolerateable (a victim in a disjoint
> > numaset will go away and other one can go ahead and trigger its own
> > OOM).
>
> But it might take very long period before MMF_OOM_SKIP is set by the OOM reaper
> or exit_mmap(). Until MMF_OOM_SKIP is set, OOM events from disjoint numaset can't
> make forward progress.
If that is a concern then I would stick with the current status quo
until we see the issue to be reported by real workloads.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-07-03 7:06 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-29 11:24 [PATCH] mm: mempolicy: don't select exited threads as OOM victims Tetsuo Handa
2019-07-01 11:17 ` Michal Hocko
2019-07-01 13:04 ` Tetsuo Handa
2019-07-01 13:17 ` Michal Hocko
2019-07-01 13:38 ` Tetsuo Handa
2019-07-01 13:48 ` Michal Hocko
2019-07-01 13:56 ` Tetsuo Handa
2019-07-01 14:04 ` Michal Hocko
2019-07-01 14:16 ` Michal Hocko
2019-07-02 13:19 ` Tetsuo Handa
2019-07-02 13:51 ` Michal Hocko
2019-07-02 21:26 ` Tetsuo Handa
2019-07-03 7:06 ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).