[PATCH] mm, oom: don't invoke oom killer if current has been reapered

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm, oom: don't invoke oom killer if current has been reapered
@ 2020-07-11  3:18 Yafang Shao
  2020-07-11  5:37 ` Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Yafang Shao @ 2020-07-11  3:18 UTC (permalink / raw)
  To: mhocko, rientjes, akpm; +Cc: linux-mm, Yafang Shao

If the current's MMF_OOM_SKIP is set, it means that the current is exiting
or dying and likely to realease its address space. So we don't need to
invoke the oom killer again. Otherwise that may cause some unexpected
issues, for example, bellow is the issue found in our production
environment.

There're many threads of a multi-threaded task parallel running in a
container on many cpus. Then many threads triggered OOM at the same time,

CPU-1	        CPU-2         ...        CPU-n
thread-1        thread-2      ...        thread-n

wait oom_lock   wait oom_lock ...        hold oom_lock

                                         (sigkill received)

                                         select current as victim
                                         and wakeup oom reaper

                                         release oom_lock

                                         (MMF_OOM_SKIP set by oom reaper)

                                         (lots of pages are freed)
hold oom_lock

because MMF_OOM_SKIP
is set, kill others

The thread running on CPU-n received sigkill and it will select current as
the victim and wakeup the oom reaper. Then oom reaper will reap its rss and
free lots of pages, as a result, there will be many free pages.
Although the multi-threaded task is exiting, the other threads will
continue to kill others because of the check of MMF_OOM_SKIP in
task_will_free_mem().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/oom_kill.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6e94962..a8a155a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -825,13 +825,6 @@ static bool task_will_free_mem(struct task_struct *task)
 	if (!__task_will_free_mem(task))
 		return false;

-	/*
-	 * This task has already been drained by the oom reaper so there are
-	 * only small chances it will free some more
-	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
-		return false;
-
 	if (atomic_read(&mm->mm_users) <= 1)
 		return true;

@@ -963,7 +956,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 * so it can die quickly
 	 */
 	task_lock(victim);
-	if (task_will_free_mem(victim)) {
+	if (!test_bit(MMF_OOM_SKIP, &victim->mm->flags) &&
+	    task_will_free_mem(victim)) {
 		mark_oom_victim(victim);
 		wake_oom_reaper(victim);
 		task_unlock(victim);
@@ -1056,6 +1050,10 @@ bool out_of_memory(struct oom_control *oc)
 			return true;
 	}

+	/* current has been already reapered */
+	if (test_bit(MMF_OOM_SKIP, &current->mm->flags))
+		return true;
+
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-11  3:18 [PATCH] mm, oom: don't invoke oom killer if current has been reapered Yafang Shao
@ 2020-07-11  5:37 ` Yafang Shao
  2020-07-13  6:01 ` Michal Hocko
  2020-07-13 23:50 ` Tetsuo Handa
  2 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2020-07-11  5:37 UTC (permalink / raw)
  To: Michal Hocko, David Rientjes, Andrew Morton; +Cc: Linux MM

On Sat, Jul 11, 2020 at 11:18 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> If the current's MMF_OOM_SKIP is set, it means that the current is exiting
> or dying and likely to realease its address space. So we don't need to
> invoke the oom killer again. Otherwise that may cause some unexpected
> issues, for example, bellow is the issue found in our production
> environment.
>
> There're many threads of a multi-threaded task parallel running in a
> container on many cpus. Then many threads triggered OOM at the same time,
>
> CPU-1           CPU-2         ...        CPU-n
> thread-1        thread-2      ...        thread-n
>
> wait oom_lock   wait oom_lock ...        hold oom_lock
>
>                                          (sigkill received)
>
>                                          select current as victim
>                                          and wakeup oom reaper
>
>                                          release oom_lock
>
>                                          (MMF_OOM_SKIP set by oom reaper)
>
>                                          (lots of pages are freed)
> hold oom_lock
>
> because MMF_OOM_SKIP
> is set, kill others
>
> The thread running on CPU-n received sigkill and it will select current as
> the victim and wakeup the oom reaper. Then oom reaper will reap its rss and
> free lots of pages, as a result, there will be many free pages.
> Although the multi-threaded task is exiting, the other threads will
> continue to kill others because of the check of MMF_OOM_SKIP in
> task_will_free_mem().
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/oom_kill.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6e94962..a8a155a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -825,13 +825,6 @@ static bool task_will_free_mem(struct task_struct *task)
>         if (!__task_will_free_mem(task))
>                 return false;
>
> -       /*
> -        * This task has already been drained by the oom reaper so there are
> -        * only small chances it will free some more
> -        */
> -       if (test_bit(MMF_OOM_SKIP, &mm->flags))
> -               return false;
> -
>         if (atomic_read(&mm->mm_users) <= 1)
>                 return true;
>
> @@ -963,7 +956,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>          * so it can die quickly
>          */
>         task_lock(victim);
> -       if (task_will_free_mem(victim)) {
> +       if (!test_bit(MMF_OOM_SKIP, &victim->mm->flags) &&
> +           task_will_free_mem(victim)) {
>                 mark_oom_victim(victim);
>                 wake_oom_reaper(victim);
>                 task_unlock(victim);
> @@ -1056,6 +1050,10 @@ bool out_of_memory(struct oom_control *oc)
>                         return true;
>         }
>
> +       /* current has been already reapered */
> +       if (test_bit(MMF_OOM_SKIP, &current->mm->flags))
> +               return true;
> +

Oops.

Should check whether mm is NULL first:
if (mm && test_bit(MMF_OOM_SKIP, mm->flags))


>         /*
>          * If current has a pending SIGKILL or is exiting, then automatically
>          * select it.  The goal is to allow it to allocate so that it may
> --
> 1.8.3.1
>


-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-11  3:18 [PATCH] mm, oom: don't invoke oom killer if current has been reapered Yafang Shao
  2020-07-11  5:37 ` Yafang Shao
@ 2020-07-13  6:01 ` Michal Hocko
  2020-07-13  6:21   ` Michal Hocko
  2020-07-13 23:50 ` Tetsuo Handa
  2 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2020-07-13  6:01 UTC (permalink / raw)
  To: Yafang Shao; +Cc: rientjes, akpm, linux-mm

On Fri 10-07-20 23:18:01, Yafang Shao wrote:
> If the current's MMF_OOM_SKIP is set, it means that the current is exiting
> or dying and likely to realease its address space.

That is not actually true. The primary reason for this flag is to tell
that the task is no longer relevant for the oom victim selection because
most of its memory has been released. But the task might be stuck at
many places and waiting for it to terminate might easily lockup the
system.

The design of the oom reaper is to guarantee a forward progress if when
the victim cannot make a forward progress on its own. For that to work
the oom killer cannot relly rely on the victim's state or that it would
finish.

If you remove this fundamental assumption then the oom killer can lockup
again.

> So we don't need to
> invoke the oom killer again. Otherwise that may cause some unexpected
> issues, for example, bellow is the issue found in our production
> environment.

Please see the above.

> There're many threads of a multi-threaded task parallel running in a
> container on many cpus. Then many threads triggered OOM at the same time,
> 
> CPU-1	        CPU-2         ...        CPU-n
> thread-1        thread-2      ...        thread-n
> 
> wait oom_lock   wait oom_lock ...        hold oom_lock
> 
>                                          (sigkill received)
> 
>                                          select current as victim
>                                          and wakeup oom reaper
> 
>                                          release oom_lock
> 
>                                          (MMF_OOM_SKIP set by oom reaper)
> 
>                                          (lots of pages are freed)
> hold oom_lock

Could you be more specific please? The page allocator never waits for
the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom
tries to allocate with the lock held.

Could you provide oom reports please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13  6:01 ` Michal Hocko
@ 2020-07-13  6:21   ` Michal Hocko
  2020-07-13 12:24     ` Yafang Shao
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2020-07-13  6:21 UTC (permalink / raw)
  To: Yafang Shao; +Cc: rientjes, akpm, linux-mm

On Mon 13-07-20 08:01:57, Michal Hocko wrote:
> On Fri 10-07-20 23:18:01, Yafang Shao wrote:
[...]
> > There're many threads of a multi-threaded task parallel running in a
> > container on many cpus. Then many threads triggered OOM at the same time,
> > 
> > CPU-1	        CPU-2         ...        CPU-n
> > thread-1        thread-2      ...        thread-n
> > 
> > wait oom_lock   wait oom_lock ...        hold oom_lock
> > 
> >                                          (sigkill received)
> > 
> >                                          select current as victim
> >                                          and wakeup oom reaper
> > 
> >                                          release oom_lock
> > 
> >                                          (MMF_OOM_SKIP set by oom reaper)
> > 
> >                                          (lots of pages are freed)
> > hold oom_lock
> 
> Could you be more specific please? The page allocator never waits for
> the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom
> tries to allocate with the lock held.

I suspect that you are looking at memcg oom killer. Because we do not do
trylock there for some reason I do not immediatelly remember from top of
my head. If this is really the case then I would recommend looking into
how the page allocator implements this and follow the same pattern for
memcg as well.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13  6:21   ` Michal Hocko
@ 2020-07-13 12:24     ` Yafang Shao
  2020-07-13 12:45       ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2020-07-13 12:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, Linux MM

On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 13-07-20 08:01:57, Michal Hocko wrote:
> > On Fri 10-07-20 23:18:01, Yafang Shao wrote:
> [...]
> > > There're many threads of a multi-threaded task parallel running in a
> > > container on many cpus. Then many threads triggered OOM at the same time,
> > >
> > > CPU-1               CPU-2         ...        CPU-n
> > > thread-1        thread-2      ...        thread-n
> > >
> > > wait oom_lock   wait oom_lock ...        hold oom_lock
> > >
> > >                                          (sigkill received)
> > >
> > >                                          select current as victim
> > >                                          and wakeup oom reaper
> > >
> > >                                          release oom_lock
> > >
> > >                                          (MMF_OOM_SKIP set by oom reaper)
> > >
> > >                                          (lots of pages are freed)
> > > hold oom_lock
> >
> > Could you be more specific please? The page allocator never waits for
> > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom
> > tries to allocate with the lock held.
>
> I suspect that you are looking at memcg oom killer.

Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory().

> Because we do not do
> trylock there for some reason I do not immediatelly remember from top of
> my head. If this is really the case then I would recommend looking into
> how the page allocator implements this and follow the same pattern for
> memcg as well.
>

That is a good suggestion.
But we can't try locking the global oom_lock here, because task ooming
in memcg foo may can't help the tasks in memcg bar.
IOW, we need to introduce the per memcg oom_lock, like bellow,

mem_cgroup_out_of_memory
+    if (mutex_trylock(memcg->lock))
+        return true.

    if (mutex_lock_killable(&oom_lock))
        return true;

And the memcg tree should also be considered.

-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 12:24     ` Yafang Shao
@ 2020-07-13 12:45       ` Michal Hocko
  2020-07-13 13:11         ` Yafang Shao
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2020-07-13 12:45 UTC (permalink / raw)
  To: Yafang Shao; +Cc: David Rientjes, Andrew Morton, Linux MM

On Mon 13-07-20 20:24:07, Yafang Shao wrote:
> On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 13-07-20 08:01:57, Michal Hocko wrote:
> > > On Fri 10-07-20 23:18:01, Yafang Shao wrote:
> > [...]
> > > > There're many threads of a multi-threaded task parallel running in a
> > > > container on many cpus. Then many threads triggered OOM at the same time,
> > > >
> > > > CPU-1               CPU-2         ...        CPU-n
> > > > thread-1        thread-2      ...        thread-n
> > > >
> > > > wait oom_lock   wait oom_lock ...        hold oom_lock
> > > >
> > > >                                          (sigkill received)
> > > >
> > > >                                          select current as victim
> > > >                                          and wakeup oom reaper
> > > >
> > > >                                          release oom_lock
> > > >
> > > >                                          (MMF_OOM_SKIP set by oom reaper)
> > > >
> > > >                                          (lots of pages are freed)
> > > > hold oom_lock
> > >
> > > Could you be more specific please? The page allocator never waits for
> > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom
> > > tries to allocate with the lock held.
> >
> > I suspect that you are looking at memcg oom killer.
> 
> Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory().
> 
> > Because we do not do
> > trylock there for some reason I do not immediatelly remember from top of
> > my head. If this is really the case then I would recommend looking into
> > how the page allocator implements this and follow the same pattern for
> > memcg as well.
> >
> 
> That is a good suggestion.
> But we can't try locking the global oom_lock here, because task ooming
> in memcg foo may can't help the tasks in memcg bar.

I do not follow. oom_lock is not about fwd progress. It is a big lock to
synchronize against oom_disable logic.

I have this in mind

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 248e6cad0095..29d1f8c2d968 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;
 
-	if (mutex_lock_killable(&oom_lock))
+	if (!mutex_trylock(&oom_lock))
 		return true;
+
+
 	/*
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.

But as I've said I would need to double check the history on why we
differ here. Btw. I suspect that mem_cgroup_out_of_memory call in
mem_cgroup_oom_synchronize is bogus and can no longer trigger after
29ef680ae7c21 but this needs double checking as well.

> IOW, we need to introduce the per memcg oom_lock, like bellow,

I do not see why. Besides that we already do have per oom memcg
hierarchy lock.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 12:45       ` Michal Hocko
@ 2020-07-13 13:11         ` Yafang Shao
  2020-07-13 19:05           ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2020-07-13 13:11 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, Linux MM

On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 13-07-20 20:24:07, Yafang Shao wrote:
> > On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 13-07-20 08:01:57, Michal Hocko wrote:
> > > > On Fri 10-07-20 23:18:01, Yafang Shao wrote:
> > > [...]
> > > > > There're many threads of a multi-threaded task parallel running in a
> > > > > container on many cpus. Then many threads triggered OOM at the same time,
> > > > >
> > > > > CPU-1               CPU-2         ...        CPU-n
> > > > > thread-1        thread-2      ...        thread-n
> > > > >
> > > > > wait oom_lock   wait oom_lock ...        hold oom_lock
> > > > >
> > > > >                                          (sigkill received)
> > > > >
> > > > >                                          select current as victim
> > > > >                                          and wakeup oom reaper
> > > > >
> > > > >                                          release oom_lock
> > > > >
> > > > >                                          (MMF_OOM_SKIP set by oom reaper)
> > > > >
> > > > >                                          (lots of pages are freed)
> > > > > hold oom_lock
> > > >
> > > > Could you be more specific please? The page allocator never waits for
> > > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom
> > > > tries to allocate with the lock held.
> > >
> > > I suspect that you are looking at memcg oom killer.
> >
> > Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory().
> >
> > > Because we do not do
> > > trylock there for some reason I do not immediatelly remember from top of
> > > my head. If this is really the case then I would recommend looking into
> > > how the page allocator implements this and follow the same pattern for
> > > memcg as well.
> > >
> >
> > That is a good suggestion.
> > But we can't try locking the global oom_lock here, because task ooming
> > in memcg foo may can't help the tasks in memcg bar.
>
> I do not follow. oom_lock is not about fwd progress. It is a big lock to
> synchronize against oom_disable logic.
>
> I have this in mind
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 248e6cad0095..29d1f8c2d968 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         };
>         bool ret;
>
> -       if (mutex_lock_killable(&oom_lock))
> +       if (!mutex_trylock(&oom_lock))
>                 return true;

                   root_mem_cgroup
            /                                                          \
memcg_a (16G)                                             memcg_b (32G)
|                                                                        |
process a_1 (reach memcg_a limit)                process b_1(reach
memcg_b limit)
hold oom_lock                                                  wait oom_lock

So we can find that process a_1 will try to kill process in memcg_a,
while process b_1 need to try to kill process in memcg_b.
IOW, the process killed in memcg_a can't help the processes in
memcg_b, so if process b_1 should not trylock oom_lock here.

While if the memcg tree is ,
                   target mem_cgroup (16G)
            /                                                          \
            |
              |
process a_1 (reach memcg_a limit)                process a_2(reach
memcg_a limit)
hold oom_lock                                                  wait oom_lock

Then, process a_2 can trylock oom_lock here. IOW, these processes
should in the same memcg.

That's why I said that we should introduce per-memcg oom_lock.

> +
> +
>         /*
>          * A few threads which were not waiting at mutex_lock_killable() can
>          * fail to bail out. Therefore, check again after holding oom_lock.
>
> But as I've said I would need to double check the history on why we
> differ here. Btw. I suspect that mem_cgroup_out_of_memory call in
> mem_cgroup_oom_synchronize is bogus and can no longer trigger after
> 29ef680ae7c21 but this needs double checking as well.
>
> > IOW, we need to introduce the per memcg oom_lock, like bellow,
>
> I do not see why. Besides that we already do have per oom memcg
> hierarchy lock.
>




-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 13:11         ` Yafang Shao
@ 2020-07-13 19:05           ` Michal Hocko
  2020-07-14  0:15             ` Tetsuo Handa
  2020-07-14  2:09             ` Yafang Shao
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Hocko @ 2020-07-13 19:05 UTC (permalink / raw)
  To: Yafang Shao; +Cc: David Rientjes, Andrew Morton, Linux MM

On Mon 13-07-20 21:11:50, Yafang Shao wrote:
> On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 13-07-20 20:24:07, Yafang Shao wrote:
[...]
> > > But we can't try locking the global oom_lock here, because task ooming
> > > in memcg foo may can't help the tasks in memcg bar.
> >
> > I do not follow. oom_lock is not about fwd progress. It is a big lock to
> > synchronize against oom_disable logic.
> >
> > I have this in mind
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 248e6cad0095..29d1f8c2d968 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >         };
> >         bool ret;
> >
> > -       if (mutex_lock_killable(&oom_lock))
> > +       if (!mutex_trylock(&oom_lock))
> >                 return true;
> 
>                    root_mem_cgroup
>             /                                                          \
> memcg_a (16G)                                             memcg_b (32G)
> |                                                                        |
> process a_1 (reach memcg_a limit)                process b_1(reach
> memcg_b limit)
> hold oom_lock                                                  wait oom_lock
> 
> So we can find that process a_1 will try to kill process in memcg_a,
> while process b_1 need to try to kill process in memcg_b.
> IOW, the process killed in memcg_a can't help the processes in
> memcg_b, so if process b_1 should not trylock oom_lock here.
> 
> While if the memcg tree is ,
>                    target mem_cgroup (16G)
>             /                                                          \
>             |
>               |
> process a_1 (reach memcg_a limit)                process a_2(reach
> memcg_a limit)
> hold oom_lock                                                  wait oom_lock
> 
> Then, process a_2 can trylock oom_lock here. IOW, these processes
> should in the same memcg.
> 
> That's why I said that we should introduce per-memcg oom_lock.

I still fail to understand your reaasoning. Sure, the oom lock is global
so it doesn't have a per oom hierarchy resolution pretty much by definition.
But that is not really important. The whole point of the trylock is to
remove the ordering between the oom selection, the oom reaper and
potential charge consumers which trigger the oom in parallel. With the
blocking lock they would pile up in the order they have hit the OOM
situation. With the trylock they would simply keep retrying until the
oom is done. That would reduce the race window considerably. This is
what the global oom is doing.

Another alternative would be to check mem_cgroup_margin after the lock
is taken but it would be better to keep in sync with the global case as
much as possible unless there is a good reason to differ.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-11  3:18 [PATCH] mm, oom: don't invoke oom killer if current has been reapered Yafang Shao
  2020-07-11  5:37 ` Yafang Shao
  2020-07-13  6:01 ` Michal Hocko
@ 2020-07-13 23:50 ` Tetsuo Handa
  2020-07-14  2:13   ` Yafang Shao
  2020-07-14  6:43   ` Michal Hocko
  2 siblings, 2 replies; 20+ messages in thread
From: Tetsuo Handa @ 2020-07-13 23:50 UTC (permalink / raw)
  To: Yafang Shao; +Cc: mhocko, rientjes, akpm, linux-mm

On 2020/07/11 12:18, Yafang Shao wrote:
> If the current's MMF_OOM_SKIP is set, it means that the current is exiting
> or dying and likely to realease its address space. So we don't need to
> invoke the oom killer again. Otherwise that may cause some unexpected
> issues, for example, bellow is the issue found in our production
> environment.

What kernel version are you using?

Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer")
should solve this problem.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 19:05           ` Michal Hocko
@ 2020-07-14  0:15             ` Tetsuo Handa
  2020-07-14  0:18               ` Tetsuo Handa
  2020-07-14  2:09             ` Yafang Shao
  1 sibling, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2020-07-14  0:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Yafang Shao, David Rientjes, Andrew Morton, Linux MM

On 2020/07/14 4:05, Michal Hocko wrote:
> Another alternative would be to check mem_cgroup_margin after the lock
> is taken but it would be better to keep in sync with the global case as
> much as possible unless there is a good reason to differ.
> 

If mem_cgroup_margin() is something what zone_watermark_ok() from
get_page_from_freelist() will do, checking mem_cgroup_margin() with
oom_lock held will make more keep in sync with the global case
(i.e. calling __alloc_pages_may_oom() with oom_lock held).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  0:15             ` Tetsuo Handa
@ 2020-07-14  0:18               ` Tetsuo Handa
  0 siblings, 0 replies; 20+ messages in thread
From: Tetsuo Handa @ 2020-07-14  0:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Yafang Shao, David Rientjes, Andrew Morton, Linux MM

On 2020/07/14 9:15, Tetsuo Handa wrote:
> On 2020/07/14 4:05, Michal Hocko wrote:
>> Another alternative would be to check mem_cgroup_margin after the lock
>> is taken but it would be better to keep in sync with the global case as
>> much as possible unless there is a good reason to differ.
>>
> 
> If mem_cgroup_margin() is something what zone_watermark_ok() from
> get_page_from_freelist() will do, checking mem_cgroup_margin() with
> oom_lock held will make more keep in sync with the global case
> (i.e. calling __alloc_pages_may_oom() with oom_lock held).
> 
I meant:

(i.e. calling get_page_from_freelist() from __alloc_pages_may_oom() with oom_lock held).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 19:05           ` Michal Hocko
  2020-07-14  0:15             ` Tetsuo Handa
@ 2020-07-14  2:09             ` Yafang Shao
  1 sibling, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2020-07-14  2:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, Linux MM

On Tue, Jul 14, 2020 at 3:05 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 13-07-20 21:11:50, Yafang Shao wrote:
> > On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 13-07-20 20:24:07, Yafang Shao wrote:
> [...]
> > > > But we can't try locking the global oom_lock here, because task ooming
> > > > in memcg foo may can't help the tasks in memcg bar.
> > >
> > > I do not follow. oom_lock is not about fwd progress. It is a big lock to
> > > synchronize against oom_disable logic.
> > >
> > > I have this in mind
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 248e6cad0095..29d1f8c2d968 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >         };
> > >         bool ret;
> > >
> > > -       if (mutex_lock_killable(&oom_lock))
> > > +       if (!mutex_trylock(&oom_lock))
> > >                 return true;
> >
> >                    root_mem_cgroup
> >             /                                                          \
> > memcg_a (16G)                                             memcg_b (32G)
> > |                                                                        |
> > process a_1 (reach memcg_a limit)                process b_1(reach
> > memcg_b limit)
> > hold oom_lock                                                  wait oom_lock
> >
> > So we can find that process a_1 will try to kill process in memcg_a,
> > while process b_1 need to try to kill process in memcg_b.
> > IOW, the process killed in memcg_a can't help the processes in
> > memcg_b, so if process b_1 should not trylock oom_lock here.
> >
> > While if the memcg tree is ,
> >                    target mem_cgroup (16G)
> >             /                                                          \
> >             |
> >               |
> > process a_1 (reach memcg_a limit)                process a_2(reach
> > memcg_a limit)
> > hold oom_lock                                                  wait oom_lock
> >
> > Then, process a_2 can trylock oom_lock here. IOW, these processes
> > should in the same memcg.
> >
> > That's why I said that we should introduce per-memcg oom_lock.
>
> I still fail to understand your reaasoning. Sure, the oom lock is global
> so it doesn't have a per oom hierarchy resolution pretty much by definition.
> But that is not really important. The whole point of the trylock is to
> remove the ordering between the oom selection, the oom reaper and
> potential charge consumers which trigger the oom in parallel. With the
> blocking lock they would pile up in the order they have hit the OOM
> situation. With the trylock they would simply keep retrying until the
> oom is done. That would reduce the race window considerably. This is
> what the global oom is doing.
>

Thanks for the explanation.
Seems trylock can work.

> Another alternative would be to check mem_cgroup_margin after the lock
> is taken but it would be better to keep in sync with the global case as
> much as possible unless there is a good reason to differ.
>
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 23:50 ` Tetsuo Handa
@ 2020-07-14  2:13   ` Yafang Shao
  2020-07-14  2:42     ` Tetsuo Handa
  2020-07-14  6:43   ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2020-07-14  2:13 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Michal Hocko, David Rientjes, Andrew Morton, Linux MM

On Tue, Jul 14, 2020 at 7:50 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/07/11 12:18, Yafang Shao wrote:
> > If the current's MMF_OOM_SKIP is set, it means that the current is exiting
> > or dying and likely to realease its address space. So we don't need to
> > invoke the oom killer again. Otherwise that may cause some unexpected
> > issues, for example, bellow is the issue found in our production
> > environment.
>
> What kernel version are you using?
>

The kernel version is 4.18.

> Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer")
> should solve this problem.

Yes, this commit should fix this issue. Thanks for the information.

But it seems the proposal that using trylock in
mem_cgroup_out_of_memory() should  be better?
The trylock could also solve the problem that different processes are
doing oom at the same time.


-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  2:13   ` Yafang Shao
@ 2020-07-14  2:42     ` Tetsuo Handa
  2020-07-14  2:58       ` Yafang Shao
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2020-07-14  2:42 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Michal Hocko, David Rientjes, Andrew Morton, Linux MM

On 2020/07/14 11:13, Yafang Shao wrote:
> But it seems the proposal that using trylock in
> mem_cgroup_out_of_memory() should  be better?
> The trylock could also solve the problem that different processes are
> doing oom at the same time.

I think trylock is worse. The trylock needlessly wastes CPU time which could
have been utilized by the OOM killer/reaper for reclaiming memory.
If concurrent OOM on multiple non-overwrapping memcg domains is a real problem,
killable lock on per-memcg oom_lock will be better than trylock on global oom_lock.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  2:42     ` Tetsuo Handa
@ 2020-07-14  2:58       ` Yafang Shao
  2020-07-14  4:06         ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2020-07-14  2:58 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Michal Hocko, David Rientjes, Andrew Morton, Linux MM

On Tue, Jul 14, 2020 at 10:42 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/07/14 11:13, Yafang Shao wrote:
> > But it seems the proposal that using trylock in
> > mem_cgroup_out_of_memory() should  be better?
> > The trylock could also solve the problem that different processes are
> > doing oom at the same time.
>
> I think trylock is worse. The trylock needlessly wastes CPU time which could
> have been utilized by the OOM killer/reaper for reclaiming memory.

If it may wastes the CPU time, we can shed it out for 1 second like
what it does in __alloc_pages_may_oom():

__alloc_pages_may_oom
    if (!mutex_trylock(&oom_lock)) {
        schedule_timeout_uninterruptible(1);  // to avoid wasting CPU time
        return;
    }

But I find that we doesn't sched it out in pagefault path,

pagefault_out_of_memory
    if (!mutex_trylock(&oom_lock))
        return;

I haven't thought deeply what the difference is ...


> If concurrent OOM on multiple non-overwrapping memcg domains is a real problem,
> killable lock on per-memcg oom_lock will be better than trylock on global oom_lock.

I am wondering why we select the process per memcg, for example,

select_bad_process
    if (is_memcg_oom(oc))
        mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
    else
        mem_cgroup_scan_tasks(root_mem_cgorup, oom_evaluate_task, oc);

Then we can put the lock here to lock each memcg, and avoid the oom
contention between different memcgs.

-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  2:58       ` Yafang Shao
@ 2020-07-14  4:06         ` Tetsuo Handa
  2020-07-14  5:03           ` Yafang Shao
  2020-07-14  6:51           ` Michal Hocko
  0 siblings, 2 replies; 20+ messages in thread
From: Tetsuo Handa @ 2020-07-14  4:06 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Michal Hocko, David Rientjes, Andrew Morton, Linux MM

On 2020/07/14 11:58, Yafang Shao wrote:
> On Tue, Jul 14, 2020 at 10:42 AM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> On 2020/07/14 11:13, Yafang Shao wrote:
>>> But it seems the proposal that using trylock in
>>> mem_cgroup_out_of_memory() should  be better?
>>> The trylock could also solve the problem that different processes are
>>> doing oom at the same time.
>>
>> I think trylock is worse. The trylock needlessly wastes CPU time which could
>> have been utilized by the OOM killer/reaper for reclaiming memory.
> 
> If it may wastes the CPU time, we can shed it out for 1 second like
> what it does in __alloc_pages_may_oom():
> 
> __alloc_pages_may_oom
>     if (!mutex_trylock(&oom_lock)) {
>         schedule_timeout_uninterruptible(1);  // to avoid wasting CPU time

1 second is HZ. 1 means 1 millisecond if CONFIG_HZ=1000. :-)

>         return;
>     }
> 
> But I find that we doesn't sched it out in pagefault path,
> 
> pagefault_out_of_memory
>     if (!mutex_trylock(&oom_lock))
>         return;
> 
> I haven't thought deeply what the difference is ...

David Rientjes is proposing it for avoiding soft lockup, and Michal Hocko is refusing it.
How to give the OOM killer/reaper enough CPU time for reclaiming memory is a dogfight. :-(

https://lkml.kernel.org/r/alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  4:06         ` Tetsuo Handa
@ 2020-07-14  5:03           ` Yafang Shao
  2020-07-14  6:51           ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2020-07-14  5:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Michal Hocko, David Rientjes, Andrew Morton, Linux MM

On Tue, Jul 14, 2020 at 12:06 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/07/14 11:58, Yafang Shao wrote:
> > On Tue, Jul 14, 2020 at 10:42 AM Tetsuo Handa
> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >>
> >> On 2020/07/14 11:13, Yafang Shao wrote:
> >>> But it seems the proposal that using trylock in
> >>> mem_cgroup_out_of_memory() should  be better?
> >>> The trylock could also solve the problem that different processes are
> >>> doing oom at the same time.
> >>
> >> I think trylock is worse. The trylock needlessly wastes CPU time which could
> >> have been utilized by the OOM killer/reaper for reclaiming memory.
> >
> > If it may wastes the CPU time, we can shed it out for 1 second like
> > what it does in __alloc_pages_may_oom():
> >
> > __alloc_pages_may_oom
> >     if (!mutex_trylock(&oom_lock)) {
> >         schedule_timeout_uninterruptible(1);  // to avoid wasting CPU time
>
> 1 second is HZ. 1 means 1 millisecond if CONFIG_HZ=1000. :-)
>

Right. Thanks for pointing it out :)

> >         return;
> >     }
> >
> > But I find that we doesn't sched it out in pagefault path,
> >
> > pagefault_out_of_memory
> >     if (!mutex_trylock(&oom_lock))
> >         return;
> >
> > I haven't thought deeply what the difference is ...
>
> David Rientjes is proposing it for avoiding soft lockup, and Michal Hocko is refusing it.
> How to give the OOM killer/reaper enough CPU time for reclaiming memory is a dogfight. :-(
>
> https://lkml.kernel.org/r/alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com

OK. I will take a look at the discussion in that thread.

-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-13 23:50 ` Tetsuo Handa
  2020-07-14  2:13   ` Yafang Shao
@ 2020-07-14  6:43   ` Michal Hocko
  2020-07-14  9:30     ` Yafang Shao
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2020-07-14  6:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Yafang Shao, rientjes, akpm, linux-mm

On Tue 14-07-20 08:50:57, Tetsuo Handa wrote:
> On 2020/07/11 12:18, Yafang Shao wrote:
> > If the current's MMF_OOM_SKIP is set, it means that the current is exiting
> > or dying and likely to realease its address space. So we don't need to
> > invoke the oom killer again. Otherwise that may cause some unexpected
> > issues, for example, bellow is the issue found in our production
> > environment.
> 
> What kernel version are you using?
> 
> Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer")
> should solve this problem.

I haven't really seen the oom report but this would solve only half of
the problem. It is still possible that other tasks from the oom memcg
hierarchy can be blocked on the oom lock and then observe the oom reaped
tasks. The trylock or the check for the margin is needed to address the
other half of the problem.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  4:06         ` Tetsuo Handa
  2020-07-14  5:03           ` Yafang Shao
@ 2020-07-14  6:51           ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2020-07-14  6:51 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Yafang Shao, David Rientjes, Andrew Morton, Linux MM

On Tue 14-07-20 13:06:48, Tetsuo Handa wrote:
> On 2020/07/14 11:58, Yafang Shao wrote:
[...]
> > But I find that we doesn't sched it out in pagefault path,
> > 
> > pagefault_out_of_memory
> >     if (!mutex_trylock(&oom_lock))
> >         return;
> > 
> > I haven't thought deeply what the difference is ...
> 
> David Rientjes is proposing it for avoiding soft lockup, and Michal Hocko is refusing it.
> How to give the OOM killer/reaper enough CPU time for reclaiming memory is a dogfight. :-(
> 
> https://lkml.kernel.org/r/alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com

I believe there was a review feedback to be addressed and never followed
up.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm, oom: don't invoke oom killer if current has been reapered
  2020-07-14  6:43   ` Michal Hocko
@ 2020-07-14  9:30     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2020-07-14  9:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, David Rientjes, Andrew Morton, Linux MM

On Tue, Jul 14, 2020 at 2:43 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-07-20 08:50:57, Tetsuo Handa wrote:
> > On 2020/07/11 12:18, Yafang Shao wrote:
> > > If the current's MMF_OOM_SKIP is set, it means that the current is exiting
> > > or dying and likely to realease its address space. So we don't need to
> > > invoke the oom killer again. Otherwise that may cause some unexpected
> > > issues, for example, bellow is the issue found in our production
> > > environment.
> >
> > What kernel version are you using?
> >
> > Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer")
> > should solve this problem.
>
> I haven't really seen the oom report but this would solve only half of
> the problem. It is still possible that other tasks from the oom memcg
> hierarchy can be blocked on the oom lock and then observe the oom reaped
> tasks. The trylock or the check for the margin is needed to address the
> other half of the problem.

Agreed. Different tasks doing memcg oom in the same memcg should also
be resolved.
I will send a patch for it.


-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-07-14  9:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-11  3:18 [PATCH] mm, oom: don't invoke oom killer if current has been reapered Yafang Shao
2020-07-11  5:37 ` Yafang Shao
2020-07-13  6:01 ` Michal Hocko
2020-07-13  6:21   ` Michal Hocko
2020-07-13 12:24     ` Yafang Shao
2020-07-13 12:45       ` Michal Hocko
2020-07-13 13:11         ` Yafang Shao
2020-07-13 19:05           ` Michal Hocko
2020-07-14  0:15             ` Tetsuo Handa
2020-07-14  0:18               ` Tetsuo Handa
2020-07-14  2:09             ` Yafang Shao
2020-07-13 23:50 ` Tetsuo Handa
2020-07-14  2:13   ` Yafang Shao
2020-07-14  2:42     ` Tetsuo Handa
2020-07-14  2:58       ` Yafang Shao
2020-07-14  4:06         ` Tetsuo Handa
2020-07-14  5:03           ` Yafang Shao
2020-07-14  6:51           ` Michal Hocko
2020-07-14  6:43   ` Michal Hocko
2020-07-14  9:30     ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).