All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-mm@kvack.org, rientjes@google.com, akpm@linux-foundation.org
Subject: Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
Date: Tue, 19 Apr 2016 10:17:22 -0400	[thread overview]
Message-ID: <20160419141722.GB4126@dhcp22.suse.cz> (raw)
In-Reply-To: <201604182059.JFB76917.OFJMHFLSOtQVFO@I-love.SAKURA.ne.jp>

On Mon 18-04-16 20:59:51, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 16-04-16 11:51:11, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > The first obvious one is when the oom victim clears its mm and gets
> > > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > > > > > task would be ignored by the oom killer anyway. The flag would be
> > > > > > cleared by that time already most of the time anyway.
> > > > > 
> > > > > I didn't understand what this wants to tell. The OOM victim will clear
> > > > > TIF_MEMDIE as soon as it sets current->mm = NULL.
> > > > 
> > > > No it clears the flag _after_ it returns from mmput. There is no
> > > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
> > > > waits for completion and who knows what else might get stuck.
> > > 
> > > OK. Then, I think an OOM livelock scenario shown below is possible.
> > > 
> > >  (1) First OOM victim (where mm->mm_users == 1) is selected by the first
> > >      round of out_of_memory() call.
> > > 
> > >  (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users).
> > > 
> > >  (3) The OOM victim calls mmput() from exit_mm() from do_exit().
> > >      mmput() returns immediately because atomic_dec_and_test(&mm->mm_users)
> > >      returns false because of (2).
> > > 
> > >  (4) The OOM reaper reaps memory and then calls mmput().
> > >      mmput() calls exit_aio() etc. and waits for completion because
> > >      atomic_dec_and_test(&mm->mm_users) is now true.
> > > 
> > >  (5) Second OOM victim (which is the parent of the first OOM victim)
> > >      is selected by the next round of out_of_memory() call.
> > > 
> > >  (6) The OOM reaper is waiting for completion of the first OOM victim's
> > >      memory while the second OOM victim is waiting for the OOM reaper to
> > >      reap memory.
> > > 
> > > Where is the guarantee that exit_aio() etc. called from mmput() by the
> > > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is
> > > not blocked forever inside __oom_reap_task())?
> > 
> > You should realize that the mmput is called _after_ we have reclaimed
> > victim's address space. So there should be some memory freed by that
> > time which reduce the likelyhood of a lockup due to memory allocation
> > request if it is really needed for exit_aio.
> 
> Not always true. mmput() is called when down_read_trylock(&mm->mmap_sem) failed.
> It is possible that the OOM victim was about to call up_write(&mm->mmap_sem) when
> down_read_trylock(&mm->mmap_sem) failed, and it is possible that the OOM victim
> runs until returning from mmput() from exit_mm() from do_exit() when the OOM
> reaper was preempted between down_read_trylock(&mm->mmap_sem) and mmput().
> Under such race, the OOM reaper will call mmput() without reclaiming the OOM
> victim's address space.

You are right! For some reason I have missed that.

> > But you have a good point here. We want to strive for robustness of
> > oom_reaper as much as possible. We have dropped the munlock patch because
> > of the robustness so I guess we want this to be fixed as well. The
> > reason for blocking might be different from memory pressure I guess.
> 
> The reality of race/dependency is more complicated than we can imagine.
> 
> > 
> > Here is what should work - I have only compile tested it. I will prepare
> > the proper patch later this week with other oom reaper patches or after
> > I come back from LSF/MM.
> 
> Excuse me, but is system_wq suitable for queuing operations which may take
> unpredictable duration to flush?
> 
>   system_wq is the one used by schedule[_delayed]_work[_on]().
>   Multi-CPU multi-threaded.  There are users which expect relatively
>   short queue flush time.  Don't queue works which can run for too
>   long.

An alternative would be using a dedicated WQ with WQ_MEM_RECLAIM which I
am not really sure would be justified considering we are talking about a
highly unlikely event. You do not want to consume resources permanently
for an eventual and not fatal event.

> Many users including SysRq-f depend on system_wq being flushed shortly.

Critical work shouldn't really rely on system_wq, full stop. There is
just too much going on on that WQ and it is simply impossible to
guarantee anything.

> We
> haven't guaranteed that SysRq-f can always fire and select a different OOM
> victim, but you proposed always clearing TIF_MEMDIE without thinking the
> possibility of the OOM victim with mmap_sem held for write being stuck at
> unkillable wait.
> 
> I wonder about your definition of "robustness". You are almost always missing
> the worst scenario. You are trying to manage OOM without defining default:
> label in a switch statement. I don't think your approach is robust.

I am trying to be as robust as it is viable. You have to realize we are
in the catastrophic path already and there is simply no deterministic
way out.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-04-19 14:17 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-06 14:13 [PATCH 0/3] oom reaper follow ups v1 Michal Hocko
2016-04-06 14:13 ` Michal Hocko
2016-04-06 14:13 ` [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
2016-04-06 14:13   ` Michal Hocko
2016-04-06 14:13 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path Michal Hocko
2016-04-06 14:13   ` Michal Hocko
2016-04-07 11:38   ` Tetsuo Handa
2016-04-07 11:38     ` Tetsuo Handa
2016-04-08 11:19     ` Tetsuo Handa
2016-04-08 11:19       ` Tetsuo Handa
2016-04-08 11:50       ` Michal Hocko
2016-04-08 11:50         ` Michal Hocko
2016-04-09  4:39         ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular " Tetsuo Handa
2016-04-09  4:39           ` Tetsuo Handa
2016-04-11 12:02           ` Michal Hocko
2016-04-11 12:02             ` Michal Hocko
2016-04-11 13:26             ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular " Tetsuo Handa
2016-04-11 13:26               ` Tetsuo Handa
2016-04-11 13:43               ` Michal Hocko
2016-04-11 13:43                 ` Michal Hocko
2016-04-13 11:08                 ` Tetsuo Handa
2016-04-13 11:08                   ` Tetsuo Handa
2016-04-08 11:34     ` Michal Hocko
2016-04-08 11:34       ` Michal Hocko
2016-04-08 13:14   ` Michal Hocko
2016-04-08 13:14     ` Michal Hocko
2016-04-06 14:13 ` [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper Michal Hocko
2016-04-06 14:13   ` Michal Hocko
2016-04-07 11:55   ` Tetsuo Handa
2016-04-07 11:55     ` Tetsuo Handa
2016-04-08 11:34     ` Michal Hocko
2016-04-08 11:34       ` Michal Hocko
2016-04-16  2:51       ` Tetsuo Handa
2016-04-17 11:54         ` Michal Hocko
2016-04-18 11:59           ` Tetsuo Handa
2016-04-19 14:17             ` Michal Hocko [this message]
2016-04-19 15:07               ` Tetsuo Handa
2016-04-19 19:32                 ` Michal Hocko
2016-04-08 13:07   ` Michal Hocko
2016-04-08 13:07     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160419141722.GB4126@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.