linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: penguin-kernel@i-love.sakura.ne.jp
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	torvalds@linux-foundation.org
Subject: Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
Date: Wed, 04 Jul 2018 11:22:55 +0900	[thread overview]
Message-ID: <201807040222.w642Mtlr099513@www262.sakura.ne.jp> (raw)
In-Reply-To: <20180703152922.GR16767@dhcp22.suse.cz>

Michal Hocko wrote:
> > On Tue 03-07-18 23:25:01, Tetsuo Handa wrote:
> > > This series provides
> > > 
> > >   (1) Mitigation and a fix for CVE-2016-10723.
> > > 
> > >   (2) A mitigation for needlessly selecting next OOM victim reported
> > >       by David Rientjes and rejected by Michal Hocko.
> > > 
> > >   (3) A preparation for handling many concurrent OOM victims which
> > >       could become real by introducing memcg-aware OOM killer.
> > 
> > It would have been great to describe the overal design in the cover
> > letter. So let me summarize just to be sure I understand the proposal.

You understood the proposal correctly.

> > You are removing the oom_reaper and moving the oom victim tear down to
> > the oom path.

Yes. This is for getting rid of the lie

	/*
	 * Acquire the oom lock.  If that fails, somebody else is
	 * making progress for us.
	 */
	if (!mutex_trylock(&oom_lock)) {
		*did_some_progress = 1;
		schedule_timeout_uninterruptible(1);
		return NULL;
	}

which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
we can eliminate this heuristic.

Of course, we don't have to remove the OOM reaper kernel thread. But the OOM
killer path knows better than the OOM reaper path regarding which domains need
to be reclaimed faster than other domains.

> >               To handle cases where we cannot get mmap_sem to do that
> > work you simply decay oom_badness over time if there are no changes in
> > the victims oom score.

Yes. It is a feedback based approach which handles not only unable to get
mmap_sem case but also already started free_pgtables()/remove_vma() case.

However, you are not understanding [PATCH 3/8] correctly. It IS A FEEDBACK
BASED APPROACH which corresponds to

	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
		schedule_timeout_idle(HZ/10);

in current code (though the timeout is different from current code).

> 
> Correction. You do not decay oom_badness. You simply increase a stall
> counter anytime oom_badness hasn't changed since the last check (if that
> check happend at least HZ/10 ago) and get the victim out of sight if the
> counter is larger than 30. This is where 3s are coming from. So in fact
> this is the low boundary while it might be considerably larger depending
> on how often we examine the victim.

Yes, it can become larger. But if we don't need to examine the OOM victim, it means
that we are not OOM. We need to examine the OOM victim only if we are still OOM.
(If we are no longer OOM, the OOM victim will disappear via exit_oom_mm() before
the counter reaches 30.)

Note that neither current

#define MAX_OOM_REAP_RETRIES 10

	/* Retry the down_read_trylock(mmap_sem) a few times */
	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
		schedule_timeout_idle(HZ/10);

code guarantees that the OOM reaper will give up in 1 second. If oom_reap_task_mm()
for one OOM domain takes e.g. one minute when there are OOM victims in multiple OOM
domains, we will needlessly block another OOM domain where allocating threads are
wasting CPU resources due to the lie mentioned above.

By allowing allocating threads to do direct OOM reaping, we won't needlessly block
allocating threads and we can eliminate the lie mentioned above.

> 
> >                        In order to not block in the oom context for too
> > long because the address space might be quite large, you allow to
> > direct oom reap from multiple contexts.

Yes. Since the thread which acquired the oom_lock can be SCHED_IDLE priority,
this patch is trying to speed up direct OOM reaping by allowing other threads
who acquired the oom_lock with !SCHED_IDLE priority. Thus, [PATCH 7/8] tries to
reduce the duration of holding the oom_lock, by releasing the oom_lock before
doing direct OOM reaping.

We could make oom_lock a spinlock (or "mutex_lock(&oom_lock); preempt_disable();"
and "preempt_enable(); mutex_unlock(&oom_lock);") but there is no perfect answer
for whether we should do so.

> > 
> > You fail to explain why is this approach more appropriate and how you
> > have settled with your current tuning with 3s timeout etc...
> > 
> > Considering how subtle this whole area is I am not overly happy about
> > another rewrite without a really strong reasoning behind. There is none
> > here, unfortunately. Well, except for statements how I reject something
> > without telling the whole story etc...

However, you are not understanding [PATCH 1/8] correctly. You are simply
refusing my patch instead of proving that removing the short sleep _is_ safe.
Since you don't prove it, I won't remove the short sleep until [PATCH 8/8].

In the kernel space, not yielding enough CPU resources for other threads to
make forward progress is a BUG. You learned it with warn_alloc() for allocation
stall reporting, didn't you?

  reply	other threads:[~2018-07-04  2:23 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
2018-07-03 14:25 ` [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
2018-07-03 14:38   ` Michal Hocko
2018-07-03 14:25 ` [PATCH 2/8] mm,oom: Check pending victims earlier in out_of_memory() Tetsuo Handa
2018-07-03 14:25 ` [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes Tetsuo Handa
2018-07-03 14:58   ` Michal Hocko
2018-07-03 14:25 ` [PATCH 4/8] mm,page_alloc: Make oom_reserves_allowed() even Tetsuo Handa
2018-07-03 14:25 ` [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock Tetsuo Handa
2018-07-03 14:59   ` Michal Hocko
2018-07-03 14:25 ` [PATCH 6/8] mm,oom: Make oom_lock static variable Tetsuo Handa
2018-07-03 14:25 ` [PATCH 7/8] mm,oom: Do not sleep with oom_lock held Tetsuo Handa
2018-07-03 14:25 ` [PATCH 8/8] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
2018-07-03 15:12 ` [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Michal Hocko
2018-07-03 15:29   ` Michal Hocko
2018-07-04  2:22     ` penguin-kernel [this message]
2018-07-04  7:16       ` Michal Hocko
2018-07-04  7:22         ` Michal Hocko
2018-07-05  3:05           ` Tetsuo Handa
2018-07-05  7:24             ` Michal Hocko
2018-07-06  2:40               ` Tetsuo Handa
2018-07-06  2:49                 ` Linus Torvalds
2018-07-07  1:12                   ` Tetsuo Handa
2018-07-09  7:45                     ` Michal Hocko
2018-07-06  5:56                 ` Michal Hocko
2018-07-10  3:57                   ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201807040222.w642Mtlr099513@www262.sakura.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).