All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
	Nico Pache <npache@redhat.com>,
	linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
	Joel Savitz <jsavitz@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, Rafael Aquini <aquini@redhat.com>,
	Waiman Long <longman@redhat.com>, Baoquan He <bhe@redhat.com>,
	Christoph von Recklinghausen <crecklin@redhat.com>,
	Don Dutile <ddutile@redhat.com>,
	"Herton R . Krzesinski" <herton@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Darren Hart <dvhart@infradead.org>,
	Andre Almeida <andrealmeid@collabora.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper
Date: Wed, 23 Mar 2022 11:30:49 +0100	[thread overview]
Message-ID: <87ils59d0m.ffs@tglx> (raw)
In-Reply-To: <YjrlqAMyJg3GKZVs@dhcp22.suse.cz>

Michal!

On Wed, Mar 23 2022 at 10:17, Michal Hocko wrote:
> Let me skip over futex part which I need to digest and only focus on the
> oom side of the things for clarification.

The most important thing to know about futexes is: They are cursed.

> On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
> [...]
>> > While some places can be handled by changing uninterruptible waiting to
>> > killable there are places which are not really fixable, e.g. lock
>> > chain dependency which leads to memory allocation.
>> 
>> I'm not following. Which lock chain dependency causes memory allocation?
>
> Consider an oom victim is blocked on a lock or waiting for an event to
> happen but the lock holder is stuck allocating or the wake up depends on
> an allocation. Many sleeping locks are doing GFP_KERNEL allocations.

Fair enough.
  
>> That will prevent the enforced race in most cases and allow the exiting
>> and/or killed processes to cleanup themself. Not pretty, but it should
>> reduce the chance of the reaper to win the race with the exiting and/or
>> killed process significantly.
>> 
>> It's not going to work when the problem is combined with a heavy VM
>> overload situation which keeps a guest (or one/some it's vCPUs) away
>> from being scheduled. See below for a discussion of guarantees.
>> 
>> If it failed to do so when the sleep returns, then you still can reap
>> it.
>
> Yes, this is certainly an option. Please note that the oom_reaper is not
> the only way to trigger this. process_mrelease syscall performs the same
> operation from the userspace. Arguably process_mrelease could be used
> sanely/correctly because the userspace oom killer can do pro-cleanup
> steps before going to final SIGKILL & process_mrelease. One way would be
> to send SIGTERM in the first step and allow the victim to perform its
> cleanup.

A potential staged approach would be:

  Send SIGTERM
  wait some time
  Send SIGKILL
  wait some time
  sys_process_mrelease()

Needs proper documentation though.

>> That said, the robust list is no guarantee. It's a best effort approach
>> which works well most of the time at least for the "normal" issues where
>> a task holding a futex dies unexpectedly. But there is no guarantee that
>> it works under all circumstances, e.g. OOM.
>
> OK, so this is an important note. I am all fine by documenting this
> restriction. It is not like oom victims couldn't cause other disruptions
> by leaving inconsistent/stale state behind.

Correct. Futexes are a small part of the overall damage.

>> Wrong. The whole point of robust lists is to handle the "normal" case
>> gracefully. A process being OOM killed is _NOT_ in the "normal"
>> category.
>> 
>> Neither is it "normal" that a VM is scheduled out long enough to miss a
>> 1 second deadline. That might be considered normal by cloud folks, but
>> that's absolute not normal from an OS POV. Again, that's not a OS
>> problem, that's an operator/admin problem.
>
> Thanks for this clarification. I would tend to agree. Following a
> previous example that oom victims can leave inconsistent state behind
> which can influence other processes. I am wondering what kind of
> expectations about the lock protected state can we make when the holder
> of the lock has been interrupted at any random place in the critical
> section.

Right. If a futex is released by the exit cleanup, it is marked with
FUTEX_OWNER_DIED. User space is supposed to handle this.

pthread_mutex_lock() returns EOWNERDEAD to the caller if the owner died
bit is set. It's the callers responsibility to deal with potentially
corrupted or inconsistent state.

>> So let me summarize what I think needs to be done in priority order:
>> 
>>  #1 Delay the oom reaper so the normal case of a process being able to
>>     exit is not subject to a pointless race to death.
>> 
>>  #2 If #1 does not result in the desired outcome, reap the mm (which is
>>     what we have now).
>> 
>>  #3 If it's expected that #2 will allow the stuck process to make
>>     progress on the way towards cleanup, then do not reap any VMA
>>     containing a robust list head of a thread which belongs to the
>>     exiting and/or killed process.
>> 
>> The remaining cases, i.e. the lock chain example I pointed out above or
>> the stuck forever task are going to be rare and fall under the
>> collateral damage and no guarantee rule.
>
> I do agree that delaying oom_reaper wake up is the simplest thing to do
> at this stage and it could catch up most of the failures. We still have
> process_mrelease syscall case but I guess we can document this as a
> caveat into the man page.

Yes. The user space oom killer should better know what it is doing :)

Thanks,

        tglx

  reply	other threads:[~2022-03-23 10:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-18  3:36 [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper Nico Pache
2022-03-21  8:55 ` Michal Hocko
2022-03-21 22:45   ` Nico Pache
2022-03-22  0:42   ` Davidlohr Bueso
2022-03-22  1:53     ` Nico Pache
2022-03-22  2:57       ` Davidlohr Bueso
2022-03-22  3:09         ` Nico Pache
2022-03-22  8:26         ` Michal Hocko
2022-03-22 15:17           ` Thomas Gleixner
2022-03-22 16:36             ` Michal Hocko
2022-03-22 22:43               ` Thomas Gleixner
2022-03-23  9:17                 ` Michal Hocko
2022-03-23 10:30                   ` Thomas Gleixner [this message]
2022-03-23 11:11                   ` Peter Zijlstra
2022-03-30  9:18                   ` Michal Hocko
2022-03-30 18:18                     ` Nico Pache
2022-03-30 21:36                       ` Nico Pache
2022-04-06 17:22             ` Nico Pache
2022-04-06 17:36               ` Nico Pache

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ils59d0m.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrealmeid@collabora.com \
    --cc=aquini@redhat.com \
    --cc=bhe@redhat.com \
    --cc=crecklin@redhat.com \
    --cc=dave@stgolabs.net \
    --cc=ddutile@redhat.com \
    --cc=dvhart@infradead.org \
    --cc=herton@redhat.com \
    --cc=jsavitz@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.