linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, Li Zefan <lizefan@huawei.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	David Rientjes <rientjes@google.com>,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Anshuman Khandual <khandual@linux.vnet.ibm.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-api@vger.kernel.org
Subject: Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
Date: Fri, 19 May 2017 13:27:56 +0200	[thread overview]
Message-ID: <4bdfa99a-d241-131e-40a3-67b030803b0e@suse.cz> (raw)
In-Reply-To: <alpine.DEB.2.20.1705181158250.27641@east.gentwo.org>

On 05/18/2017 07:07 PM, Christoph Lameter wrote:
> On Thu, 18 May 2017, Vlastimil Babka wrote:
> 
>>> The race is where? If you expand the node set during the move of the
>>> application then you are safe in terms of the legacy apps that did not
>>> include static bindings.
>>
>> No, that expand/shrink by itself doesn't work against parallel
> 
> Parallel? I think we are clear that ithis is inherently racy against the
> app changing policies etc etc? There is a huge issue there already. The
> app needs to be well behaved in some heretofore undefined way in order to
> make moves clean.

The code is safe against mbind() changing a vma's mempolicy parallel to
another thread page faulting within that vma, because mbind() takes
mmap_sem for write, and page faults take it for read. The per-task
mempolicy can be changed by set_mempolicy() call which means the task
itself doesn't allocate stuff in parallel.
So, the application never needed to be "well behaved" wrt changing its
own mempolicies.

Now with mempolicy rebinding due to cpuset migrations, the application
cannot be "well behaved" as it has no way to learn about being under a
cpuset, or cpuset change. Any application can be put in a cpuset and we
can't really expect that all would be adapted, even if the necessary
interfaces existed. Thus, the rebinding implementation in the kernel
itself has to be robust against parallel allocations.

>> get_page_from_freelist going through a zonelist. Moving from node 0 to
>> 1, with zonelist containing nodes 1 and 0 in that order:
>>
>> - mempolicy mask is 0
>> - zonelist iteration checks node 1, it's not allowed, skip
> 
> There is an allocation from node 1?

Sorry, I missed to mention the full scenario. Let's say the allocation
is on cpu local to node 1, so it gets zonelist from node 1, which
contains nodes 1 and 0 in that order.

> This is not allowed before the move.
> So it should fail. Not skipping to another node.
> 
>> - mempolicy mask is 0,1 (expand)
>> - mempolicy mask is 1 (shrink)
>> - zonelist iteration checks node 0, it's not allowed, skip
>> - OOM
> 
> Are you talking about a race here between zonelist scanning and the
> moving? That has been there forever.

As far as I can tell from my git archeology in [1] there was always some
kind of protection against the race (generation counters, two-step
protocol, seqlock...), which however had some corner cases. This patch
is merely plugging the last known one.

> And frankly there are gazillions of these races.

I don't know about any other existing race that we don't handle after
this patch.

> The best thing to do is
> to get the cpuset moving logic out of the kernel and into user space.
> 
> Understand that this is a heuristic and maybe come up with a list of
> restrictions that make an app safe. An safe app that can be moved must f.e
> 
> 1. Not allocate new memory while its being moved
> 2. Not change memory policies after its initialization and while its being
> moved.

As I explainer eariler in this mail, changing mempolicy by app itself is
safe, the problem was always due to cpuset-triggered rebinding.

> 3. Not save memory policy state in some variable (because the logic to
> translate the memory policies for the new context cannot find it).
> 
> ...
> 
> Again cpuset process migration  is a huge mess that you do not want to
> have in the kernel and AFAICT this is a corner case with difficult
> semantics. Better have that in user space...

Moving this out of kernel etc is changing the current semantics and
breaking existing userspace, this patch is a fix within the existing one.

[1] https://marc.info/?l=linux-mm&m=148611344511408&w=2


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

  reply	other threads:[~2017-05-19 11:28 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-11 14:06 [RFC 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
2017-04-11 14:06 ` [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
2017-04-11 17:24   ` Christoph Lameter
2017-04-11 19:00     ` Vlastimil Babka
2017-04-12 21:25       ` Christoph Lameter
2017-04-13  6:24         ` Vlastimil Babka
2017-04-14 20:37           ` Christoph Lameter
2017-04-26  8:07             ` Vlastimil Babka
2017-04-30 21:33               ` Christoph Lameter
2017-05-17  9:20                 ` Michal Hocko
2017-05-17 13:56                   ` Christoph Lameter
2017-05-17 14:05                     ` Michal Hocko
2017-05-17 14:48                       ` Christoph Lameter
2017-05-17 14:56                         ` Michal Hocko
2017-05-17 15:25                           ` Christoph Lameter
2017-05-18  9:08                             ` Michal Hocko
2017-05-18 16:57                               ` Christoph Lameter
2017-05-18 17:24                                 ` Michal Hocko
2017-05-18 19:07                                   ` Christoph Lameter
2017-05-19  7:37                                     ` Michal Hocko
2017-05-17 15:27                           ` Christoph Lameter
2017-05-18 10:03                         ` Vlastimil Babka
2017-05-18 17:07                           ` Christoph Lameter
2017-05-19 11:27                             ` Vlastimil Babka [this message]
2017-04-13  5:42   ` Anshuman Khandual
2017-04-13  6:06     ` Vlastimil Babka
2017-04-13  6:07       ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
2017-04-11 17:32   ` Christoph Lameter
2017-04-11 19:03     ` Vlastimil Babka
2017-04-12  8:49       ` Vlastimil Babka
2017-04-12 21:16         ` Christoph Lameter
2017-04-12 21:18           ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
2017-04-11 14:06 ` [RFC 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
2017-04-11 14:06 ` [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
2017-04-12  8:10   ` Hillf Danton
2017-04-12  8:18     ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4bdfa99a-d241-131e-40a3-67b030803b0e@suse.cz \
    --to=vbabka@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=hughd@google.com \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan@huawei.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).