linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anshuman Khandual <khandual@linux.vnet.ibm.com>
To: Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	Li Zefan <lizefan@huawei.com>, Michal Hocko <mhocko@kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	David Rientjes <rientjes@google.com>,
	Christoph Lameter <cl@linux.com>, Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Anshuman Khandual <khandual@linux.vnet.ibm.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
Date: Thu, 13 Apr 2017 11:12:16 +0530	[thread overview]
Message-ID: <95469f35-56e9-7dc4-b7fd-a3e8c25bdff3@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170411140609.3787-2-vbabka@suse.cz>

On 04/11/2017 07:36 PM, Vlastimil Babka wrote:
> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
> mems update") has fixed known recent regressions found by LTP's cpuset01
> testcase. I have however found that by modifying the testcase to use per-vma
> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
> the premature OOM still happens and the issue is much older.

Meanwhile while we are discussing this RFC, will it be better to WARN
out these situations where we dont have node in the intersection,
hence no usable zone during allocation. That might actually give
a hint to the user before a premature OOM/allocation failure comes.

> 
> The root of the problem is that the cpuset's mems_allowed and mempolicy's
> nodemask can temporarily have no intersection, thus get_page_from_freelist()
> cannot find any usable zone. The current semantic for empty intersection is to
> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
> node_zonelist(), but the racy update can happen after we already passed the
> check. Such races should be protected by the seqlock task->mems_allowed_seq,
> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
> for write, while the allocation can have mmap_sem for read when it's taking the
> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
> __alloc_pages_slowpath(), so there's still a potential race window.
> 
> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> node_zonelist(). This works fine, because almost all callers of
> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
> exception is new_node_page() from hotplug, where the potential violation of
> nodemask isn't an issue, as there's already a fallback allocation attempt
> without any nodemask. If there's a future caller that needs to have its specific
> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
> flag for that.

Did you really mean node_zonelist() in both the instances above. Because
that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
depending upon the passed GFP flags in the allocation request and does not
deal with ignoring the passed nodemask.

  parent reply	other threads:[~2017-04-13  5:43 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-11 14:06 [RFC 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
2017-04-11 14:06 ` [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
2017-04-11 17:24   ` Christoph Lameter
2017-04-11 19:00     ` Vlastimil Babka
2017-04-12 21:25       ` Christoph Lameter
2017-04-13  6:24         ` Vlastimil Babka
2017-04-14 20:37           ` Christoph Lameter
2017-04-26  8:07             ` Vlastimil Babka
2017-04-30 21:33               ` Christoph Lameter
2017-05-17  9:20                 ` Michal Hocko
2017-05-17 13:56                   ` Christoph Lameter
2017-05-17 14:05                     ` Michal Hocko
2017-05-17 14:48                       ` Christoph Lameter
2017-05-17 14:56                         ` Michal Hocko
2017-05-17 15:25                           ` Christoph Lameter
2017-05-18  9:08                             ` Michal Hocko
2017-05-18 16:57                               ` Christoph Lameter
2017-05-18 17:24                                 ` Michal Hocko
2017-05-18 19:07                                   ` Christoph Lameter
2017-05-19  7:37                                     ` Michal Hocko
2017-05-17 15:27                           ` Christoph Lameter
2017-05-18 10:03                         ` Vlastimil Babka
2017-05-18 17:07                           ` Christoph Lameter
2017-05-19 11:27                             ` Vlastimil Babka
2017-04-13  5:42   ` Anshuman Khandual [this message]
2017-04-13  6:06     ` Vlastimil Babka
2017-04-13  6:07       ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
2017-04-11 17:32   ` Christoph Lameter
2017-04-11 19:03     ` Vlastimil Babka
2017-04-12  8:49       ` Vlastimil Babka
2017-04-12 21:16         ` Christoph Lameter
2017-04-12 21:18           ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
2017-04-11 14:06 ` [RFC 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
2017-04-11 14:06 ` [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
2017-04-12  8:10   ` Hillf Danton
2017-04-12  8:18     ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95469f35-56e9-7dc4-b7fd-a3e8c25bdff3@linux.vnet.ibm.com \
    --to=khandual@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan@huawei.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).