linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	Mel Gorman
	<mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>,
	David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>,
	Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Andrea Arcangeli
	<aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Anshuman Khandual
	<khandual-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>,
	"Kirill A. Shutemov"
	<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Subject: Re: [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups
Date: Wed, 17 May 2017 18:24:10 +0200	[thread overview]
Message-ID: <20170517162409.GC20660@dhcp22.suse.cz> (raw)
In-Reply-To: <20170517081140.30654-1-vbabka-AlSwsSmVLrQ@public.gmane.org>

On Wed 17-05-17 10:11:34, Vlastimil Babka wrote:
> Changes since RFC v1 [3]:
> 
> - Reworked patch 2 after discussion with Christoph Lameter.
> - Fix bug in patch 5 spotted by Hillf Danton.
> - Rebased to mmotm-2017-05-12-15-53
> 
> I would like to stress that this patchset aims to fix issues and cleanup the
> code *within the existing documented semantics*, i.e. patch 1 ignores mempolicy
> restrictions if the set of allowed nodes has no intersection with set of nodes
> allowed by cpuset. I believe discussing potential changes of the semantics can
> be better done once we have a baseline with no known bugs of the current
> semantics.
> 
> ===
> 
> I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1]
> and the discussion itself [2]. I've been trying to rewrite the handling as
> proposed, with the idea that changing semantics to make all mempolicies static
> wrt cpuset updates (and discarding the relative and default modes) can be tried
> on top, as there's a high risk of being rejected/reverted because somebody
> might still care about the removed modes.
> 
> However I haven't yet figured out how to properly:
> 
> 1) make mempolicies swappable instead of rebinding in place. I thought mbind()
> already works that way and uses refcounting to avoid use-after-free of the old
> policy by a parallel allocation, but turns out true refcounting is only done
> for shared (shmem) mempolicies, and the actual protection for mbind() comes
> from mmap_sem. Extending the refcounting means more overhead in allocator hot
> path.

But that overhead would be there only if any policy is in place, right?
Do you think it would be possible to use per cpu ref counting and thus
reduce the overhead? We use this trick in the css ref. counting in the
memcg and the overhead is not measurable. It would be certainly better
to use the same mechanism for shared and anon mappings whenever
possible.

> Also swapping whole mempolicies means that we have to allocate the new
> ones, which can fail, and reverting of the partially done work also means
> allocating (note that mbind() doesn't care and will just leave part of the
> range updated and part not updated when returning -ENOMEM...).

This is nasty but we already have to deal with half-the-way
initialization already so why cannot we use the same approach? I am
sorry if I am asking something trivial but I have already flushed all
details from memory so have to re-read it again.

> 2) make cpuset's task->mems_allowed also swappable (after converting it from
> nodemask to zonelist, which is the easy part) for mostly the same reasons.
> 
> The good news is that while trying to do the above, I've at least figured out
> how to hopefully close the remaining premature OOM's, and do a buch of cleanups
> on top, removing quite some of the code that was also supposed to prevent the
> cpuset update races, but doesn't work anymore nowadays. This should fix the
> most pressing concerns with this topic and give us a better baseline before
> either proceeding with the original proposal, or pushing a change of semantics
> that removes the problem 1) above. I'd be then fine with trying to change the
> semantic first and rewrite later.

the diffstat definitely looks promissing. I will have a look tomorrow
(unless something unexpected jums in again).

> Patchset is based on next-20170411 and has been tested with the LTP cpuset01
> stress test.
> 
> [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71-AlSwsSmVLrQ@public.gmane.org
> [2] https://lwn.net/Articles/717797/
> [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2
> 
> Vlastimil Babka (6):
>   mm, page_alloc: fix more premature OOM due to race with cpuset update
>   mm, mempolicy: stop adjusting current->il_next in
>     mpol_rebind_nodemask()
>   mm, page_alloc: pass preferred nid instead of zonelist to allocator
>   mm, mempolicy: simplify rebinding mempolicies when updating cpusets
>   mm, cpuset: always use seqlock when changing task's nodemask
>   mm, mempolicy: don't check cpuset seqlock where it doesn't matter
> 
>  include/linux/gfp.h            |  11 ++-
>  include/linux/mempolicy.h      |  12 ++-
>  include/linux/sched.h          |   2 +-
>  include/uapi/linux/mempolicy.h |   8 --
>  kernel/cgroup/cpuset.c         |  33 ++------
>  mm/hugetlb.c                   |  15 ++--
>  mm/memory_hotplug.c            |   6 +-
>  mm/mempolicy.c                 | 181 ++++++++++-------------------------------
>  mm/page_alloc.c                |  61 ++++++++++----
>  9 files changed, 118 insertions(+), 211 deletions(-)
> 
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

      parent reply	other threads:[~2017-05-17 16:24 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
2017-05-17  8:11 ` [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
     [not found]   ` <20170517081140.30654-2-vbabka-AlSwsSmVLrQ@public.gmane.org>
2017-05-19 11:51     ` Michal Hocko
     [not found]       ` <20170519115110.GB29839-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-23 11:32         ` Vlastimil Babka
2017-05-17  8:11 ` [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
     [not found]   ` <20170517081140.30654-3-vbabka-AlSwsSmVLrQ@public.gmane.org>
2017-05-17 15:07     ` Christoph Lameter
2017-05-17  8:11 ` [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
2017-05-17 15:19   ` Christoph Lameter
     [not found]     ` <alpine.DEB.2.20.1705171009340.8714-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2017-05-18 10:25       ` Vlastimil Babka
     [not found]   ` <20170517081140.30654-4-vbabka-AlSwsSmVLrQ@public.gmane.org>
2017-05-19 11:59     ` Michal Hocko
2017-05-17  8:11 ` [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
2017-05-23  7:11   ` Michal Hocko
2017-05-17  8:11 ` [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
     [not found]   ` <20170517081140.30654-6-vbabka-AlSwsSmVLrQ@public.gmane.org>
2017-05-23  7:15     ` Michal Hocko
2017-05-17  8:11 ` [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka
2017-05-23  7:16   ` Michal Hocko
     [not found] ` <20170517081140.30654-1-vbabka-AlSwsSmVLrQ@public.gmane.org>
2017-05-17 16:24   ` Michal Hocko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170517162409.GC20660@dhcp22.suse.cz \
    --to=mhocko-dgejt+ai2ygdnm+yrofe0a@public.gmane.org \
    --cc=aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org \
    --cc=hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=khandual-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org \
    --cc=rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=vbabka-AlSwsSmVLrQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).