From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752955AbdDKOGY (ORCPT ); Tue, 11 Apr 2017 10:06:24 -0400 Received: from mx2.suse.de ([195.135.220.15]:49189 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751995AbdDKOGR (ORCPT ); Tue, 11 Apr 2017 10:06:17 -0400 From: Vlastimil Babka To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Li Zefan , Michal Hocko , Mel Gorman , David Rientjes , Christoph Lameter , Hugh Dickins , Andrea Arcangeli , Anshuman Khandual , "Kirill A. Shutemov" , Vlastimil Babka Subject: [RFC 0/6] cpuset/mempolicies related fixes and cleanups Date: Tue, 11 Apr 2017 16:06:03 +0200 Message-Id: <20170411140609.3787-1-vbabka@suse.cz> X-Mailer: git-send-email 2.12.2 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1] and the discussion itself [2]. I've been trying to rewrite the handling as proposed, with the idea that changing semantics to make all mempolicies static wrt cpuset updates (and discarding the relative and default modes) can be tried on top, as there's a high risk of being rejected/reverted because somebody might still care about the removed modes. However I haven't yet figured out how to properly: 1) make mempolicies swappable instead of rebinding in place. I thought mbind() already works that way and uses refcounting to avoid use-after-free of the old policy by a parallel allocation, but turns out true refcounting is only done for shared (shmem) mempolicies, and the actual protection for mbind() comes from mmap_sem. Extending the refcounting means more overhead in allocator hot path. Also swapping whole mempolicies means that we have to allocate the new ones, which can fail, and reverting of the partially done work also means allocating (note that mbind() doesn't care and will just leave part of the range updated and part not updated when returning -ENOMEM...). 2) make cpuset's task->mems_allowed also swappable (after converting it from nodemask to zonelist, which is the easy part) for mostly the same reasons. The good news is that while trying to do the above, I've at least figured out how to hopefully close the remaining premature OOM's, and do a buch of cleanups on top, removing quite some of the code that was also supposed to prevent the cpuset update races, but doesn't work anymore nowadays. This should fix the most pressing concerns with this topic and give us a better baseline before either proceeding with the original proposal, or pushing a change of semantics that removes the problem 1) above. I'd be then fine with trying to change the semantic first and rewrite later. Patchset is based on next-20170411 and has been tested with the LTP cpuset01 stress test. [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz [2] https://lwn.net/Articles/717797/ Vlastimil Babka (6): mm, page_alloc: fix more premature OOM due to race with cpuset update mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() mm, page_alloc: pass preferred nid instead of zonelist to allocator mm, mempolicy: simplify rebinding mempolicies when updating cpusets mm, cpuset: always use seqlock when changing task's nodemask mm, mempolicy: don't check cpuset seqlock where it doesn't matter include/linux/gfp.h | 11 ++- include/linux/mempolicy.h | 12 ++- include/uapi/linux/mempolicy.h | 8 -- kernel/cgroup/cpuset.c | 33 ++------- mm/hugetlb.c | 15 ++-- mm/memory_hotplug.c | 6 +- mm/mempolicy.c | 165 +++++++++-------------------------------- mm/page_alloc.c | 61 ++++++++++----- 8 files changed, 109 insertions(+), 202 deletions(-) -- 2.12.2