linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
@ 2018-07-10 23:34 Yang Shi
  2018-07-10 23:34 ` [RFC v4 PATCH 1/3] mm: introduce VM_DEAD flag and extend check_stable_address_space to check it Yang Shi
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-10 23:34 UTC (permalink / raw)
  To: mhocko, willy, ldufour, kirill, akpm; +Cc: yang.shi, linux-mm, linux-kernel


Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.


History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).

V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
and suggestions.

Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko
suggested (also in the v1 patches review) to try "two phases" approach. Zapping
pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
discussion detail, see https://lwn.net/Articles/753269/)


Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.

But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
  * The unexpected state from PF if it wins the race in the middle of munmap.
    It may return zero page, instead of the content or SIGSEGV.
  * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
    is a showstopper from akpm

And, some part may need write mmap_sem, for example, vma splitting. So, the
design is as follows:
        acquire write mmap_sem
        lookup vmas (find and split vmas)
        set VM_DEAD flags
        deal with special mappings
        downgrade_write

        zap pages
        release mmap_sem

        retake mmap_sem exclusively
        cleanup vmas
        release mmap_sem

Define large mapping size thresh as PUD size, just zap pages with read mmap_sem
for mappings which are >= PUD_SIZE. So, unmapping less than PUD_SIZE area still
goes with the regular path.

All vmas which will be zapped soon will have VM_DEAD flag set. Since PF may race
with munmap, may just return the right content or SIGSEGV before the optimization,
but with the optimization, it may return a zero page. Here use this flag to mark
PF to this area is unstable, will trigger SIGSEGV, in order to prevent from the
unexpected 3rd state.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are considered
as special mappings. They will be dealt with before zapping pages with write
mmap_sem held. Basically, just update vm_flags. The actual unmapping is still
done with read mmap_sem.

And, since they are also manipulated by unmap_single_vma() which is called by
zap_page_range() with read mmap_sem held in this case, to prevent from updating
vm_flags in read critical section and considering the complexity of coding, just
check if VM_DEAD is set, then skip any VM_DEAD area since they should be handled
before.

When cleaning up vmas, just call do_munmap() without carrying vmas from the above
to avoid race condition, since the address space might be already changed under
our feet after retaking exclusive lock.

For the time being, just do this in munmap syscall path. Other vm_munmap() or
do_munmap() call sites (i.e mmap, mremap, etc) remain intact for stability reason.
And, make this 64 bit only explicitly per akpm's suggestion.

Changelog:
v3 -> v4:
* Extend check_stable_address_space to check VM_DEAD as Michal suggested
* Deal with vm_flags update of VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe
  mappings with exclusive lock held. The actual unmapping is still done with read
  mmap_sem to solve akpm's concern
* Clean up vmas with calling do_munmap to prevent from race condition by not
  carrying vmas as Kirill suggested
* Extracted more common code
* Solved some code cleanup comments from akpm
* Dropped uprobe and arch specific code, now all the changes are mm only
* Still keep PUD_SIZE threshold, if everyone thinks it is better to extend to all
  sizes or smaller size, will remove it
* Make this optimization 64 bit only explicitly per akpm's suggestion

v2 —> v3:
* Refactor do_munmap code to extract the common part per Peter's sugestion
* Introduced VM_DEAD flag per Michal's suggestion. Just handled VM_DEAD in 
  x86's page fault handler for the time being. Other architectures will be covered
  once the patch series is reviewed
* Now lookup vma (find and split) and set VM_DEAD flag with write mmap_sem, then
  zap mapping with read mmap_sem, then clean up pgtables and vmas with write
  mmap_sem per Peter's suggestion

v1 —> v2:
* Re-implemented the code per the discussion on LSFMM summit

 
Regression and performance data:
Did the below regression test with setting thresh to 4K manually in the code:
  * Full LTP
  * Trinity (munmap/all vm syscalls)
  * Stress-ng mmap
  * mm-tests: kernbench, phpbench, sysbench-mariadb, will-it-scale
  * vm-scalability

With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.

                w/o             w/
do_munmap    2165433 us      35148.923 us
SyS_munmap   2165369 us      2166535 us


Yang Shi (3):
      mm: introduce VM_DEAD flag and extend check_stable_address_space to check it
      mm: refactor do_munmap() to extract the common part
      mm: mmap: zap pages with read mmap_sem for large mapping

 include/linux/mm.h  |   8 +++
 include/linux/oom.h |  20 -------
 mm/huge_memory.c    |   4 +-
 mm/hugetlb.c        |   5 ++
 mm/memory.c         |  57 ++++++++++++++++---
 mm/mmap.c           | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
 mm/shmem.c          |   9 ++-
 7 files changed, 255 insertions(+), 69 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC v4 PATCH 1/3] mm: introduce VM_DEAD flag and extend check_stable_address_space to check it
  2018-07-10 23:34 [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
@ 2018-07-10 23:34 ` Yang Shi
  2018-07-10 23:34 ` [RFC v4 PATCH 2/3] mm: refactor do_munmap() to extract the common part Yang Shi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-10 23:34 UTC (permalink / raw)
  To: mhocko, willy, ldufour, kirill, akpm; +Cc: yang.shi, linux-mm, linux-kernel

VM_DEAD flag is used to mark a vma is being unmapped for the later
munmap large address space optimization. Before the optimization PF
race with munmap, may return the right content or SIGSEGV, but with
the optimization, it may return a zero page.

Use this flag to mark PF to this area is unstable, will trigger SIGSEGV,
in order to prevent from the 3rd state.

This flag will be set by the optimization for unmapping large address
space (>= 1GB) in the later patch. It is 64 bit only at the moment,
since:
  * we used up vm_flags bit for 32 bit
  * 32 bit machine typically will not have such large mapping

Extend check_stable_address_space() to check this flag, as well as the
page fault path of shmem and hugetlb.

Since oom reaper doesn't tear down shmem and hugetlb, so skip those two
cases for MMF_UNSTABLE.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/mm.h  |  8 ++++++++
 include/linux/oom.h | 20 --------------------
 mm/huge_memory.c    |  4 ++--
 mm/hugetlb.c        |  5 +++++
 mm/memory.c         | 39 +++++++++++++++++++++++++++++++++++----
 mm/shmem.c          |  9 ++++++++-
 6 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9f..ce7b112 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -242,6 +242,12 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_64BIT
+#define VM_DEAD			BIT(37)	/* bit only usable on 64 bit kernel */
+#else
+#define VM_DEAD			0
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
@@ -2782,5 +2788,7 @@ static inline bool page_is_guard(struct page *page)
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+extern int check_stable_address_space(struct vm_area_struct *vma);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac11..0265ed5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -75,26 +75,6 @@ static inline bool mm_is_oom_victim(struct mm_struct *mm)
 	return test_bit(MMF_OOM_VICTIM, &mm->flags);
 }
 
-/*
- * Checks whether a page fault on the given mm is still reliable.
- * This is no longer true if the oom reaper started to reap the
- * address space which is reflected by MMF_UNSTABLE flag set in
- * the mm. At that moment any !shared mapping would lose the content
- * and could cause a memory corruption (zero pages instead of the
- * original content).
- *
- * User should call this before establishing a page table entry for
- * a !shared mapping and under the proper page table lock.
- *
- * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
- */
-static inline int check_stable_address_space(struct mm_struct *mm)
-{
-	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
-		return VM_FAULT_SIGBUS;
-	return 0;
-}
-
 void __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a..997bac9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -578,7 +578,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	} else {
 		pmd_t entry;
 
-		ret = check_stable_address_space(vma->vm_mm);
+		ret = check_stable_address_space(vma);
 		if (ret)
 			goto unlock_release;
 
@@ -696,7 +696,7 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			ret = check_stable_address_space(vma->vm_mm);
+			ret = check_stable_address_space(vma);
 			if (ret) {
 				spin_unlock(vmf->ptl);
 			} else if (userfaultfd_missing(vma)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3612fbb..8965d02 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3887,6 +3887,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
 
+	ret = check_stable_address_space(vma);
+	if (ret)
+		goto out;
+
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
@@ -4006,6 +4010,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	if (need_wait_lock)
 		wait_on_page_locked(page);
+out:
 	return ret;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 7206a63..250547f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,7 +68,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
-#include <linux/oom.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -776,6 +776,37 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 }
 
 /*
+ * Checks whether a page fault on the given mm is still reliable.
+ * This is no longer true if the oom reaper started to reap the
+ * address space which is reflected by MMF_UNSTABLE flag set in
+ * the mm. At that moment any !shared mapping would lose the content
+ * and could cause a memory corruption (zero pages instead of the
+ * original content).
+ * oom reaper doesn't reap hugetlb and shmem, so skip the check for
+ * such vmas.
+ *
+ * And, check if the given vma has VM_DEAD flag set, which means
+ * the vma will be unmapped soon, PF is not safe for such vma.
+ *
+ * User should call this before establishing a page table entry for
+ * a !shared mapping (disk file based), or shmem mapping, or hugetlb
+ * mapping, and under the proper page table lock.
+ *
+ * Return 0 when the PF is safe, VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV
+ * otherwise.
+ */
+int check_stable_address_space(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_DEAD)
+		return VM_FAULT_SIGSEGV;
+	if (!is_vm_hugetlb_page(vma) && !shmem_file(vma->vm_file)) {
+		if (unlikely(test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+			return VM_FAULT_SIGBUS;
+	}
+	return 0;
+}
+
+/*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
@@ -3147,7 +3178,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
-		ret = check_stable_address_space(vma->vm_mm);
+		ret = check_stable_address_space(vma);
 		if (ret)
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
@@ -3184,7 +3215,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
-	ret = check_stable_address_space(vma->vm_mm);
+	ret = check_stable_address_space(vma);
 	if (ret)
 		goto release;
 
@@ -3495,7 +3526,7 @@ int finish_fault(struct vm_fault *vmf)
 	 * page
 	 */
 	if (!(vmf->vma->vm_flags & VM_SHARED))
-		ret = check_stable_address_space(vmf->vma->vm_mm);
+		ret = check_stable_address_space(vmf->vma);
 	if (!ret)
 		ret = alloc_set_pte(vmf, vmf->memcg, page);
 	if (vmf->pte)
diff --git a/mm/shmem.c b/mm/shmem.c
index 2cab844..9f9ac7c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1953,7 +1953,13 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
 	enum sgp_type sgp;
 	int err;
-	vm_fault_t ret = VM_FAULT_LOCKED;
+	vm_fault_t ret = 0;
+
+	ret = check_stable_address_space(vma);
+	if (ret)
+		goto out;
+
+	ret = VM_FAULT_LOCKED;
 
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
@@ -2025,6 +2031,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
+out:
 	return ret;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v4 PATCH 2/3] mm: refactor do_munmap() to extract the common part
  2018-07-10 23:34 [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
  2018-07-10 23:34 ` [RFC v4 PATCH 1/3] mm: introduce VM_DEAD flag and extend check_stable_address_space to check it Yang Shi
@ 2018-07-10 23:34 ` Yang Shi
  2018-07-10 23:34 ` [RFC v4 PATCH 3/3] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-10 23:34 UTC (permalink / raw)
  To: mhocko, willy, ldufour, kirill, akpm; +Cc: yang.shi, linux-mm, linux-kernel

Introduces three new helper functions:
  * munmap_addr_sanity()
  * munmap_lookup_vma()
  * munmap_mlock_vma()

They will be used by do_munmap() and the new do_munmap with zapping
large mapping early in the later patch.

There is no functional change, just code refactor.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mmap.c | 120 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 82 insertions(+), 38 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87e..2504094 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2686,34 +2686,44 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-/* Munmap is split into 2 main parts -- this part which finds
- * what needs doing, and the areas themselves, which do the
- * work.  This now handles partial unmappings.
- * Jeremy Fitzhardinge <jeremy@goop.org>
- */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
-	      struct list_head *uf)
+static inline bool munmap_addr_sanity(unsigned long start, size_t len)
 {
-	unsigned long end;
-	struct vm_area_struct *vma, *prev, *last;
-
 	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
-		return -EINVAL;
+		return false;
 
-	len = PAGE_ALIGN(len);
-	if (len == 0)
-		return -EINVAL;
+	if (PAGE_ALIGN(len) == 0)
+		return false;
+
+	return true;
+}
+
+/*
+ * munmap_lookup_vma: find the first overlap vma and split overlap vmas.
+ * @mm: mm_struct
+ * @vma: the first overlapping vma
+ * @prev: vma's prev
+ * @start: start address
+ * @end: end address
+ *
+ * returns 1 if successful, 0 or errno otherwise
+ */
+static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
+			     struct vm_area_struct **prev, unsigned long start,
+			     unsigned long end)
+{
+	struct vm_area_struct *tmp, *last;
 
 	/* Find the first overlapping VMA */
-	vma = find_vma(mm, start);
-	if (!vma)
+	tmp = find_vma(mm, start);
+	if (!tmp)
 		return 0;
-	prev = vma->vm_prev;
-	/* we have  start < vma->vm_end  */
+
+	*prev = tmp->vm_prev;
+
+	/* we have start < vma->vm_end  */
 
 	/* if it doesn't overlap, we have nothing.. */
-	end = start + len;
-	if (vma->vm_start >= end)
+	if (tmp->vm_start >= end)
 		return 0;
 
 	/*
@@ -2723,7 +2733,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	 * unmapped vm_area_struct will remain in use: so lower split_vma
 	 * places tmp vma above, and higher split_vma places tmp vma below.
 	 */
-	if (start > vma->vm_start) {
+	if (start > tmp->vm_start) {
 		int error;
 
 		/*
@@ -2731,13 +2741,14 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (end < tmp->vm_end &&
+		    mm->map_count > sysctl_max_map_count)
 			return -ENOMEM;
 
-		error = __split_vma(mm, vma, start, 0);
+		error = __split_vma(mm, tmp, start, 0);
 		if (error)
 			return error;
-		prev = vma;
+		*prev = tmp;
 	}
 
 	/* Does it split the last one? */
@@ -2747,7 +2758,48 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		if (error)
 			return error;
 	}
-	vma = prev ? prev->vm_next : mm->mmap;
+
+	*vma = *prev ? (*prev)->vm_next : mm->mmap;
+
+	return 1;
+}
+
+static inline void munmap_mlock_vma(struct vm_area_struct *vma,
+				    unsigned long end)
+{
+	struct vm_area_struct *tmp = vma;
+
+	while (tmp && tmp->vm_start < end) {
+		if (tmp->vm_flags & VM_LOCKED) {
+			vma->vm_mm->locked_vm -= vma_pages(tmp);
+			munlock_vma_pages_all(tmp);
+		}
+		tmp = tmp->vm_next;
+	}
+}
+
+/* Munmap is split into 2 main parts -- this part which finds
+ * what needs doing, and the areas themselves, which do the
+ * work.  This now handles partial unmappings.
+ * Jeremy Fitzhardinge <jeremy@goop.org>
+ */
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+	      struct list_head *uf)
+{
+	unsigned long end;
+	struct vm_area_struct *vma = NULL, *prev;
+	int ret = 0;
+
+	if (!munmap_addr_sanity(start, len))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len);
+
+	end = start + len;
+
+	ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
+	if (ret != 1)
+		return ret;
 
 	if (unlikely(uf)) {
 		/*
@@ -2759,24 +2811,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		 * split, despite we could. This is unlikely enough
 		 * failure that it's not worth optimizing it for.
 		 */
-		int error = userfaultfd_unmap_prep(vma, start, end, uf);
-		if (error)
-			return error;
+		ret = userfaultfd_unmap_prep(vma, start, end, uf);
+		if (ret)
+			return ret;
 	}
 
 	/*
 	 * unlock any mlock()ed ranges before detaching vmas
 	 */
-	if (mm->locked_vm) {
-		struct vm_area_struct *tmp = vma;
-		while (tmp && tmp->vm_start < end) {
-			if (tmp->vm_flags & VM_LOCKED) {
-				mm->locked_vm -= vma_pages(tmp);
-				munlock_vma_pages_all(tmp);
-			}
-			tmp = tmp->vm_next;
-		}
-	}
+	if (mm->locked_vm)
+		munmap_mlock_vma(vma, end);
 
 	/*
 	 * Remove the vma's, and unmap the actual pages
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v4 PATCH 3/3] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-10 23:34 [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
  2018-07-10 23:34 ` [RFC v4 PATCH 1/3] mm: introduce VM_DEAD flag and extend check_stable_address_space to check it Yang Shi
  2018-07-10 23:34 ` [RFC v4 PATCH 2/3] mm: refactor do_munmap() to extract the common part Yang Shi
@ 2018-07-10 23:34 ` Yang Shi
  2018-07-11 10:33 ` [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap " Michal Hocko
  2018-07-11 11:10 ` Kirill A. Shutemov
  4 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-10 23:34 UTC (permalink / raw)
  To: mhocko, willy, ldufour, kirill, akpm; +Cc: yang.shi, linux-mm, linux-kernel

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem exclusively from very beginning to
all the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hocko [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas.

But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
	acquire write mmap_sem
	lookup vmas (find and split vmas)
	set VM_DEAD flags
	deal with special mappings
	downgrade_write

	zap pages
	release mmap_sem

	retake mmap_sem exclusively
	cleanup vmas
	release mmap_sem

Define large mapping size thresh as PUD size, just zap pages with read
mmap_sem for mappings which are >= PUD_SIZE. So, unmapping less than
PUD_SIZE area still goes with the regular path.

All vmas which will be zapped soon will have VM_DEAD flag set. Since PF
may race with munmap, may just return the right content or SIGSEGV before
the optimization, but with the optimization, it may return a zero page.
Here use this flag to mark PF to this area is unstable, will trigger
SIGSEGV, in order to prevent from the unexpected 3rd state.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are
considered as special mappings. They will be dealt with before zapping
pages with write mmap_sem held. Basically, just update vm_flags.

And, since they are also manipulated by unmap_single_vma() which is
called by zap_page_range() with read mmap_sem held in this case, to
prevent from updating vm_flags in read critical section and considering
the complexity of coding, just check if VM_DEAD is set, then skip any
VM_DEAD area since they should be handled before.

When cleaning up vmas, just call do_munmap() without carrying vmas from
the above to avoid race condition, since the address space might be
already changed under our feet after retaking exclusive lock.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact
for stability reason. And, this optimization is 64 bit only.

With the patches, exclusive mmap_sem hold time when munmap a 80GB
address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us
level from second.

		w/o		w/
do_munmap    2165433 us      35148.923 us
SyS_munmap   2165369 us      2166535 us

Here the excution time of do_munmap is used to evaluate the time of
holding exclusive lock.

[1] https://lwn.net/Articles/753269/

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/memory.c |  18 +++++++++--
 mm/mmap.c   | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 115 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 250547f..d343130 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1556,10 +1556,10 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	if (end <= vma->vm_start)
 		return;
 
-	if (vma->vm_file)
+	if (vma->vm_file && !(vma->vm_flags & VM_DEAD))
 		uprobe_munmap(vma, start, end);
 
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
+	if (unlikely(vma->vm_flags & VM_PFNMAP) && !(vma->vm_flags & VM_DEAD))
 		untrack_pfn(vma, 0, 0);
 
 	if (start != end) {
@@ -1577,7 +1577,19 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 */
 			if (vma->vm_file) {
 				i_mmap_lock_write(vma->vm_file->f_mapping);
-				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+				if (vma->vm_flags & VM_DEAD)
+					/*
+					 * The vma is being unmapped with read
+					 * mmap_sem.
+					 * Can't update vm_flags, it has been
+					 * updated before with exclusive lock
+					 * held.
+					 */
+					__unmap_hugepage_range(tlb, vma, start,
+							       end, NULL);
+				else
+					__unmap_hugepage_range_final(tlb, vma,
+							start, end, NULL);
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
 			}
 		} else
diff --git a/mm/mmap.c b/mm/mmap.c
index 2504094..169b143 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2778,6 +2778,91 @@ static inline void munmap_mlock_vma(struct vm_area_struct *vma,
 	}
 }
 
+/*
+ * Unmap large mapping early with acquiring read mmap_sem
+ *
+ * uf is the list for userfaultfd
+ */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+			       size_t len, struct list_head *uf)
+{
+	unsigned long end = 0;
+	struct vm_area_struct *start_vma = NULL, *prev, *vma;
+	int ret = 0;
+
+	if (!munmap_addr_sanity(start, len))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len);
+
+	end = start + len;
+
+	if (len >= PUD_SIZE) {
+		/*
+		 * need write mmap_sem to split vma and set VM_DEAD flag
+		 * splitting vma up-front to save PITA to clean if it is failed
+		 */
+		if (down_write_killable(&mm->mmap_sem))
+			return -EINTR;
+
+		ret = munmap_lookup_vma(mm, &start_vma, &prev, start, end);
+		if (ret != 1)
+			goto out;
+
+		/* This ret value might be returned, so reset it */
+		ret = 0;
+
+		if (unlikely(uf)) {
+			ret = userfaultfd_unmap_prep(start_vma, start, end, uf);
+			if (ret)
+				goto out;
+		}
+
+		/* Handle mlocked vmas */
+		if (mm->locked_vm)
+			munmap_mlock_vma(start_vma, end);
+
+		/*
+		 * set VM_DEAD flag before tear down them.
+		 * page fault on VM_DEAD vma will trigger SIGSEGV.
+		 *
+		 * And, clear uprobe, VM_PFNMAP and hugetlb mapping in advance.
+		 */
+		vma = start_vma;
+		for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+			vma->vm_flags |= VM_DEAD;
+
+			if (vma->vm_file)
+				uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+			if (unlikely(vma->vm_flags & VM_PFNMAP))
+				untrack_pfn(vma, 0, 0);
+			if (is_vm_hugetlb_page(vma))
+				vma->vm_flags &= ~VM_MAYSHARE;
+		}
+
+		downgrade_write(&mm->mmap_sem);
+
+		/* zap mappings with read mmap_sem */
+		zap_page_range(start_vma, start, len);
+		/* indicates early zap is success */
+		up_read(&mm->mmap_sem);
+	}
+
+	/* hold write mmap_sem for vma cleanup or regular path */
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+	/*
+	 * call do_munmap() for vma cleanup too in order to not carry vma
+	 * to here since the address space might be changed under our
+	 * feet before we retake the exclusive lock.
+	 */
+	ret = do_munmap(mm, start, len, uf);
+
+out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -2836,6 +2921,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return 0;
 }
 
+static int vm_munmap_zap_early(unsigned long start, size_t len)
+{
+	int ret;
+	struct mm_struct *mm = current->mm;
+	LIST_HEAD(uf);
+
+	ret = do_munmap_zap_early(mm, start, len, &uf);
+	userfaultfd_unmap_complete(mm, &uf);
+	return ret;
+}
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
@@ -2855,10 +2951,13 @@ int vm_munmap(unsigned long start, size_t len)
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	profile_munmap(addr);
+#ifdef CONFIG_64BIT
+	return vm_munmap_zap_early(addr, len);
+#else
 	return vm_munmap(addr, len);
+#endif
 }
 
-
 /*
  * Emulation of deprecated remap_file_pages() syscall.
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-10 23:34 [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
                   ` (2 preceding siblings ...)
  2018-07-10 23:34 ` [RFC v4 PATCH 3/3] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
@ 2018-07-11 10:33 ` Michal Hocko
  2018-07-11 11:13   ` Kirill A. Shutemov
                     ` (2 more replies)
  2018-07-11 11:10 ` Kirill A. Shutemov
  4 siblings, 3 replies; 16+ messages in thread
From: Michal Hocko @ 2018-07-11 10:33 UTC (permalink / raw)
  To: Yang Shi; +Cc: willy, ldufour, kirill, akpm, linux-mm, linux-kernel

On Wed 11-07-18 07:34:06, Yang Shi wrote:
> 
> Background:
> Recently, when we ran some vm scalability tests on machines with large memory,
> we ran into a couple of mmap_sem scalability issues when unmapping large memory
> space, please refer to https://lkml.org/lkml/2017/12/14/733 and
> https://lkml.org/lkml/2018/2/20/576.
> 
> 
> History:
> Then akpm suggested to unmap large mapping section by section and drop mmap_sem
> at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
> 
> V1 patch series was submitted to the mailing list per Andrew's suggestion
> (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
> and suggestions.
> 
> Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko
> suggested (also in the v1 patches review) to try "two phases" approach. Zapping
> pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
> discussion detail, see https://lwn.net/Articles/753269/)
> 
> 
> Approach:
> Zapping pages is the most time consuming part, according to the suggestion from
> Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
> what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
> 
> But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
>   * The unexpected state from PF if it wins the race in the middle of munmap.
>     It may return zero page, instead of the content or SIGSEGV.
>   * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
>     is a showstopper from akpm

I do not really understand why this is a showstopper. This is a mere
optimization. VM_LOCKED ranges are usually not that large. VM_HUGETLB
can be quite large alright but this should be doable on top. Is there
any reason to block any "cover most mappings first" patch?

> And, some part may need write mmap_sem, for example, vma splitting. So, the
> design is as follows:
>         acquire write mmap_sem
>         lookup vmas (find and split vmas)
>         set VM_DEAD flags
>         deal with special mappings
>         downgrade_write
> 
>         zap pages
>         release mmap_sem
> 
>         retake mmap_sem exclusively
>         cleanup vmas
>         release mmap_sem

Please explain why dropping the lock and then ratake it to cleanup vmas
is OK. This is really important because parallel thread could have
changed the underlying address space range.

Moreover

>  include/linux/mm.h  |   8 +++
>  include/linux/oom.h |  20 -------
>  mm/huge_memory.c    |   4 +-
>  mm/hugetlb.c        |   5 ++
>  mm/memory.c         |  57 ++++++++++++++++---
>  mm/mmap.c           | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
>  mm/shmem.c          |   9 ++-
>  7 files changed, 255 insertions(+), 69 deletions(-)

this is not a small change for something that could be achieved
from the userspace trivially (just call madvise before munmap - library
can hide this). Most workloads will even not care about races because
they simply do not play tricks with mmaps and userspace MM. So why do we
want to put the additional complexity into the kernel?

Note that I am _not_ saying this is a wrong idea, we just need some
pretty sounds arguments to justify the additional complexity which is
mostly based on our fear that somebody might be doing something
(half)insane or dubious at best.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-10 23:34 [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
                   ` (3 preceding siblings ...)
  2018-07-11 10:33 ` [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap " Michal Hocko
@ 2018-07-11 11:10 ` Kirill A. Shutemov
  2018-07-11 11:58   ` Michal Hocko
  2018-07-11 17:04   ` Yang Shi
  4 siblings, 2 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-07-11 11:10 UTC (permalink / raw)
  To: Yang Shi; +Cc: mhocko, willy, ldufour, akpm, linux-mm, linux-kernel

On Wed, Jul 11, 2018 at 07:34:06AM +0800, Yang Shi wrote:
> 
> Background:
> Recently, when we ran some vm scalability tests on machines with large memory,
> we ran into a couple of mmap_sem scalability issues when unmapping large memory
> space, please refer to https://lkml.org/lkml/2017/12/14/733 and
> https://lkml.org/lkml/2018/2/20/576.
> 
> 
> History:
> Then akpm suggested to unmap large mapping section by section and drop mmap_sem
> at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
> 
> V1 patch series was submitted to the mailing list per Andrew's suggestion
> (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
> and suggestions.
> 
> Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko
> suggested (also in the v1 patches review) to try "two phases" approach. Zapping
> pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
> discussion detail, see https://lwn.net/Articles/753269/)
> 
> 
> Approach:
> Zapping pages is the most time consuming part, according to the suggestion from
> Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
> what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
> 
> But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
>   * The unexpected state from PF if it wins the race in the middle of munmap.
>     It may return zero page, instead of the content or SIGSEGV.
>   * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
>     is a showstopper from akpm
> 
> And, some part may need write mmap_sem, for example, vma splitting. So, the
> design is as follows:
>         acquire write mmap_sem
>         lookup vmas (find and split vmas)
>         set VM_DEAD flags
>         deal with special mappings
>         downgrade_write
> 
>         zap pages
>         release mmap_sem
> 
>         retake mmap_sem exclusively
>         cleanup vmas
>         release mmap_sem
> 
> Define large mapping size thresh as PUD size, just zap pages with read mmap_sem
> for mappings which are >= PUD_SIZE. So, unmapping less than PUD_SIZE area still
> goes with the regular path.
> 
> All vmas which will be zapped soon will have VM_DEAD flag set. Since PF may race
> with munmap, may just return the right content or SIGSEGV before the optimization,
> but with the optimization, it may return a zero page. Here use this flag to mark
> PF to this area is unstable, will trigger SIGSEGV, in order to prevent from the
> unexpected 3rd state.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are considered
> as special mappings. They will be dealt with before zapping pages with write
> mmap_sem held. Basically, just update vm_flags. The actual unmapping is still
> done with read mmap_sem.
> 
> And, since they are also manipulated by unmap_single_vma() which is called by
> zap_page_range() with read mmap_sem held in this case, to prevent from updating
> vm_flags in read critical section and considering the complexity of coding, just
> check if VM_DEAD is set, then skip any VM_DEAD area since they should be handled
> before.
> 
> When cleaning up vmas, just call do_munmap() without carrying vmas from the above
> to avoid race condition, since the address space might be already changed under
> our feet after retaking exclusive lock.
> 
> For the time being, just do this in munmap syscall path. Other vm_munmap() or
> do_munmap() call sites (i.e mmap, mremap, etc) remain intact for stability reason.
> And, make this 64 bit only explicitly per akpm's suggestion.

I still see VM_DEAD as unnecessary complication. We should be fine without it.
But looks like I'm in the minority :/

It's okay. I have another suggestion that also doesn't require VM_DEAD
trick too :)

1. Take mmap_sem for write;
2. Adjust VMA layout (split/remove). After the step all memory we try to
   unmap is outside any VMA.
3. Downgrade mmap_sem to read.
4. Zap the page range.
5. Drop mmap_sem.

I believe it should be safe.

The pages in the range cannot be re-faulted after step 3 as find_vma()
will not see the corresponding VMA and deliver SIGSEGV.

New VMAs cannot be created in the range before step 5 since we hold the
semaphore at least for read the whole time.

Do you see problem in this approach?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 10:33 ` [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap " Michal Hocko
@ 2018-07-11 11:13   ` Kirill A. Shutemov
  2018-07-11 11:53     ` Michal Hocko
  2018-07-11 16:57   ` Yang Shi
  2018-07-11 22:49   ` Andrew Morton
  2 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-07-11 11:13 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Yang Shi, willy, ldufour, akpm, linux-mm, linux-kernel

On Wed, Jul 11, 2018 at 12:33:12PM +0200, Michal Hocko wrote:
> this is not a small change for something that could be achieved
> from the userspace trivially (just call madvise before munmap - library
> can hide this). Most workloads will even not care about races because
> they simply do not play tricks with mmaps and userspace MM. So why do we
> want to put the additional complexity into the kernel?

As I said before, kernel latency issues have to be addressed in kernel.
We cannot rely on userspace being kind here.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 11:13   ` Kirill A. Shutemov
@ 2018-07-11 11:53     ` Michal Hocko
  2018-07-11 17:08       ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2018-07-11 11:53 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Yang Shi, willy, ldufour, akpm, linux-mm, linux-kernel

On Wed 11-07-18 14:13:12, Kirill A. Shutemov wrote:
> On Wed, Jul 11, 2018 at 12:33:12PM +0200, Michal Hocko wrote:
> > this is not a small change for something that could be achieved
> > from the userspace trivially (just call madvise before munmap - library
> > can hide this). Most workloads will even not care about races because
> > they simply do not play tricks with mmaps and userspace MM. So why do we
> > want to put the additional complexity into the kernel?
> 
> As I said before, kernel latency issues have to be addressed in kernel.
> We cannot rely on userspace being kind here.

Those who really care and create really large mappings will know how to
do this properly. Most others just do not care enough. So I am not
really sure this alone is a sufficient argument.

I personally like the in kernel auto tuning but as I've said the
changelog should be really clear why all the complications are
justified. This would be a lot easier to argue about if it was a simple
	if (len > THARSHOLD)
		do_madvise(DONTNEED)
	munmap().
approach. But if we really have to care about parallel faults and munmap
consitency this will always be tricky
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 11:10 ` Kirill A. Shutemov
@ 2018-07-11 11:58   ` Michal Hocko
  2018-07-11 17:04   ` Yang Shi
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2018-07-11 11:58 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Yang Shi, willy, ldufour, akpm, linux-mm, linux-kernel

On Wed 11-07-18 14:10:52, Kirill A. Shutemov wrote:
[...]
> It's okay. I have another suggestion that also doesn't require VM_DEAD
> trick too :)
> 
> 1. Take mmap_sem for write;
> 2. Adjust VMA layout (split/remove). After the step all memory we try to
>    unmap is outside any VMA.
> 3. Downgrade mmap_sem to read.
> 4. Zap the page range.
> 5. Drop mmap_sem.
> 
> I believe it should be safe.
> 
> The pages in the range cannot be re-faulted after step 3 as find_vma()
> will not see the corresponding VMA and deliver SIGSEGV.
> 
> New VMAs cannot be created in the range before step 5 since we hold the
> semaphore at least for read the whole time.
> 
> Do you see problem in this approach?

Yes this seems to be safe. At least from the first glance.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 10:33 ` [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap " Michal Hocko
  2018-07-11 11:13   ` Kirill A. Shutemov
@ 2018-07-11 16:57   ` Yang Shi
  2018-07-11 22:49   ` Andrew Morton
  2 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-11 16:57 UTC (permalink / raw)
  To: Michal Hocko; +Cc: willy, ldufour, kirill, akpm, linux-mm, linux-kernel



On 7/11/18 3:33 AM, Michal Hocko wrote:
> On Wed 11-07-18 07:34:06, Yang Shi wrote:
>> Background:
>> Recently, when we ran some vm scalability tests on machines with large memory,
>> we ran into a couple of mmap_sem scalability issues when unmapping large memory
>> space, please refer to https://lkml.org/lkml/2017/12/14/733 and
>> https://lkml.org/lkml/2018/2/20/576.
>>
>>
>> History:
>> Then akpm suggested to unmap large mapping section by section and drop mmap_sem
>> at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
>>
>> V1 patch series was submitted to the mailing list per Andrew's suggestion
>> (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
>> and suggestions.
>>
>> Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko
>> suggested (also in the v1 patches review) to try "two phases" approach. Zapping
>> pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
>> discussion detail, see https://lwn.net/Articles/753269/)
>>
>>
>> Approach:
>> Zapping pages is the most time consuming part, according to the suggestion from
>> Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
>> what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
>>
>> But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
>>    * The unexpected state from PF if it wins the race in the middle of munmap.
>>      It may return zero page, instead of the content or SIGSEGV.
>>    * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
>>      is a showstopper from akpm
> I do not really understand why this is a showstopper. This is a mere
> optimization. VM_LOCKED ranges are usually not that large. VM_HUGETLB
> can be quite large alright but this should be doable on top. Is there
> any reason to block any "cover most mappings first" patch?
>
>> And, some part may need write mmap_sem, for example, vma splitting. So, the
>> design is as follows:
>>          acquire write mmap_sem
>>          lookup vmas (find and split vmas)
>>          set VM_DEAD flags
>>          deal with special mappings
>>          downgrade_write
>>
>>          zap pages
>>          release mmap_sem
>>
>>          retake mmap_sem exclusively
>>          cleanup vmas
>>          release mmap_sem
> Please explain why dropping the lock and then ratake it to cleanup vmas
> is OK. This is really important because parallel thread could have
> changed the underlying address space range.

Yes, the address space could be changed after retaking the lock. 
Actually, here do_munmap() is called in the new patch to do the cleanup 
work as Kirill suggested, which will re-lookup vmas and deal with any 
address space change.

If there is no address space change, actually it just clean up vmas.

>
> Moreover
>
>>   include/linux/mm.h  |   8 +++
>>   include/linux/oom.h |  20 -------
>>   mm/huge_memory.c    |   4 +-
>>   mm/hugetlb.c        |   5 ++
>>   mm/memory.c         |  57 ++++++++++++++++---
>>   mm/mmap.c           | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
>>   mm/shmem.c          |   9 ++-
>>   7 files changed, 255 insertions(+), 69 deletions(-)
> this is not a small change for something that could be achieved
> from the userspace trivially (just call madvise before munmap - library
> can hide this). Most workloads will even not care about races because
> they simply do not play tricks with mmaps and userspace MM. So why do we
> want to put the additional complexity into the kernel?
>
> Note that I am _not_ saying this is a wrong idea, we just need some
> pretty sounds arguments to justify the additional complexity which is
> mostly based on our fear that somebody might be doing something
> (half)insane or dubious at best.

I agree with Kirill that we can't rely on sane userspace to handle 
kernel latency issue. Moreover, we even don't know if they are sane 
enough or not at all.

Yang

>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 11:10 ` Kirill A. Shutemov
  2018-07-11 11:58   ` Michal Hocko
@ 2018-07-11 17:04   ` Yang Shi
  2018-07-12  8:04     ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-07-11 17:04 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: mhocko, willy, ldufour, akpm, linux-mm, linux-kernel



On 7/11/18 4:10 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 11, 2018 at 07:34:06AM +0800, Yang Shi wrote:
>> Background:
>> Recently, when we ran some vm scalability tests on machines with large memory,
>> we ran into a couple of mmap_sem scalability issues when unmapping large memory
>> space, please refer to https://lkml.org/lkml/2017/12/14/733 and
>> https://lkml.org/lkml/2018/2/20/576.
>>
>>
>> History:
>> Then akpm suggested to unmap large mapping section by section and drop mmap_sem
>> at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
>>
>> V1 patch series was submitted to the mailing list per Andrew's suggestion
>> (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
>> and suggestions.
>>
>> Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko
>> suggested (also in the v1 patches review) to try "two phases" approach. Zapping
>> pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
>> discussion detail, see https://lwn.net/Articles/753269/)
>>
>>
>> Approach:
>> Zapping pages is the most time consuming part, according to the suggestion from
>> Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
>> what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
>>
>> But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
>>    * The unexpected state from PF if it wins the race in the middle of munmap.
>>      It may return zero page, instead of the content or SIGSEGV.
>>    * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
>>      is a showstopper from akpm
>>
>> And, some part may need write mmap_sem, for example, vma splitting. So, the
>> design is as follows:
>>          acquire write mmap_sem
>>          lookup vmas (find and split vmas)
>>          set VM_DEAD flags
>>          deal with special mappings
>>          downgrade_write
>>
>>          zap pages
>>          release mmap_sem
>>
>>          retake mmap_sem exclusively
>>          cleanup vmas
>>          release mmap_sem
>>
>> Define large mapping size thresh as PUD size, just zap pages with read mmap_sem
>> for mappings which are >= PUD_SIZE. So, unmapping less than PUD_SIZE area still
>> goes with the regular path.
>>
>> All vmas which will be zapped soon will have VM_DEAD flag set. Since PF may race
>> with munmap, may just return the right content or SIGSEGV before the optimization,
>> but with the optimization, it may return a zero page. Here use this flag to mark
>> PF to this area is unstable, will trigger SIGSEGV, in order to prevent from the
>> unexpected 3rd state.
>>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are considered
>> as special mappings. They will be dealt with before zapping pages with write
>> mmap_sem held. Basically, just update vm_flags. The actual unmapping is still
>> done with read mmap_sem.
>>
>> And, since they are also manipulated by unmap_single_vma() which is called by
>> zap_page_range() with read mmap_sem held in this case, to prevent from updating
>> vm_flags in read critical section and considering the complexity of coding, just
>> check if VM_DEAD is set, then skip any VM_DEAD area since they should be handled
>> before.
>>
>> When cleaning up vmas, just call do_munmap() without carrying vmas from the above
>> to avoid race condition, since the address space might be already changed under
>> our feet after retaking exclusive lock.
>>
>> For the time being, just do this in munmap syscall path. Other vm_munmap() or
>> do_munmap() call sites (i.e mmap, mremap, etc) remain intact for stability reason.
>> And, make this 64 bit only explicitly per akpm's suggestion.
> I still see VM_DEAD as unnecessary complication. We should be fine without it.
> But looks like I'm in the minority :/
>
> It's okay. I have another suggestion that also doesn't require VM_DEAD
> trick too :)
>
> 1. Take mmap_sem for write;
> 2. Adjust VMA layout (split/remove). After the step all memory we try to
>     unmap is outside any VMA.
> 3. Downgrade mmap_sem to read.
> 4. Zap the page range.
> 5. Drop mmap_sem.
>
> I believe it should be safe.


Yes, it looks so. But, a further question is all the vmas have been 
removed, how zap_page_range could do its job? It depends on the vmas.

One approach is to save all the vmas on a separate list, then 
zap_page_range does unmap with this list.

Yang

>
> The pages in the range cannot be re-faulted after step 3 as find_vma()
> will not see the corresponding VMA and deliver SIGSEGV.
>
> New VMAs cannot be created in the range before step 5 since we hold the
> semaphore at least for read the whole time.
>
> Do you see problem in this approach?
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 11:53     ` Michal Hocko
@ 2018-07-11 17:08       ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-11 17:08 UTC (permalink / raw)
  To: Michal Hocko, Kirill A. Shutemov
  Cc: willy, ldufour, akpm, linux-mm, linux-kernel



On 7/11/18 4:53 AM, Michal Hocko wrote:
> On Wed 11-07-18 14:13:12, Kirill A. Shutemov wrote:
>> On Wed, Jul 11, 2018 at 12:33:12PM +0200, Michal Hocko wrote:
>>> this is not a small change for something that could be achieved
>>> from the userspace trivially (just call madvise before munmap - library
>>> can hide this). Most workloads will even not care about races because
>>> they simply do not play tricks with mmaps and userspace MM. So why do we
>>> want to put the additional complexity into the kernel?
>> As I said before, kernel latency issues have to be addressed in kernel.
>> We cannot rely on userspace being kind here.
> Those who really care and create really large mappings will know how to
> do this properly. Most others just do not care enough. So I am not
> really sure this alone is a sufficient argument.
>
> I personally like the in kernel auto tuning but as I've said the
> changelog should be really clear why all the complications are
> justified. This would be a lot easier to argue about if it was a simple
> 	if (len > THARSHOLD)
> 		do_madvise(DONTNEED)
> 	munmap().

The main difference AFAICS, is it can't deal with the parallel faults 
and those special mappings. Someone may not care about it, but someone may.

Yang

> approach. But if we really have to care about parallel faults and munmap
> consitency this will always be tricky


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 10:33 ` [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap " Michal Hocko
  2018-07-11 11:13   ` Kirill A. Shutemov
  2018-07-11 16:57   ` Yang Shi
@ 2018-07-11 22:49   ` Andrew Morton
  2018-07-12  8:15     ` Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-07-11 22:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Yang Shi, willy, ldufour, kirill, linux-mm, linux-kernel

On Wed, 11 Jul 2018 12:33:12 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > Approach:
> > Zapping pages is the most time consuming part, according to the suggestion from
> > Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
> > what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
> > 
> > But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
> >   * The unexpected state from PF if it wins the race in the middle of munmap.
> >     It may return zero page, instead of the content or SIGSEGV.
> >   * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
> >     is a showstopper from akpm
> 
> I do not really understand why this is a showstopper. This is a mere
> optimization. VM_LOCKED ranges are usually not that large. VM_HUGETLB
> can be quite large alright but this should be doable on top. Is there
> any reason to block any "cover most mappings first" patch?

Somebody somewhere is going to want to unmap vast mlocked regions and
they're going to report softlockup warnings.  So we shouldn't implement
something which can't address these cases.  Maybe it doesn't do so in
the first version, but we should at least have a plan to handle all
cases.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 17:04   ` Yang Shi
@ 2018-07-12  8:04     ` Michal Hocko
  2018-07-12 23:45       ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2018-07-12  8:04 UTC (permalink / raw)
  To: Yang Shi; +Cc: Kirill A. Shutemov, willy, ldufour, akpm, linux-mm, linux-kernel

On Wed 11-07-18 10:04:48, Yang Shi wrote:
[...]
> One approach is to save all the vmas on a separate list, then zap_page_range
> does unmap with this list.

Just detached unmapped vma chain from mm. You can keep the existing
vm_next chain and reuse it.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-11 22:49   ` Andrew Morton
@ 2018-07-12  8:15     ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2018-07-12  8:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Yang Shi, willy, ldufour, kirill, linux-mm, linux-kernel

On Wed 11-07-18 15:49:54, Andrew Morton wrote:
> On Wed, 11 Jul 2018 12:33:12 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > Approach:
> > > Zapping pages is the most time consuming part, according to the suggestion from
> > > Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
> > > what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
> > > 
> > > But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
> > >   * The unexpected state from PF if it wins the race in the middle of munmap.
> > >     It may return zero page, instead of the content or SIGSEGV.
> > >   * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
> > >     is a showstopper from akpm
> > 
> > I do not really understand why this is a showstopper. This is a mere
> > optimization. VM_LOCKED ranges are usually not that large. VM_HUGETLB
> > can be quite large alright but this should be doable on top. Is there
> > any reason to block any "cover most mappings first" patch?
> 
> Somebody somewhere is going to want to unmap vast mlocked regions and
> they're going to report softlockup warnings. So we shouldn't implement
> something which can't address these cases.  Maybe it doesn't do so in
> the first version, but we should at least have a plan to handle all
> cases.

Absolutely. I was just responding to the "showstopper" part. This is
improving some cases but it shouldn't make others worse so going
incremental should be perfectly reasonable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-07-12  8:04     ` Michal Hocko
@ 2018-07-12 23:45       ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-07-12 23:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, willy, ldufour, akpm, linux-mm, linux-kernel



On 7/12/18 1:04 AM, Michal Hocko wrote:
> On Wed 11-07-18 10:04:48, Yang Shi wrote:
> [...]
>> One approach is to save all the vmas on a separate list, then zap_page_range
>> does unmap with this list.
> Just detached unmapped vma chain from mm. You can keep the existing
> vm_next chain and reuse it.

Yes. Other than this, we still need do:

   * Tell zap_page_range not update vm_flags as what I did in v4. Of 
course without VM_DEAD this time

   * Extract pagetable free code then do it after zap_page_range. I 
think I can just cal free_pgd_range() directly.

>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-07-12 23:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-10 23:34 [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
2018-07-10 23:34 ` [RFC v4 PATCH 1/3] mm: introduce VM_DEAD flag and extend check_stable_address_space to check it Yang Shi
2018-07-10 23:34 ` [RFC v4 PATCH 2/3] mm: refactor do_munmap() to extract the common part Yang Shi
2018-07-10 23:34 ` [RFC v4 PATCH 3/3] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
2018-07-11 10:33 ` [RFC v4 0/3] mm: zap pages with read mmap_sem in munmap " Michal Hocko
2018-07-11 11:13   ` Kirill A. Shutemov
2018-07-11 11:53     ` Michal Hocko
2018-07-11 17:08       ` Yang Shi
2018-07-11 16:57   ` Yang Shi
2018-07-11 22:49   ` Andrew Morton
2018-07-12  8:15     ` Michal Hocko
2018-07-11 11:10 ` Kirill A. Shutemov
2018-07-11 11:58   ` Michal Hocko
2018-07-11 17:04   ` Yang Shi
2018-07-12  8:04     ` Michal Hocko
2018-07-12 23:45       ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).