linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping
@ 2018-06-29 22:39 Yang Shi
  2018-06-29 22:39 ` [RFC v3 PATCH 1/5] uprobes: make vma_has_uprobes non-static Yang Shi
                   ` (5 more replies)
  0 siblings, 6 replies; 44+ messages in thread
From: Yang Shi @ 2018-06-29 22:39 UTC (permalink / raw)
  To: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: yang.shi, linux-mm, x86, linux-kernel


Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.


History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).

V1 patch series was submitted to the mailing list per Andrew’s suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
and suggestions.

Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hock
suggested (also in the v1 patches review) to try "two phases" approach. Zapping
pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
discussion detail, see https://lwn.net/Articles/753269/)


Changelog:
v2 —> v3:
* Refactor do_munmap code to extract the common part per Peter's sugestion
* Introduced VM_DEAD flag per Michal's suggestion. Just handled VM_DEAD in 
  x86's page fault handler for the time being. Other architectures will be covered
  once the patch series is reviewed
* Now lookup vma (find and split) and set VM_DEAD flag with write mmap_sem, then
  zap mapping with read mmap_sem, then clean up pgtables and vmas with write
  mmap_sem per Peter's suggestion

v1 —> v2:
* Re-implemented the code per the discussion on LSFMM summit

 
Regression and performance data:
Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory

Regression test with full LTP and trinity (munmap) with setting thresh to 4K in
the code (just for regression test only) so that the new code can be covered
better and trinity (munmap) test manipulates 4K mapping.

No regression issue is reported and the system survives under trinity (munmap)
test for 4 hours until I abort the test.

Throughput of page faults (#/s) with the below stress-ng test:
stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf
--timeout 600s
        pristine         patched          delta
       89.41K/sec       97.29K/sec        +8.8%

The result is not very stable, and depends on timing. So, this is just for reference.


Yang Shi (5):
      uprobes: make vma_has_uprobes non-static
      mm: introduce VM_DEAD flag
      mm: refactor do_munmap() to extract the common part
      mm: mmap: zap pages with read mmap_sem for large mapping
      x86: check VM_DEAD flag in page fault

 arch/x86/mm/fault.c     |   4 ++
 include/linux/mm.h      |   6 +++
 include/linux/uprobes.h |   7 +++
 kernel/events/uprobes.c |   2 +-
 mm/mmap.c               | 243 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
 5 files changed, 224 insertions(+), 38 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC v3 PATCH 1/5] uprobes: make vma_has_uprobes non-static
  2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
@ 2018-06-29 22:39 ` Yang Shi
  2018-06-29 22:39 ` [RFC v3 PATCH 2/5] mm: introduce VM_DEAD flag Yang Shi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-06-29 22:39 UTC (permalink / raw)
  To: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: yang.shi, linux-mm, x86, linux-kernel

vma_has_uprobes() will be used in the following patch to check if a vma
could be unmapped with holding read mmap_sem, but it is static. So, make
it non-static to use outside uprobe.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/uprobes.h | 7 +++++++
 kernel/events/uprobes.c | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0a294e9..7f1fb8c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -149,6 +149,8 @@ struct uprobes_state {
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
@@ -203,5 +205,10 @@ static inline void uprobe_copy_process(struct task_struct *t, unsigned long flag
 static inline void uprobe_clear_state(struct mm_struct *mm)
 {
 }
+static inline bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long,
+				   unsigned long end)
+{
+	return false;
+}
 #endif /* !CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ccc579a..4880c46 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1095,7 +1095,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	return 0;
 }
 
-static bool
+bool
 vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
 	loff_t min, max;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v3 PATCH 2/5] mm: introduce VM_DEAD flag
  2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
  2018-06-29 22:39 ` [RFC v3 PATCH 1/5] uprobes: make vma_has_uprobes non-static Yang Shi
@ 2018-06-29 22:39 ` Yang Shi
  2018-07-02 13:40   ` Michal Hocko
  2018-06-29 22:39 ` [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part Yang Shi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-06-29 22:39 UTC (permalink / raw)
  To: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: yang.shi, linux-mm, x86, linux-kernel

VM_DEAD flag is used to mark a vma is being unmapped, access to this
area will trigger SIGSEGV.

This flag will be used by the optimization for unmapping large address
space (>= 1GB) in the later patch. It is 64 bit only at the moment,
since:
  * we used up vm_flags bit for 32 bit
  * 32 bit machine typically will not have such large mapping

All architectures, which support 64 bit, need check this flag in their
page fault handler. This is implemented in later patches.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/mm.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9f..28a3906 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -242,6 +242,12 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_64BIT
+#define VM_DEAD			BIT(37)	/* bit only usable on 64 bit kernel */
+#else
+#define VM_DEAD			0
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part
  2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
  2018-06-29 22:39 ` [RFC v3 PATCH 1/5] uprobes: make vma_has_uprobes non-static Yang Shi
  2018-06-29 22:39 ` [RFC v3 PATCH 2/5] mm: introduce VM_DEAD flag Yang Shi
@ 2018-06-29 22:39 ` Yang Shi
  2018-07-02 13:42   ` Michal Hocko
  2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-06-29 22:39 UTC (permalink / raw)
  To: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: yang.shi, linux-mm, x86, linux-kernel

Introduces two new helper functions:
  * munmap_addr_sanity()
  * munmap_lookup_vma()

They will be used by do_munmap() and the new do_munmap with zapping
large mapping early in the later patch.

There is no functional change, just code refactor.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mmap.c | 107 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 72 insertions(+), 35 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87e..87dcf83 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2686,34 +2686,45 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-/* Munmap is split into 2 main parts -- this part which finds
- * what needs doing, and the areas themselves, which do the
- * work.  This now handles partial unmappings.
- * Jeremy Fitzhardinge <jeremy@goop.org>
- */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
-	      struct list_head *uf)
+static inline bool munmap_addr_sanity(unsigned long start, size_t len)
 {
-	unsigned long end;
-	struct vm_area_struct *vma, *prev, *last;
+	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start)
+		return false;
 
-	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
-		return -EINVAL;
+	if (PAGE_ALIGN(len) == 0)
+		return false;
 
-	len = PAGE_ALIGN(len);
-	if (len == 0)
-		return -EINVAL;
+	return true;
+}
+
+/*
+ * munmap_lookup_vma: find the first overlap vma and split overlap vmas.
+ * @mm: mm_struct
+ * @vma: the first overlapping vma
+ * @prev: vma's prev
+ * @start: start address
+ * @end: end address
+ *
+ * returns 1 if successful, 0 or errno otherwise
+ */
+static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
+			     struct vm_area_struct **prev, unsigned long start,
+			     unsigned long end)
+{
+	struct vm_area_struct *tmp, *last;
+	int ret;
 
 	/* Find the first overlapping VMA */
-	vma = find_vma(mm, start);
-	if (!vma)
+	tmp = find_vma(mm, start);
+	if (!tmp)
 		return 0;
-	prev = vma->vm_prev;
-	/* we have  start < vma->vm_end  */
+
+	*prev = tmp->vm_prev;
+
+	/* we have start < vma->vm_end  */
 
 	/* if it doesn't overlap, we have nothing.. */
-	end = start + len;
-	if (vma->vm_start >= end)
+	if (tmp->vm_start >= end)
 		return 0;
 
 	/*
@@ -2723,31 +2734,57 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	 * unmapped vm_area_struct will remain in use: so lower split_vma
 	 * places tmp vma above, and higher split_vma places tmp vma below.
 	 */
-	if (start > vma->vm_start) {
-		int error;
-
+	if (start > tmp->vm_start) {
 		/*
 		 * Make sure that map_count on return from munmap() will
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (end < tmp->vm_end &&
+		    mm->map_count > sysctl_max_map_count)
 			return -ENOMEM;
 
-		error = __split_vma(mm, vma, start, 0);
-		if (error)
-			return error;
-		prev = vma;
+		ret = __split_vma(mm, tmp, start, 0);
+		if (ret)
+			return ret;
+		*prev = tmp;
 	}
 
 	/* Does it split the last one? */
 	last = find_vma(mm, end);
 	if (last && end > last->vm_start) {
-		int error = __split_vma(mm, last, end, 1);
-		if (error)
-			return error;
+		ret = __split_vma(mm, last, end, 1);
+		if (ret)
+			return ret;
 	}
-	vma = prev ? prev->vm_next : mm->mmap;
+
+	*vma = *prev ? (*prev)->vm_next : mm->mmap;
+
+	return 1;
+}
+
+/* Munmap is split into 2 main parts -- this part which finds
+ * what needs doing, and the areas themselves, which do the
+ * work.  This now handles partial unmappings.
+ * Jeremy Fitzhardinge <jeremy@goop.org>
+ */
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+	      struct list_head *uf)
+{
+	unsigned long end;
+	struct vm_area_struct *vma = NULL, *prev;
+	int ret = 0;
+
+	if (!munmap_addr_sanity(start, len))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len);
+
+	end = start + len;
+
+	ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
+	if (ret != 1)
+		return ret;
 
 	if (unlikely(uf)) {
 		/*
@@ -2759,9 +2796,9 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		 * split, despite we could. This is unlikely enough
 		 * failure that it's not worth optimizing it for.
 		 */
-		int error = userfaultfd_unmap_prep(vma, start, end, uf);
-		if (error)
-			return error;
+		ret = userfaultfd_unmap_prep(vma, start, end, uf);
+		if (ret)
+			return ret;
 	}
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
                   ` (2 preceding siblings ...)
  2018-06-29 22:39 ` [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part Yang Shi
@ 2018-06-29 22:39 ` Yang Shi
  2018-06-30  1:28   ` Andrew Morton
                     ` (3 more replies)
  2018-06-29 22:39 ` [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault Yang Shi
  2018-07-02 13:39 ` [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Michal Hocko
  5 siblings, 4 replies; 44+ messages in thread
From: Yang Shi @ 2018-06-29 22:39 UTC (permalink / raw)
  To: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: yang.shi, linux-mm, x86, linux-kernel

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 134 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
 	return 1;
 }
 
+/* Consider PUD size or 1GB mapping as large mapping */
+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
+#endif
+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+			       size_t len, struct list_head *uf)
+{
+	unsigned long end = 0;
+	struct vm_area_struct *vma = NULL, *prev, *tmp;
+	bool success = false;
+	int ret = 0;
+
+	if (!munmap_addr_sanity(start, len))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len);
+
+	end = start + len;
+
+	/* Just deal with uf in regular path */
+	if (unlikely(uf))
+		goto regular_path;
+
+	if (len >= LARGE_MAP_THRESH) {
+		/*
+		 * need write mmap_sem to split vma and set VM_DEAD flag
+		 * splitting vma up-front to save PITA to clean if it is failed
+		 */
+		down_write(&mm->mmap_sem);
+		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
+		if (ret != 1) {
+			up_write(&mm->mmap_sem);
+			return ret;
+		}
+		/* This ret value might be returned, so reset it */
+		ret = 0;
+
+		/*
+		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
+		 * flag set or has uprobes set, need acquire write map_sem,
+		 * so skip them in early zap. Just deal with such mapping in
+		 * regular path.
+		 * Borrow can_madv_dontneed_vma() to check the conditions.
+		 */
+		tmp = vma;
+		while (tmp && tmp->vm_start < end) {
+			if (!can_madv_dontneed_vma(tmp) ||
+			    vma_has_uprobes(tmp, start, end)) {
+				up_write(&mm->mmap_sem);
+				goto regular_path;
+			}
+			tmp = tmp->vm_next;
+		}
+		/*
+		 * set VM_DEAD flag before tear down them.
+		 * page fault on VM_DEAD vma will trigger SIGSEGV.
+		 */
+		tmp = vma;
+		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
+			tmp->vm_flags |= VM_DEAD;
+		up_write(&mm->mmap_sem);
+
+		/* zap mappings with read mmap_sem */
+		down_read(&mm->mmap_sem);
+		zap_page_range(vma, start, len);
+		/* indicates early zap is success */
+		success = true;
+		up_read(&mm->mmap_sem);
+	}
+
+regular_path:
+	/* hold write mmap_sem for vma manipulation or regular path */
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+	if (success) {
+		/* vmas have been zapped, here clean up pgtable and vmas */
+		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
+		struct mmu_gather tlb;
+		tlb_gather_mmu(&tlb, mm, start, end);
+		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+			      next ? next->vm_start : USER_PGTABLES_CEILING);
+		tlb_finish_mmu(&tlb, start, end);
+
+		detach_vmas_to_be_unmapped(mm, vma, prev, end);
+		arch_unmap(mm, vma, start, end);
+		remove_vma_list(mm, vma);
+	} else {
+		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
+		if (vma) {
+			if (unlikely(uf)) {
+				int ret = userfaultfd_unmap_prep(vma, start,
+								 end, uf);
+				if (ret)
+					goto out;
+			}
+			if (mm->locked_vm) {
+				tmp = vma;
+				while (tmp && tmp->vm_start < end) {
+					if (tmp->vm_flags & VM_LOCKED) {
+						mm->locked_vm -= vma_pages(tmp);
+						munlock_vma_pages_all(tmp);
+					}
+					tmp = tmp->vm_next;
+				}
+			}
+			detach_vmas_to_be_unmapped(mm, vma, prev, end);
+			unmap_region(mm, vma, prev, start, end);
+			remove_vma_list(mm, vma);
+		} else
+			/* When mapping size < LARGE_MAP_THRESH */
+			ret = do_munmap(mm, start, len, uf);
+	}
+
+out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return 0;
 }
 
+static int vm_munmap_zap_early(unsigned long start, size_t len)
+{
+	int ret;
+	struct mm_struct *mm = current->mm;
+	LIST_HEAD(uf);
+
+	ret = do_munmap_zap_early(mm, start, len, &uf);
+	userfaultfd_unmap_complete(mm, &uf);
+	return ret;
+}
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
@@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	profile_munmap(addr);
-	return vm_munmap(addr, len);
+	return vm_munmap_zap_early(addr, len);
 }
 
-
 /*
  * Emulation of deprecated remap_file_pages() syscall.
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
                   ` (3 preceding siblings ...)
  2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
@ 2018-06-29 22:39 ` Yang Shi
  2018-07-02  8:45   ` Laurent Dufour
  2018-07-02 13:39 ` [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Michal Hocko
  5 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-06-29 22:39 UTC (permalink / raw)
  To: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: yang.shi, linux-mm, x86, linux-kernel

Check VM_DEAD flag of vma in page fault handler, if it is set, trigger
SIGSEGV.

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 arch/x86/mm/fault.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9a84a0d..3fd2da5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1357,6 +1357,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 		bad_area(regs, error_code, address);
 		return;
 	}
+	if (unlikely(vma->vm_flags & VM_DEAD)) {
+		bad_area(regs, error_code, address);
+		return;
+	}
 	if (error_code & X86_PF_USER) {
 		/*
 		 * Accessing the stack below %sp is always a bug.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
@ 2018-06-30  1:28   ` Andrew Morton
  2018-06-30  2:10     ` Yang Shi
  2018-06-30  1:35   ` Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2018-06-30  1:28 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.

Perhaps it would be better to treat all mappings in the fashion,
regardless of size.  Simpler code, lesser mmap_sem hold times, much
better testing coverage.  Is there any particular downside to doing
this?

> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.

So we'll still get huge latencies an softlockup warnings for some
usecases.  This is a problem!

> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Where is this "regression and performance data"?  Something mising from
the changelog?

> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>  	return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> +#endif
> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +			       size_t len, struct list_head *uf)

Can we have a comment describing what `uf' is and what it does? (at least)

> +{
> +	unsigned long end = 0;
> +	struct vm_area_struct *vma = NULL, *prev, *tmp;

`tmp' is a poor choice of identifier - it doesn't communicate either
the variable's type nor its purpose.

Perhaps rename vma to start_vma(?) and rename tmp to vma?

And declaring start_vma to be const would be a nice readability addition.

> +	bool success = false;
> +	int ret = 0;
> +
> +	if (!munmap_addr_sanity(start, len))
> +		return -EINVAL;
> +
> +	len = PAGE_ALIGN(len);
> +
> +	end = start + len;
> +
> +	/* Just deal with uf in regular path */
> +	if (unlikely(uf))
> +		goto regular_path;
> +
> +	if (len >= LARGE_MAP_THRESH) {
> +		/*
> +		 * need write mmap_sem to split vma and set VM_DEAD flag
> +		 * splitting vma up-front to save PITA to clean if it is failed
> +		 */
> +		down_write(&mm->mmap_sem);
> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
> +		if (ret != 1) {
> +			up_write(&mm->mmap_sem);
> +			return ret;

Can just use `goto out' here, and that would avoid the unpleasing use
of a deeply eembded `return'.

> +		}
> +		/* This ret value might be returned, so reset it */
> +		ret = 0;
> +
> +		/*
> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
> +		 * flag set or has uprobes set, need acquire write map_sem,
> +		 * so skip them in early zap. Just deal with such mapping in
> +		 * regular path.

For each case, please describe *why* mmap_sem must be held for writing.

> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
> +		 */
> +		tmp = vma;
> +		while (tmp && tmp->vm_start < end) {
> +			if (!can_madv_dontneed_vma(tmp) ||
> +			    vma_has_uprobes(tmp, start, end)) {
> +				up_write(&mm->mmap_sem);
> +				goto regular_path;
> +			}
> +			tmp = tmp->vm_next;
> +		}
> +		/*
> +		 * set VM_DEAD flag before tear down them.
> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
> +		 */
> +		tmp = vma;
> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
> +			tmp->vm_flags |= VM_DEAD;
> +		up_write(&mm->mmap_sem);
> +
> +		/* zap mappings with read mmap_sem */
> +		down_read(&mm->mmap_sem);

Use downgrade_write()?

> +		zap_page_range(vma, start, len);
> +		/* indicates early zap is success */
> +		success = true;
> +		up_read(&mm->mmap_sem);
> +	}
> +
> +regular_path:
> +	/* hold write mmap_sem for vma manipulation or regular path */
> +	if (down_write_killable(&mm->mmap_sem))
> +		return -EINTR;

Why is this _killable() while the preceding down_write() was not?

> +	if (success) {
> +		/* vmas have been zapped, here clean up pgtable and vmas */
> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
> +		struct mmu_gather tlb;
> +		tlb_gather_mmu(&tlb, mm, start, end);
> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
> +		tlb_finish_mmu(&tlb, start, end);
> +
> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +		arch_unmap(mm, vma, start, end);
> +		remove_vma_list(mm, vma);
> +	} else {
> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
> +		if (vma) {
> +			if (unlikely(uf)) {
> +				int ret = userfaultfd_unmap_prep(vma, start,
> +								 end, uf);
> +				if (ret)
> +					goto out;

Bug?  This `ret' shadows the other `ret' in this function.

> +			}
> +			if (mm->locked_vm) {
> +				tmp = vma;
> +				while (tmp && tmp->vm_start < end) {
> +					if (tmp->vm_flags & VM_LOCKED) {
> +						mm->locked_vm -= vma_pages(tmp);
> +						munlock_vma_pages_all(tmp);
> +					}
> +					tmp = tmp->vm_next;
> +				}
> +			}
> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +			unmap_region(mm, vma, prev, start, end);
> +			remove_vma_list(mm, vma);
> +		} else
> +			/* When mapping size < LARGE_MAP_THRESH */
> +			ret = do_munmap(mm, start, len, uf);
> +	}
> +
> +out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
>  /* Munmap is split into 2 main parts -- this part which finds
>   * what needs doing, and the areas themselves, which do the
>   * work.  This now handles partial unmappings.
> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	return 0;
>  }
>  
> +static int vm_munmap_zap_early(unsigned long start, size_t len)
> +{
> +	int ret;
> +	struct mm_struct *mm = current->mm;
> +	LIST_HEAD(uf);
> +
> +	ret = do_munmap_zap_early(mm, start, len, &uf);
> +	userfaultfd_unmap_complete(mm, &uf);
> +	return ret;
> +}
> +
>  int vm_munmap(unsigned long start, size_t len)
>  {
>  	int ret;
> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>  SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>  {
>  	profile_munmap(addr);
> -	return vm_munmap(addr, len);
> +	return vm_munmap_zap_early(addr, len);
>  }
>  
> -
>  /*
>   * Emulation of deprecated remap_file_pages() syscall.
>   */
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
  2018-06-30  1:28   ` Andrew Morton
@ 2018-06-30  1:35   ` Andrew Morton
  2018-06-30  2:28     ` Yang Shi
  2018-07-02 12:33   ` Kirill A. Shutemov
  2018-07-02 13:53   ` Michal Hocko
  3 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2018-06-30  1:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:


And...

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>  	return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> +#endif

So this assumes that 32-bit machines cannot have 1GB mappings (fair
enough) and this is the sole means by which we avoid falling into the
"len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
least because for such machines, VM_DEAD is zero.

This is rather ugly and fragile.  And, I guess, explains why we can't
give all mappings this treatment: 32-bit machines can't do it.  And
we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-30  1:28   ` Andrew Morton
@ 2018-06-30  2:10     ` Yang Shi
  0 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-06-30  2:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 6/29/18 6:28 PM, Andrew Morton wrote:
> On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> Zapping pages is the most time consuming part, according to the
>> suggestion from Michal Hock [1], zapping pages can be done with holding
>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
>> the page fault to VM_DEAD vma will trigger SIGSEGV.
>>
>> Define large mapping size thresh as PUD size or 1GB, just zap pages with
>> read mmap_sem for mappings which are >= thresh value.
> Perhaps it would be better to treat all mappings in the fashion,
> regardless of size.  Simpler code, lesser mmap_sem hold times, much
> better testing coverage.  Is there any particular downside to doing
> this?

Actually, my testing uses 4K size to improve the coverage. The only 
downside AFAICS is the cost of multiple acquiring/releasing lock may 
outpace the benefits.

>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
>> fallback to regular path since unmapping those mappings need acquire
>> write mmap_sem.
> So we'll still get huge latencies an softlockup warnings for some
> usecases.  This is a problem!

Because unmapping such area needs modify vm_flags, which need write 
mmap_sem, in current code base.

>
>> For the time being, just do this in munmap syscall path. Other
>> vm_munmap() or do_munmap() call sites remain intact for stability
>> reason.
>>
>> The below is some regression and performance data collected on a machine
>> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> Where is this "regression and performance data"?  Something mising from
> the changelog?

oops, it might be removed inadvertently.

>> With the patched kernel, write mmap_sem hold time is dropped to us level
>> from second.
>>
>> [1] https://lwn.net/Articles/753269/
>>
>> ...
>>
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>>   	return 1;
>>   }
>>   
>> +/* Consider PUD size or 1GB mapping as large mapping */
>> +#ifdef HPAGE_PUD_SIZE
>> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
>> +#else
>> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
>> +#endif
>> +
>> +/* Unmap large mapping early with acquiring read mmap_sem */
>> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
>> +			       size_t len, struct list_head *uf)
> Can we have a comment describing what `uf' is and what it does? (at least)

Sure.

>
>> +{
>> +	unsigned long end = 0;
>> +	struct vm_area_struct *vma = NULL, *prev, *tmp;
> `tmp' is a poor choice of identifier - it doesn't communicate either
> the variable's type nor its purpose.
>
> Perhaps rename vma to start_vma(?) and rename tmp to vma?
>
> And declaring start_vma to be const would be a nice readability addition.

Sounds ok.

>
>> +	bool success = false;
>> +	int ret = 0;
>> +
>> +	if (!munmap_addr_sanity(start, len))
>> +		return -EINVAL;
>> +
>> +	len = PAGE_ALIGN(len);
>> +
>> +	end = start + len;
>> +
>> +	/* Just deal with uf in regular path */
>> +	if (unlikely(uf))
>> +		goto regular_path;
>> +
>> +	if (len >= LARGE_MAP_THRESH) {
>> +		/*
>> +		 * need write mmap_sem to split vma and set VM_DEAD flag
>> +		 * splitting vma up-front to save PITA to clean if it is failed
>> +		 */
>> +		down_write(&mm->mmap_sem);
>> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
>> +		if (ret != 1) {
>> +			up_write(&mm->mmap_sem);
>> +			return ret;
> Can just use `goto out' here, and that would avoid the unpleasing use
> of a deeply eembded `return'.

Yes.

>
>> +		}
>> +		/* This ret value might be returned, so reset it */
>> +		ret = 0;
>> +
>> +		/*
>> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
>> +		 * flag set or has uprobes set, need acquire write map_sem,
>> +		 * so skip them in early zap. Just deal with such mapping in
>> +		 * regular path.
> For each case, please describe *why* mmap_sem must be held for writing.

Sure.

>
>> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
>> +		 */
>> +		tmp = vma;
>> +		while (tmp && tmp->vm_start < end) {
>> +			if (!can_madv_dontneed_vma(tmp) ||
>> +			    vma_has_uprobes(tmp, start, end)) {
>> +				up_write(&mm->mmap_sem);
>> +				goto regular_path;
>> +			}
>> +			tmp = tmp->vm_next;
>> +		}
>> +		/*
>> +		 * set VM_DEAD flag before tear down them.
>> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
>> +		 */
>> +		tmp = vma;
>> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
>> +			tmp->vm_flags |= VM_DEAD;
>> +		up_write(&mm->mmap_sem);
>> +
>> +		/* zap mappings with read mmap_sem */
>> +		down_read(&mm->mmap_sem);
> Use downgrade_write()?

Aha, thanks for reminding. I forgot this. Just focused on "upgrade_read" 
part suggested by Michal.

>
>> +		zap_page_range(vma, start, len);
>> +		/* indicates early zap is success */
>> +		success = true;
>> +		up_read(&mm->mmap_sem);
>> +	}
>> +
>> +regular_path:
>> +	/* hold write mmap_sem for vma manipulation or regular path */
>> +	if (down_write_killable(&mm->mmap_sem))
>> +		return -EINTR;
> Why is this _killable() while the preceding down_write() was not?

This is copied from the original do_munmap(). The preceding one could be 
_killable() too.

>
>> +	if (success) {
>> +		/* vmas have been zapped, here clean up pgtable and vmas */
>> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
>> +		struct mmu_gather tlb;
>> +		tlb_gather_mmu(&tlb, mm, start, end);
>> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
>> +		tlb_finish_mmu(&tlb, start, end);
>> +
>> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +		arch_unmap(mm, vma, start, end);
>> +		remove_vma_list(mm, vma);
>> +	} else {
>> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
>> +		if (vma) {
>> +			if (unlikely(uf)) {
>> +				int ret = userfaultfd_unmap_prep(vma, start,
>> +								 end, uf);
>> +				if (ret)
>> +					goto out;
> Bug?  This `ret' shadows the other `ret' in this function.

oops, it should just use the same "ret".

Thanks,
Yang

>
>> +			}
>> +			if (mm->locked_vm) {
>> +				tmp = vma;
>> +				while (tmp && tmp->vm_start < end) {
>> +					if (tmp->vm_flags & VM_LOCKED) {
>> +						mm->locked_vm -= vma_pages(tmp);
>> +						munlock_vma_pages_all(tmp);
>> +					}
>> +					tmp = tmp->vm_next;
>> +				}
>> +			}
>> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +			unmap_region(mm, vma, prev, start, end);
>> +			remove_vma_list(mm, vma);
>> +		} else
>> +			/* When mapping size < LARGE_MAP_THRESH */
>> +			ret = do_munmap(mm, start, len, uf);
>> +	}
>> +
>> +out:
>> +	up_write(&mm->mmap_sem);
>> +	return ret;
>> +}
>> +
>>   /* Munmap is split into 2 main parts -- this part which finds
>>    * what needs doing, and the areas themselves, which do the
>>    * work.  This now handles partial unmappings.
>> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>>   	return 0;
>>   }
>>   
>> +static int vm_munmap_zap_early(unsigned long start, size_t len)
>> +{
>> +	int ret;
>> +	struct mm_struct *mm = current->mm;
>> +	LIST_HEAD(uf);
>> +
>> +	ret = do_munmap_zap_early(mm, start, len, &uf);
>> +	userfaultfd_unmap_complete(mm, &uf);
>> +	return ret;
>> +}
>> +
>>   int vm_munmap(unsigned long start, size_t len)
>>   {
>>   	int ret;
>> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>>   SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>   {
>>   	profile_munmap(addr);
>> -	return vm_munmap(addr, len);
>> +	return vm_munmap_zap_early(addr, len);
>>   }
>>   
>> -
>>   /*
>>    * Emulation of deprecated remap_file_pages() syscall.
>>    */
>> -- 
>> 1.8.3.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-30  1:35   ` Andrew Morton
@ 2018-06-30  2:28     ` Yang Shi
  2018-06-30  3:15       ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-06-30  2:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 6/29/18 6:35 PM, Andrew Morton wrote:
> On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
> And...
>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 87dcf83..d61e08b 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>>   	return 1;
>>   }
>>   
>> +/* Consider PUD size or 1GB mapping as large mapping */
>> +#ifdef HPAGE_PUD_SIZE
>> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
>> +#else
>> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
>> +#endif
> So this assumes that 32-bit machines cannot have 1GB mappings (fair
> enough) and this is the sole means by which we avoid falling into the
> "len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
> least because for such machines, VM_DEAD is zero.
>
> This is rather ugly and fragile.  And, I guess, explains why we can't
> give all mappings this treatment: 32-bit machines can't do it.  And
> we're adding a bunch of code to 32-bit kernels which will never be
> executed.
>
> I'm thinking it would be better to be much more explicit with "#ifdef
> CONFIG_64BIT" in this code, rather than relying upon the above magic.
>
> But I tend to think that the fact that we haven't solved anything on
> locked vmas or on uprobed mappings is a shostopper for the whole
> approach :(

I agree it is not that perfect. But, it still could improve the most use 
cases.

For the locked vmas and hugetlb vmas, unmapping operations need modify 
vm_flags. But, I'm wondering we might be able to separate unmap and 
vm_flags update. Because we know they will be unmapped right away, the 
vm_flags might be able to be updated in write mmap_sem critical section 
before the actual unmap is called or after it. This is just off the top 
of my head.

For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-30  2:28     ` Yang Shi
@ 2018-06-30  3:15       ` Andrew Morton
  2018-06-30  4:26         ` Yang Shi
  2018-07-02 14:05         ` Michal Hocko
  0 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2018-06-30  3:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> 
> 
> > we're adding a bunch of code to 32-bit kernels which will never be
> > executed.
> >
> > I'm thinking it would be better to be much more explicit with "#ifdef
> > CONFIG_64BIT" in this code, rather than relying upon the above magic.
> >
> > But I tend to think that the fact that we haven't solved anything on
> > locked vmas or on uprobed mappings is a shostopper for the whole
> > approach :(
> 
> I agree it is not that perfect. But, it still could improve the most use 
> cases.

Well, those unaddressed usecases will need to be fixed at some point. 
What's our plan for that?

Would one of your earlier designs have addressed all usecases?  I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

> For the locked vmas and hugetlb vmas, unmapping operations need modify 
> vm_flags. But, I'm wondering we might be able to separate unmap and 
> vm_flags update. Because we know they will be unmapped right away, the 
> vm_flags might be able to be updated in write mmap_sem critical section 
> before the actual unmap is called or after it. This is just off the top 
> of my head.
> 
> For uprobed mappings, I'm not sure how vital it is to this case.
> 
> Thanks,
> Yang
> 
> >

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-30  3:15       ` Andrew Morton
@ 2018-06-30  4:26         ` Yang Shi
  2018-07-03  0:01           ` Yang Shi
  2018-07-02 14:05         ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-06-30  4:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 6/29/18 8:15 PM, Andrew Morton wrote:
> On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>>
>>> we're adding a bunch of code to 32-bit kernels which will never be
>>> executed.
>>>
>>> I'm thinking it would be better to be much more explicit with "#ifdef
>>> CONFIG_64BIT" in this code, rather than relying upon the above magic.
>>>
>>> But I tend to think that the fact that we haven't solved anything on
>>> locked vmas or on uprobed mappings is a shostopper for the whole
>>> approach :(
>> I agree it is not that perfect. But, it still could improve the most use
>> cases.
> Well, those unaddressed usecases will need to be fixed at some point.

Yes, definitely.

> What's our plan for that?

As I mentioned in the earlier email, locked and hugetlb cases might be 
able to be solved by separating vm_flags update and actual unmap. I will 
look into it further later.

 From my point of view, uprobe mapping sounds not that vital.

>
> Would one of your earlier designs have addressed all usecases?  I
> expect the dumb unmap-a-little-bit-at-a-time approach would have?

Yes. The v1 design does unmap with holding write map_sem. So, the 
vm_flags update is not a problem.

Thanks,
Yang

>
>> For the locked vmas and hugetlb vmas, unmapping operations need modify
>> vm_flags. But, I'm wondering we might be able to separate unmap and
>> vm_flags update. Because we know they will be unmapped right away, the
>> vm_flags might be able to be updated in write mmap_sem critical section
>> before the actual unmap is called or after it. This is just off the top
>> of my head.
>>
>> For uprobed mappings, I'm not sure how vital it is to this case.
>>
>> Thanks,
>> Yang
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-06-29 22:39 ` [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault Yang Shi
@ 2018-07-02  8:45   ` Laurent Dufour
  2018-07-02 12:15     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Laurent Dufour @ 2018-07-02  8:45 UTC (permalink / raw)
  To: Yang Shi, mhocko, willy, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa
  Cc: linux-mm, x86, linux-kernel

On 30/06/2018 00:39, Yang Shi wrote:
> Check VM_DEAD flag of vma in page fault handler, if it is set, trigger
> SIGSEGV.
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  arch/x86/mm/fault.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9a84a0d..3fd2da5 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1357,6 +1357,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
>  		bad_area(regs, error_code, address);
>  		return;
>  	}
> +	if (unlikely(vma->vm_flags & VM_DEAD)) {
> +		bad_area(regs, error_code, address);
> +		return;
> +	}

This will have to be done for all the supported architectures, what about doing
this check in handle_mm_fault() and return VM_FAULT_SIGSEGV ?

>  	if (error_code & X86_PF_USER) {
>  		/*
>  		 * Accessing the stack below %sp is always a bug.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02  8:45   ` Laurent Dufour
@ 2018-07-02 12:15     ` Michal Hocko
  2018-07-02 12:26       ` Laurent Dufour
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 12:15 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Yang Shi, willy, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Mon 02-07-18 10:45:03, Laurent Dufour wrote:
> On 30/06/2018 00:39, Yang Shi wrote:
> > Check VM_DEAD flag of vma in page fault handler, if it is set, trigger
> > SIGSEGV.
> > 
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > ---
> >  arch/x86/mm/fault.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 9a84a0d..3fd2da5 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1357,6 +1357,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
> >  		bad_area(regs, error_code, address);
> >  		return;
> >  	}
> > +	if (unlikely(vma->vm_flags & VM_DEAD)) {
> > +		bad_area(regs, error_code, address);
> > +		return;
> > +	}
> 
> This will have to be done for all the supported architectures, what about doing
> this check in handle_mm_fault() and return VM_FAULT_SIGSEGV ?

We already do have a model for that. Have a look at MMF_UNSTABLE.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 12:15     ` Michal Hocko
@ 2018-07-02 12:26       ` Laurent Dufour
  2018-07-02 12:45         ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Laurent Dufour @ 2018-07-02 12:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, willy, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On 02/07/2018 14:15, Michal Hocko wrote:
> On Mon 02-07-18 10:45:03, Laurent Dufour wrote:
>> On 30/06/2018 00:39, Yang Shi wrote:
>>> Check VM_DEAD flag of vma in page fault handler, if it is set, trigger
>>> SIGSEGV.
>>>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> ---
>>>  arch/x86/mm/fault.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>> index 9a84a0d..3fd2da5 100644
>>> --- a/arch/x86/mm/fault.c
>>> +++ b/arch/x86/mm/fault.c
>>> @@ -1357,6 +1357,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
>>>  		bad_area(regs, error_code, address);
>>>  		return;
>>>  	}
>>> +	if (unlikely(vma->vm_flags & VM_DEAD)) {
>>> +		bad_area(regs, error_code, address);
>>> +		return;
>>> +	}
>>
>> This will have to be done for all the supported architectures, what about doing
>> this check in handle_mm_fault() and return VM_FAULT_SIGSEGV ?
> 
> We already do have a model for that. Have a look at MMF_UNSTABLE.

MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
  2018-06-30  1:28   ` Andrew Morton
  2018-06-30  1:35   ` Andrew Morton
@ 2018-07-02 12:33   ` Kirill A. Shutemov
  2018-07-02 12:49     ` Michal Hocko
  2018-07-02 17:19     ` Yang Shi
  2018-07-02 13:53   ` Michal Hocko
  3 siblings, 2 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2018-07-02 12:33 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 134 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>  	return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> +#endif

PUD_SIZE is defined everywhere.

> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +			       size_t len, struct list_head *uf)
> +{
> +	unsigned long end = 0;
> +	struct vm_area_struct *vma = NULL, *prev, *tmp;
> +	bool success = false;
> +	int ret = 0;
> +
> +	if (!munmap_addr_sanity(start, len))
> +		return -EINVAL;
> +
> +	len = PAGE_ALIGN(len);
> +
> +	end = start + len;
> +
> +	/* Just deal with uf in regular path */
> +	if (unlikely(uf))
> +		goto regular_path;
> +
> +	if (len >= LARGE_MAP_THRESH) {
> +		/*
> +		 * need write mmap_sem to split vma and set VM_DEAD flag
> +		 * splitting vma up-front to save PITA to clean if it is failed

What errors do you talk about? ENOMEM on VMA split? Anything else?

> +		 */
> +		down_write(&mm->mmap_sem);
> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
> +		if (ret != 1) {
> +			up_write(&mm->mmap_sem);
> +			return ret;
> +		}
> +		/* This ret value might be returned, so reset it */
> +		ret = 0;
> +
> +		/*
> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
> +		 * flag set or has uprobes set, need acquire write map_sem,
> +		 * so skip them in early zap. Just deal with such mapping in
> +		 * regular path.
> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
> +		 */
> +		tmp = vma;
> +		while (tmp && tmp->vm_start < end) {
> +			if (!can_madv_dontneed_vma(tmp) ||
> +			    vma_has_uprobes(tmp, start, end)) {
> +				up_write(&mm->mmap_sem);
> +				goto regular_path;
> +			}
> +			tmp = tmp->vm_next;
> +		}
> +		/*
> +		 * set VM_DEAD flag before tear down them.
> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
> +		 */
> +		tmp = vma;
> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
> +			tmp->vm_flags |= VM_DEAD;

I probably miss the explanation somewhere, but what's wrong with allowing
other thread to re-populate the VMA?

I would rather allow the VMA to be re-populated by other thread while we
are zapping the range. And later zap the range again under down_write.

It should also lead to consolidated regular path: take mmap_sem for write
and call do_munmap().

On the first path we just skip VMA we cannot deal with under
down_read(mmap_sem), regular path will take care of them.


> +		up_write(&mm->mmap_sem);
> +
> +		/* zap mappings with read mmap_sem */
> +		down_read(&mm->mmap_sem);

Yeah. There's race between up_write() and down_read().
Use downgrade, as Andrew suggested.

> +		zap_page_range(vma, start, len);
> +		/* indicates early zap is success */
> +		success = true;
> +		up_read(&mm->mmap_sem);

And here again.

This race can be avoided if we wouldn't carry vma to regular_path, but
just go directly to do_munmap().

> +	}
> +
> +regular_path:
> +	/* hold write mmap_sem for vma manipulation or regular path */
> +	if (down_write_killable(&mm->mmap_sem))
> +		return -EINTR;
> +	if (success) {
> +		/* vmas have been zapped, here clean up pgtable and vmas */
> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
> +		struct mmu_gather tlb;
> +		tlb_gather_mmu(&tlb, mm, start, end);
> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
> +		tlb_finish_mmu(&tlb, start, end);
> +
> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +		arch_unmap(mm, vma, start, end);
> +		remove_vma_list(mm, vma);
> +	} else {
> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
> +		if (vma) {
> +			if (unlikely(uf)) {
> +				int ret = userfaultfd_unmap_prep(vma, start,
> +								 end, uf);
> +				if (ret)
> +					goto out;
> +			}
> +			if (mm->locked_vm) {
> +				tmp = vma;
> +				while (tmp && tmp->vm_start < end) {
> +					if (tmp->vm_flags & VM_LOCKED) {
> +						mm->locked_vm -= vma_pages(tmp);
> +						munlock_vma_pages_all(tmp);
> +					}
> +					tmp = tmp->vm_next;
> +				}
> +			}
> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
> +			unmap_region(mm, vma, prev, start, end);
> +			remove_vma_list(mm, vma);
> +		} else
> +			/* When mapping size < LARGE_MAP_THRESH */
> +			ret = do_munmap(mm, start, len, uf);
> +	}
> +
> +out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
>  /* Munmap is split into 2 main parts -- this part which finds
>   * what needs doing, and the areas themselves, which do the
>   * work.  This now handles partial unmappings.
> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	return 0;
>  }
>  
> +static int vm_munmap_zap_early(unsigned long start, size_t len)
> +{
> +	int ret;
> +	struct mm_struct *mm = current->mm;
> +	LIST_HEAD(uf);
> +
> +	ret = do_munmap_zap_early(mm, start, len, &uf);
> +	userfaultfd_unmap_complete(mm, &uf);
> +	return ret;
> +}
> +
>  int vm_munmap(unsigned long start, size_t len)
>  {
>  	int ret;
> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>  SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>  {
>  	profile_munmap(addr);
> -	return vm_munmap(addr, len);
> +	return vm_munmap_zap_early(addr, len);
>  }
>  
> -
>  /*
>   * Emulation of deprecated remap_file_pages() syscall.
>   */
> -- 
> 1.8.3.1
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 12:26       ` Laurent Dufour
@ 2018-07-02 12:45         ` Michal Hocko
  2018-07-02 13:33           ` Laurent Dufour
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 12:45 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Yang Shi, willy, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Mon 02-07-18 14:26:09, Laurent Dufour wrote:
> On 02/07/2018 14:15, Michal Hocko wrote:
> > On Mon 02-07-18 10:45:03, Laurent Dufour wrote:
> >> On 30/06/2018 00:39, Yang Shi wrote:
> >>> Check VM_DEAD flag of vma in page fault handler, if it is set, trigger
> >>> SIGSEGV.
> >>>
> >>> Cc: Michal Hocko <mhocko@kernel.org>
> >>> Cc: Thomas Gleixner <tglx@linutronix.de>
> >>> Cc: Ingo Molnar <mingo@redhat.com>
> >>> Cc: "H. Peter Anvin" <hpa@zytor.com>
> >>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> >>> ---
> >>>  arch/x86/mm/fault.c | 4 ++++
> >>>  1 file changed, 4 insertions(+)
> >>>
> >>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> >>> index 9a84a0d..3fd2da5 100644
> >>> --- a/arch/x86/mm/fault.c
> >>> +++ b/arch/x86/mm/fault.c
> >>> @@ -1357,6 +1357,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
> >>>  		bad_area(regs, error_code, address);
> >>>  		return;
> >>>  	}
> >>> +	if (unlikely(vma->vm_flags & VM_DEAD)) {
> >>> +		bad_area(regs, error_code, address);
> >>> +		return;
> >>> +	}
> >>
> >> This will have to be done for all the supported architectures, what about doing
> >> this check in handle_mm_fault() and return VM_FAULT_SIGSEGV ?
> > 
> > We already do have a model for that. Have a look at MMF_UNSTABLE.
> 
> MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.

Yeah, and we have the VMA ready for all places where we do check the
flag. check_stable_address_space can be made to get vma rather than mm.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 12:33   ` Kirill A. Shutemov
@ 2018-07-02 12:49     ` Michal Hocko
  2018-07-03  8:12       ` Kirill A. Shutemov
  2018-07-02 17:19     ` Yang Shi
  1 sibling, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 12:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Yang Shi, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
[...]
> I probably miss the explanation somewhere, but what's wrong with allowing
> other thread to re-populate the VMA?

We have discussed that earlier and it boils down to how is racy access
to munmap supposed to behave. Right now we have either the original
content or SEGV. If we allow to simply madvise_dontneed before real
unmap we could get a new page as well. There might be (quite broken I
would say) user space code that would simply corrupt data silently that
way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 12:45         ` Michal Hocko
@ 2018-07-02 13:33           ` Laurent Dufour
  2018-07-02 13:37             ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Laurent Dufour @ 2018-07-02 13:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, willy, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 02/07/2018 14:45, Michal Hocko wrote:
> On Mon 02-07-18 14:26:09, Laurent Dufour wrote:
>> On 02/07/2018 14:15, Michal Hocko wrote:
>>> On Mon 02-07-18 10:45:03, Laurent Dufour wrote:
>>>> On 30/06/2018 00:39, Yang Shi wrote:
>>>>> Check VM_DEAD flag of vma in page fault handler, if it is set, trigger
>>>>> SIGSEGV.
>>>>>
>>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>>> ---
>>>>>  arch/x86/mm/fault.c | 4 ++++
>>>>>  1 file changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>>>> index 9a84a0d..3fd2da5 100644
>>>>> --- a/arch/x86/mm/fault.c
>>>>> +++ b/arch/x86/mm/fault.c
>>>>> @@ -1357,6 +1357,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
>>>>>  		bad_area(regs, error_code, address);
>>>>>  		return;
>>>>>  	}
>>>>> +	if (unlikely(vma->vm_flags & VM_DEAD)) {
>>>>> +		bad_area(regs, error_code, address);
>>>>> +		return;
>>>>> +	}
>>>>
>>>> This will have to be done for all the supported architectures, what about doing
>>>> this check in handle_mm_fault() and return VM_FAULT_SIGSEGV ?
>>>
>>> We already do have a model for that. Have a look at MMF_UNSTABLE.
>>
>> MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.
> 
> Yeah, and we have the VMA ready for all places where we do check the
> flag. check_stable_address_space can be made to get vma rather than mm.

Yeah, this would have been more efficient to check that flag at the beginning
of the page fault handler rather than the end, but this way it will be easier
to handle the speculative page fault too ;)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 13:33           ` Laurent Dufour
@ 2018-07-02 13:37             ` Michal Hocko
  2018-07-02 17:24               ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 13:37 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Yang Shi, willy, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Mon 02-07-18 15:33:11, Laurent Dufour wrote:
> 
> 
> On 02/07/2018 14:45, Michal Hocko wrote:
> > On Mon 02-07-18 14:26:09, Laurent Dufour wrote:
> >> On 02/07/2018 14:15, Michal Hocko wrote:
[...]
> >>> We already do have a model for that. Have a look at MMF_UNSTABLE.
> >>
> >> MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.
> > 
> > Yeah, and we have the VMA ready for all places where we do check the
> > flag. check_stable_address_space can be made to get vma rather than mm.
> 
> Yeah, this would have been more efficient to check that flag at the beginning
> of the page fault handler rather than the end, but this way it will be easier
> to handle the speculative page fault too ;)

The thing is that it doesn't really need to be called earlier. You are
not risking data corruption on file backed mappings.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping
  2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
                   ` (4 preceding siblings ...)
  2018-06-29 22:39 ` [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault Yang Shi
@ 2018-07-02 13:39 ` Michal Hocko
  5 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 13:39 UTC (permalink / raw)
  To: Yang Shi
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Sat 30-06-18 06:39:40, Yang Shi wrote:
> 
> Background:
> Recently, when we ran some vm scalability tests on machines with large memory,
> we ran into a couple of mmap_sem scalability issues when unmapping large memory
> space, please refer to https://lkml.org/lkml/2017/12/14/733 and
> https://lkml.org/lkml/2018/2/20/576.
> 
> 
> History:
> Then akpm suggested to unmap large mapping section by section and drop mmap_sem
> at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
> 
> V1 patch series was submitted to the mailing list per Andrew’s suggestion
> (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
> and suggestions.
> 
> Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hock
> suggested (also in the v1 patches review) to try "two phases" approach. Zapping
> pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
> discussion detail, see https://lwn.net/Articles/753269/)

The cover letter should really describe your approach to the problem.
But there is none here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 2/5] mm: introduce VM_DEAD flag
  2018-06-29 22:39 ` [RFC v3 PATCH 2/5] mm: introduce VM_DEAD flag Yang Shi
@ 2018-07-02 13:40   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 13:40 UTC (permalink / raw)
  To: Yang Shi
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Sat 30-06-18 06:39:42, Yang Shi wrote:
> VM_DEAD flag is used to mark a vma is being unmapped, access to this
> area will trigger SIGSEGV.
> 
> This flag will be used by the optimization for unmapping large address
> space (>= 1GB) in the later patch. It is 64 bit only at the moment,
> since:
>   * we used up vm_flags bit for 32 bit
>   * 32 bit machine typically will not have such large mapping
> 
> All architectures, which support 64 bit, need check this flag in their
> page fault handler. This is implemented in later patches.

Please add a new flag with its users. There is simply no way to tell the
semantic from the above description and the patch as well.
 
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  include/linux/mm.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9f..28a3906 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -242,6 +242,12 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>  #endif
>  #endif /* CONFIG_ARCH_HAS_PKEYS */
>  
> +#ifdef CONFIG_64BIT
> +#define VM_DEAD			BIT(37)	/* bit only usable on 64 bit kernel */
> +#else
> +#define VM_DEAD			0
> +#endif
> +
>  #if defined(CONFIG_X86)
>  # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
>  #elif defined(CONFIG_PPC)
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part
  2018-06-29 22:39 ` [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part Yang Shi
@ 2018-07-02 13:42   ` Michal Hocko
  2018-07-02 16:59     ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 13:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Sat 30-06-18 06:39:43, Yang Shi wrote:
> Introduces two new helper functions:
>   * munmap_addr_sanity()
>   * munmap_lookup_vma()
> 
> They will be used by do_munmap() and the new do_munmap with zapping
> large mapping early in the later patch.
> 
> There is no functional change, just code refactor.

There are whitespace changes which make the code much harder to review
than necessary.
> +static inline bool munmap_addr_sanity(unsigned long start, size_t len)
>  {
> -	unsigned long end;
> -	struct vm_area_struct *vma, *prev, *last;
> +	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start)
> +		return false;
>  
> -	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
> -		return -EINVAL;

e.g. here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
                     ` (2 preceding siblings ...)
  2018-07-02 12:33   ` Kirill A. Shutemov
@ 2018-07-02 13:53   ` Michal Hocko
  2018-07-02 17:07     ` Yang Shi
  3 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 13:53 UTC (permalink / raw)
  To: Yang Shi
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Sat 30-06-18 06:39:44, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding

s@Hock@Hocko@

> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.

This really deserves an explanation why the all dance is really needed.

It would be also good to mention how do you achieve the overal
consistency. E.g. you are dropping mmap_sem and then re-taking it for
write. What if any pending write lock succeeds and modify the address
space? Does it matter, why if not? 

> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.

What are those stability reasons?

> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.

I haven't read through the implemenation carefuly TBH but the changelog
needs quite some work to explain the solution and resulting semantic of
munmap after the change.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-30  3:15       ` Andrew Morton
  2018-06-30  4:26         ` Yang Shi
@ 2018-07-02 14:05         ` Michal Hocko
  2018-07-02 20:48           ` Andrew Morton
  1 sibling, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 14:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yang Shi, willy, ldufour, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]
> Would one of your earlier designs have addressed all usecases?  I
> expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work. You simply
cannot drop the mmap_sem during unmap because another thread could
change the address space under your feet. So you need some form of
VM_DEAD and handle concurrent and conflicting address space operations.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part
  2018-07-02 13:42   ` Michal Hocko
@ 2018-07-02 16:59     ` Yang Shi
  2018-07-02 17:58       ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-07-02 16:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 7/2/18 6:42 AM, Michal Hocko wrote:
> On Sat 30-06-18 06:39:43, Yang Shi wrote:
>> Introduces two new helper functions:
>>    * munmap_addr_sanity()
>>    * munmap_lookup_vma()
>>
>> They will be used by do_munmap() and the new do_munmap with zapping
>> large mapping early in the later patch.
>>
>> There is no functional change, just code refactor.
> There are whitespace changes which make the code much harder to review
> than necessary.
>> +static inline bool munmap_addr_sanity(unsigned long start, size_t len)
>>   {
>> -	unsigned long end;
>> -	struct vm_area_struct *vma, *prev, *last;
>> +	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start)
>> +		return false;
>>   
>> -	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
>> -		return -EINVAL;
> e.g. here.

Oh, yes. I did some coding style cleanup too.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 13:53   ` Michal Hocko
@ 2018-07-02 17:07     ` Yang Shi
  0 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-07-02 17:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 7/2/18 6:53 AM, Michal Hocko wrote:
> On Sat 30-06-18 06:39:44, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> Zapping pages is the most time consuming part, according to the
>> suggestion from Michal Hock [1], zapping pages can be done with holding
> s@Hock@Hocko@

Sorry for the wrong spelling.

>
>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
>> the page fault to VM_DEAD vma will trigger SIGSEGV.
> This really deserves an explanation why the all dance is really needed.
>
> It would be also good to mention how do you achieve the overal
> consistency. E.g. you are dropping mmap_sem and then re-taking it for
> write. What if any pending write lock succeeds and modify the address
> space? Does it matter, why if not?

Sure.

>
>> Define large mapping size thresh as PUD size or 1GB, just zap pages with
>> read mmap_sem for mappings which are >= thresh value.
>>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
>> fallback to regular path since unmapping those mappings need acquire
>> write mmap_sem.
>>
>> For the time being, just do this in munmap syscall path. Other
>> vm_munmap() or do_munmap() call sites remain intact for stability
>> reason.
> What are those stability reasons?

mmap() and mremap() may call do_munmap() as well, so it may introduce 
more race condition if they use the zap early version of do_munmap too. 
They would have much more chances to take mmap_sem to change address 
space and cause conflict.

And, it looks they are not the vital source of long period of write 
mmap_sem hold. So, it sounds not worth making things more complicated 
for the time being.

>
>> The below is some regression and performance data collected on a machine
>> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
>>
>> With the patched kernel, write mmap_sem hold time is dropped to us level
>> from second.
> I haven't read through the implemenation carefuly TBH but the changelog
> needs quite some work to explain the solution and resulting semantic of
> munmap after the change.

Thanks for the suggestion. Will polish the changelog.

Yang



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 12:33   ` Kirill A. Shutemov
  2018-07-02 12:49     ` Michal Hocko
@ 2018-07-02 17:19     ` Yang Shi
  2018-07-03  8:07       ` Kirill A. Shutemov
  1 sibling, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-07-02 17:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel



On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:
> On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
>>
>> Zapping pages is the most time consuming part, according to the
>> suggestion from Michal Hock [1], zapping pages can be done with holding
>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
>> the page fault to VM_DEAD vma will trigger SIGSEGV.
>>
>> Define large mapping size thresh as PUD size or 1GB, just zap pages with
>> read mmap_sem for mappings which are >= thresh value.
>>
>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
>> fallback to regular path since unmapping those mappings need acquire
>> write mmap_sem.
>>
>> For the time being, just do this in munmap syscall path. Other
>> vm_munmap() or do_munmap() call sites remain intact for stability
>> reason.
>>
>> The below is some regression and performance data collected on a machine
>> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
>>
>> With the patched kernel, write mmap_sem hold time is dropped to us level
>> from second.
>>
>> [1] https://lwn.net/Articles/753269/
>>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 134 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 87dcf83..d61e08b 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
>>   	return 1;
>>   }
>>   
>> +/* Consider PUD size or 1GB mapping as large mapping */
>> +#ifdef HPAGE_PUD_SIZE
>> +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
>> +#else
>> +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
>> +#endif
> PUD_SIZE is defined everywhere.

If THP is defined, otherwise it is:

#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

>
>> +
>> +/* Unmap large mapping early with acquiring read mmap_sem */
>> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
>> +			       size_t len, struct list_head *uf)
>> +{
>> +	unsigned long end = 0;
>> +	struct vm_area_struct *vma = NULL, *prev, *tmp;
>> +	bool success = false;
>> +	int ret = 0;
>> +
>> +	if (!munmap_addr_sanity(start, len))
>> +		return -EINVAL;
>> +
>> +	len = PAGE_ALIGN(len);
>> +
>> +	end = start + len;
>> +
>> +	/* Just deal with uf in regular path */
>> +	if (unlikely(uf))
>> +		goto regular_path;
>> +
>> +	if (len >= LARGE_MAP_THRESH) {
>> +		/*
>> +		 * need write mmap_sem to split vma and set VM_DEAD flag
>> +		 * splitting vma up-front to save PITA to clean if it is failed
> What errors do you talk about? ENOMEM on VMA split? Anything else?

Yes, ENOMEM on vma split.

>
>> +		 */
>> +		down_write(&mm->mmap_sem);
>> +		ret = munmap_lookup_vma(mm, &vma, &prev, start, end);
>> +		if (ret != 1) {
>> +			up_write(&mm->mmap_sem);
>> +			return ret;
>> +		}
>> +		/* This ret value might be returned, so reset it */
>> +		ret = 0;
>> +
>> +		/*
>> +		 * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
>> +		 * flag set or has uprobes set, need acquire write map_sem,
>> +		 * so skip them in early zap. Just deal with such mapping in
>> +		 * regular path.
>> +		 * Borrow can_madv_dontneed_vma() to check the conditions.
>> +		 */
>> +		tmp = vma;
>> +		while (tmp && tmp->vm_start < end) {
>> +			if (!can_madv_dontneed_vma(tmp) ||
>> +			    vma_has_uprobes(tmp, start, end)) {
>> +				up_write(&mm->mmap_sem);
>> +				goto regular_path;
>> +			}
>> +			tmp = tmp->vm_next;
>> +		}
>> +		/*
>> +		 * set VM_DEAD flag before tear down them.
>> +		 * page fault on VM_DEAD vma will trigger SIGSEGV.
>> +		 */
>> +		tmp = vma;
>> +		for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next)
>> +			tmp->vm_flags |= VM_DEAD;
> I probably miss the explanation somewhere, but what's wrong with allowing
> other thread to re-populate the VMA?
>
> I would rather allow the VMA to be re-populated by other thread while we
> are zapping the range. And later zap the range again under down_write.
>
> It should also lead to consolidated regular path: take mmap_sem for write
> and call do_munmap().
>
> On the first path we just skip VMA we cannot deal with under
> down_read(mmap_sem), regular path will take care of them.
>
>
>> +		up_write(&mm->mmap_sem);
>> +
>> +		/* zap mappings with read mmap_sem */
>> +		down_read(&mm->mmap_sem);
> Yeah. There's race between up_write() and down_read().
> Use downgrade, as Andrew suggested.
>
>> +		zap_page_range(vma, start, len);
>> +		/* indicates early zap is success */
>> +		success = true;
>> +		up_read(&mm->mmap_sem);
> And here again.
>
> This race can be avoided if we wouldn't carry vma to regular_path, but
> just go directly to do_munmap().

Thanks, Kirill. Yes, I did think about re-validating vmas before. This 
sounds reasonable to avoid the race. Although we spend more time in 
re-looking up vmas, but it should be very short, and the duplicate zap 
should be very short too.

Yang

>
>> +	}
>> +
>> +regular_path:
>> +	/* hold write mmap_sem for vma manipulation or regular path */
>> +	if (down_write_killable(&mm->mmap_sem))
>> +		return -EINTR;
>> +	if (success) {
>> +		/* vmas have been zapped, here clean up pgtable and vmas */
>> +		struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
>> +		struct mmu_gather tlb;
>> +		tlb_gather_mmu(&tlb, mm, start, end);
>> +		free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>> +			      next ? next->vm_start : USER_PGTABLES_CEILING);
>> +		tlb_finish_mmu(&tlb, start, end);
>> +
>> +		detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +		arch_unmap(mm, vma, start, end);
>> +		remove_vma_list(mm, vma);
>> +	} else {
>> +		/* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */
>> +		if (vma) {
>> +			if (unlikely(uf)) {
>> +				int ret = userfaultfd_unmap_prep(vma, start,
>> +								 end, uf);
>> +				if (ret)
>> +					goto out;
>> +			}
>> +			if (mm->locked_vm) {
>> +				tmp = vma;
>> +				while (tmp && tmp->vm_start < end) {
>> +					if (tmp->vm_flags & VM_LOCKED) {
>> +						mm->locked_vm -= vma_pages(tmp);
>> +						munlock_vma_pages_all(tmp);
>> +					}
>> +					tmp = tmp->vm_next;
>> +				}
>> +			}
>> +			detach_vmas_to_be_unmapped(mm, vma, prev, end);
>> +			unmap_region(mm, vma, prev, start, end);
>> +			remove_vma_list(mm, vma);
>> +		} else
>> +			/* When mapping size < LARGE_MAP_THRESH */
>> +			ret = do_munmap(mm, start, len, uf);
>> +	}
>> +
>> +out:
>> +	up_write(&mm->mmap_sem);
>> +	return ret;
>> +}
>> +
>>   /* Munmap is split into 2 main parts -- this part which finds
>>    * what needs doing, and the areas themselves, which do the
>>    * work.  This now handles partial unmappings.
>> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>>   	return 0;
>>   }
>>   
>> +static int vm_munmap_zap_early(unsigned long start, size_t len)
>> +{
>> +	int ret;
>> +	struct mm_struct *mm = current->mm;
>> +	LIST_HEAD(uf);
>> +
>> +	ret = do_munmap_zap_early(mm, start, len, &uf);
>> +	userfaultfd_unmap_complete(mm, &uf);
>> +	return ret;
>> +}
>> +
>>   int vm_munmap(unsigned long start, size_t len)
>>   {
>>   	int ret;
>> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len)
>>   SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>   {
>>   	profile_munmap(addr);
>> -	return vm_munmap(addr, len);
>> +	return vm_munmap_zap_early(addr, len);
>>   }
>>   
>> -
>>   /*
>>    * Emulation of deprecated remap_file_pages() syscall.
>>    */
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 13:37             ` Michal Hocko
@ 2018-07-02 17:24               ` Yang Shi
  2018-07-02 17:57                 ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-07-02 17:24 UTC (permalink / raw)
  To: Michal Hocko, Laurent Dufour
  Cc: willy, akpm, peterz, mingo, acme, alexander.shishkin, jolsa,
	namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 7/2/18 6:37 AM, Michal Hocko wrote:
> On Mon 02-07-18 15:33:11, Laurent Dufour wrote:
>>
>> On 02/07/2018 14:45, Michal Hocko wrote:
>>> On Mon 02-07-18 14:26:09, Laurent Dufour wrote:
>>>> On 02/07/2018 14:15, Michal Hocko wrote:
> [...]
>>>>> We already do have a model for that. Have a look at MMF_UNSTABLE.
>>>> MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.
>>> Yeah, and we have the VMA ready for all places where we do check the
>>> flag. check_stable_address_space can be made to get vma rather than mm.
>> Yeah, this would have been more efficient to check that flag at the beginning
>> of the page fault handler rather than the end, but this way it will be easier
>> to handle the speculative page fault too ;)
> The thing is that it doesn't really need to be called earlier. You are
> not risking data corruption on file backed mappings.

OK, I just think it could save a few cycles to check the flag earlier.

If nobody think it is necessary, we definitely could re-use 
check_stable_address_space(), just return VM_FAULT_SIGSEGV for VM_DEAD 
vma, and check for both shared and non-shared.

Thanks,
Yang

>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 17:24               ` Yang Shi
@ 2018-07-02 17:57                 ` Michal Hocko
  2018-07-02 18:10                   ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 17:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Laurent Dufour, willy, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon 02-07-18 10:24:27, Yang Shi wrote:
> 
> 
> On 7/2/18 6:37 AM, Michal Hocko wrote:
> > On Mon 02-07-18 15:33:11, Laurent Dufour wrote:
> > > 
> > > On 02/07/2018 14:45, Michal Hocko wrote:
> > > > On Mon 02-07-18 14:26:09, Laurent Dufour wrote:
> > > > > On 02/07/2018 14:15, Michal Hocko wrote:
> > [...]
> > > > > > We already do have a model for that. Have a look at MMF_UNSTABLE.
> > > > > MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.
> > > > Yeah, and we have the VMA ready for all places where we do check the
> > > > flag. check_stable_address_space can be made to get vma rather than mm.
> > > Yeah, this would have been more efficient to check that flag at the beginning
> > > of the page fault handler rather than the end, but this way it will be easier
> > > to handle the speculative page fault too ;)
> > The thing is that it doesn't really need to be called earlier. You are
> > not risking data corruption on file backed mappings.
> 
> OK, I just think it could save a few cycles to check the flag earlier.

This should be an extremely rare case. Just think about it. It should
only ever happen when an access races with munmap which itself is
questionable if not an outright bug.

> If nobody think it is necessary, we definitely could re-use
> check_stable_address_space(),

If we really need this whole VM_DEAD thing then it should be better
handled at the same place rather than some ad-hoc places.

> just return VM_FAULT_SIGSEGV for VM_DEAD vma,
> and check for both shared and non-shared.

Why would you even care about shared mappings?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part
  2018-07-02 16:59     ` Yang Shi
@ 2018-07-02 17:58       ` Michal Hocko
  2018-07-02 18:02         ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-02 17:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel

On Mon 02-07-18 09:59:06, Yang Shi wrote:
> 
> 
> On 7/2/18 6:42 AM, Michal Hocko wrote:
> > On Sat 30-06-18 06:39:43, Yang Shi wrote:
> > > Introduces two new helper functions:
> > >    * munmap_addr_sanity()
> > >    * munmap_lookup_vma()
> > > 
> > > They will be used by do_munmap() and the new do_munmap with zapping
> > > large mapping early in the later patch.
> > > 
> > > There is no functional change, just code refactor.
> > There are whitespace changes which make the code much harder to review
> > than necessary.
> > > +static inline bool munmap_addr_sanity(unsigned long start, size_t len)
> > >   {
> > > -	unsigned long end;
> > > -	struct vm_area_struct *vma, *prev, *last;
> > > +	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start)
> > > +		return false;
> > > -	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
> > > -		return -EINVAL;
> > e.g. here.
> 
> Oh, yes. I did some coding style cleanup too.

If you want to do some coding style cleanups make them a separate patch.
The resulting diff would be much easier to review.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part
  2018-07-02 17:58       ` Michal Hocko
@ 2018-07-02 18:02         ` Yang Shi
  0 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-07-02 18:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 7/2/18 10:58 AM, Michal Hocko wrote:
> On Mon 02-07-18 09:59:06, Yang Shi wrote:
>>
>> On 7/2/18 6:42 AM, Michal Hocko wrote:
>>> On Sat 30-06-18 06:39:43, Yang Shi wrote:
>>>> Introduces two new helper functions:
>>>>     * munmap_addr_sanity()
>>>>     * munmap_lookup_vma()
>>>>
>>>> They will be used by do_munmap() and the new do_munmap with zapping
>>>> large mapping early in the later patch.
>>>>
>>>> There is no functional change, just code refactor.
>>> There are whitespace changes which make the code much harder to review
>>> than necessary.
>>>> +static inline bool munmap_addr_sanity(unsigned long start, size_t len)
>>>>    {
>>>> -	unsigned long end;
>>>> -	struct vm_area_struct *vma, *prev, *last;
>>>> +	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start)
>>>> +		return false;
>>>> -	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
>>>> -		return -EINVAL;
>>> e.g. here.
>> Oh, yes. I did some coding style cleanup too.
> If you want to do some coding style cleanups make them a separate patch.
> The resulting diff would be much easier to review.

Sure, thanks

>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 17:57                 ` Michal Hocko
@ 2018-07-02 18:10                   ` Yang Shi
  2018-07-03  6:17                     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-07-02 18:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Laurent Dufour, willy, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel



On 7/2/18 10:57 AM, Michal Hocko wrote:
> On Mon 02-07-18 10:24:27, Yang Shi wrote:
>>
>> On 7/2/18 6:37 AM, Michal Hocko wrote:
>>> On Mon 02-07-18 15:33:11, Laurent Dufour wrote:
>>>> On 02/07/2018 14:45, Michal Hocko wrote:
>>>>> On Mon 02-07-18 14:26:09, Laurent Dufour wrote:
>>>>>> On 02/07/2018 14:15, Michal Hocko wrote:
>>> [...]
>>>>>>> We already do have a model for that. Have a look at MMF_UNSTABLE.
>>>>>> MMF_UNSTABLE is a mm's flag, here this is a VMA's flag which is checked.
>>>>> Yeah, and we have the VMA ready for all places where we do check the
>>>>> flag. check_stable_address_space can be made to get vma rather than mm.
>>>> Yeah, this would have been more efficient to check that flag at the beginning
>>>> of the page fault handler rather than the end, but this way it will be easier
>>>> to handle the speculative page fault too ;)
>>> The thing is that it doesn't really need to be called earlier. You are
>>> not risking data corruption on file backed mappings.
>> OK, I just think it could save a few cycles to check the flag earlier.
> This should be an extremely rare case. Just think about it. It should
> only ever happen when an access races with munmap which itself is
> questionable if not an outright bug.
>
>> If nobody think it is necessary, we definitely could re-use
>> check_stable_address_space(),
> If we really need this whole VM_DEAD thing then it should be better
> handled at the same place rather than some ad-hoc places.
>
>> just return VM_FAULT_SIGSEGV for VM_DEAD vma,
>> and check for both shared and non-shared.
> Why would you even care about shared mappings?

Just thought about we are dealing with VM_DEAD, which means the vma will 
be tore down soon regardless it is shared or non-shared.

MMF_UNSTABLE doesn't care about !shared case.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 14:05         ` Michal Hocko
@ 2018-07-02 20:48           ` Andrew Morton
  2018-07-03  6:09             ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2018-07-02 20:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, willy, ldufour, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> [...]
> > Would one of your earlier designs have addressed all usecases?  I
> > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> 
> It has been already pointed out that this will not work.

I said "one of".  There were others.

> You simply
> cannot drop the mmap_sem during unmap because another thread could
> change the address space under your feet. So you need some form of
> VM_DEAD and handle concurrent and conflicting address space operations.

Unclear that this is a problem.  If a thread does an unmap of a range
of virtual address space, there's no guarantee that upon return some
other thread has not already mapped new stuff into that address range. 
So what's changed?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-06-30  4:26         ` Yang Shi
@ 2018-07-03  0:01           ` Yang Shi
  0 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-07-03  0:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mhocko, willy, ldufour, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 6/29/18 9:26 PM, Yang Shi wrote:
>
>
> On 6/29/18 8:15 PM, Andrew Morton wrote:
>> On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi 
>> <yang.shi@linux.alibaba.com> wrote:
>>
>>>
>>>> we're adding a bunch of code to 32-bit kernels which will never be
>>>> executed.
>>>>
>>>> I'm thinking it would be better to be much more explicit with "#ifdef
>>>> CONFIG_64BIT" in this code, rather than relying upon the above magic.
>>>>
>>>> But I tend to think that the fact that we haven't solved anything on
>>>> locked vmas or on uprobed mappings is a shostopper for the whole
>>>> approach :(
>>> I agree it is not that perfect. But, it still could improve the most 
>>> use
>>> cases.
>> Well, those unaddressed usecases will need to be fixed at some point.
>
> Yes, definitely.
>
>> What's our plan for that?
>
> As I mentioned in the earlier email, locked and hugetlb cases might be 
> able to be solved by separating vm_flags update and actual unmap. I 
> will look into it further later.

By looking into this further, I think both mlocked and hugetlb vmas can 
be handled.

For mlocked vmas, it is easy since we acquires write mmap_sem before 
unmapping, so VM_LOCK flags can be cleared here then unmap, just like 
what the regular path does.

For hugetlb vmas, the VM_MAYSHARE flag is just checked by 
huge_pmd_share() in hugetlb_fault()->huge_pte_alloc(), another call site 
is dup_mm()->copy_page_range()->copy_hugetlb_page_range(), we don't care 
this call chain in this case.

So we may expand VM_DEAD to hugetlb_fault().  Michal suggested to check 
VM_DEAD in check_stable_address_space(), so it would be called in 
hugetlb_fault() too (not in current code), then the page fault handler 
would bail out before huge_pte_alloc() is called.

With this trick, we don't have to care about when the vm_flags is 
updated, we can unmap hugetlb vmas in read mmap_sem critical section, 
then update the vm_flags with write mmap_sem held or before the unmap.

Yang

>
> From my point of view, uprobe mapping sounds not that vital.
>
>>
>> Would one of your earlier designs have addressed all usecases? I
>> expect the dumb unmap-a-little-bit-at-a-time approach would have?
>
> Yes. The v1 design does unmap with holding write map_sem. So, the 
> vm_flags update is not a problem.
>
> Thanks,
> Yang
>
>>
>>> For the locked vmas and hugetlb vmas, unmapping operations need modify
>>> vm_flags. But, I'm wondering we might be able to separate unmap and
>>> vm_flags update. Because we know they will be unmapped right away, the
>>> vm_flags might be able to be updated in write mmap_sem critical section
>>> before the actual unmap is called or after it. This is just off the top
>>> of my head.
>>>
>>> For uprobed mappings, I'm not sure how vital it is to this case.
>>>
>>> Thanks,
>>> Yang
>>>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 20:48           ` Andrew Morton
@ 2018-07-03  6:09             ` Michal Hocko
  2018-07-03 16:53               ` Yang Shi
  2018-07-03 18:22               ` Yang Shi
  0 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2018-07-03  6:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yang Shi, willy, ldufour, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon 02-07-18 13:48:45, Andrew Morton wrote:
> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> > [...]
> > > Would one of your earlier designs have addressed all usecases?  I
> > > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> > 
> > It has been already pointed out that this will not work.
> 
> I said "one of".  There were others.

Well, I was aware only about two potential solutions. Either do the
heavy lifting under the shared lock and do the rest with the exlusive
one and this, drop the lock per parts. Maybe I have missed others?

> > You simply
> > cannot drop the mmap_sem during unmap because another thread could
> > change the address space under your feet. So you need some form of
> > VM_DEAD and handle concurrent and conflicting address space operations.
> 
> Unclear that this is a problem.  If a thread does an unmap of a range
> of virtual address space, there's no guarantee that upon return some
> other thread has not already mapped new stuff into that address range. 
> So what's changed?

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

They do not use any external synchronization and rely on the atomic
munmap. Thread B only munmaps range that it knows belongs to it (e.g.
called mmap in the past). It should be clear that ThreadA should not
get an address from the addr, sizeB range, right? In the most simple case
it will not happen. But let's say that the addr, sizeB range has
unmapped holes for what ever reasons. Now anytime munmap drops the
exclusive lock after handling one VMA, Thread A might find its sizeA
range and use it. ThreadB then might remove this new range as soon as it
gets its exclusive lock again.

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

I remember my attempt to make MAP_LOCKED consistent with mlock (if the
population fails then return -ENOMEM) and that required to drop the
shared mmap_sem and take it in exclusive mode (because we do not
have upgrade_read) and Linus was strongly against [1][2] for very
similar reasons. If you drop the lock you simply do not know what
happened under your feet.

[1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com
[2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-02 18:10                   ` Yang Shi
@ 2018-07-03  6:17                     ` Michal Hocko
  2018-07-03 16:50                       ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2018-07-03  6:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Laurent Dufour, willy, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon 02-07-18 11:10:23, Yang Shi wrote:
> On 7/2/18 10:57 AM, Michal Hocko wrote:
[...]
> > Why would you even care about shared mappings?
> 
> Just thought about we are dealing with VM_DEAD, which means the vma will be
> tore down soon regardless it is shared or non-shared.
> 
> MMF_UNSTABLE doesn't care about !shared case.

Let me clarify some more. MMF_UNSTABLE is there to prevent from
unexpected page faults when the mm is torn down by the oom reaper. And
oom reaper only cares about private mappings because we do not touch
shared ones. Disk based shared mappings should be a non-issue for
VM_DEAD because even if you race and refault a page back then you know
it is the same one you have seen before. Memory backed shared mappings
are a different story because you can get a fresh new page. oom_reaper
doesn't care because it doesn't tear those down. You would have to but
my primary point was that we already have MMF_UNSTABLE so all you need
is to extend it to memory backed shared mappings (shmem and hugetlb).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 17:19     ` Yang Shi
@ 2018-07-03  8:07       ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2018-07-03  8:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon, Jul 02, 2018 at 10:19:32AM -0700, Yang Shi wrote:
> 
> 
> On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:
> > On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > 300GB), the below hung task issue may happen occasionally.
> > > INFO: task ps:14018 blocked for more than 120 seconds.
> > >         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
> > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > >   ps              D    0 14018      1 0x00000004
> > >    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> > >    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> > >    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> > >   Call Trace:
> > >    [<ffffffff817154d0>] ? __schedule+0x250/0x730
> > >    [<ffffffff817159e6>] schedule+0x36/0x80
> > >    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> > >    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> > >    [<ffffffff81717db0>] down_read+0x20/0x40
> > >    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> > >    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> > >    [<ffffffff81241d87>] __vfs_read+0x37/0x150
> > >    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> > >    [<ffffffff81242266>] vfs_read+0x96/0x130
> > >    [<ffffffff812437b5>] SyS_read+0x55/0xc0
> > >    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > Zapping pages is the most time consuming part, according to the
> > > suggestion from Michal Hock [1], zapping pages can be done with holding
> > > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> > > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> > > the page fault to VM_DEAD vma will trigger SIGSEGV.
> > > 
> > > Define large mapping size thresh as PUD size or 1GB, just zap pages with
> > > read mmap_sem for mappings which are >= thresh value.
> > > 
> > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> > > fallback to regular path since unmapping those mappings need acquire
> > > write mmap_sem.
> > > 
> > > For the time being, just do this in munmap syscall path. Other
> > > vm_munmap() or do_munmap() call sites remain intact for stability
> > > reason.
> > > 
> > > The below is some regression and performance data collected on a machine
> > > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> > > 
> > > With the patched kernel, write mmap_sem hold time is dropped to us level
> > > from second.
> > > 
> > > [1] https://lwn.net/Articles/753269/
> > > 
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > > ---
> > >   mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 134 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 87dcf83..d61e08b 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma,
> > >   	return 1;
> > >   }
> > > +/* Consider PUD size or 1GB mapping as large mapping */
> > > +#ifdef HPAGE_PUD_SIZE
> > > +#define LARGE_MAP_THRESH	HPAGE_PUD_SIZE
> > > +#else
> > > +#define LARGE_MAP_THRESH	(1 * 1024 * 1024 * 1024)
> > > +#endif
> > PUD_SIZE is defined everywhere.
> 
> If THP is defined, otherwise it is:
> 
> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

I'm talking about PUD_SIZE, not HPAGE_PUD_SIZE.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-02 12:49     ` Michal Hocko
@ 2018-07-03  8:12       ` Kirill A. Shutemov
  2018-07-03  8:27         ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Kirill A. Shutemov @ 2018-07-03  8:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> [...]
> > I probably miss the explanation somewhere, but what's wrong with allowing
> > other thread to re-populate the VMA?
> 
> We have discussed that earlier and it boils down to how is racy access
> to munmap supposed to behave. Right now we have either the original
> content or SEGV. If we allow to simply madvise_dontneed before real
> unmap we could get a new page as well. There might be (quite broken I
> would say) user space code that would simply corrupt data silently that
> way.

Okay, so we add a lot of complexity to accommodate broken userspace that
may or may not exist. Is it right? :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-03  8:12       ` Kirill A. Shutemov
@ 2018-07-03  8:27         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2018-07-03  8:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Yang Shi, willy, ldufour, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote:
> On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> > [...]
> > > I probably miss the explanation somewhere, but what's wrong with allowing
> > > other thread to re-populate the VMA?
> > 
> > We have discussed that earlier and it boils down to how is racy access
> > to munmap supposed to behave. Right now we have either the original
> > content or SEGV. If we allow to simply madvise_dontneed before real
> > unmap we could get a new page as well. There might be (quite broken I
> > would say) user space code that would simply corrupt data silently that
> > way.
> 
> Okay, so we add a lot of complexity to accommodate broken userspace that
> may or may not exist. Is it right? :)

I would really love to do the most simple and obious thing

diff --git a/mm/mmap.c b/mm/mmap.c
index 336bee8c4e25..86ffb179c3b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap);
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	profile_munmap(addr);
+	if (len > LARGE_NUMBER)
+		do_madvise(addr, len, MADV_DONTNEED);
 	return vm_munmap(addr, len);
 }
 
but the argument that current semantic of good data or SEGV on
racing threads is no longer preserved sounds valid to me. Remember
optimizations shouldn't eat your data. How do we ensure that we won't
corrupt data silently?

Besides that if this was so simple then we do not even need any kernel
code. You could do that from glibc resp. any munmap wrapper. So maybe
the proper answer is, if you do care then just help the system and
DONTNEED your data before you munmap as an optimization for large
mappings.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault
  2018-07-03  6:17                     ` Michal Hocko
@ 2018-07-03 16:50                       ` Yang Shi
  0 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-07-03 16:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Laurent Dufour, willy, akpm, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel



On 7/2/18 11:17 PM, Michal Hocko wrote:
> On Mon 02-07-18 11:10:23, Yang Shi wrote:
>> On 7/2/18 10:57 AM, Michal Hocko wrote:
> [...]
>>> Why would you even care about shared mappings?
>> Just thought about we are dealing with VM_DEAD, which means the vma will be
>> tore down soon regardless it is shared or non-shared.
>>
>> MMF_UNSTABLE doesn't care about !shared case.

Sorry, this is a typo, it should be "shared".

> Let me clarify some more. MMF_UNSTABLE is there to prevent from
> unexpected page faults when the mm is torn down by the oom reaper. And
> oom reaper only cares about private mappings because we do not touch
> shared ones. Disk based shared mappings should be a non-issue for
> VM_DEAD because even if you race and refault a page back then you know
> it is the same one you have seen before. Memory backed shared mappings
> are a different story because you can get a fresh new page. oom_reaper
> doesn't care because it doesn't tear those down. You would have to but
> my primary point was that we already have MMF_UNSTABLE so all you need
> is to extend it to memory backed shared mappings (shmem and hugetlb).

Yes, sure. I think I got your point. Thanks for the elaboration.

Yang

>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-03  6:09             ` Michal Hocko
@ 2018-07-03 16:53               ` Yang Shi
  2018-07-03 18:22               ` Yang Shi
  1 sibling, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-07-03 16:53 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: willy, ldufour, peterz, mingo, acme, alexander.shishkin, jolsa,
	namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 7/2/18 11:09 PM, Michal Hocko wrote:
> On Mon 02-07-18 13:48:45, Andrew Morton wrote:
>> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
>>> [...]
>>>> Would one of your earlier designs have addressed all usecases?  I
>>>> expect the dumb unmap-a-little-bit-at-a-time approach would have?
>>> It has been already pointed out that this will not work.
>> I said "one of".  There were others.
> Well, I was aware only about two potential solutions. Either do the
> heavy lifting under the shared lock and do the rest with the exlusive
> one and this, drop the lock per parts. Maybe I have missed others?

There is the other one which I presented on LSFMM summit. But, actually 
it turns out that one looks very similar to the current under review one.

Yang

>
>>> You simply
>>> cannot drop the mmap_sem during unmap because another thread could
>>> change the address space under your feet. So you need some form of
>>> VM_DEAD and handle concurrent and conflicting address space operations.
>> Unclear that this is a problem.  If a thread does an unmap of a range
>> of virtual address space, there's no guarantee that upon return some
>> other thread has not already mapped new stuff into that address range.
>> So what's changed?
> Well, consider the following scenario:
> Thread A = calling mmap(NULL, sizeA)
> Thread B = calling munmap(addr, sizeB)
>
> They do not use any external synchronization and rely on the atomic
> munmap. Thread B only munmaps range that it knows belongs to it (e.g.
> called mmap in the past). It should be clear that ThreadA should not
> get an address from the addr, sizeB range, right? In the most simple case
> it will not happen. But let's say that the addr, sizeB range has
> unmapped holes for what ever reasons. Now anytime munmap drops the
> exclusive lock after handling one VMA, Thread A might find its sizeA
> range and use it. ThreadB then might remove this new range as soon as it
> gets its exclusive lock again.
>
> Is such a code safe? No it is not and I would call it fragile at best
> but people tend to do weird things and atomic munmap behavior is
> something they can easily depend on.
>
> Another example would be an atomic address range probing by
> MAP_FIXED_NOREPLACE. It would simply break for similar reasons.
>
> I remember my attempt to make MAP_LOCKED consistent with mlock (if the
> population fails then return -ENOMEM) and that required to drop the
> shared mmap_sem and take it in exclusive mode (because we do not
> have upgrade_read) and Linus was strongly against [1][2] for very
> similar reasons. If you drop the lock you simply do not know what
> happened under your feet.
>
> [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com
> [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-03  6:09             ` Michal Hocko
  2018-07-03 16:53               ` Yang Shi
@ 2018-07-03 18:22               ` Yang Shi
  2018-07-04  8:13                 ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Yang Shi @ 2018-07-03 18:22 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: willy, ldufour, peterz, mingo, acme, alexander.shishkin, jolsa,
	namhyung, tglx, hpa, linux-mm, x86, linux-kernel



On 7/2/18 11:09 PM, Michal Hocko wrote:
> On Mon 02-07-18 13:48:45, Andrew Morton wrote:
>> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
>>> [...]
>>>> Would one of your earlier designs have addressed all usecases?  I
>>>> expect the dumb unmap-a-little-bit-at-a-time approach would have?
>>> It has been already pointed out that this will not work.
>> I said "one of".  There were others.
> Well, I was aware only about two potential solutions. Either do the
> heavy lifting under the shared lock and do the rest with the exlusive
> one and this, drop the lock per parts. Maybe I have missed others?
>
>>> You simply
>>> cannot drop the mmap_sem during unmap because another thread could
>>> change the address space under your feet. So you need some form of
>>> VM_DEAD and handle concurrent and conflicting address space operations.
>> Unclear that this is a problem.  If a thread does an unmap of a range
>> of virtual address space, there's no guarantee that upon return some
>> other thread has not already mapped new stuff into that address range.
>> So what's changed?
> Well, consider the following scenario:
> Thread A = calling mmap(NULL, sizeA)
> Thread B = calling munmap(addr, sizeB)
>
> They do not use any external synchronization and rely on the atomic
> munmap. Thread B only munmaps range that it knows belongs to it (e.g.
> called mmap in the past). It should be clear that ThreadA should not
> get an address from the addr, sizeB range, right? In the most simple case
> it will not happen. But let's say that the addr, sizeB range has
> unmapped holes for what ever reasons. Now anytime munmap drops the
> exclusive lock after handling one VMA, Thread A might find its sizeA
> range and use it. ThreadB then might remove this new range as soon as it
> gets its exclusive lock again.

I'm a little bit confused here. If ThreadB already has unmapped that 
range, then ThreadA uses it. It sounds not like a problem since ThreadB 
should just go ahead to handle the next range when it gets its exclusive 
lock again, right? I don't think of why ThreadB would re-visit that 
range to remove it.

But, if ThreadA uses MAP_FIXED, it might remap some ranges, then ThreadB 
remove them. It might trigger SIGSEGV or SIGBUS, but this is not even 
guaranteed on vanilla kernel too if the application doesn't do any 
synchronization. It all depends on timing.

>
> Is such a code safe? No it is not and I would call it fragile at best
> but people tend to do weird things and atomic munmap behavior is
> something they can easily depend on.
>
> Another example would be an atomic address range probing by
> MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

Yes, I agree, it could simply break MAP_FIXED_NOREPLACE.

Yang

>
> I remember my attempt to make MAP_LOCKED consistent with mlock (if the
> population fails then return -ENOMEM) and that required to drop the
> shared mmap_sem and take it in exclusive mode (because we do not
> have upgrade_read) and Linus was strongly against [1][2] for very
> similar reasons. If you drop the lock you simply do not know what
> happened under your feet.
>
> [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com
> [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping
  2018-07-03 18:22               ` Yang Shi
@ 2018-07-04  8:13                 ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2018-07-04  8:13 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, willy, ldufour, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, tglx, hpa, linux-mm, x86,
	linux-kernel

On Tue 03-07-18 11:22:17, Yang Shi wrote:
> 
> 
> On 7/2/18 11:09 PM, Michal Hocko wrote:
> > On Mon 02-07-18 13:48:45, Andrew Morton wrote:
> > > On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > > On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> > > > [...]
> > > > > Would one of your earlier designs have addressed all usecases?  I
> > > > > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> > > > It has been already pointed out that this will not work.
> > > I said "one of".  There were others.
> > Well, I was aware only about two potential solutions. Either do the
> > heavy lifting under the shared lock and do the rest with the exlusive
> > one and this, drop the lock per parts. Maybe I have missed others?
> > 
> > > > You simply
> > > > cannot drop the mmap_sem during unmap because another thread could
> > > > change the address space under your feet. So you need some form of
> > > > VM_DEAD and handle concurrent and conflicting address space operations.
> > > Unclear that this is a problem.  If a thread does an unmap of a range
> > > of virtual address space, there's no guarantee that upon return some
> > > other thread has not already mapped new stuff into that address range.
> > > So what's changed?
> > Well, consider the following scenario:
> > Thread A = calling mmap(NULL, sizeA)
> > Thread B = calling munmap(addr, sizeB)
> > 
> > They do not use any external synchronization and rely on the atomic
> > munmap. Thread B only munmaps range that it knows belongs to it (e.g.
> > called mmap in the past). It should be clear that ThreadA should not
> > get an address from the addr, sizeB range, right? In the most simple case
> > it will not happen. But let's say that the addr, sizeB range has
> > unmapped holes for what ever reasons. Now anytime munmap drops the
> > exclusive lock after handling one VMA, Thread A might find its sizeA
> > range and use it. ThreadB then might remove this new range as soon as it
> > gets its exclusive lock again.
> 
> I'm a little bit confused here. If ThreadB already has unmapped that range,
> then ThreadA uses it. It sounds not like a problem since ThreadB should just
> go ahead to handle the next range when it gets its exclusive lock again,
> right? I don't think of why ThreadB would re-visit that range to remove it.

Not if the new range overlap with the follow up range that ThreadB does.
Example

B: munmap [XXXXX]    [XXXXXX] [XXXXXXXXXX]
B: breaks the lock after processing the first vma.
A: mmap [XXXXXXXXXXXX]
B: munmap retakes the lock and revalidate from the last vm_end because
   the old vma->vm_next might be gone
B:               [XXX][XXXXX] [XXXXXXXXXX]

so you munmap part of the range. Sure you can plan some tricks and skip
over vmas that do not start above your last vma->vm_end or something
like that but I expect there are other can of worms hidden there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-07-04  8:14 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29 22:39 [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi
2018-06-29 22:39 ` [RFC v3 PATCH 1/5] uprobes: make vma_has_uprobes non-static Yang Shi
2018-06-29 22:39 ` [RFC v3 PATCH 2/5] mm: introduce VM_DEAD flag Yang Shi
2018-07-02 13:40   ` Michal Hocko
2018-06-29 22:39 ` [RFC v3 PATCH 3/5] mm: refactor do_munmap() to extract the common part Yang Shi
2018-07-02 13:42   ` Michal Hocko
2018-07-02 16:59     ` Yang Shi
2018-07-02 17:58       ` Michal Hocko
2018-07-02 18:02         ` Yang Shi
2018-06-29 22:39 ` [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi
2018-06-30  1:28   ` Andrew Morton
2018-06-30  2:10     ` Yang Shi
2018-06-30  1:35   ` Andrew Morton
2018-06-30  2:28     ` Yang Shi
2018-06-30  3:15       ` Andrew Morton
2018-06-30  4:26         ` Yang Shi
2018-07-03  0:01           ` Yang Shi
2018-07-02 14:05         ` Michal Hocko
2018-07-02 20:48           ` Andrew Morton
2018-07-03  6:09             ` Michal Hocko
2018-07-03 16:53               ` Yang Shi
2018-07-03 18:22               ` Yang Shi
2018-07-04  8:13                 ` Michal Hocko
2018-07-02 12:33   ` Kirill A. Shutemov
2018-07-02 12:49     ` Michal Hocko
2018-07-03  8:12       ` Kirill A. Shutemov
2018-07-03  8:27         ` Michal Hocko
2018-07-02 17:19     ` Yang Shi
2018-07-03  8:07       ` Kirill A. Shutemov
2018-07-02 13:53   ` Michal Hocko
2018-07-02 17:07     ` Yang Shi
2018-06-29 22:39 ` [RFC v3 PATCH 5/5] x86: check VM_DEAD flag in page fault Yang Shi
2018-07-02  8:45   ` Laurent Dufour
2018-07-02 12:15     ` Michal Hocko
2018-07-02 12:26       ` Laurent Dufour
2018-07-02 12:45         ` Michal Hocko
2018-07-02 13:33           ` Laurent Dufour
2018-07-02 13:37             ` Michal Hocko
2018-07-02 17:24               ` Yang Shi
2018-07-02 17:57                 ` Michal Hocko
2018-07-02 18:10                   ` Yang Shi
2018-07-03  6:17                     ` Michal Hocko
2018-07-03 16:50                       ` Yang Shi
2018-07-02 13:39 ` [RFC v3 PATCH 0/5] mm: zap pages with read mmap_sem in munmap for large mapping Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).