* [RFC v2 0/2] mm: zap pages with read mmap_sem in munmap for large mapping @ 2018-06-18 23:34 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-18 23:34 UTC (permalink / raw) To: mhocko, willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin, jolsa, namhyung Cc: yang.shi, linux-mm, linux-kernel Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. History: Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). V1 patch series was submitted to the mailing list per Andrew’s suggestion (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback and suggestions. Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hock suggested (also in the v1 patches review) to try "two phases" approach. Zapping pages with read mmap_sem, then doing via cleanup with write mmap_sem (for discussion detail, see https://lwn.net/Articles/753269/) So, I came up with the V2 patch series per this suggestion. Here I don't call madvise(MADV_DONTNEED) directly since it is a little different from what munmap does, so I use unmap_region() as what do_munmap() does. The patches may need more cleanup and refactor, but it sounds better to let the community start review the patches early to make sure I'm on the right track. Regression and performance data: Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory Regression test with full LTP and trinity (munmap) with setting thresh to 4K in the code (just for regression test only) so that the new code can be covered better and trinity (munmap) test manipulates 4K mapping. No regression issue is reported and the system survives under trinity (munmap) test for 4 hours until I abort the test. Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 89.41K/sec 97.29K/sec +8.8% The number looks a little bit better than v1. Yang Shi (2): uprobes: make vma_has_uprobes non-static mm: mmap: zap pages with read mmap_sem for large mapping include/linux/uprobes.h | 7 ++++ kernel/events/uprobes.c | 2 +- mm/mmap.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 155 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFC v2 0/2] mm: zap pages with read mmap_sem in munmap for large mapping @ 2018-06-18 23:34 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-18 23:34 UTC (permalink / raw) To: mhocko, willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin, jolsa, namhyung Cc: yang.shi, linux-mm, linux-kernel Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. History: Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). V1 patch series was submitted to the mailing list per Andrewa??s suggestion (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback and suggestions. Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hock suggested (also in the v1 patches review) to try "two phases" approach. Zapping pages with read mmap_sem, then doing via cleanup with write mmap_sem (for discussion detail, see https://lwn.net/Articles/753269/) So, I came up with the V2 patch series per this suggestion. Here I don't call madvise(MADV_DONTNEED) directly since it is a little different from what munmap does, so I use unmap_region() as what do_munmap() does. The patches may need more cleanup and refactor, but it sounds better to let the community start review the patches early to make sure I'm on the right track. Regression and performance data: Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory Regression test with full LTP and trinity (munmap) with setting thresh to 4K in the code (just for regression test only) so that the new code can be covered better and trinity (munmap) test manipulates 4K mapping. No regression issue is reported and the system survives under trinity (munmap) test for 4 hours until I abort the test. Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 89.41K/sec 97.29K/sec +8.8% The number looks a little bit better than v1. Yang Shi (2): uprobes: make vma_has_uprobes non-static mm: mmap: zap pages with read mmap_sem for large mapping include/linux/uprobes.h | 7 ++++ kernel/events/uprobes.c | 2 +- mm/mmap.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 155 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFC v2 PATCH 1/2] uprobes: make vma_has_uprobes non-static 2018-06-18 23:34 ` Yang Shi (?) @ 2018-06-18 23:34 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-18 23:34 UTC (permalink / raw) To: mhocko, willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin, jolsa, namhyung Cc: yang.shi, linux-mm, linux-kernel vma_has_uprobes() will be used in the following patch to check if a vma could be unmapped with holding read mmap_sem, but it is static. So, make it non-static to use outside uprobe. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- include/linux/uprobes.h | 7 +++++++ kernel/events/uprobes.c | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0a294e9..7f1fb8c 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -149,6 +149,8 @@ struct uprobes_state { extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, + unsigned long end); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; @@ -203,5 +205,10 @@ static inline void uprobe_copy_process(struct task_struct *t, unsigned long flag static inline void uprobe_clear_state(struct mm_struct *mm) { } +static inline bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long, + unsigned long end) +{ + return false; +} #endif /* !CONFIG_UPROBES */ #endif /* _LINUX_UPROBES_H */ diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 1725b90..bd46b7a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1095,7 +1095,7 @@ int uprobe_mmap(struct vm_area_struct *vma) return 0; } -static bool +bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end) { loff_t min, max; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-18 23:34 ` Yang Shi (?) (?) @ 2018-06-18 23:34 ` Yang Shi 2018-06-19 10:02 ` Peter Zijlstra 2018-06-19 22:17 ` Nadav Amit -1 siblings, 2 replies; 39+ messages in thread From: Yang Shi @ 2018-06-18 23:34 UTC (permalink / raw) To: mhocko, willy, ldufour, akpm, peterz, mingo, acme, alexander.shishkin, jolsa, namhyung Cc: yang.shi, linux-mm, linux-kernel When running some mmap/munmap scalability tests with large memory (i.e. > 300GB), the below hung task issue may happen occasionally. INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 Call Trace: [<ffffffff817154d0>] ? __schedule+0x250/0x730 [<ffffffff817159e6>] schedule+0x36/0x80 [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff81717db0>] down_read+0x20/0x40 [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 [<ffffffff81241d87>] __vfs_read+0x37/0x150 [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 [<ffffffff81242266>] vfs_read+0x96/0x130 [<ffffffff812437b5>] SyS_read+0x55/0xc0 [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 It is because munmap holds mmap_sem from very beginning to all the way down to the end, and doesn't release it in the middle. When unmapping large mapping, it may take long time (take ~18 seconds to unmap 320GB mapping with every single page mapped on an idle machine). Zapping pages is the most time consuming part, according to the suggestion from Michal Hock [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to manipulate vmas. Define large mapping size thresh as PUD size or 1GB, just zap pages with read mmap_sem for mappings which are >= thresh value. If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just fallback to regular path since unmapping those mappings need acquire write mmap_sem. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites remain intact since it sounds the complexity to handle race conditions outpace the benefits. The below is some regression and performance data collected on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. With the patched kernel, write mmap_sem hold time is dropped to us level from second. Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 89.41K/sec 97.29K/sec +8.8% [1] https://lwn.net/Articles/753269/ Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- mm/mmap.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 147 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index fc41c05..e84f80c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2686,6 +2686,141 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, return __split_vma(mm, vma, addr, new_below); } +/* Consider PUD size or 1GB mapping as large mapping */ +#ifdef HPAGE_PUD_SIZE +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE +#else +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) +#endif + +/* Unmap large mapping early with acquiring read mmap_sem */ +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, + size_t len, struct list_head *uf) +{ + unsigned long end = 0; + struct vm_area_struct *vma = NULL, *prev, *last, *tmp; + bool success = false; + int ret = 0; + + if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start) + return -EINVAL; + + len = (PAGE_ALIGN(len)); + if (len == 0) + return -EINVAL; + + /* Just deal with uf in regular path */ + if (unlikely(uf)) + goto regular_path; + + if (len >= LARGE_MAP_THRESH) { + down_read(&mm->mmap_sem); + vma = find_vma(mm, start); + if (!vma) { + up_read(&mm->mmap_sem); + return 0; + } + + prev = vma->vm_prev; + + end = start + len; + if (vma->vm_start > end) { + up_read(&mm->mmap_sem); + return 0; + } + + if (start > vma->vm_start) { + int error; + + if (end < vma->vm_end && + mm->map_count > sysctl_max_map_count) { + up_read(&mm->mmap_sem); + return -ENOMEM; + } + + error = __split_vma(mm, vma, start, 0); + if (error) { + up_read(&mm->mmap_sem); + return error; + } + prev = vma; + } + + last = find_vma(mm, end); + if (last && end > last->vm_start) { + int error = __split_vma(mm, last, end, 1); + + if (error) { + up_read(&mm->mmap_sem); + return error; + } + } + vma = prev ? prev->vm_next : mm->mmap; + + /* + * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP + * flag set or has uprobes set, need acquire write map_sem, + * so skip them in early zap. Just deal with such mapping in + * regular path. + * Borrow can_madv_dontneed_vma() to check the conditions. + */ + tmp = vma; + while (tmp && tmp->vm_start < end) { + if (!can_madv_dontneed_vma(tmp) || + vma_has_uprobes(tmp, start, end)) + goto sem_drop; + tmp = tmp->vm_next; + } + + unmap_region(mm, vma, prev, start, end); + /* indicates early zap is success */ + success = true; + +sem_drop: + up_read(&mm->mmap_sem); + } + +regular_path: + /* hold write mmap_sem for vma manipulation or regular path */ + if (down_write_killable(&mm->mmap_sem)) + return -EINTR; + if (success) { + /* vmas have been zapped, here just deal with loose end */ + detach_vmas_to_be_unmapped(mm, vma, prev, end); + arch_unmap(mm, vma, start, end); + remove_vma_list(mm, vma); + } else { + /* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */ + if (vma) { + if (unlikely(uf)) { + int ret = userfaultfd_unmap_prep(vma, start, + end, uf); + if (ret) + goto out; + } + if (mm->locked_vm) { + tmp = vma; + while (tmp && tmp->vm_start < end) { + if (tmp->vm_flags & VM_LOCKED) { + mm->locked_vm -= vma_pages(tmp); + munlock_vma_pages_all(tmp); + } + tmp = tmp->vm_next; + } + } + detach_vmas_to_be_unmapped(mm, vma, prev, end); + unmap_region(mm, vma, prev, start, end); + remove_vma_list(mm, vma); + } else + /* When mapping size < LARGE_MAP_THRESH */ + ret = do_munmap(mm, start, len, uf); + } + +out: + up_write(&mm->mmap_sem); + return ret; +} + /* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the * work. This now handles partial unmappings. @@ -2792,6 +2927,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, return 0; } +static int vm_munmap_zap_early(unsigned long start, size_t len) +{ + int ret; + struct mm_struct *mm = current->mm; + LIST_HEAD(uf); + + ret = do_munmap_zap_early(mm, start, len, &uf); + userfaultfd_unmap_complete(mm, &uf); + return ret; +} + int vm_munmap(unsigned long start, size_t len) { int ret; @@ -2811,7 +2957,7 @@ int vm_munmap(unsigned long start, size_t len) SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) { profile_munmap(addr); - return vm_munmap(addr, len); + return vm_munmap_zap_early(addr, len); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-18 23:34 ` [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi @ 2018-06-19 10:02 ` Peter Zijlstra 2018-06-19 21:13 ` Yang Shi 2018-06-19 22:17 ` Nadav Amit 1 sibling, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2018-06-19 10:02 UTC (permalink / raw) To: Yang Shi Cc: mhocko, willy, ldufour, akpm, mingo, acme, alexander.shishkin, jolsa, namhyung, linux-mm, linux-kernel On Tue, Jun 19, 2018 at 07:34:16AM +0800, Yang Shi wrote: > diff --git a/mm/mmap.c b/mm/mmap.c > index fc41c05..e84f80c 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2686,6 +2686,141 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, > return __split_vma(mm, vma, addr, new_below); > } > > +/* Consider PUD size or 1GB mapping as large mapping */ > +#ifdef HPAGE_PUD_SIZE > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE > +#else > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) > +#endif > + > +/* Unmap large mapping early with acquiring read mmap_sem */ > +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, > + size_t len, struct list_head *uf) > +{ > + unsigned long end = 0; > + struct vm_area_struct *vma = NULL, *prev, *last, *tmp; > + bool success = false; > + int ret = 0; > + > + if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start) > + return -EINVAL; > + > + len = (PAGE_ALIGN(len)); > + if (len == 0) > + return -EINVAL; > + > + /* Just deal with uf in regular path */ > + if (unlikely(uf)) > + goto regular_path; > + > + if (len >= LARGE_MAP_THRESH) { > + down_read(&mm->mmap_sem); > + vma = find_vma(mm, start); > + if (!vma) { > + up_read(&mm->mmap_sem); > + return 0; > + } > + > + prev = vma->vm_prev; > + > + end = start + len; > + if (vma->vm_start > end) { > + up_read(&mm->mmap_sem); > + return 0; > + } > + > + if (start > vma->vm_start) { > + int error; > + > + if (end < vma->vm_end && > + mm->map_count > sysctl_max_map_count) { > + up_read(&mm->mmap_sem); > + return -ENOMEM; > + } > + > + error = __split_vma(mm, vma, start, 0); > + if (error) { > + up_read(&mm->mmap_sem); > + return error; > + } > + prev = vma; > + } > + > + last = find_vma(mm, end); > + if (last && end > last->vm_start) { > + int error = __split_vma(mm, last, end, 1); > + > + if (error) { > + up_read(&mm->mmap_sem); > + return error; > + } > + } > + vma = prev ? prev->vm_next : mm->mmap; Hold up, two things: you having to copy most of do_munmap() didn't seem to suggest a helper function? And second, since when are we allowed to split VMAs under a read lock? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-19 10:02 ` Peter Zijlstra @ 2018-06-19 21:13 ` Yang Shi 2018-06-20 7:17 ` Michal Hocko 0 siblings, 1 reply; 39+ messages in thread From: Yang Shi @ 2018-06-19 21:13 UTC (permalink / raw) To: Peter Zijlstra Cc: mhocko, willy, ldufour, akpm, mingo, acme, alexander.shishkin, jolsa, namhyung, linux-mm, linux-kernel On 6/19/18 3:02 AM, Peter Zijlstra wrote: > On Tue, Jun 19, 2018 at 07:34:16AM +0800, Yang Shi wrote: > >> diff --git a/mm/mmap.c b/mm/mmap.c >> index fc41c05..e84f80c 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -2686,6 +2686,141 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, >> return __split_vma(mm, vma, addr, new_below); >> } >> >> +/* Consider PUD size or 1GB mapping as large mapping */ >> +#ifdef HPAGE_PUD_SIZE >> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE >> +#else >> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) >> +#endif >> + >> +/* Unmap large mapping early with acquiring read mmap_sem */ >> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, >> + size_t len, struct list_head *uf) >> +{ >> + unsigned long end = 0; >> + struct vm_area_struct *vma = NULL, *prev, *last, *tmp; >> + bool success = false; >> + int ret = 0; >> + >> + if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE - start) >> + return -EINVAL; >> + >> + len = (PAGE_ALIGN(len)); >> + if (len == 0) >> + return -EINVAL; >> + >> + /* Just deal with uf in regular path */ >> + if (unlikely(uf)) >> + goto regular_path; >> + >> + if (len >= LARGE_MAP_THRESH) { >> + down_read(&mm->mmap_sem); >> + vma = find_vma(mm, start); >> + if (!vma) { >> + up_read(&mm->mmap_sem); >> + return 0; >> + } >> + >> + prev = vma->vm_prev; >> + >> + end = start + len; >> + if (vma->vm_start > end) { >> + up_read(&mm->mmap_sem); >> + return 0; >> + } >> + >> + if (start > vma->vm_start) { >> + int error; >> + >> + if (end < vma->vm_end && >> + mm->map_count > sysctl_max_map_count) { >> + up_read(&mm->mmap_sem); >> + return -ENOMEM; >> + } >> + >> + error = __split_vma(mm, vma, start, 0); >> + if (error) { >> + up_read(&mm->mmap_sem); >> + return error; >> + } >> + prev = vma; >> + } >> + >> + last = find_vma(mm, end); >> + if (last && end > last->vm_start) { >> + int error = __split_vma(mm, last, end, 1); >> + >> + if (error) { >> + up_read(&mm->mmap_sem); >> + return error; >> + } >> + } >> + vma = prev ? prev->vm_next : mm->mmap; > Hold up, two things: you having to copy most of do_munmap() didn't seem > to suggest a helper function? And second, since when are we allowed to Yes, they will be extracted into a helper function in the next version. May bad, I don't think it is allowed. We could reform this to: acquire write mmap_sem vma lookup (split vmas) release write mmap_sem acquire read mmap_sem zap pages release read mmap_sem I'm supposed this is safe as what Michal said before. Thanks, Yang > split VMAs under a read lock? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-19 21:13 ` Yang Shi @ 2018-06-20 7:17 ` Michal Hocko 2018-06-20 16:23 ` Yang Shi 0 siblings, 1 reply; 39+ messages in thread From: Michal Hocko @ 2018-06-20 7:17 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, willy, ldufour, akpm, mingo, acme, alexander.shishkin, jolsa, namhyung, linux-mm, linux-kernel On Tue 19-06-18 14:13:05, Yang Shi wrote: > > > On 6/19/18 3:02 AM, Peter Zijlstra wrote: [...] > > Hold up, two things: you having to copy most of do_munmap() didn't seem > > to suggest a helper function? And second, since when are we allowed to > > Yes, they will be extracted into a helper function in the next version. > > May bad, I don't think it is allowed. We could reform this to: > > acquire write mmap_sem > vma lookup (split vmas) > release write mmap_sem > > acquire read mmap_sem > zap pages > release read mmap_sem > > I'm supposed this is safe as what Michal said before. I didn't get to read your patches carefully yet but I am wondering why do you need to split in the first place. Why cannot you simply unmap the range (madvise(DONTNEED)) under the read lock and then take the lock for write to finish the rest? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-20 7:17 ` Michal Hocko @ 2018-06-20 16:23 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-20 16:23 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, willy, ldufour, akpm, mingo, acme, alexander.shishkin, jolsa, namhyung, linux-mm, linux-kernel On 6/20/18 12:17 AM, Michal Hocko wrote: > On Tue 19-06-18 14:13:05, Yang Shi wrote: >> >> On 6/19/18 3:02 AM, Peter Zijlstra wrote: > [...] >>> Hold up, two things: you having to copy most of do_munmap() didn't seem >>> to suggest a helper function? And second, since when are we allowed to >> Yes, they will be extracted into a helper function in the next version. >> >> May bad, I don't think it is allowed. We could reform this to: >> >> acquire write mmap_sem >> vma lookup (split vmas) >> release write mmap_sem >> >> acquire read mmap_sem >> zap pages >> release read mmap_sem >> >> I'm supposed this is safe as what Michal said before. > I didn't get to read your patches carefully yet but I am wondering why > do you need to split in the first place. Why cannot you simply unmap the > range (madvise(DONTNEED)) under the read lock and then take the lock for > write to finish the rest? Yes, we can. I just thought splitting vma up-front sounds more straight forward. But, I neglected the write mmap_sem issue. Will move the vma split into later write mmap_sem in the next version. Thanks, Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-18 23:34 ` [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi 2018-06-19 10:02 ` Peter Zijlstra @ 2018-06-19 22:17 ` Nadav Amit 2018-06-19 23:08 ` Yang Shi 1 sibling, 1 reply; 39+ messages in thread From: Nadav Amit @ 2018-06-19 22:17 UTC (permalink / raw) To: Yang Shi Cc: Michal Hocko, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: > When running some mmap/munmap scalability tests with large memory (i.e. >> 300GB), the below hung task issue may happen occasionally. > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > (snip) > > Zapping pages is the most time consuming part, according to the > suggestion from Michal Hock [1], zapping pages can be done with holding > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > mmap_sem to manipulate vmas. Does munmap() == MADV_DONTNEED + munmap() ? For example, what happens with userfaultfd in this case? Can you get an extra #PF, which would be visible to userspace, before the munmap is finished? In addition, would it be ok for the user to potentially get a zeroed page in the time window after the MADV_DONTNEED finished removing a PTE and before the munmap() is done? Regards, Nadav ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-19 22:17 ` Nadav Amit @ 2018-06-19 23:08 ` Yang Shi 2018-06-20 0:31 ` Nadav Amit 0 siblings, 1 reply; 39+ messages in thread From: Yang Shi @ 2018-06-19 23:08 UTC (permalink / raw) To: Nadav Amit Cc: Michal Hocko, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1496 bytes --] On 6/19/18 3:17 PM, Nadav Amit wrote: > at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> When running some mmap/munmap scalability tests with large memory (i.e. >>> 300GB), the below hung task issue may happen occasionally. >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> > (snip) > >> Zapping pages is the most time consuming part, according to the >> suggestion from Michal Hock [1], zapping pages can be done with holding >> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >> mmap_sem to manipulate vmas. > Does munmap() == MADV_DONTNEED + munmap() ? Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. > > For example, what happens with userfaultfd in this case? Can you get an > extra #PF, which would be visible to userspace, before the munmap is > finished? userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. > > In addition, would it be ok for the user to potentially get a zeroed page in > the time window after the MADV_DONTNEED finished removing a PTE and before > the munmap() is done? This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. Thanks, Yang > > Regards, > Nadav [-- Attachment #2: Type: text/html, Size: 2825 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-19 23:08 ` Yang Shi @ 2018-06-20 0:31 ` Nadav Amit 2018-06-20 7:18 ` Michal Hocko 0 siblings, 1 reply; 39+ messages in thread From: Nadav Amit @ 2018-06-20 0:31 UTC (permalink / raw) To: Yang Shi Cc: Michal Hocko, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > On 6/19/18 3:17 PM, Nadav Amit wrote: >> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> >> wrote: >> >> >>> When running some mmap/munmap scalability tests with large memory (i.e. >>> >>>> 300GB), the below hung task issue may happen occasionally. >>>> >>> INFO: task ps:14018 blocked for more than 120 seconds. >>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>> message. >>> ps D 0 14018 1 0x00000004 >>> >>> >> (snip) >> >> >>> Zapping pages is the most time consuming part, according to the >>> suggestion from Michal Hock [1], zapping pages can be done with holding >>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>> mmap_sem to manipulate vmas. >>> >> Does munmap() == MADV_DONTNEED + munmap() ? > > Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. > >> >> For example, what happens with userfaultfd in this case? Can you get an >> extra #PF, which would be visible to userspace, before the munmap is >> finished? >> > > userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. Right. I see it now. > >> >> In addition, would it be ok for the user to potentially get a zeroed page in >> the time window after the MADV_DONTNEED finished removing a PTE and before >> the munmap() is done? >> > > This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. Thanks for the reference. Reading the man page I see: "All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV.” To me it sounds pretty well-defined, and this implementation does not follow this definition. I would expect the man page to be updated and indicate that the behavior has changed. Regards, Nadav ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-20 0:31 ` Nadav Amit @ 2018-06-20 7:18 ` Michal Hocko 0 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-20 7:18 UTC (permalink / raw) To: Nadav Amit Cc: Yang Shi, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Tue 19-06-18 17:31:27, Nadav Amit wrote: > at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > > > > > On 6/19/18 3:17 PM, Nadav Amit wrote: > >> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> > >> wrote: > >> > >> > >>> When running some mmap/munmap scalability tests with large memory (i.e. > >>> > >>>> 300GB), the below hung task issue may happen occasionally. > >>>> > >>> INFO: task ps:14018 blocked for more than 120 seconds. > >>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > >>> message. > >>> ps D 0 14018 1 0x00000004 > >>> > >>> > >> (snip) > >> > >> > >>> Zapping pages is the most time consuming part, according to the > >>> suggestion from Michal Hock [1], zapping pages can be done with holding > >>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > >>> mmap_sem to manipulate vmas. > >>> > >> Does munmap() == MADV_DONTNEED + munmap() ? > > > > Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. > > > >> > >> For example, what happens with userfaultfd in this case? Can you get an > >> extra #PF, which would be visible to userspace, before the munmap is > >> finished? > >> > > > > userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. > > Right. I see it now. > > > > >> > >> In addition, would it be ok for the user to potentially get a zeroed page in > >> the time window after the MADV_DONTNEED finished removing a PTE and before > >> the munmap() is done? > >> > > > > This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. > > Thanks for the reference. > > Reading the man page I see: "All pages containing a part of the indicated > range are unmapped, and subsequent references to these pages will generate > SIGSEGV.” Yes, this is true but I guess what Yang Shi meant was that an userspace access racing with munmap is not well defined. You never know whether you get your data, #PTF or SEGV because it depends on timing. The user visible change might be that you lose content and get zero page instead if you hit the race window while we are unmapping which was not possible before. But whouldn't such an access pattern be buggy anyway? You need some form of external synchronization AFAICS. But maybe some userspace depends on "getting right data or get SEGV" semantic. If we have to preserve that then we can come up with a VM_DEAD flag set before we tear it down and force the SEGV on the #PF path. Something similar we already do for MMF_UNSTABLE. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-20 7:18 ` Michal Hocko 0 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-20 7:18 UTC (permalink / raw) To: Nadav Amit Cc: Yang Shi, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Tue 19-06-18 17:31:27, Nadav Amit wrote: > at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > > > > > On 6/19/18 3:17 PM, Nadav Amit wrote: > >> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> > >> wrote: > >> > >> > >>> When running some mmap/munmap scalability tests with large memory (i.e. > >>> > >>>> 300GB), the below hung task issue may happen occasionally. > >>>> > >>> INFO: task ps:14018 blocked for more than 120 seconds. > >>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > >>> message. > >>> ps D 0 14018 1 0x00000004 > >>> > >>> > >> (snip) > >> > >> > >>> Zapping pages is the most time consuming part, according to the > >>> suggestion from Michal Hock [1], zapping pages can be done with holding > >>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > >>> mmap_sem to manipulate vmas. > >>> > >> Does munmap() == MADV_DONTNEED + munmap() ? > > > > Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. > > > >> > >> For example, what happens with userfaultfd in this case? Can you get an > >> extra #PF, which would be visible to userspace, before the munmap is > >> finished? > >> > > > > userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. > > Right. I see it now. > > > > >> > >> In addition, would it be ok for the user to potentially get a zeroed page in > >> the time window after the MADV_DONTNEED finished removing a PTE and before > >> the munmap() is done? > >> > > > > This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. > > Thanks for the reference. > > Reading the man page I see: "All pages containing a part of the indicated > range are unmapped, and subsequent references to these pages will generate > SIGSEGV.a?? Yes, this is true but I guess what Yang Shi meant was that an userspace access racing with munmap is not well defined. You never know whether you get your data, #PTF or SEGV because it depends on timing. The user visible change might be that you lose content and get zero page instead if you hit the race window while we are unmapping which was not possible before. But whouldn't such an access pattern be buggy anyway? You need some form of external synchronization AFAICS. But maybe some userspace depends on "getting right data or get SEGV" semantic. If we have to preserve that then we can come up with a VM_DEAD flag set before we tear it down and force the SEGV on the #PF path. Something similar we already do for MMF_UNSTABLE. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-20 7:18 ` Michal Hocko (?) @ 2018-06-20 17:12 ` Nadav Amit -1 siblings, 0 replies; 39+ messages in thread From: Nadav Amit @ 2018-06-20 17:12 UTC (permalink / raw) To: Michal Hocko Cc: Yang Shi, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3039 bytes --] at 12:18 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Tue 19-06-18 17:31:27, Nadav Amit wrote: >> at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >>> On 6/19/18 3:17 PM, Nadav Amit wrote: >>>> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> >>>> wrote: >>>> >>>> >>>>> When running some mmap/munmap scalability tests with large memory (i.e. >>>>> >>>>>> 300GB), the below hung task issue may happen occasionally. >>>>> INFO: task ps:14018 blocked for more than 120 seconds. >>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>>>> message. >>>>> ps D 0 14018 1 0x00000004 >>>> (snip) >>>> >>>> >>>>> Zapping pages is the most time consuming part, according to the >>>>> suggestion from Michal Hock [1], zapping pages can be done with holding >>>>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>>>> mmap_sem to manipulate vmas. >>>> Does munmap() == MADV_DONTNEED + munmap() ? >>> >>> Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. >>> >>>> For example, what happens with userfaultfd in this case? Can you get an >>>> extra #PF, which would be visible to userspace, before the munmap is >>>> finished? >>> >>> userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. >> >> Right. I see it now. >> >>>> In addition, would it be ok for the user to potentially get a zeroed page in >>>> the time window after the MADV_DONTNEED finished removing a PTE and before >>>> the munmap() is done? >>> >>> This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. >> >> Thanks for the reference. >> >> Reading the man page I see: "All pages containing a part of the indicated >> range are unmapped, and subsequent references to these pages will generate >> SIGSEGV.” > > Yes, this is true but I guess what Yang Shi meant was that an userspace > access racing with munmap is not well defined. You never know whether > you get your data, #PTF or SEGV because it depends on timing. The user > visible change might be that you lose content and get zero page instead > if you hit the race window while we are unmapping which was not possible > before. But whouldn't such an access pattern be buggy anyway? You need > some form of external synchronization AFAICS. It seems to follow the specifications, so it is not clearly buggy IMHO. I don’t know of such a use-case, but if somebody does so - the proposed change might even cause a security vulnerability. > But maybe some userspace depends on "getting right data or get SEGV" > semantic. If we have to preserve that then we can come up with a VM_DEAD > flag set before we tear it down and force the SEGV on the #PF path. > Something similar we already do for MMF_UNSTABLE. That seems reasonable. Regards, Nadav [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-20 7:18 ` Michal Hocko @ 2018-06-20 18:42 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-20 18:42 UTC (permalink / raw) To: Michal Hocko, Nadav Amit Cc: Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/20/18 12:18 AM, Michal Hocko wrote: > On Tue 19-06-18 17:31:27, Nadav Amit wrote: >> at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >>> >>> On 6/19/18 3:17 PM, Nadav Amit wrote: >>>> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> >>>> wrote: >>>> >>>> >>>>> When running some mmap/munmap scalability tests with large memory (i.e. >>>>> >>>>>> 300GB), the below hung task issue may happen occasionally. >>>>>> >>>>> INFO: task ps:14018 blocked for more than 120 seconds. >>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>>>> message. >>>>> ps D 0 14018 1 0x00000004 >>>>> >>>>> >>>> (snip) >>>> >>>> >>>>> Zapping pages is the most time consuming part, according to the >>>>> suggestion from Michal Hock [1], zapping pages can be done with holding >>>>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>>>> mmap_sem to manipulate vmas. >>>>> >>>> Does munmap() == MADV_DONTNEED + munmap() ? >>> Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. >>> >>>> For example, what happens with userfaultfd in this case? Can you get an >>>> extra #PF, which would be visible to userspace, before the munmap is >>>> finished? >>>> >>> userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. >> Right. I see it now. >> >>>> In addition, would it be ok for the user to potentially get a zeroed page in >>>> the time window after the MADV_DONTNEED finished removing a PTE and before >>>> the munmap() is done? >>>> >>> This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. >> Thanks for the reference. >> >> Reading the man page I see: "All pages containing a part of the indicated >> range are unmapped, and subsequent references to these pages will generate >> SIGSEGV.” > Yes, this is true but I guess what Yang Shi meant was that an userspace > access racing with munmap is not well defined. You never know whether > you get your data, #PTF or SEGV because it depends on timing. The user > visible change might be that you lose content and get zero page instead > if you hit the race window while we are unmapping which was not possible > before. But whouldn't such an access pattern be buggy anyway? You need > some form of external synchronization AFAICS. > > But maybe some userspace depends on "getting right data or get SEGV" > semantic. If we have to preserve that then we can come up with a VM_DEAD > flag set before we tear it down and force the SEGV on the #PF path. > Something similar we already do for MMF_UNSTABLE. Set VM_DEAD with read mmap_sem held? It should be ok since this is the only place to set this flag for this unique special case. Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-20 18:42 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-20 18:42 UTC (permalink / raw) To: Michal Hocko, Nadav Amit Cc: Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/20/18 12:18 AM, Michal Hocko wrote: > On Tue 19-06-18 17:31:27, Nadav Amit wrote: >> at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >>> >>> On 6/19/18 3:17 PM, Nadav Amit wrote: >>>> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> >>>> wrote: >>>> >>>> >>>>> When running some mmap/munmap scalability tests with large memory (i.e. >>>>> >>>>>> 300GB), the below hung task issue may happen occasionally. >>>>>> >>>>> INFO: task ps:14018 blocked for more than 120 seconds. >>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>>>> message. >>>>> ps D 0 14018 1 0x00000004 >>>>> >>>>> >>>> (snip) >>>> >>>> >>>>> Zapping pages is the most time consuming part, according to the >>>>> suggestion from Michal Hock [1], zapping pages can be done with holding >>>>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>>>> mmap_sem to manipulate vmas. >>>>> >>>> Does munmap() == MADV_DONTNEED + munmap() ? >>> Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. >>> >>>> For example, what happens with userfaultfd in this case? Can you get an >>>> extra #PF, which would be visible to userspace, before the munmap is >>>> finished? >>>> >>> userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. >> Right. I see it now. >> >>>> In addition, would it be ok for the user to potentially get a zeroed page in >>>> the time window after the MADV_DONTNEED finished removing a PTE and before >>>> the munmap() is done? >>>> >>> This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. >> Thanks for the reference. >> >> Reading the man page I see: "All pages containing a part of the indicated >> range are unmapped, and subsequent references to these pages will generate >> SIGSEGV.a?? > Yes, this is true but I guess what Yang Shi meant was that an userspace > access racing with munmap is not well defined. You never know whether > you get your data, #PTF or SEGV because it depends on timing. The user > visible change might be that you lose content and get zero page instead > if you hit the race window while we are unmapping which was not possible > before. But whouldn't such an access pattern be buggy anyway? You need > some form of external synchronization AFAICS. > > But maybe some userspace depends on "getting right data or get SEGV" > semantic. If we have to preserve that then we can come up with a VM_DEAD > flag set before we tear it down and force the SEGV on the #PF path. > Something similar we already do for MMF_UNSTABLE. Set VM_DEAD with read mmap_sem held? It should be ok since this is the only place to set this flag for this unique special case. Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-20 18:42 ` Yang Shi (?) @ 2018-06-23 1:01 ` Yang Shi 2018-06-25 9:14 ` Michal Hocko -1 siblings, 1 reply; 39+ messages in thread From: Yang Shi @ 2018-06-23 1:01 UTC (permalink / raw) To: Michal Hocko, Nadav Amit Cc: Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel Yes, this is true but I guess what Yang Shi meant was that an userspace >> access racing with munmap is not well defined. You never know whether >> you get your data, #PTF or SEGV because it depends on timing. The user >> visible change might be that you lose content and get zero page instead >> if you hit the race window while we are unmapping which was not possible >> before. But whouldn't such an access pattern be buggy anyway? You need >> some form of external synchronization AFAICS. >> >> But maybe some userspace depends on "getting right data or get SEGV" >> semantic. If we have to preserve that then we can come up with a VM_DEAD >> flag set before we tear it down and force the SEGV on the #PF path. >> Something similar we already do for MMF_UNSTABLE. > > Set VM_DEAD with read mmap_sem held? It should be ok since this is the > only place to set this flag for this unique special case. BTW, it looks the vm flags have used up in 32 bit. If we really need VM_DEAD, it should be for both 32-bit and 64-bit. Any suggestion? Thanks, Yang > > Yang > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-23 1:01 ` Yang Shi @ 2018-06-25 9:14 ` Michal Hocko 0 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-25 9:14 UTC (permalink / raw) To: Yang Shi Cc: Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Fri 22-06-18 18:01:08, Yang Shi wrote: > Yes, this is true but I guess what Yang Shi meant was that an userspace > > > access racing with munmap is not well defined. You never know whether > > > you get your data, #PTF or SEGV because it depends on timing. The user > > > visible change might be that you lose content and get zero page instead > > > if you hit the race window while we are unmapping which was not possible > > > before. But whouldn't such an access pattern be buggy anyway? You need > > > some form of external synchronization AFAICS. > > > > > > But maybe some userspace depends on "getting right data or get SEGV" > > > semantic. If we have to preserve that then we can come up with a VM_DEAD > > > flag set before we tear it down and force the SEGV on the #PF path. > > > Something similar we already do for MMF_UNSTABLE. > > > > Set VM_DEAD with read mmap_sem held? It should be ok since this is the > > only place to set this flag for this unique special case. > > BTW, it looks the vm flags have used up in 32 bit. If we really need > VM_DEAD, it should be for both 32-bit and 64-bit. Do we really need any special handling for 32b? Who is going to create GB mappings for all this to be worth doing? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-20 7:18 ` Michal Hocko @ 2018-06-26 0:06 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-26 0:06 UTC (permalink / raw) To: Michal Hocko, Nadav Amit Cc: Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/20/18 12:18 AM, Michal Hocko wrote: > On Tue 19-06-18 17:31:27, Nadav Amit wrote: >> at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >>> >>> On 6/19/18 3:17 PM, Nadav Amit wrote: >>>> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> >>>> wrote: >>>> >>>> >>>>> When running some mmap/munmap scalability tests with large memory (i.e. >>>>> >>>>>> 300GB), the below hung task issue may happen occasionally. >>>>>> >>>>> INFO: task ps:14018 blocked for more than 120 seconds. >>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>>>> message. >>>>> ps D 0 14018 1 0x00000004 >>>>> >>>>> >>>> (snip) >>>> >>>> >>>>> Zapping pages is the most time consuming part, according to the >>>>> suggestion from Michal Hock [1], zapping pages can be done with holding >>>>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>>>> mmap_sem to manipulate vmas. >>>>> >>>> Does munmap() == MADV_DONTNEED + munmap() ? >>> Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. >>> >>>> For example, what happens with userfaultfd in this case? Can you get an >>>> extra #PF, which would be visible to userspace, before the munmap is >>>> finished? >>>> >>> userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. >> Right. I see it now. >> >>>> In addition, would it be ok for the user to potentially get a zeroed page in >>>> the time window after the MADV_DONTNEED finished removing a PTE and before >>>> the munmap() is done? >>>> >>> This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. >> Thanks for the reference. >> >> Reading the man page I see: "All pages containing a part of the indicated >> range are unmapped, and subsequent references to these pages will generate >> SIGSEGV.” > Yes, this is true but I guess what Yang Shi meant was that an userspace > access racing with munmap is not well defined. You never know whether > you get your data, #PTF or SEGV because it depends on timing. The user > visible change might be that you lose content and get zero page instead > if you hit the race window while we are unmapping which was not possible > before. But whouldn't such an access pattern be buggy anyway? You need > some form of external synchronization AFAICS. > > But maybe some userspace depends on "getting right data or get SEGV" > semantic. If we have to preserve that then we can come up with a VM_DEAD > flag set before we tear it down and force the SEGV on the #PF path. > Something similar we already do for MMF_UNSTABLE. By looking this deeper, we may not be able to cover all the unmapping range for VM_DEAD, for example, if the start addr is in the middle of a vma. We can't set VM_DEAD to that vma since that would trigger SIGSEGV for still mapped area. splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD to non-overlapped vmas. Access to overlapped vmas (first and last) will still have undefined behavior. Thanks, Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-26 0:06 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-26 0:06 UTC (permalink / raw) To: Michal Hocko, Nadav Amit Cc: Matthew Wilcox, ldufour, Andrew Morton, Peter Zijlstra, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/20/18 12:18 AM, Michal Hocko wrote: > On Tue 19-06-18 17:31:27, Nadav Amit wrote: >> at 4:08 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote: >> >>> >>> On 6/19/18 3:17 PM, Nadav Amit wrote: >>>> at 4:34 PM, Yang Shi <yang.shi@linux.alibaba.com> >>>> wrote: >>>> >>>> >>>>> When running some mmap/munmap scalability tests with large memory (i.e. >>>>> >>>>>> 300GB), the below hung task issue may happen occasionally. >>>>>> >>>>> INFO: task ps:14018 blocked for more than 120 seconds. >>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>>>> message. >>>>> ps D 0 14018 1 0x00000004 >>>>> >>>>> >>>> (snip) >>>> >>>> >>>>> Zapping pages is the most time consuming part, according to the >>>>> suggestion from Michal Hock [1], zapping pages can be done with holding >>>>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>>>> mmap_sem to manipulate vmas. >>>>> >>>> Does munmap() == MADV_DONTNEED + munmap() ? >>> Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. >>> >>>> For example, what happens with userfaultfd in this case? Can you get an >>>> extra #PF, which would be visible to userspace, before the munmap is >>>> finished? >>>> >>> userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. >> Right. I see it now. >> >>>> In addition, would it be ok for the user to potentially get a zeroed page in >>>> the time window after the MADV_DONTNEED finished removing a PTE and before >>>> the munmap() is done? >>>> >>> This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. >> Thanks for the reference. >> >> Reading the man page I see: "All pages containing a part of the indicated >> range are unmapped, and subsequent references to these pages will generate >> SIGSEGV.a?? > Yes, this is true but I guess what Yang Shi meant was that an userspace > access racing with munmap is not well defined. You never know whether > you get your data, #PTF or SEGV because it depends on timing. The user > visible change might be that you lose content and get zero page instead > if you hit the race window while we are unmapping which was not possible > before. But whouldn't such an access pattern be buggy anyway? You need > some form of external synchronization AFAICS. > > But maybe some userspace depends on "getting right data or get SEGV" > semantic. If we have to preserve that then we can come up with a VM_DEAD > flag set before we tear it down and force the SEGV on the #PF path. > Something similar we already do for MMF_UNSTABLE. By looking this deeper, we may not be able to cover all the unmapping range for VM_DEAD, for example, if the start addr is in the middle of a vma. We can't set VM_DEAD to that vma since that would trigger SIGSEGV for still mapped area. splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD to non-overlapped vmas. Access to overlapped vmas (first and last) will still have undefined behavior. Thanks, Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-26 0:06 ` Yang Shi (?) @ 2018-06-26 7:43 ` Peter Zijlstra 2018-06-27 1:03 ` Yang Shi -1 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2018-06-26 7:43 UTC (permalink / raw) To: Yang Shi Cc: Michal Hocko, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > By looking this deeper, we may not be able to cover all the unmapping range > for VM_DEAD, for example, if the start addr is in the middle of a vma. We > can't set VM_DEAD to that vma since that would trigger SIGSEGV for still > mapped area. > > splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD > to non-overlapped vmas. Access to overlapped vmas (first and last) will > still have undefined behavior. Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for writing, free everything left, drop mmap_sem. ? Sure, you acquire the lock 3 times, but both write instances should be 'short', and I suppose you can do a demote between 1 and 2 if you care. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-26 7:43 ` Peter Zijlstra @ 2018-06-27 1:03 ` Yang Shi 2018-06-27 7:24 ` Michal Hocko 0 siblings, 1 reply; 39+ messages in thread From: Yang Shi @ 2018-06-27 1:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Michal Hocko, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/26/18 12:43 AM, Peter Zijlstra wrote: > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >> By looking this deeper, we may not be able to cover all the unmapping range >> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >> mapped area. >> >> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >> to non-overlapped vmas. Access to overlapped vmas (first and last) will >> still have undefined behavior. > Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > writing, free everything left, drop mmap_sem. > > ? > > Sure, you acquire the lock 3 times, but both write instances should be > 'short', and I suppose you can do a demote between 1 and 2 if you care. Thanks, Peter. Yes, by looking the code and trying two different approaches, it looks this approach is the most straight-forward one. Splitting vma up-front can save a lot pain later. Holding write mmap_sem for this job before zapping mappings sounds worth the cost (very short write critical section). And, VM_DEAD can be set exclusively with write mmap_sem without racing with page faults, this will give us consistent behavior for the race between PF and munmap. And, we don't need care about overlapped vma since it has been split before. Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-27 1:03 ` Yang Shi @ 2018-06-27 7:24 ` Michal Hocko 2018-06-27 17:23 ` Yang Shi 0 siblings, 1 reply; 39+ messages in thread From: Michal Hocko @ 2018-06-27 7:24 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > By looking this deeper, we may not be able to cover all the unmapping range > > > for VM_DEAD, for example, if the start addr is in the middle of a vma. We > > > can't set VM_DEAD to that vma since that would trigger SIGSEGV for still > > > mapped area. > > > > > > splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD > > > to non-overlapped vmas. Access to overlapped vmas (first and last) will > > > still have undefined behavior. > > Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > writing, free everything left, drop mmap_sem. > > > > ? > > > > Sure, you acquire the lock 3 times, but both write instances should be > > 'short', and I suppose you can do a demote between 1 and 2 if you care. > > Thanks, Peter. Yes, by looking the code and trying two different approaches, > it looks this approach is the most straight-forward one. Yes, you just have to be careful about the max vma count limit. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-27 7:24 ` Michal Hocko @ 2018-06-27 17:23 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-27 17:23 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/27/18 12:24 AM, Michal Hocko wrote: > On Tue 26-06-18 18:03:34, Yang Shi wrote: >> >> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>> By looking this deeper, we may not be able to cover all the unmapping range >>>> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >>>> mapped area. >>>> >>>> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >>>> to non-overlapped vmas. Access to overlapped vmas (first and last) will >>>> still have undefined behavior. >>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire >>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>> writing, free everything left, drop mmap_sem. >>> >>> ? >>> >>> Sure, you acquire the lock 3 times, but both write instances should be >>> 'short', and I suppose you can do a demote between 1 and 2 if you care. >> Thanks, Peter. Yes, by looking the code and trying two different approaches, >> it looks this approach is the most straight-forward one. > Yes, you just have to be careful about the max vma count limit. Yes, we should just need copy what do_munmap does as below: if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) return -ENOMEM; If the mas map count limit has been reached, it will return failure before zapping mappings. Thanks, Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-27 17:23 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-27 17:23 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/27/18 12:24 AM, Michal Hocko wrote: > On Tue 26-06-18 18:03:34, Yang Shi wrote: >> >> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>> By looking this deeper, we may not be able to cover all the unmapping range >>>> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >>>> mapped area. >>>> >>>> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >>>> to non-overlapped vmas. Access to overlapped vmas (first and last) will >>>> still have undefined behavior. >>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire >>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>> writing, free everything left, drop mmap_sem. >>> >>> ? >>> >>> Sure, you acquire the lock 3 times, but both write instances should be >>> 'short', and I suppose you can do a demote between 1 and 2 if you care. >> Thanks, Peter. Yes, by looking the code and trying two different approaches, >> it looks this approach is the most straight-forward one. > Yes, you just have to be careful about the max vma count limit. Yes, we should just need copy what do_munmap does as below: if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) A A A A A A A A A return -ENOMEM; If the mas map count limit has been reached, it will return failure before zapping mappings. Thanks, Yang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-27 17:23 ` Yang Shi @ 2018-06-28 11:51 ` Michal Hocko -1 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-28 11:51 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > By looking this deeper, we may not be able to cover all the unmapping range > > > > > for VM_DEAD, for example, if the start addr is in the middle of a vma. We > > > > > can't set VM_DEAD to that vma since that would trigger SIGSEGV for still > > > > > mapped area. > > > > > > > > > > splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD > > > > > to non-overlapped vmas. Access to overlapped vmas (first and last) will > > > > > still have undefined behavior. > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > writing, free everything left, drop mmap_sem. > > > > > > > > ? > > > > > > > > Sure, you acquire the lock 3 times, but both write instances should be > > > > 'short', and I suppose you can do a demote between 1 and 2 if you care. > > > Thanks, Peter. Yes, by looking the code and trying two different approaches, > > > it looks this approach is the most straight-forward one. > > Yes, you just have to be careful about the max vma count limit. > > Yes, we should just need copy what do_munmap does as below: > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > return -ENOMEM; > > If the mas map count limit has been reached, it will return failure before > zapping mappings. Yeah, but as soon as you drop the lock and retake it, somebody might have changed the adddress space and we might get inconsistency. So I am wondering whether we really need upgrade_read (to promote read to write lock) and do the down_write split & set up VM_DEAD downgrade_write unmap upgrade_read zap ptes up_write looks terrible, no question about that, but we won't drop the mmap sem at any time. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-28 11:51 ` Michal Hocko 0 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-28 11:51 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > By looking this deeper, we may not be able to cover all the unmapping range > > > > > for VM_DEAD, for example, if the start addr is in the middle of a vma. We > > > > > can't set VM_DEAD to that vma since that would trigger SIGSEGV for still > > > > > mapped area. > > > > > > > > > > splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD > > > > > to non-overlapped vmas. Access to overlapped vmas (first and last) will > > > > > still have undefined behavior. > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > writing, free everything left, drop mmap_sem. > > > > > > > > ? > > > > > > > > Sure, you acquire the lock 3 times, but both write instances should be > > > > 'short', and I suppose you can do a demote between 1 and 2 if you care. > > > Thanks, Peter. Yes, by looking the code and trying two different approaches, > > > it looks this approach is the most straight-forward one. > > Yes, you just have to be careful about the max vma count limit. > > Yes, we should just need copy what do_munmap does as below: > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > return -ENOMEM; > > If the mas map count limit has been reached, it will return failure before > zapping mappings. Yeah, but as soon as you drop the lock and retake it, somebody might have changed the adddress space and we might get inconsistency. So I am wondering whether we really need upgrade_read (to promote read to write lock) and do the down_write split & set up VM_DEAD downgrade_write unmap upgrade_read zap ptes up_write looks terrible, no question about that, but we won't drop the mmap sem at any time. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-28 11:51 ` Michal Hocko @ 2018-06-28 19:10 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-28 19:10 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/28/18 4:51 AM, Michal Hocko wrote: > On Wed 27-06-18 10:23:39, Yang Shi wrote: >> >> On 6/27/18 12:24 AM, Michal Hocko wrote: >>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>> By looking this deeper, we may not be able to cover all the unmapping range >>>>>> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >>>>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >>>>>> mapped area. >>>>>> >>>>>> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >>>>>> to non-overlapped vmas. Access to overlapped vmas (first and last) will >>>>>> still have undefined behavior. >>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire >>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>> writing, free everything left, drop mmap_sem. >>>>> >>>>> ? >>>>> >>>>> Sure, you acquire the lock 3 times, but both write instances should be >>>>> 'short', and I suppose you can do a demote between 1 and 2 if you care. >>>> Thanks, Peter. Yes, by looking the code and trying two different approaches, >>>> it looks this approach is the most straight-forward one. >>> Yes, you just have to be careful about the max vma count limit. >> Yes, we should just need copy what do_munmap does as below: >> >> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >> return -ENOMEM; >> >> If the mas map count limit has been reached, it will return failure before >> zapping mappings. > Yeah, but as soon as you drop the lock and retake it, somebody might > have changed the adddress space and we might get inconsistency. > > So I am wondering whether we really need upgrade_read (to promote read > to write lock) and do the > down_write > split & set up VM_DEAD > downgrade_write > unmap > upgrade_read > zap ptes > up_write I'm supposed address space changing just can be done by mmap, mremap, mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD flag is set for the vma, just return failure since it is being unmapped. Does it sounds reasonable? Thanks, Yang > > looks terrible, no question about that, but we won't drop the mmap sem > at any time. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-28 19:10 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-28 19:10 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/28/18 4:51 AM, Michal Hocko wrote: > On Wed 27-06-18 10:23:39, Yang Shi wrote: >> >> On 6/27/18 12:24 AM, Michal Hocko wrote: >>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>> By looking this deeper, we may not be able to cover all the unmapping range >>>>>> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >>>>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >>>>>> mapped area. >>>>>> >>>>>> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >>>>>> to non-overlapped vmas. Access to overlapped vmas (first and last) will >>>>>> still have undefined behavior. >>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire >>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>> writing, free everything left, drop mmap_sem. >>>>> >>>>> ? >>>>> >>>>> Sure, you acquire the lock 3 times, but both write instances should be >>>>> 'short', and I suppose you can do a demote between 1 and 2 if you care. >>>> Thanks, Peter. Yes, by looking the code and trying two different approaches, >>>> it looks this approach is the most straight-forward one. >>> Yes, you just have to be careful about the max vma count limit. >> Yes, we should just need copy what do_munmap does as below: >> >> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >> A A A A A A A A A return -ENOMEM; >> >> If the mas map count limit has been reached, it will return failure before >> zapping mappings. > Yeah, but as soon as you drop the lock and retake it, somebody might > have changed the adddress space and we might get inconsistency. > > So I am wondering whether we really need upgrade_read (to promote read > to write lock) and do the > down_write > split & set up VM_DEAD > downgrade_write > unmap > upgrade_read > zap ptes > up_write I'm supposed address space changing just can be done by mmap, mremap, mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD flag is set for the vma, just return failure since it is being unmapped. Does it sounds reasonable? Thanks, Yang > > looks terrible, no question about that, but we won't drop the mmap sem > at any time. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-28 19:10 ` Yang Shi @ 2018-06-29 0:59 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-29 0:59 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/28/18 12:10 PM, Yang Shi wrote: > > > On 6/28/18 4:51 AM, Michal Hocko wrote: >> On Wed 27-06-18 10:23:39, Yang Shi wrote: >>> >>> On 6/27/18 12:24 AM, Michal Hocko wrote: >>>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>>> By looking this deeper, we may not be able to cover all the >>>>>>> unmapping range >>>>>>> for VM_DEAD, for example, if the start addr is in the middle of >>>>>>> a vma. We >>>>>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV >>>>>>> for still >>>>>>> mapped area. >>>>>>> >>>>>>> splitting can't be done with read mmap_sem held, so maybe just >>>>>>> set VM_DEAD >>>>>>> to non-overlapped vmas. Access to overlapped vmas (first and >>>>>>> last) will >>>>>>> still have undefined behavior. >>>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. >>>>>> Acquire >>>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>>> writing, free everything left, drop mmap_sem. >>>>>> >>>>>> ? >>>>>> >>>>>> Sure, you acquire the lock 3 times, but both write instances >>>>>> should be >>>>>> 'short', and I suppose you can do a demote between 1 and 2 if you >>>>>> care. >>>>> Thanks, Peter. Yes, by looking the code and trying two different >>>>> approaches, >>>>> it looks this approach is the most straight-forward one. >>>> Yes, you just have to be careful about the max vma count limit. >>> Yes, we should just need copy what do_munmap does as below: >>> >>> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >>> return -ENOMEM; >>> >>> If the mas map count limit has been reached, it will return failure >>> before >>> zapping mappings. >> Yeah, but as soon as you drop the lock and retake it, somebody might >> have changed the adddress space and we might get inconsistency. >> >> So I am wondering whether we really need upgrade_read (to promote read >> to write lock) and do the >> down_write >> split & set up VM_DEAD >> downgrade_write >> unmap >> upgrade_read >> zap ptes >> up_write Promoting to write lock may be a trouble. There might be other users in the critical section with read lock, we have to wait them to finish. > > I'm supposed address space changing just can be done by mmap, mremap, > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD > flag is set for the vma, just return failure since it is being unmapped. > > Does it sounds reasonable? It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), right? How about letting them return -EBUSY or -EAGAIN to notify the application? This changes the behavior a little bit, MAP_FIXED and mremap may fail if they fail the race with munmap (if the mapping is larger than 1GB). I'm not sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very heavily which may run into the race condition. I guess it should be rare to meet all the conditions to trigger the race. The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since they may corrupt its own address space as the man page noted. Thanks, Yang > > Thanks, > Yang > >> >> looks terrible, no question about that, but we won't drop the mmap sem >> at any time. > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-29 0:59 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-29 0:59 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/28/18 12:10 PM, Yang Shi wrote: > > > On 6/28/18 4:51 AM, Michal Hocko wrote: >> On Wed 27-06-18 10:23:39, Yang Shi wrote: >>> >>> On 6/27/18 12:24 AM, Michal Hocko wrote: >>>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>>> By looking this deeper, we may not be able to cover all the >>>>>>> unmapping range >>>>>>> for VM_DEAD, for example, if the start addr is in the middle of >>>>>>> a vma. We >>>>>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV >>>>>>> for still >>>>>>> mapped area. >>>>>>> >>>>>>> splitting can't be done with read mmap_sem held, so maybe just >>>>>>> set VM_DEAD >>>>>>> to non-overlapped vmas. Access to overlapped vmas (first and >>>>>>> last) will >>>>>>> still have undefined behavior. >>>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. >>>>>> Acquire >>>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>>> writing, free everything left, drop mmap_sem. >>>>>> >>>>>> ? >>>>>> >>>>>> Sure, you acquire the lock 3 times, but both write instances >>>>>> should be >>>>>> 'short', and I suppose you can do a demote between 1 and 2 if you >>>>>> care. >>>>> Thanks, Peter. Yes, by looking the code and trying two different >>>>> approaches, >>>>> it looks this approach is the most straight-forward one. >>>> Yes, you just have to be careful about the max vma count limit. >>> Yes, we should just need copy what do_munmap does as below: >>> >>> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >>> A A A A A A A A A A return -ENOMEM; >>> >>> If the mas map count limit has been reached, it will return failure >>> before >>> zapping mappings. >> Yeah, but as soon as you drop the lock and retake it, somebody might >> have changed the adddress space and we might get inconsistency. >> >> So I am wondering whether we really need upgrade_read (to promote read >> to write lock) and do the >> A A A A down_write >> A A A A split & set up VM_DEAD >> A A A A downgrade_write >> A A A A unmap >> A A A A upgrade_read >> A A A A zap ptes >> A A A A up_write Promoting to write lock may be a trouble. There might be other users in the critical section with read lock, we have to wait them to finish. > > I'm supposed address space changing just can be done by mmap, mremap, > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD > flag is set for the vma, just return failure since it is being unmapped. > > Does it sounds reasonable? It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), right? How about letting them return -EBUSY or -EAGAIN to notify the application? This changes the behavior a little bit, MAP_FIXED and mremap may fail if they fail the race with munmap (if the mapping is larger than 1GB). I'm not sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very heavily which may run into the race condition. I guess it should be rare to meet all the conditions to trigger the race. The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since they may corrupt its own address space as the man page noted. Thanks, Yang > > Thanks, > Yang > >> >> looks terrible, no question about that, but we won't drop the mmap sem >> at any time. > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-29 0:59 ` Yang Shi @ 2018-06-29 11:39 ` Michal Hocko -1 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-29 11:39 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Thu 28-06-18 17:59:25, Yang Shi wrote: > > > On 6/28/18 12:10 PM, Yang Shi wrote: > > > > > > On 6/28/18 4:51 AM, Michal Hocko wrote: > > > On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > > > > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > > > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > > > > By looking this deeper, we may not be able to > > > > > > > > cover all the unmapping range > > > > > > > > for VM_DEAD, for example, if the start addr is > > > > > > > > in the middle of a vma. We > > > > > > > > can't set VM_DEAD to that vma since that would > > > > > > > > trigger SIGSEGV for still > > > > > > > > mapped area. > > > > > > > > > > > > > > > > splitting can't be done with read mmap_sem held, > > > > > > > > so maybe just set VM_DEAD > > > > > > > > to non-overlapped vmas. Access to overlapped > > > > > > > > vmas (first and last) will > > > > > > > > still have undefined behavior. > > > > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, > > > > > > > drop mmap_sem. Acquire > > > > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > > > > writing, free everything left, drop mmap_sem. > > > > > > > > > > > > > > ? > > > > > > > > > > > > > > Sure, you acquire the lock 3 times, but both write > > > > > > > instances should be > > > > > > > 'short', and I suppose you can do a demote between 1 > > > > > > > and 2 if you care. > > > > > > Thanks, Peter. Yes, by looking the code and trying two > > > > > > different approaches, > > > > > > it looks this approach is the most straight-forward one. > > > > > Yes, you just have to be careful about the max vma count limit. > > > > Yes, we should just need copy what do_munmap does as below: > > > > > > > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > > > > return -ENOMEM; > > > > > > > > If the mas map count limit has been reached, it will return > > > > failure before > > > > zapping mappings. > > > Yeah, but as soon as you drop the lock and retake it, somebody might > > > have changed the adddress space and we might get inconsistency. > > > > > > So I am wondering whether we really need upgrade_read (to promote read > > > to write lock) and do the > > > down_write > > > split & set up VM_DEAD > > > downgrade_write > > > unmap > > > upgrade_read > > > zap ptes > > > up_write > > Promoting to write lock may be a trouble. There might be other users in the > critical section with read lock, we have to wait them to finish. Yes. Is that a problem though? > > I'm supposed address space changing just can be done by mmap, mremap, > > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD > > flag is set for the vma, just return failure since it is being unmapped. > > > > Does it sounds reasonable? > > It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), > right? > > How about letting them return -EBUSY or -EAGAIN to notify the application? Well, non of those is documented to return EBUSY and EAGAIN already has a meaning for locked memory. > This changes the behavior a little bit, MAP_FIXED and mremap may fail if > they fail the race with munmap (if the mapping is larger than 1GB). I'm not > sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very > heavily which may run into the race condition. I guess it should be rare to > meet all the conditions to trigger the race. > > The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since > they may corrupt its own address space as the man page noted. Well, I suspect you are overcomplicating this a bit. This should be really straightforward thing - well except for VM_DEAD which is quite tricky already. We should rather not spread this trickyness outside of the #PF path. And I would even try hard to start that part simple to see whether it actually matters. Relying on races between threads without any locking is quite questionable already. Nobody has pointed to a sane usecase so far. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-29 11:39 ` Michal Hocko 0 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-29 11:39 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Thu 28-06-18 17:59:25, Yang Shi wrote: > > > On 6/28/18 12:10 PM, Yang Shi wrote: > > > > > > On 6/28/18 4:51 AM, Michal Hocko wrote: > > > On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > > > > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > > > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > > > > By looking this deeper, we may not be able to > > > > > > > > cover all the unmapping range > > > > > > > > for VM_DEAD, for example, if the start addr is > > > > > > > > in the middle of a vma. We > > > > > > > > can't set VM_DEAD to that vma since that would > > > > > > > > trigger SIGSEGV for still > > > > > > > > mapped area. > > > > > > > > > > > > > > > > splitting can't be done with read mmap_sem held, > > > > > > > > so maybe just set VM_DEAD > > > > > > > > to non-overlapped vmas. Access to overlapped > > > > > > > > vmas (first and last) will > > > > > > > > still have undefined behavior. > > > > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, > > > > > > > drop mmap_sem. Acquire > > > > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > > > > writing, free everything left, drop mmap_sem. > > > > > > > > > > > > > > ? > > > > > > > > > > > > > > Sure, you acquire the lock 3 times, but both write > > > > > > > instances should be > > > > > > > 'short', and I suppose you can do a demote between 1 > > > > > > > and 2 if you care. > > > > > > Thanks, Peter. Yes, by looking the code and trying two > > > > > > different approaches, > > > > > > it looks this approach is the most straight-forward one. > > > > > Yes, you just have to be careful about the max vma count limit. > > > > Yes, we should just need copy what do_munmap does as below: > > > > > > > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > > > > return -ENOMEM; > > > > > > > > If the mas map count limit has been reached, it will return > > > > failure before > > > > zapping mappings. > > > Yeah, but as soon as you drop the lock and retake it, somebody might > > > have changed the adddress space and we might get inconsistency. > > > > > > So I am wondering whether we really need upgrade_read (to promote read > > > to write lock) and do the > > > down_write > > > split & set up VM_DEAD > > > downgrade_write > > > unmap > > > upgrade_read > > > zap ptes > > > up_write > > Promoting to write lock may be a trouble. There might be other users in the > critical section with read lock, we have to wait them to finish. Yes. Is that a problem though? > > I'm supposed address space changing just can be done by mmap, mremap, > > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD > > flag is set for the vma, just return failure since it is being unmapped. > > > > Does it sounds reasonable? > > It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), > right? > > How about letting them return -EBUSY or -EAGAIN to notify the application? Well, non of those is documented to return EBUSY and EAGAIN already has a meaning for locked memory. > This changes the behavior a little bit, MAP_FIXED and mremap may fail if > they fail the race with munmap (if the mapping is larger than 1GB). I'm not > sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very > heavily which may run into the race condition. I guess it should be rare to > meet all the conditions to trigger the race. > > The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since > they may corrupt its own address space as the man page noted. Well, I suspect you are overcomplicating this a bit. This should be really straightforward thing - well except for VM_DEAD which is quite tricky already. We should rather not spread this trickyness outside of the #PF path. And I would even try hard to start that part simple to see whether it actually matters. Relying on races between threads without any locking is quite questionable already. Nobody has pointed to a sane usecase so far. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-29 11:39 ` Michal Hocko @ 2018-06-29 16:50 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-29 16:50 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/29/18 4:39 AM, Michal Hocko wrote: > On Thu 28-06-18 17:59:25, Yang Shi wrote: >> >> On 6/28/18 12:10 PM, Yang Shi wrote: >>> >>> On 6/28/18 4:51 AM, Michal Hocko wrote: >>>> On Wed 27-06-18 10:23:39, Yang Shi wrote: >>>>> On 6/27/18 12:24 AM, Michal Hocko wrote: >>>>>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>>>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>>>>> By looking this deeper, we may not be able to >>>>>>>>> cover all the unmapping range >>>>>>>>> for VM_DEAD, for example, if the start addr is >>>>>>>>> in the middle of a vma. We >>>>>>>>> can't set VM_DEAD to that vma since that would >>>>>>>>> trigger SIGSEGV for still >>>>>>>>> mapped area. >>>>>>>>> >>>>>>>>> splitting can't be done with read mmap_sem held, >>>>>>>>> so maybe just set VM_DEAD >>>>>>>>> to non-overlapped vmas. Access to overlapped >>>>>>>>> vmas (first and last) will >>>>>>>>> still have undefined behavior. >>>>>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, >>>>>>>> drop mmap_sem. Acquire >>>>>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>>>>> writing, free everything left, drop mmap_sem. >>>>>>>> >>>>>>>> ? >>>>>>>> >>>>>>>> Sure, you acquire the lock 3 times, but both write >>>>>>>> instances should be >>>>>>>> 'short', and I suppose you can do a demote between 1 >>>>>>>> and 2 if you care. >>>>>>> Thanks, Peter. Yes, by looking the code and trying two >>>>>>> different approaches, >>>>>>> it looks this approach is the most straight-forward one. >>>>>> Yes, you just have to be careful about the max vma count limit. >>>>> Yes, we should just need copy what do_munmap does as below: >>>>> >>>>> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >>>>> return -ENOMEM; >>>>> >>>>> If the mas map count limit has been reached, it will return >>>>> failure before >>>>> zapping mappings. >>>> Yeah, but as soon as you drop the lock and retake it, somebody might >>>> have changed the adddress space and we might get inconsistency. >>>> >>>> So I am wondering whether we really need upgrade_read (to promote read >>>> to write lock) and do the >>>> down_write >>>> split & set up VM_DEAD >>>> downgrade_write >>>> unmap >>>> upgrade_read >>>> zap ptes >>>> up_write >> Promoting to write lock may be a trouble. There might be other users in the >> critical section with read lock, we have to wait them to finish. > Yes. Is that a problem though? Not a problem, but just not sure how complicated it would be. Considering all the lock debug/lockdep stuff. And, the behavior smells like rcu. > >>> I'm supposed address space changing just can be done by mmap, mremap, >>> mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD >>> flag is set for the vma, just return failure since it is being unmapped. >>> >>> Does it sounds reasonable? >> It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), >> right? >> >> How about letting them return -EBUSY or -EAGAIN to notify the application? > Well, non of those is documented to return EBUSY and EAGAIN already has > a meaning for locked memory. > >> This changes the behavior a little bit, MAP_FIXED and mremap may fail if >> they fail the race with munmap (if the mapping is larger than 1GB). I'm not >> sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very >> heavily which may run into the race condition. I guess it should be rare to >> meet all the conditions to trigger the race. >> >> The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since >> they may corrupt its own address space as the man page noted. > Well, I suspect you are overcomplicating this a bit. This should be > really straightforward thing - well except for VM_DEAD which is quite > tricky already. We should rather not spread this trickyness outside of > the #PF path. And I would even try hard to start that part simple to see > whether it actually matters. Relying on races between threads without > any locking is quite questionable already. Nobody has pointed to a sane > usecase so far. I agree to keep it as simple as possible then see if it matters or not. So, in v3 I will just touch the page fault path. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-29 16:50 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-29 16:50 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/29/18 4:39 AM, Michal Hocko wrote: > On Thu 28-06-18 17:59:25, Yang Shi wrote: >> >> On 6/28/18 12:10 PM, Yang Shi wrote: >>> >>> On 6/28/18 4:51 AM, Michal Hocko wrote: >>>> On Wed 27-06-18 10:23:39, Yang Shi wrote: >>>>> On 6/27/18 12:24 AM, Michal Hocko wrote: >>>>>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>>>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>>>>> By looking this deeper, we may not be able to >>>>>>>>> cover all the unmapping range >>>>>>>>> for VM_DEAD, for example, if the start addr is >>>>>>>>> in the middle of a vma. We >>>>>>>>> can't set VM_DEAD to that vma since that would >>>>>>>>> trigger SIGSEGV for still >>>>>>>>> mapped area. >>>>>>>>> >>>>>>>>> splitting can't be done with read mmap_sem held, >>>>>>>>> so maybe just set VM_DEAD >>>>>>>>> to non-overlapped vmas. Access to overlapped >>>>>>>>> vmas (first and last) will >>>>>>>>> still have undefined behavior. >>>>>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, >>>>>>>> drop mmap_sem. Acquire >>>>>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>>>>> writing, free everything left, drop mmap_sem. >>>>>>>> >>>>>>>> ? >>>>>>>> >>>>>>>> Sure, you acquire the lock 3 times, but both write >>>>>>>> instances should be >>>>>>>> 'short', and I suppose you can do a demote between 1 >>>>>>>> and 2 if you care. >>>>>>> Thanks, Peter. Yes, by looking the code and trying two >>>>>>> different approaches, >>>>>>> it looks this approach is the most straight-forward one. >>>>>> Yes, you just have to be careful about the max vma count limit. >>>>> Yes, we should just need copy what do_munmap does as below: >>>>> >>>>> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >>>>> A A A A A A A A A A return -ENOMEM; >>>>> >>>>> If the mas map count limit has been reached, it will return >>>>> failure before >>>>> zapping mappings. >>>> Yeah, but as soon as you drop the lock and retake it, somebody might >>>> have changed the adddress space and we might get inconsistency. >>>> >>>> So I am wondering whether we really need upgrade_read (to promote read >>>> to write lock) and do the >>>> A A A A down_write >>>> A A A A split & set up VM_DEAD >>>> A A A A downgrade_write >>>> A A A A unmap >>>> A A A A upgrade_read >>>> A A A A zap ptes >>>> A A A A up_write >> Promoting to write lock may be a trouble. There might be other users in the >> critical section with read lock, we have to wait them to finish. > Yes. Is that a problem though? Not a problem, but just not sure how complicated it would be. Considering all the lock debug/lockdep stuff. And, the behavior smells like rcu. > >>> I'm supposed address space changing just can be done by mmap, mremap, >>> mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD >>> flag is set for the vma, just return failure since it is being unmapped. >>> >>> Does it sounds reasonable? >> It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), >> right? >> >> How about letting them return -EBUSY or -EAGAIN to notify the application? > Well, non of those is documented to return EBUSY and EAGAIN already has > a meaning for locked memory. > >> This changes the behavior a little bit, MAP_FIXED and mremap may fail if >> they fail the race with munmap (if the mapping is larger than 1GB). I'm not >> sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very >> heavily which may run into the race condition. I guess it should be rare to >> meet all the conditions to trigger the race. >> >> The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since >> they may corrupt its own address space as the man page noted. > Well, I suspect you are overcomplicating this a bit. This should be > really straightforward thing - well except for VM_DEAD which is quite > tricky already. We should rather not spread this trickyness outside of > the #PF path. And I would even try hard to start that part simple to see > whether it actually matters. Relying on races between threads without > any locking is quite questionable already. Nobody has pointed to a sane > usecase so far. I agree to keep it as simple as possible then see if it matters or not. So, in v3 I will just touch the page fault path. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-28 19:10 ` Yang Shi @ 2018-06-29 11:34 ` Michal Hocko -1 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-29 11:34 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Thu 28-06-18 12:10:10, Yang Shi wrote: > > > On 6/28/18 4:51 AM, Michal Hocko wrote: > > On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > > > By looking this deeper, we may not be able to cover all the unmapping range > > > > > > > for VM_DEAD, for example, if the start addr is in the middle of a vma. We > > > > > > > can't set VM_DEAD to that vma since that would trigger SIGSEGV for still > > > > > > > mapped area. > > > > > > > > > > > > > > splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD > > > > > > > to non-overlapped vmas. Access to overlapped vmas (first and last) will > > > > > > > still have undefined behavior. > > > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire > > > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > > > writing, free everything left, drop mmap_sem. > > > > > > > > > > > > ? > > > > > > > > > > > > Sure, you acquire the lock 3 times, but both write instances should be > > > > > > 'short', and I suppose you can do a demote between 1 and 2 if you care. > > > > > Thanks, Peter. Yes, by looking the code and trying two different approaches, > > > > > it looks this approach is the most straight-forward one. > > > > Yes, you just have to be careful about the max vma count limit. > > > Yes, we should just need copy what do_munmap does as below: > > > > > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > > > return -ENOMEM; > > > > > > If the mas map count limit has been reached, it will return failure before > > > zapping mappings. > > Yeah, but as soon as you drop the lock and retake it, somebody might > > have changed the adddress space and we might get inconsistency. > > > > So I am wondering whether we really need upgrade_read (to promote read > > to write lock) and do the > > down_write > > split & set up VM_DEAD > > downgrade_write > > unmap > > upgrade_read > > zap ptes > > up_write > > I'm supposed address space changing just can be done by mmap, mremap, > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD flag is > set for the vma, just return failure since it is being unmapped. I am sorry I do not follow. How does VM_DEAD flag helps for a completely unrelated vmas? Or maybe it would be better to post the code to see what you mean exactly. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-29 11:34 ` Michal Hocko 0 siblings, 0 replies; 39+ messages in thread From: Michal Hocko @ 2018-06-29 11:34 UTC (permalink / raw) To: Yang Shi Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On Thu 28-06-18 12:10:10, Yang Shi wrote: > > > On 6/28/18 4:51 AM, Michal Hocko wrote: > > On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > > > By looking this deeper, we may not be able to cover all the unmapping range > > > > > > > for VM_DEAD, for example, if the start addr is in the middle of a vma. We > > > > > > > can't set VM_DEAD to that vma since that would trigger SIGSEGV for still > > > > > > > mapped area. > > > > > > > > > > > > > > splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD > > > > > > > to non-overlapped vmas. Access to overlapped vmas (first and last) will > > > > > > > still have undefined behavior. > > > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire > > > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > > > writing, free everything left, drop mmap_sem. > > > > > > > > > > > > ? > > > > > > > > > > > > Sure, you acquire the lock 3 times, but both write instances should be > > > > > > 'short', and I suppose you can do a demote between 1 and 2 if you care. > > > > > Thanks, Peter. Yes, by looking the code and trying two different approaches, > > > > > it looks this approach is the most straight-forward one. > > > > Yes, you just have to be careful about the max vma count limit. > > > Yes, we should just need copy what do_munmap does as below: > > > > > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > > > return -ENOMEM; > > > > > > If the mas map count limit has been reached, it will return failure before > > > zapping mappings. > > Yeah, but as soon as you drop the lock and retake it, somebody might > > have changed the adddress space and we might get inconsistency. > > > > So I am wondering whether we really need upgrade_read (to promote read > > to write lock) and do the > > down_write > > split & set up VM_DEAD > > downgrade_write > > unmap > > upgrade_read > > zap ptes > > up_write > > I'm supposed address space changing just can be done by mmap, mremap, > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD flag is > set for the vma, just return failure since it is being unmapped. I am sorry I do not follow. How does VM_DEAD flag helps for a completely unrelated vmas? Or maybe it would be better to post the code to see what you mean exactly. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping 2018-06-29 11:34 ` Michal Hocko @ 2018-06-29 16:45 ` Yang Shi -1 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-29 16:45 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/29/18 4:34 AM, Michal Hocko wrote: > On Thu 28-06-18 12:10:10, Yang Shi wrote: >> >> On 6/28/18 4:51 AM, Michal Hocko wrote: >>> On Wed 27-06-18 10:23:39, Yang Shi wrote: >>>> On 6/27/18 12:24 AM, Michal Hocko wrote: >>>>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>>>> By looking this deeper, we may not be able to cover all the unmapping range >>>>>>>> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >>>>>>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >>>>>>>> mapped area. >>>>>>>> >>>>>>>> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >>>>>>>> to non-overlapped vmas. Access to overlapped vmas (first and last) will >>>>>>>> still have undefined behavior. >>>>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire >>>>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>>>> writing, free everything left, drop mmap_sem. >>>>>>> >>>>>>> ? >>>>>>> >>>>>>> Sure, you acquire the lock 3 times, but both write instances should be >>>>>>> 'short', and I suppose you can do a demote between 1 and 2 if you care. >>>>>> Thanks, Peter. Yes, by looking the code and trying two different approaches, >>>>>> it looks this approach is the most straight-forward one. >>>>> Yes, you just have to be careful about the max vma count limit. >>>> Yes, we should just need copy what do_munmap does as below: >>>> >>>> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >>>> return -ENOMEM; >>>> >>>> If the mas map count limit has been reached, it will return failure before >>>> zapping mappings. >>> Yeah, but as soon as you drop the lock and retake it, somebody might >>> have changed the adddress space and we might get inconsistency. >>> >>> So I am wondering whether we really need upgrade_read (to promote read >>> to write lock) and do the >>> down_write >>> split & set up VM_DEAD >>> downgrade_write >>> unmap >>> upgrade_read >>> zap ptes >>> up_write >> I'm supposed address space changing just can be done by mmap, mremap, >> mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD flag is >> set for the vma, just return failure since it is being unmapped. > I am sorry I do not follow. How does VM_DEAD flag helps for a completely > unrelated vmas? Or maybe it would be better to post the code to see what > you mean exactly. I mean we just care about the vmas which have been found/split by munmap, right? We already set VM_DEAD to them. Even though those other vmas are changed by somebody else, it would not cause any inconsistency to this munmap call. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping @ 2018-06-29 16:45 ` Yang Shi 0 siblings, 0 replies; 39+ messages in thread From: Yang Shi @ 2018-06-29 16:45 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Nadav Amit, Matthew Wilcox, ldufour, Andrew Morton, Ingo Molnar, acme, alexander.shishkin, jolsa, namhyung, open list:MEMORY MANAGEMENT, linux-kernel On 6/29/18 4:34 AM, Michal Hocko wrote: > On Thu 28-06-18 12:10:10, Yang Shi wrote: >> >> On 6/28/18 4:51 AM, Michal Hocko wrote: >>> On Wed 27-06-18 10:23:39, Yang Shi wrote: >>>> On 6/27/18 12:24 AM, Michal Hocko wrote: >>>>> On Tue 26-06-18 18:03:34, Yang Shi wrote: >>>>>> On 6/26/18 12:43 AM, Peter Zijlstra wrote: >>>>>>> On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: >>>>>>>> By looking this deeper, we may not be able to cover all the unmapping range >>>>>>>> for VM_DEAD, for example, if the start addr is in the middle of a vma. We >>>>>>>> can't set VM_DEAD to that vma since that would trigger SIGSEGV for still >>>>>>>> mapped area. >>>>>>>> >>>>>>>> splitting can't be done with read mmap_sem held, so maybe just set VM_DEAD >>>>>>>> to non-overlapped vmas. Access to overlapped vmas (first and last) will >>>>>>>> still have undefined behavior. >>>>>>> Acquire mmap_sem for writing, split, mark VM_DEAD, drop mmap_sem. Acquire >>>>>>> mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for >>>>>>> writing, free everything left, drop mmap_sem. >>>>>>> >>>>>>> ? >>>>>>> >>>>>>> Sure, you acquire the lock 3 times, but both write instances should be >>>>>>> 'short', and I suppose you can do a demote between 1 and 2 if you care. >>>>>> Thanks, Peter. Yes, by looking the code and trying two different approaches, >>>>>> it looks this approach is the most straight-forward one. >>>>> Yes, you just have to be careful about the max vma count limit. >>>> Yes, we should just need copy what do_munmap does as below: >>>> >>>> if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) >>>> A A A A A A A A A return -ENOMEM; >>>> >>>> If the mas map count limit has been reached, it will return failure before >>>> zapping mappings. >>> Yeah, but as soon as you drop the lock and retake it, somebody might >>> have changed the adddress space and we might get inconsistency. >>> >>> So I am wondering whether we really need upgrade_read (to promote read >>> to write lock) and do the >>> down_write >>> split & set up VM_DEAD >>> downgrade_write >>> unmap >>> upgrade_read >>> zap ptes >>> up_write >> I'm supposed address space changing just can be done by mmap, mremap, >> mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD flag is >> set for the vma, just return failure since it is being unmapped. > I am sorry I do not follow. How does VM_DEAD flag helps for a completely > unrelated vmas? Or maybe it would be better to post the code to see what > you mean exactly. I mean we just care about the vmas which have been found/split by munmap, right? We already set VM_DEAD to them. Even though those other vmas are changed by somebody else, it would not cause any inconsistency to this munmap call. ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2018-06-29 16:50 UTC | newest] Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-18 23:34 [RFC v2 0/2] mm: zap pages with read mmap_sem in munmap for large mapping Yang Shi 2018-06-18 23:34 ` Yang Shi 2018-06-18 23:34 ` [RFC v2 PATCH 1/2] uprobes: make vma_has_uprobes non-static Yang Shi 2018-06-18 23:34 ` [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping Yang Shi 2018-06-19 10:02 ` Peter Zijlstra 2018-06-19 21:13 ` Yang Shi 2018-06-20 7:17 ` Michal Hocko 2018-06-20 16:23 ` Yang Shi 2018-06-19 22:17 ` Nadav Amit 2018-06-19 23:08 ` Yang Shi 2018-06-20 0:31 ` Nadav Amit 2018-06-20 7:18 ` Michal Hocko 2018-06-20 7:18 ` Michal Hocko 2018-06-20 17:12 ` Nadav Amit 2018-06-20 18:42 ` Yang Shi 2018-06-20 18:42 ` Yang Shi 2018-06-23 1:01 ` Yang Shi 2018-06-25 9:14 ` Michal Hocko 2018-06-26 0:06 ` Yang Shi 2018-06-26 0:06 ` Yang Shi 2018-06-26 7:43 ` Peter Zijlstra 2018-06-27 1:03 ` Yang Shi 2018-06-27 7:24 ` Michal Hocko 2018-06-27 17:23 ` Yang Shi 2018-06-27 17:23 ` Yang Shi 2018-06-28 11:51 ` Michal Hocko 2018-06-28 11:51 ` Michal Hocko 2018-06-28 19:10 ` Yang Shi 2018-06-28 19:10 ` Yang Shi 2018-06-29 0:59 ` Yang Shi 2018-06-29 0:59 ` Yang Shi 2018-06-29 11:39 ` Michal Hocko 2018-06-29 11:39 ` Michal Hocko 2018-06-29 16:50 ` Yang Shi 2018-06-29 16:50 ` Yang Shi 2018-06-29 11:34 ` Michal Hocko 2018-06-29 11:34 ` Michal Hocko 2018-06-29 16:45 ` Yang Shi 2018-06-29 16:45 ` Yang Shi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.