From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751747AbeCTVcC (ORCPT ); Tue, 20 Mar 2018 17:32:02 -0400 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:48741 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751697AbeCTVb6 (ORCPT ); Tue, 20 Mar 2018 17:31:58 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R821e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0SzoMu-x_1521581495; From: Yang Shi To: akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Date: Wed, 21 Mar 2018 05:31:18 +0800 Message-Id: <1521581486-99134-1-git-send-email-yang.shi@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). So, this series of patches are aimed to solve the mmap_sem issue by adopting akpm's suggestion. Approach: A couple of approaches were explored. #1. Unmap large map by section in vm_munmap(). It works, but just sys_munmap() can benefit from this change. #2. Do unmapping in deeper place of the call chain, i.e. zap_pmd_range(). In this way, I don't have to define a magic size for unmapping. But, there are two major issues: * mmap_sem may be acquired by down_write() or down_read() in all the possible call paths. So, the call path has to be checked to determine to use which variants, either _write or _read. It increases the complexity significantly. * The below race condition might be introduced:      CPU A                         CPU B ----------                ---------- do_munmap zap_pmd_range up_write                      do_munmap                                     down_write                                     ......                                     remove_vma_list                                     up_write down_write access vmas  <-- use-after-free bug And, unmapping by section requires splitting vma, so the code has to deal with partial unmapped vma, it also increase the complexity significantly. #3. Do it in do_munmap(). I can keep splitting vma/unmap region/free pagetables /free vmas sequence atomic for every section. And, not only sys_munmap() can benefit, but also mremap and sysv shm. The only problem is it may not want to drop mmap_sem from some call paths. So, an extra parameter, called "atomic", is introduced to do_munmap(). The caller can pass "true" or "false" to tell do_munmap() if dropping mmap_sem is expected or not. "True" means not drop, "false" means drop. Since all callers to do_munmap() acquire mmap_sem by _write, so I just need deal with one variant. And, when re-acquiring mmap_sem, just use down_write() for now since dealing with the return value of down_write_killable() sounds unnecessary. Other than these, a magic section size has to be defined explicitly, now HPAGE_PUD_SIZE is used. According to my test, HPAGE_PUD_SIZE sounds good enough. This is also why down_write() is used for re-acquiring mmap_sem instead of down_write_killable(). Smaller size looks have to much overhead. Regression and performance data: Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory Full LTP test is done, no regression issue. Measurement of SyS_munmap() execution time: size pristine patched delta 80GB 5008377 us 4905841 us -2% 160GB 9129243 us 9145306 us +0.18% 320GB 17915310 us 17990174 us +0.42% Throughput of page faults (#/s) with vm-scalability: pristine patched delta mmap-pread-seq 554894 563517 +1.6% mmap-pread-seq-mt 581232 580772 -0.079% mmap-xread-seq-mt 99182 105400 +6.3% Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 100165 108396 +8.2% There are 8 patches in this series. 1/8: Introduce “atomic” parameter and define do_munmap_range(), modify do_munmap() to call do_munmap() to unmap memory by section 2/8 - 6/8: modify do_munmap() call sites in mm/mmap.c, mm/mremap.c, fs/proc/vmcore.c, ipc/shm.c and mm/nommu.c to adopt "atomic" parameter 7/8 - 8/8: modify the do_munmap() call sites in arch/x86 to adopt "atomic" parameter Yang Shi (8): mm: mmap: unmap large mapping by section mm: mmap: pass atomic parameter to do_munmap() call sites mm: mremap: pass atomic parameter to do_munmap() mm: nommu: add atomic parameter to do_munmap() ipc: shm: pass atomic parameter to do_munmap() fs: proc/vmcore: pass atomic parameter to do_munmap() x86: mpx: pass atomic parameter to do_munmap() x86: vma: pass atomic parameter to do_munmap() arch/x86/entry/vdso/vma.c | 2 +- arch/x86/mm/mpx.c | 2 +- fs/proc/vmcore.c | 4 ++-- include/linux/mm.h | 2 +- ipc/shm.c | 9 ++++++--- mm/mmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ mm/mremap.c | 10 ++++++---- mm/nommu.c | 5 +++-- 8 files changed, 62 insertions(+), 20 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f200.google.com (mail-io0-f200.google.com [209.85.223.200]) by kanga.kvack.org (Postfix) with ESMTP id C95876B0007 for ; Tue, 20 Mar 2018 17:32:09 -0400 (EDT) Received: by mail-io0-f200.google.com with SMTP id m3so2699119ioe.17 for ; Tue, 20 Mar 2018 14:32:09 -0700 (PDT) Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com. [115.124.30.131]) by mx.google.com with ESMTPS id o76-v6si1942123ith.18.2018.03.20.14.32.07 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 20 Mar 2018 14:32:08 -0700 (PDT) From: Yang Shi Subject: [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Date: Wed, 21 Mar 2018 05:31:18 +0800 Message-Id: <1521581486-99134-1-git-send-email-yang.shi@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). So, this series of patches are aimed to solve the mmap_sem issue by adopting akpm's suggestion. Approach: A couple of approaches were explored. #1. Unmap large map by section in vm_munmap(). It works, but just sys_munmap() can benefit from this change. #2. Do unmapping in deeper place of the call chain, i.e. zap_pmd_range(). In this way, I don't have to define a magic size for unmapping. But, there are two major issues: * mmap_sem may be acquired by down_write() or down_read() in all the possible call paths. So, the call path has to be checked to determine to use which variants, either _write or _read. It increases the complexity significantly. * The below race condition might be introduced: A A A A CPU AA A A A A A A A A A A A A A A A A A A A A A A A CPU B ----------A A A A A A A A A A A A A A A ---------- do_munmap zap_pmd_range up_writeA A A A A A A A A A A A A A A A A A A A A do_munmap A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A down_write A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...... A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A remove_vma_list A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A up_write down_write access vmasA <-- use-after-free bug And, unmapping by section requires splitting vma, so the code has to deal with partial unmapped vma, it also increase the complexity significantly. #3. Do it in do_munmap(). I can keep splitting vma/unmap region/free pagetables /free vmas sequence atomic for every section. And, not only sys_munmap() can benefit, but also mremap and sysv shm. The only problem is it may not want to drop mmap_sem from some call paths. So, an extra parameter, called "atomic", is introduced to do_munmap(). The caller can pass "true" or "false" to tell do_munmap() if dropping mmap_sem is expected or not. "True" means not drop, "false" means drop. Since all callers to do_munmap() acquire mmap_sem by _write, so I just need deal with one variant. And, when re-acquiring mmap_sem, just use down_write() for now since dealing with the return value of down_write_killable() sounds unnecessary. Other than these, a magic section size has to be defined explicitly, now HPAGE_PUD_SIZE is used. According to my test, HPAGE_PUD_SIZE sounds good enough. This is also why down_write() is used for re-acquiring mmap_sem instead of down_write_killable(). Smaller size looks have to much overhead. Regression and performance data: Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory Full LTP test is done, no regression issue. Measurement of SyS_munmap() execution time: size pristine patched delta 80GB 5008377 us 4905841 us -2% 160GB 9129243 us 9145306 us +0.18% 320GB 17915310 us 17990174 us +0.42% Throughput of page faults (#/s) with vm-scalability: pristine patched delta mmap-pread-seq 554894 563517 +1.6% mmap-pread-seq-mt 581232 580772 -0.079% mmap-xread-seq-mt 99182 105400 +6.3% Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 100165 108396 +8.2% There are 8 patches in this series. 1/8: Introduce a??atomica?? parameter and define do_munmap_range(), modify do_munmap() to call do_munmap() to unmap memory by section 2/8 - 6/8: modify do_munmap() call sites in mm/mmap.c, mm/mremap.c, fs/proc/vmcore.c, ipc/shm.c and mm/nommu.c to adopt "atomic" parameter 7/8 - 8/8: modify the do_munmap() call sites in arch/x86 to adopt "atomic" parameter Yang Shi (8): mm: mmap: unmap large mapping by section mm: mmap: pass atomic parameter to do_munmap() call sites mm: mremap: pass atomic parameter to do_munmap() mm: nommu: add atomic parameter to do_munmap() ipc: shm: pass atomic parameter to do_munmap() fs: proc/vmcore: pass atomic parameter to do_munmap() x86: mpx: pass atomic parameter to do_munmap() x86: vma: pass atomic parameter to do_munmap() arch/x86/entry/vdso/vma.c | 2 +- arch/x86/mm/mpx.c | 2 +- fs/proc/vmcore.c | 4 ++-- include/linux/mm.h | 2 +- ipc/shm.c | 9 ++++++--- mm/mmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ mm/mremap.c | 10 ++++++---- mm/nommu.c | 5 +++-- 8 files changed, 62 insertions(+), 20 deletions(-)