From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,T_DKIM_INVALID, URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F3E0C04ABB for ; Wed, 12 Sep 2018 02:29:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A300B20839 for ; Wed, 12 Sep 2018 02:29:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="gjtStW5B" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A300B20839 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728013AbeILHbh (ORCPT ); Wed, 12 Sep 2018 03:31:37 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:45926 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727675AbeILHbg (ORCPT ); Wed, 12 Sep 2018 03:31:36 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=/PXWAlpPkrm0cmrq+uAPYgjYOXRN5d+7iymz7QCJe2A=; b=gjtStW5BjqwwUInXQViq1+mNR 9PU0xY67qjUuxEDPhYyLXyvFVi0R9VmtzbgSF9fqknz7RmhHmKENMlmPUwCp9BAYC4KThE7TGb5d2 fXhvbzvVF/ljq6LAyfcJ3MoeTa0fdlInWpPJx2zjp0gH2+fCAFP5dRm+FxdZrdyetW2jrIahf8o/G 0Kddto/kKLcz+aDJiFz+0FiL2qXE6K7T9890yROTVVKhrihLVaUBUkeCfaO4/qSB+SSlQT1idxM8a IoPF38BtWf7m/xRCCoZQPCcY2M055qkxHDv0itrhl+iO0y5qL/WOI35bmn9EFJCUPZSFhVU7ktLrC PNWqNmTRw==; Received: from willy by bombadil.infradead.org with local (Exim 4.90_1 #2 (Red Hat Linux)) id 1fzuuA-0001v9-10; Wed, 12 Sep 2018 02:29:22 +0000 Date: Tue, 11 Sep 2018 19:29:21 -0700 From: Matthew Wilcox To: Yang Shi Cc: mhocko@kernel.org, ldufour@linux.vnet.ibm.com, vbabka@suse.cz, akpm@linux-foundation.org, dave.hansen@intel.com, oleg@redhat.com, srikar@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC v9 PATCH 2/4] mm: mmap: zap pages with read mmap_sem in munmap Message-ID: <20180912022921.GA20056@bombadil.infradead.org> References: <1536699493-69195-1-git-send-email-yang.shi@linux.alibaba.com> <1536699493-69195-3-git-send-email-yang.shi@linux.alibaba.com> <20180911211645.GA12159@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 11, 2018 at 04:35:03PM -0700, Yang Shi wrote: > On 9/11/18 2:16 PM, Matthew Wilcox wrote: > > On Wed, Sep 12, 2018 at 04:58:11AM +0800, Yang Shi wrote: > > > mm/mmap.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > > I really think you're going about this the wrong way by duplicating > > vm_munmap(). > > If we don't duplicate vm_munmap() or do_munmap(), we need pass an extra > parameter to them to tell when it is fine to downgrade write lock or if the > lock has been acquired outside it (i.e. in mmap()/mremap()), right? But, > vm_munmap() or do_munmap() is called not only by mmap-related, but also some > other places, like arch-specific places, which don't need downgrade write > lock or are not safe to do so. > > Actually, I did this way in the v1 patches, but it got pushed back by tglx > who suggested duplicate the code so that the change could be done in mm only > without touching other files, i.e. arch-specific stuff. I didn't have strong > argument to convince him. With my patch, there is nothing to change in arch-specific code. Here it is again ... diff --git a/mm/mmap.c b/mm/mmap.c index de699523c0b7..06dc31d1da8c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2798,11 +2798,11 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, * work. This now handles partial unmappings. * Jeremy Fitzhardinge */ -int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, - struct list_head *uf) +static int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf, bool downgrade) { unsigned long end; - struct vm_area_struct *vma, *prev, *last; + struct vm_area_struct *vma, *prev, *last, *tmp; if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) return -EINVAL; @@ -2816,7 +2816,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, if (!vma) return 0; prev = vma->vm_prev; - /* we have start < vma->vm_end */ + /* we have start < vma->vm_end */ /* if it doesn't overlap, we have nothing.. */ end = start + len; @@ -2873,18 +2873,22 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * unlock any mlock()ed ranges before detaching vmas + * and check to see if there's any reason we might have to hold + * the mmap_sem write-locked while unmapping regions. */ - if (mm->locked_vm) { - struct vm_area_struct *tmp = vma; - while (tmp && tmp->vm_start < end) { - if (tmp->vm_flags & VM_LOCKED) { - mm->locked_vm -= vma_pages(tmp); - munlock_vma_pages_all(tmp); - } - tmp = tmp->vm_next; + for (tmp = vma; tmp && tmp->vm_start < end; tmp = tmp->vm_next) { + if (tmp->vm_flags & VM_LOCKED) { + mm->locked_vm -= vma_pages(tmp); + munlock_vma_pages_all(tmp); } + if (tmp->vm_file && + has_uprobes(tmp, tmp->vm_start, tmp->vm_end)) + downgrade = false; } + if (downgrade) + downgrade_write(&mm->mmap_sem); + /* * Remove the vma's, and unmap the actual pages */ @@ -2896,7 +2900,13 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* Fix up all other VM information */ remove_vma_list(mm, vma); - return 0; + return downgrade ? 1 : 0; +} + +int do_unmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf) +{ + return __do_munmap(mm, start, len, uf, false); } int vm_munmap(unsigned long start, size_t len) @@ -2905,11 +2915,12 @@ int vm_munmap(unsigned long start, size_t len) struct mm_struct *mm = current->mm; LIST_HEAD(uf); - if (down_write_killable(&mm->mmap_sem)) - return -EINTR; - - ret = do_munmap(mm, start, len, &uf); - up_write(&mm->mmap_sem); + down_write(&mm->mmap_sem); + ret = __do_munmap(mm, start, len, &uf, true); + if (ret == 1) + up_read(&mm->mmap_sem); + else + up_write(&mm->mmap_sem); userfaultfd_unmap_complete(mm, &uf); return ret; } Anybody calling do_munmap() will not get the lock dropped. > And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for > safe and bisectable sake, which needs call the regular do_munmap(). That can be introduced and then taken out ... indeed, you can split this into many patches, starting with this: + if (tmp->vm_file) + downgrade = false; to only allow this optimisation for anonymous mappings at first. > In addition to this, I just found mpx code may call do_munmap() recursively > when I was looking into the mpx code. > > We might be able to handle these by the extra parameter, but it sounds it > make the code hard to understand and error prone. Only if you make the extra parameter mandatory.