From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5E13C46475 for ; Fri, 26 Oct 2018 00:43:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5EEDD204FD for ; Fri, 26 Oct 2018 00:43:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5EEDD204FD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ah.jp.nec.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726014AbeJZJSh convert rfc822-to-8bit (ORCPT ); Fri, 26 Oct 2018 05:18:37 -0400 Received: from tyo162.gate.nec.co.jp ([114.179.232.162]:54023 "EHLO tyo162.gate.nec.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725817AbeJZJSg (ORCPT ); Fri, 26 Oct 2018 05:18:36 -0400 Received: from mailgate02.nec.co.jp ([114.179.233.122]) by tyo162.gate.nec.co.jp (8.15.1/8.15.1) with ESMTPS id w9Q0hKiO007902 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 26 Oct 2018 09:43:20 +0900 Received: from mailsv02.nec.co.jp (mailgate-v.nec.co.jp [10.204.236.94]) by mailgate02.nec.co.jp (8.15.1/8.15.1) with ESMTP id w9Q0hKej022713; Fri, 26 Oct 2018 09:43:20 +0900 Received: from mail02.kamome.nec.co.jp (mail02.kamome.nec.co.jp [10.25.43.5]) by mailsv02.nec.co.jp (8.15.1/8.15.1) with ESMTP id w9Q0hKr7031350; Fri, 26 Oct 2018 09:43:20 +0900 Received: from bpxc99gp.gisp.nec.co.jp ([10.38.151.151] [10.38.151.151]) by mail02.kamome.nec.co.jp with ESMTP id BT-MMP-4946118; Fri, 26 Oct 2018 09:42:22 +0900 Received: from BPXM23GP.gisp.nec.co.jp ([10.38.151.215]) by BPXC23GP.gisp.nec.co.jp ([10.38.151.151]) with mapi id 14.03.0319.002; Fri, 26 Oct 2018 09:42:21 +0900 From: Naoya Horiguchi To: Mike Kravetz CC: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Andrew Morton , Michal Hocko , Hugh Dickins , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa Subject: Re: [PATCH RFC v2 1/1] hugetlbfs: use i_mmap_rwsem for pmd sharing and truncate/fault sync Thread-Topic: [PATCH RFC v2 1/1] hugetlbfs: use i_mmap_rwsem for pmd sharing and truncate/fault sync Thread-Index: AQHUa1UrBn3EhKqrwU6eCRnpmgEdXqUwG/YA Date: Fri, 26 Oct 2018 00:42:20 +0000 Message-ID: <20181026004220.GA8637@hori1.linux.bs1.fc.nec.co.jp> References: <20181024045053.1467-1-mike.kravetz@oracle.com> <20181024045053.1467-2-mike.kravetz@oracle.com> In-Reply-To: <20181024045053.1467-2-mike.kravetz@oracle.com> Accept-Language: en-US, ja-JP Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.51.8.80] Content-Type: text/plain; charset="iso-2022-jp" Content-ID: Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-TM-AS-MML: disable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mike, On Tue, Oct 23, 2018 at 09:50:53PM -0700, Mike Kravetz wrote: > hugetlbfs does not correctly handle page faults racing with truncation. > In addition, shared pmds can cause additional issues. > > Without pmd sharing, issues can occur as follows: > A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At > mmap time, 4 huge pages are reserved for the file/mapping. So, > the global reserve count is 4. In addition, since this is a shared > mapping an entry for 4 pages is added to the file's reserve map. > The first 3 of the 4 pages are faulted into the file. As a result, > the global reserve count is now 1. > > Task A starts to fault in the last page (routines hugetlb_fault, > hugetlb_no_page). It allocates a huge page (alloc_huge_page). > The reserve map indicates there is a reserved page, so this is > used and the global reserve count goes to 0. > > Now, task B truncates the file to size 0. It starts by setting > inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping > of the file (hugetlb_vmdelete_list). Since task A's page table > lock is not held at the time, truncation is not blocked. Truncation > removes the 3 pages from the file (remove_inode_hugepages). When > cleaning up the reserved pages (hugetlb_unreserve_pages), it notices > the reserve map was for 4 pages. However, it has only freed 3 pages. > So it assumes there is still (4 - 3) 1 reserved pages. It then > decrements the global reserve count by 1 and it goes negative. > > Task A then continues the page fault process and adds it's newly > acquired page to the page cache. Note that the index of this page > is beyond the size of the truncated file (0). The page fault process > then notices the file has been truncated and exits. However, the > page is left in the cache associated with the file. > > Now, if the file is immediately deleted the truncate code runs again. > It will find and free the one page associated with the file. When > cleaning up reserves, it notices the reserve map is empty. Yet, one > page freed. So, the global reserve count is decremented by (0 - 1) -1. > This returns the global count to 0 as it should be. But, it is > possible for someone else to mmap this file/range before it is deleted. > If this happens, a reserve map entry for the allocated page is created > and the reserved page is forever leaked. > > With pmd sharing, the situation is even worse. Consider the following: > A task processes a page fault on a shared hugetlbfs file and calls > huge_pte_alloc to get a ptep. Suppose the returned ptep points to a > shared pmd. > > Now, anopther task truncates the hugetlbfs file. As part of truncation, > it unmaps everyone who has the file mapped. If a task has a shared pmd > in this range, huge_pmd_unshhare will be called. If this is not the last (sorry, nitpicking ..) a few typos ("anophter" and "unshhare"). > user sharing the pmd, huge_pmd_unshare will clear pud pointing to the > pmd. For the task in the middle of the page fault, the ptep returned by > huge_pte_alloc points to another task's page table or worse. This leads > to bad things such as incorrect page map/reference counts or invalid > memory references. > > i_mmap_rwsem is currently used for pmd sharing synchronization. It is also > held during unmap and whenever a call to huge_pmd_unshare is possible. It > is only acquired in write mode. Expand and modify the use of i_mmap_rwsem > as follows: > - i_mmap_rwsem is held in write mode for the duration of truncate > processing. > - i_mmap_rwsem is held in write mode whenever huge_pmd_share is called. I guess you mean huge_pmd_unshare here, right? > - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. > Today that is only via huge_pte_alloc. > - i_mmap_rwsem is held in read mode after huge_pte_alloc, until the caller > is finished with the returned ptep. > > Signed-off-by: Mike Kravetz > --- > fs/hugetlbfs/inode.c | 21 ++++++++++---- > mm/hugetlb.c | 65 +++++++++++++++++++++++++++++++++----------- > mm/rmap.c | 10 +++++++ > mm/userfaultfd.c | 11 ++++++-- > 4 files changed, 84 insertions(+), 23 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 32920a10100e..6ee97622a231 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -426,10 +426,16 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, > u32 hash; > > index = page->index; > - hash = hugetlb_fault_mutex_hash(h, current->mm, > + /* > + * No need to take fault mutex for truncation as we > + * are synchronized via i_mmap_rwsem. > + */ > + if (!truncate_op) { > + hash = hugetlb_fault_mutex_hash(h, current->mm, > &pseudo_vma, > mapping, index, 0); > - mutex_lock(&hugetlb_fault_mutex_table[hash]); > + mutex_lock(&hugetlb_fault_mutex_table[hash]); > + } > > /* > * If page is mapped, it was faulted in after being > @@ -470,7 +476,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, > } > > unlock_page(page); > - mutex_unlock(&hugetlb_fault_mutex_table[hash]); > + if (!truncate_op) > + mutex_unlock(&hugetlb_fault_mutex_table[hash]); > } > huge_pagevec_release(&pvec); > cond_resched(); > @@ -505,8 +512,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) > i_mmap_lock_write(mapping); > if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) > hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); > - i_mmap_unlock_write(mapping); > remove_inode_hugepages(inode, offset, LLONG_MAX); > + i_mmap_unlock_write(mapping); I just have an impression that hugetlbfs_punch_hole() could have the similar race and extending lock range there could be an improvement, although I might miss something as always. > return 0; > } > > @@ -624,7 +631,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > /* addr is the offset within the file (zero based) */ > addr = index * hpage_size; > > - /* mutex taken here, fault path and hole punch */ > + /* > + * fault mutex taken here, protects against fault path > + * and hole punch. inode_lock previously taken protects > + * against truncation. > + */ > hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping, > index, addr); > mutex_lock(&hugetlb_fault_mutex_table[hash]); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 7b5c0ad9a6bd..e9da3eee262f 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3252,18 +3252,33 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { > spinlock_t *src_ptl, *dst_ptl; > + struct vm_area_struct *dst_vma; > + struct address_space *mapping; > + > src_pte = huge_pte_offset(src, addr, sz); > if (!src_pte) > continue; > + > + /* > + * i_mmap_rwsem must be held to call huge_pte_alloc. > + * Continue to hold until finished with dst_pte, otherwise > + * it could go away if part of a shared pmd. > + */ > + dst_vma = find_vma(dst, addr); > + mapping = dst_vma->vm_file->f_mapping; If vma->vm_file->f_mapping gives the same mapping, you may omit the find_vma()? > + i_mmap_lock_read(mapping); > dst_pte = huge_pte_alloc(dst, addr, sz); > if (!dst_pte) { > + i_mmap_unlock_read(mapping); > ret = -ENOMEM; > break; > } > > /* If the pagetables are shared don't copy or take references */ > - if (dst_pte == src_pte) > + if (dst_pte == src_pte) { > + i_mmap_unlock_read(mapping); > continue; > + } > > dst_ptl = huge_pte_lock(h, dst, dst_pte); > src_ptl = huge_pte_lockptr(h, src, src_pte); [...] > diff --git a/mm/rmap.c b/mm/rmap.c > index 1e79fac3186b..db49e734dda8 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1347,6 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > bool ret = true; > unsigned long start = address, end; > enum ttu_flags flags = (enum ttu_flags)arg; > + bool pmd_sharing_possible = false; > > /* munlock has nothing to gain from examining un-locked vmas */ > if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) > @@ -1376,8 +1377,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > * accordingly. > */ > adjust_range_if_pmd_sharing_possible(vma, &start, &end); > + if ((end - start) > (PAGE_SIZE << compound_order(page))) > + pmd_sharing_possible = true; Maybe the similar check is done in adjust_range_if_pmd_sharing_possible() as the function name claims, so does it make more sense to get this bool value via the return value? Thanks, Naoya Horiguchi