All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>,
	David Hildenbrand <david@redhat.com>,
	Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization
Date: Tue, 13 Sep 2022 17:50:16 -0700	[thread overview]
Message-ID: <YyElSFtbJ32VRLh6@monkey> (raw)
In-Reply-To: <a879ee95-918e-00f7-b04f-e9dba156902c@huawei.com>

On 09/13/22 10:14, Miaohe Lin wrote:
> On 2022/9/13 7:02, Mike Kravetz wrote:
> > On 09/05/22 11:08, Miaohe Lin wrote:
> >> On 2022/9/3 7:07, Mike Kravetz wrote:
> >>> On 08/30/22 10:02, Miaohe Lin wrote:
> >>>> On 2022/8/25 1:57, Mike Kravetz wrote:
> >>>>> The new hugetlb vma lock (rw semaphore) is used to address this race:
> >>>>>
> >>>>> Faulting thread                                 Unsharing thread
> >>>>> ...                                                  ...
> >>>>> ptep = huge_pte_offset()
> >>>>>       or
> >>>>> ptep = huge_pte_alloc()
> >>>>> ...
> >>>>>                                                 i_mmap_lock_write
> >>>>>                                                 lock page table
> >>>>> ptep invalid   <------------------------        huge_pmd_unshare()
> >>>>> Could be in a previously                        unlock_page_table
> >>>>> sharing process or worse                        i_mmap_unlock_write
> >>>>> ...
> >>>>>
> >>>>> The vma_lock is used as follows:
> >>>>> - During fault processing. the lock is acquired in read mode before
> >>>>>   doing a page table lock and allocation (huge_pte_alloc).  The lock is
> >>>>>   held until code is finished with the page table entry (ptep).
> >>>>> - The lock must be held in write mode whenever huge_pmd_unshare is
> >>>>>   called.
> >>>>>
> >>>>> Lock ordering issues come into play when unmapping a page from all
> >>>>> vmas mapping the page.  The i_mmap_rwsem must be held to search for the
> >>>>> vmas, and the vma lock must be held before calling unmap which will
> >>>>> call huge_pmd_unshare.  This is done today in:
> >>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
> >>>>>   error handling.  In these routines we 'try' to obtain the vma lock and
> >>>>>   fail to unmap if unsuccessful.  Calling routines already deal with the
> >>>>>   failure of unmapping.
> >>>>> - hugetlb_vmdelete_list for truncation and hole punch.  This routine
> >>>>>   also tries to acquire the vma lock.  If it fails, it skips the
> >>>>>   unmapping.  However, we can not have file truncation or hole punch
> >>>>>   fail because of contention.  After hugetlb_vmdelete_list, truncation
> >>>>>   and hole punch call remove_inode_hugepages.  remove_inode_hugepages
> >>>>>   check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> >>>>>   hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> >>>>>   correct order to guarantee unmap success.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>> ---
> >>>>>  fs/hugetlbfs/inode.c |  46 +++++++++++++++++++
> >>>>>  mm/hugetlb.c         | 102 +++++++++++++++++++++++++++++++++++++++----
> >>>>>  mm/memory.c          |   2 +
> >>>>>  mm/rmap.c            | 100 +++++++++++++++++++++++++++---------------
> >>>>>  mm/userfaultfd.c     |   9 +++-
> >>>>>  5 files changed, 214 insertions(+), 45 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >>>>> index b93d131b0cb5..52d9b390389b 100644
> >>>>> --- a/fs/hugetlbfs/inode.c
> >>>>> +++ b/fs/hugetlbfs/inode.c
> >>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>>>>  					struct folio *folio, pgoff_t index)
> >>>>>  {
> >>>>>  	struct rb_root_cached *root = &mapping->i_mmap;
> >>>>> +	unsigned long skipped_vm_start;
> >>>>> +	struct mm_struct *skipped_mm;
> >>>>>  	struct page *page = &folio->page;
> >>>>>  	struct vm_area_struct *vma;
> >>>>>  	unsigned long v_start;
> >>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>>>>  	end = ((index + 1) * pages_per_huge_page(h));
> >>>>>  
> >>>>>  	i_mmap_lock_write(mapping);
> >>>>> +retry:
> >>>>> +	skipped_mm = NULL;
> >>>>>  
> >>>>>  	vma_interval_tree_foreach(vma, root, start, end - 1) {
> >>>>>  		v_start = vma_offset_start(vma, start);
> >>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>>>>  		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> >>>>>  			continue;
> >>>>>  
> >>>>> +		if (!hugetlb_vma_trylock_write(vma)) {
> >>>>> +			/*
> >>>>> +			 * If we can not get vma lock, we need to drop
> >>>>> +			 * immap_sema and take locks in order.
> >>>>> +			 */
> >>>>> +			skipped_vm_start = vma->vm_start;
> >>>>> +			skipped_mm = vma->vm_mm;
> >>>>> +			/* grab mm-struct as we will be dropping i_mmap_sema */
> >>>>> +			mmgrab(skipped_mm);
> >>>>> +			break;
> >>>>> +		}
> >>>>> +
> >>>>>  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >>>>>  				NULL, ZAP_FLAG_DROP_MARKER);
> >>>>> +		hugetlb_vma_unlock_write(vma);
> >>>>>  	}
> >>>>>  
> >>>>>  	i_mmap_unlock_write(mapping);
> >>>>> +
> >>>>> +	if (skipped_mm) {
> >>>>> +		mmap_read_lock(skipped_mm);
> >>>>> +		vma = find_vma(skipped_mm, skipped_vm_start);
> >>>>> +		if (!vma || !is_vm_hugetlb_page(vma) ||
> >>>>> +					vma->vm_file->f_mapping != mapping ||
> >>>>> +					vma->vm_start != skipped_vm_start) {
> >>>>
> >>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
> >>>>
> >>>
> >>> Yes, that is missing.  I will add here.
> >>>
> >>>>> +			mmap_read_unlock(skipped_mm);
> >>>>> +			mmdrop(skipped_mm);
> >>>>> +			goto retry;
> >>>>> +		}
> >>>>> +
> >>>>
> >>>> IMHO, above check is not enough. Think about the below scene:
> >>>>
> >>>> CPU 1					CPU 2
> >>>> hugetlb_unmap_file_folio		exit_mmap
> >>>>   mmap_read_lock(skipped_mm);		  mmap_read_lock(mm);
> >>>>   check vma is wanted.
> >>>>   					  unmap_vmas
> >>>>   mmap_read_unlock(skipped_mm);		  mmap_read_unlock
> >>>>   					  mmap_write_lock(mm);
> >>>>   					  free_pgtables
> >>>>   					  remove_vma
> >>>> 					    hugetlb_vma_lock_free
> >>>>   vma, hugetlb_vma_lock is still *used after free*
> >>>>   					  mmap_write_unlock(mm);
> >>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
> >>>
> >>> In the retry case, we are OK because go back and look up the vma again.  Right?
> >>>
> >>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
> >>> Before that, we do the following:
> >>>
> >>>>> +		hugetlb_vma_lock_write(vma);
> >>>>> +		i_mmap_lock_write(mapping);
> >>>
> >>> IIUC, vma can not go away while we hold i_mmap_lock_write.  So, after this we
> >>
> >> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
> >> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
> >>
> >>> can.
> >>>
> >>>>> +		mmap_read_unlock(skipped_mm);
> >>>>> +		mmdrop(skipped_mm);
> >>>
> >>> We continue to hold i_mmap_lock_write as we goto retry.
> >>>
> >>> I could be missing something as well.  This was how I intended to keep
> >>> vma valid while dropping and acquiring locks.
> >>
> >> Thanks for your clarifying.
> >>
> > 
> > Well, that was all correct 'in theory' but not in practice.  I did not take
> > into account the inode lock that is taken at the beginning of truncate (or
> > hole punch).  In other code paths, we take inode lock after mmap_lock.  So,
> > taking mmap_lock here is not allowed.
> 
> Considering the Lock ordering in mm/filemap.c:
> 
>  *  ->i_rwsem
>  *    ->invalidate_lock		(acquired by fs in truncate path)
>  *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> 
>  *  ->i_rwsem			(generic_perform_write)
>  *    ->mmap_lock		(fault_in_readable->do_page_fault)
> 
> It seems inode_lock is taken before the mmap_lock?

Hmmmm?  I can't find a sequence where inode_lock is taken after mmap_lock.
lockdep was complaining about taking mmap_lock after i_rwsem in the above code.
I assumed there was such a sequence somewhere.  Might need to go back and get
another trace/warning.

In any case, I think the scheme below is much cleaner.  Doing another round of
benchmarking before sending.

> > I came up with another way to make this work.  As discussed above, we need to
> > drop the i_mmap lock before acquiring the vma_lock.  However, once we drop
> > i_mmap, the vma could go away.  My solution is to make the 'vma_lock' be a
> > ref counted structure that can live on after the vma is freed.  Therefore,
> > this code can take a reference while under i_mmap then drop i_mmap and wait
> > on the vma_lock.  Of course, once it acquires the vma_lock it needs to check
> > and make sure the vma still exists.  It may sound complicated, but I think
> > it is a bit simpler than the code here.  A new series will be out soon.
> > 

-- 
Mike Kravetz

  reply	other threads:[~2022-09-14  0:51 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-24 17:57 [PATCH 0/8] hugetlb: Use new vma mutex for huge pmd sharing synchronization Mike Kravetz
2022-08-24 17:57 ` [PATCH 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-08-27  2:50   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-08-27  2:59   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 3/8] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
2022-08-27  3:08   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 4/8] hugetlb: handle truncate racing with page faults Mike Kravetz
2022-08-25 17:00   ` Mike Kravetz
2022-08-27  8:02   ` Miaohe Lin
2022-08-29 21:53     ` Mike Kravetz
2022-09-06 13:57   ` Sven Schnelle
2022-09-06 16:48     ` Mike Kravetz
2022-09-06 18:05       ` Mike Kravetz
2022-09-06 23:08         ` Mike Kravetz
2022-09-07  2:11           ` Miaohe Lin
2022-09-07  2:37             ` Mike Kravetz
2022-09-07  3:07               ` Miaohe Lin
2022-09-07  3:30                 ` Mike Kravetz
2022-09-07  8:22                   ` Sven Schnelle
2022-09-07 14:50                     ` Mike Kravetz
2022-08-24 17:57 ` [PATCH 5/8] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-08-27  8:07   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 6/8] hugetlb: add vma based lock for pmd sharing Mike Kravetz
2022-08-27  9:30   ` Miaohe Lin
2022-08-29 22:24     ` Mike Kravetz
2022-08-30  2:34       ` Miaohe Lin
2022-09-07 20:50       ` Mike Kravetz
2022-09-08  2:04         ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-08-29  2:44   ` Miaohe Lin
2022-08-29 22:37     ` Mike Kravetz
2022-08-30  2:46       ` Miaohe Lin
2022-09-02 21:35         ` Mike Kravetz
2022-09-05  2:32           ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-08-30  2:02   ` Miaohe Lin
2022-09-02 23:07     ` Mike Kravetz
2022-09-05  3:08       ` Miaohe Lin
2022-09-12 23:02         ` Mike Kravetz
2022-09-13  2:14           ` Miaohe Lin
2022-09-14  0:50             ` Mike Kravetz [this message]
2022-09-14  2:08               ` Miaohe Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YyElSFtbJ32VRLh6@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.