Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing

From: Waiman Long <longman@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing
Date: Tue, 12 Nov 2019 12:27:40 -0500	[thread overview]
Message-ID: <fd29a337-c067-ebf6-4be2-3b6e2f703ac4@redhat.com> (raw)
In-Reply-To: <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com>

On 11/8/19 8:47 PM, Mike Kravetz wrote:
> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>
>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>> held just in read mode.  It is OK to do increments in parallel without
>>>> synchronization.  However, we don't want anyone else changing the count
>>>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>>>> taking the semaphore in write mode.
>>> This would be a nice addition to the changelog methinks.
>> Last night I remembered there is one place where we currently take
>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
>> is in try_to_unmap_one.  Yes, there is a potential race here today.
> Actually there is no race there today.  Callers to huge_pmd_unshare
> hold the page table lock.  So, this synchronizes those unshare calls
> from  page migration and page poisoning.
>
>> But that race is somewhat contained as you need two threads doing some
>> combination of page migration and page poisoning to race.  This change
>> now allows migration or poisoning to race with page fault.  I would
>> really prefer if we do not open up the race window in this manner.
> But, we do open a race window by changing huge_pmd_share to take the
> i_mmap_rwsem in read mode as in the original patch.  
>
> Here is the additional code needed to take the semaphore in write mode
> for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
> combine this with Longman's patch.  Please take a look and provide feedback.
> Some of the changes are subtle, especially the exception for MAP_PRIVATE
> mappings, but I tried to add sufficient comments.
>
> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 8 Nov 2019 17:25:37 -0800
> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>  call to huge_pmd_unshare in try_to_unmap_one.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c        |  9 ++++++++-
>  mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>  mm/migrate.c        | 27 +++++++++++++++++++++++++--
>  3 files changed, 60 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f78891f92765..73d9136549a5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   * indicated by page_count > 1, unmap is achieved by clearing pud and
>   * decrementing the ref count. If count == 1, the pte page is not shared.
>   *
> - * called with page table lock held.
> + * Must be called while holding page table lock.
> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
> + * This is to prevent races with page faults calling huge_pmd_share which
> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
> + * in read mode.  It is possible to call without holding i_mmap_rwsem in
> + * write mode if the caller KNOWS the page table is associated with a private
> + * mapping.  This is because private mappings can not share PMDs and can
> + * not race with huge_pmd_share calls during page faults.

So the page table lock here is the huge_pte_lock(). Right? In
huge_pmd_share(), the pte lock has to be taken before one can share it.
So would you mind explaining where exactly is the race?

Thanks,
Longman

>   *
>   * returns: 1 successfully unmapped a shared pte page
>   *	    0 the underlying pte page is not shared, or it is the last user
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3151c87dff73..8f52b22cf71b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  	if (kill)
>  		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
>  
> -	unmap_success = try_to_unmap(hpage, ttu);
> +	if (!PageHuge(hpage)) {
> +		unmap_success = try_to_unmap(hpage, ttu);
> +	} else {
> +		mapping = page_mapping(hpage);
> +		if (mapping) {
> +			/*
> +			 * For hugetlb pages, try_to_unmap could potentially
> +			 * call huge_pmd_unshare.  Because of this, take
> +			 * semaphore in write mode here and set TTU_RMAP_LOCKED
> +			 * to indicate we have taken the lock at this higher
> +			 * level.
> +			 */
> +			i_mmap_lock_write(mapping);
> +			unmap_success = try_to_unmap(hpage,
> +							ttu|TTU_RMAP_LOCKED);
> +			i_mmap_unlock_write(mapping);
> +		} else {
> +			/*
> +			 * !mapping implies a MAP_PRIVATE huge page mapping.
> +			 * Since PMDs will never be shared in a private
> +			 * mapping, it is safe to let huge_pmd_unshare be
> +			 * called with the semaphore in read mode.
> +			 */
> +			unmap_success = try_to_unmap(hpage, ttu);
> +		}
> +	}
> +
>  	if (!unmap_success)
>  		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
>  		       pfn, page_mapcount(hpage));
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4fe45d1428c8..9cae5a4f1e48 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  		goto put_anon;
>  
>  	if (page_mapped(hpage)) {
> -		try_to_unmap(hpage,
> -			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +		struct address_space *mapping = page_mapping(hpage);
> +
> +		if (mapping) {
> +			/*
> +			 * try_to_unmap could potentially call huge_pmd_unshare.
> +			 * Because of this, take semaphore in write mode here
> +			 * and set TTU_RMAP_LOCKED to indicate we have taken
> +			 * the lock at this higher level.
> +			 */
> +			i_mmap_lock_write(mapping);
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
> +			i_mmap_unlock_write(mapping);
> +		} else {
> +			/*
> +			 * !mapping implies a MAP_PRIVATE huge page mapping.
> +			 * Since PMDs will never be shared in a private
> +			 * mapping, it is safe to let huge_pmd_unshare be
> +			 * called with the semaphore in read mode.
> +			 */
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS);
> +		}
>  		page_was_mapped = 1;
>  	}
>