Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing

From: Waiman Long <longman@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing
Date: Tue, 12 Nov 2019 21:55:00 -0500	[thread overview]
Message-ID: <28b721b1-38f3-f10a-60e3-20e247c4ecb7@redhat.com> (raw)
In-Reply-To: <33d1fcc8-ebec-bd37-2969-daa0125479df@oracle.com>

On 11/12/19 6:11 PM, Mike Kravetz wrote:
> On 11/12/19 9:27 AM, Waiman Long wrote:
>> On 11/8/19 8:47 PM, Mike Kravetz wrote:
>>> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>>>
>>>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>>>> held just in read mode.  It is OK to do increments in parallel without
>>>>>> synchronization.  However, we don't want anyone else changing the count
>>>>>> while that check in huge_pmd_unshare is happening.  Hence, the need for
>>>>>> taking the semaphore in write mode.
>>>>> This would be a nice addition to the changelog methinks.
>>>> Last night I remembered there is one place where we currently take
>>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
>>>> is in try_to_unmap_one.  Yes, there is a potential race here today.
>>> Actually there is no race there today.  Callers to huge_pmd_unshare
>>> hold the page table lock.  So, this synchronizes those unshare calls
>>> from  page migration and page poisoning.
>>>
>>>> But that race is somewhat contained as you need two threads doing some
>>>> combination of page migration and page poisoning to race.  This change
>>>> now allows migration or poisoning to race with page fault.  I would
>>>> really prefer if we do not open up the race window in this manner.
>>> But, we do open a race window by changing huge_pmd_share to take the
>>> i_mmap_rwsem in read mode as in the original patch.  
>>>
>>> Here is the additional code needed to take the semaphore in write mode
>>> for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
>>> combine this with Longman's patch.  Please take a look and provide feedback.
>>> Some of the changes are subtle, especially the exception for MAP_PRIVATE
>>> mappings, but I tried to add sufficient comments.
>>>
>>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>> Date: Fri, 8 Nov 2019 17:25:37 -0800
>>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>>>  call to huge_pmd_unshare in try_to_unmap_one.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>  mm/hugetlb.c        |  9 ++++++++-
>>>  mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>>>  mm/migrate.c        | 27 +++++++++++++++++++++++++--
>>>  3 files changed, 60 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index f78891f92765..73d9136549a5 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>>   * indicated by page_count > 1, unmap is achieved by clearing pud and
>>>   * decrementing the ref count. If count == 1, the pte page is not shared.
>>>   *
>>> - * called with page table lock held.
>>> + * Must be called while holding page table lock.
>>> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
>>> + * This is to prevent races with page faults calling huge_pmd_share which
>>> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
>>> + * in read mode.  It is possible to call without holding i_mmap_rwsem in
>>> + * write mode if the caller KNOWS the page table is associated with a private
>>> + * mapping.  This is because private mappings can not share PMDs and can
>>> + * not race with huge_pmd_share calls during page faults.
>> So the page table lock here is the huge_pte_lock(). Right? In
>> huge_pmd_share(), the pte lock has to be taken before one can share it.
>> So would you mind explaining where exactly is the race?
> My concern was that this code in huge_pmd_share:
>
> 		saddr = page_table_shareable(svma, vma, addr, idx);
> 		if (saddr) {
> 			spte = huge_pte_offset(svma->vm_mm, saddr,
> 					       vma_mmu_pagesize(svma));
> 			if (spte) {
> 				get_page(virt_to_page(spte));
> 				break;
> 			}
> 		}
>
> and this code in huge_pmd_unshare:
>
>         BUG_ON(page_count(virt_to_page(ptep)) == 0);
>         if (page_count(virt_to_page(ptep)) == 1)
>                 return 0;
>
> could run concurrently.  Note that return code for huge_pmd_unshare is
> specified as,
>
>  * returns: 1 successfully unmapped a shared pte page
>  *          0 the underlying pte page is not shared, or it is the last user
>
> And, callers take different action depending on the return value.
>
> Now, since the code blocks above can run concurrently it is possible that:
> - huge_pmd_unshare sees page_count == 1
> - huge_pmd_share increments page_count in anticipation of sharing
> - huge_pmd_unshare returns 0 indicating there is no pmd sharing
> - huge_pmd_share waits for page table lock to insert pmd page before it
>   sharts sharing
>
> My concern was with what might happen if a huge_pmd_unshare caller received
> a 0 return code and thought there were no other users of the pmd.  In the
> specific case mentioned here (try_to_unmap_one) I now do not believe there
> are any issues.  The caller is simply going to clear one entry on the pmd
> page.  I can not think of a way this could impact the other thread waiting
> to share the pmd.  It will simply start sharing the pmd after it gets the
> page table lock and the entry has been removed.
>
> As the result of your question, I went back and took a really close look
> at the code in question.  While that huge_pmd_share/huge_pmd_unshare code
> running concurrently does make me a bit nervous, I can not find any specific
> problem.  In addition, I ran a test causes this race for several hours
> without issue.
>
> Sorry for the churn/second thoughts.  This code is subtle, and sometimes hard
> to understand.  IMO, Longman's patch (V2) can go forward, but please delete
> the following blob in the commit message from me:
>
> "Note that huge_pmd_share now increments the page count with the
>  semaphore held just in read mode.  It is OK to do increments in
>  parallel without synchronization.  However, we don't want anyone else
>  changing the count while that check in huge_pmd_unshare is happening. 
>  Hence, the need for taking the semaphore in write mode."
>
Thanks for the explanation. I understand that it is often time hard to
figure out if a race exists or not. I got confused myself many times.
Anyway, as long as any destructive action is only taken while holding
the page table lock, it should be safe.

Cheers,
Longman