On Fri, Sep 30, 2022 at 10:47:45PM -0400, Peter Xu wrote:
> Hi, Mike,
> 
> On Fri, Sep 30, 2022 at 02:38:01PM -0700, Mike Kravetz wrote:
> > On 09/30/22 12:05, Peter Xu wrote:
> > > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> > > > I was able to do a little more debugging:
> > > > 
> > > > As you know the hugetlb calling path to handle_userfault is something
> > > > like this,
> > > > 
> > > > hugetlb_fault
> > > > 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > > > 	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> > > > 	if (huge_pte_none_mostly())
> > > > 		hugetlb_no_page()
> > > > 			page = find_lock_page(mapping, idx);
> > > > 			if (!page) {
> > > > 				if (userfaultfd_missing(vma))
> > > > 					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > > > 					return handle_userfault()
> > > > 			}
> > > > 
> > > > For anon mappings, find_lock_page() will never find a page, so as long
> > > > as huge_pte_none_mostly() is true we will call into handle_userfault().
> > > > 
> > > > Since your analysis shows the testcase should never call handle_userfault() for
> > > > a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
> > > > the call to handle_userfault().  Sure enough, I saw plenty of printk messages.
> > > > 
> > > > Then, before calling handle_userfault() I added code to take the page table
> > > > lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
> > > > this second test of huge_pte_none_mostly() was false.  So, the condition
> > > > changed from the check in hugetlb_fault until the check (with page table
> > > > lock) in hugetlb_no_page.
> > > > 
> > > > IIUC, the only code that should be modifying the pte in this test is
> > > > hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
> > > > updating the pte.
> > > > 
> > > > It 'appears' that hugetlb_fault is not seeing the updated pte and I can
> > > > only guess that it is due to some caching issues.
> > > > 
> > > > After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.
> > > > 
> > > > 	/* No need to invalidate - it was non-present before */
> > > > 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > > > 
> > > > I suspect that is true.  However, it seems like this test depends on all
> > > > CPUs seeing the updated pte immediately?
> > > > 
> > > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> > > > any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> > > > issues take me a while to figure out.
> > > 
> > > This morning when I went back and rethink the matter, I just found that the
> > > common hugetlb path handles private anonymous mappings with all empty page
> > > cache as you explained above.  In that sense the two patches I posted may
> > > not really make sense even if they can pass the tests.. and maybe that's
> > > also the reason why the reservations got messed up.  This is also something
> > > I found after I read more on the reservation code e.g. no matter private or
> > > shared hugetlb mappings we only reserve that only number of pages when mmap().
> > > 
> > > Indeed if with that in mind the UFFDIO_COPY should also work because
> > > hugetlb fault handler checks pte first before page cache, so uffd missing
> > > should still work as expected.
> > > 
> > > It makes sense especially for hugetlb to do that otherwise there can be
> > > plenty of zero huge pages cached in the page cache.  I'm not sure whether
> > > this is the reason hugetlb does it differently (e.g. comparing to shmem?),
> > > it'll be great if I can get a confirmation.  If it's true please ignore the
> > > two patches I posted.
> > > 
> > > I think what you analyzed is correct in that the pte shouldn't go away
> > > after being armed once.  One more thing I tried (actually yesterday) was
> > > SIGBUS the process when the write missing event was generated, and I can
> > > see the user stack points to the cmpxchg() of the pthread_mutex_lock().  It
> > > means indeed it moved forward and passed the mutex type check, it also
> > > means it should have seen a !none pte already with at least reading
> > > permission, in that sense it matches with "missing TLB" possibility
> > > experiment mentioned above, because for a missing TLB it should keep
> > > stucking at the read not write.  It's still uncertain why the pte can go
> > > away somehow from under us and why it quickly re-appears according to your
> > > experiment.
> > > 
> > 
> > I 'think' it is more of a race with all cpus seeing the pte update.  To be
> > honest, I can not wrap my head around how that can happen.
> > 
> > I did not really like your idea of adding anon (or private) pages to the
> > page cache.
> 
> I don't like that either, especially when I notice it breaks the
> reservations... :)
> 
> Note that my previous patch wasn't adding anon or private pages into page
> cache.  PageAnon() will be false for those pages being added.  I was
> literally doing insertion of page cache for UFFDIO_COPY, rather than
> directly making it anonymous.  Actually that's what shmem does.
> 
> > As you discovered, there is code like reservations which depend
> > on current behavior.
> > 
> > It seems to me that for 'missing' hugetlb faults there are two specific cases:
> > 1) Shared or file backed mappings.  In this case, the page cache is the
> >    'source of truth'.  If there is not a page in the page cache, then we
> >    hand off to userfault with VM_UFFD_MISSING.
> > 2) anon or private mappings.  In this case, pages are not in the page cache.
> >    The page table is the 'source of truth'.
> 
> Sorry, I can't say I fully agree with it.
> 
> IMHO the missing mode is really about page cache.  Let's imagine when we
> create private mappings upon a hugetlbfs file that has some pages written.
> When page fault triggers, we should never generate a missing if cache hit,
> even if the pte is null.  Maybe that's not exactly what you meant, but the
> wording is kind of misleading here.
> 
> I'd say it's really about hugetlbfs is special in treating private pages.
> Note that shmem wasn't treat it like that as I mentioned.  But AFAICT
> hugetlbfs is different in a good way because otherwise we could be wasting
> quite a few useless zero pages dangling in the page cache, and AFAICT
> that's still what we do with shmem - just try to create 20M shmem
> anonymouse file, privately map and write to them and see how many mem we'll
> consume..  It's not optimal, but still correct IMHO.
> 
> > Early in hugetlb fault processing
> >    we check the page table (huge_pte_none_mostly).  However, as my debug code
> >    has shown, checking the page table again with lock held will reveal that
> >    the pte has in fact been updated.
> 
> Right, I found it in the hard way. I should have been reading code more
> carefully: it's caused by CoW where we'll need a clear+flush to make sure
> TLB consistency (in hugetlb_wp):
> 
> 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> 		ClearHPageRestoreReserve(new_page);
> 
> 		/* Break COW or unshare */
> 		huge_ptep_clear_flush(vma, haddr, ptep);
> 		mmu_notifier_invalidate_range(mm, range.start, range.end);
> 		page_remove_rmap(old_page, vma, true);
> 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
> 		set_huge_pte_at(mm, haddr, ptep,
> 				make_huge_pte(vma, new_page, !unshare));
> 		SetHPageMigratable(new_page);
> 		/* Make the old page be freed below */
> 		new_page = old_page;
> 	}
> 
> The early TLB flush was trying to avoid inconsistent TLB entries in
> different cores.
> 
> So far I still don't know why the hugetlb commit changed the timing, but
> this time since I tracked the pgtables I'm more sure of its cause.
> 
> > 
> > My thought was that regular anon pages would have the same issue.  So, I looked
> > at the calling code there.  In do_anonymous_page() there is this block:
> > 
> > 	/* Use the zero-page for reads */
> > 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > 			!mm_forbids_zeropage(vma->vm_mm)) {
> > 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> > 						vma->vm_page_prot));
> > 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > 				vmf->address, &vmf->ptl);
> > 		if (!pte_none(*vmf->pte)) {
> > 			update_mmu_tlb(vma, vmf->address, vmf->pte);
> > 			goto unlock;
> > 		}
> > 		ret = check_stable_address_space(vma->vm_mm);
> > 		if (ret)
> > 			goto unlock;
> > 		/* Deliver the page fault to userland, check inside PT lock */
> > 		if (userfaultfd_missing(vma)) {
> > 			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > 			return handle_userfault(vmf, VM_UFFD_MISSING);
> > 		}
> > 		goto setpte;
> > 	}
> > 
> > Notice that here the pte is checked while holding the page table lock while
> > to make the decision to call handle_userfault().
> 
> Right.  That's a great finding, thanks.  It's just that I found it great
> only after I found the whole story on the CoW in progress.. I could have
> been done better.
> 
> > 
> > In my testing, if we check huge_pte_none_mostly() while holding the page table
> > lock before calling handle_userfault we will not experience the failure.  Can
> > you see if this also resolves the issue in your environment?  I do not love
> > this solution as I still can not explain how this code is missing the pte
> > update.
> 
> Yes this patch will fix it which I tested.  But IMHO there're quite a few
> wordings that are misleading as I tried to explain above on why page cache
> still matters (and matters the most IMHO for MISSING events).
> 
> Meanwhile IMHO there's a better way to address this - we're in
> hugetlb_no_page() but we're relying on a pte that was read _without_
> pgtable lock.  It means it can be either unstable or changed.  We do proper
> check for most of the rest code path for hugetlb_no_page() on pte_same()
> but we just forgot to do that for userfaultfd.
> 
> One example is IMO we shouldn't target this "check pte under locking" for
> private mappings only.  If the pte changed for either private/shared
> mappings, it's probably already illegal to be inside hugetlb_no_page().
> Logically for shared mappings the pte can change in any form too as long as
> under pgtable lock.  So the lock should logically apply to shared mappings
> too, IMHO.
> 
> In summary, I attached another patch that addressed it in the way I
> described above (only compile tested; even without writting the commit
> message because I need to go very soon).  Let me know what do you think the
> best way to approach this.

Obviously I never remember to attach things when I meant to.. Sorry for the
noise.  Attached this time.

> 
> (During debugging this problem, I also found a few other issues; I'll
> not make this email any longer but will verify them next week and follow up
> from there)
> 
> Thanks,
> 
> > 
> > From f910e7155d6831514165af35e0d75574124a4477 Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > Date: Fri, 30 Sep 2022 13:45:08 -0700
> > Subject: [PATCH] hugetlb: check pte with page table lock before handing to
> >  userfault
> > 
> > In hugetlb_no_page we decide a page is missing if not present in the
> > page cache.  This is perfectly valid for shared/file mappings where
> > pages must exist in the page cache.  For anon/private mappings, the page
> > table must be checked.  This is done early in hugetlb_fault processing
> > and is the reason we enter hugetlb_no_page.  However, the early check is
> > made without holding the page table lock.  There could be racing updates
> > to the pte entry, so check again with page table lock held before
> > deciding to call handle_userfault.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  mm/hugetlb.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 60e077ce6ca7..4cb44a4629b8 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5560,10 +5560,29 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  		if (idx >= size)
> >  			goto out;
> >  		/* Check for page in userfault range */
> > -		if (userfaultfd_missing(vma))
> > +		if (userfaultfd_missing(vma)) {
> > +			/*
> > +			 * For missing pages, the page cache (checked above) is
> > +			 * the 'source of truth' for shared mappings.  For anon
> > +			 * mappings, the source of truth is the page table.  We
> > +			 * already checked huge_pte_none_mostly() in
> > +			 * hugetlb_fault.  However, there could be racing
> > +			 * updates.  Check again while holding page table lock
> > +			 * before handing off to userfault.
> > +			 */
> > +			if (!(vma->vm_flags & VM_MAYSHARE)) {
> > +				ptl = huge_pte_lock(h, mm, ptep);
> > +				if (!huge_pte_none_mostly(huge_ptep_get(ptep))) {
> > +					spin_unlock(ptl);
> > +					ret = 0;
> > +					goto out;
> > +				}
> > +				spin_unlock(ptl);
> > +			}
> >  			return hugetlb_handle_userfault(vma, mapping, idx,
> >  						       flags, haddr, address,
> >  						       VM_UFFD_MISSING);
> > +		}
> >  
> >  		page = alloc_huge_page(vma, haddr, 0);
> >  		if (IS_ERR(page)) {
> > -- 
> > 2.37.3
> > 
> 
> -- 
> Peter Xu

-- 
Peter Xu