Re: [PATCH v10 07/10] mm: Device exclusive memory access

From: Alistair Popple <apopple@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: <linux-mm@kvack.org>, <akpm@linux-foundation.org>,
	<rcampbell@nvidia.com>, <linux-doc@vger.kernel.org>,
	<nouveau@lists.freedesktop.org>, <hughd@google.com>,
	<linux-kernel@vger.kernel.org>, <dri-devel@lists.freedesktop.org>,
	<hch@infradead.org>, <bskeggs@redhat.com>, <jgg@nvidia.com>,
	<shakeelb@google.com>, <jhubbard@nvidia.com>,
	<willy@infradead.org>, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v10 07/10] mm: Device exclusive memory access
Date: Thu, 10 Jun 2021 10:18:25 +1000	[thread overview]
Message-ID: <2773835.D95cIkl9rl@nvdebian> (raw)
In-Reply-To: <YMDmsha6GDtUf3Vs@t490s>

On Thursday, 10 June 2021 2:05:06 AM AEST Peter Xu wrote:
> On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> > On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:

[...]

> > For thp this means we could end up passing
> > tail pages to rmap_walk(), however it doesn't actually walk them.
> >
> > Based on the results of previous testing I had done I assumed rmap_walk()
> > filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> > filtering was not as deliberate as assumed.
> >
> > I've gone back and looked at what was happening in my earlier tests and the
> > tail pages get filtered because the VMA is not getting locked in
> > page_lock_anon_vma_read() due to failing this check:
> >
> >       anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> >       if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >               goto out;
> >
> > And now I'm not sure it makes sense to read page->mapping of a tail page. So
> > it might be best if we explicitly ignore any tail pages returned from GUP, at
> > least for now (a future series will improve thp support such as adding a pmd
> > version for exclusive entries).
> 
> I feel like it's illegal to access page->mapping of tail pages; I looked at
> what happens if we call page_anon_vma() on a tail page:
> 
> struct anon_vma *page_anon_vma(struct page *page)
> {
>         unsigned long mapping;
> 
>         page = compound_head(page);
>         mapping = (unsigned long)page->mapping;
>         if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 return NULL;
>         return __page_rmapping(page);
> }
> 
> It'll just take the head's mapping instead.  It makes sense since the tail page
> shouldn't have a different value against the head page, afaiu.

Right, it makes no sense to look at ->mapping on a tail page because the field
is used for something else. On the 1st tail page it is ->compound_nr and on the
2nd tail page it is ->deferred_list. See the definitions of compound_nr() and
page_deferred_list() respectively. I suppose on the rest of the pages it could
be anything.

I think in practice it is probably ok - iuc bit 0 won't be set for compound_nr
and certainly not for deferred_list->next (a pointer). But none of that seems
intentional, so it would be better to be explicit and not walk the tail pages.

> It would be great if thp experts could chim in.  Before that happens, I agree
> with you that a safer approach is to explicitly not walk a tail page for its
> rmap (and I think the rmap of a tail page will be the same of the head
> anyways.. since they seem to share the anon_vma as quoted).
> >
> > > So... for thp mappings, wondering whether we should do normal GUP (without
> > > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
> >
> > That could work (although I think GUP will still return tail pages - see
> > follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
> 
> Agreed.
> 
> > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > notifier so I would need to plumb in passing an owner everywhere which could
> > get messy.
> 
> Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a bit
> confused why we need to pass over the owner.

Sure, it is the same reason we need to pass it for the exclusive notifier.
Any invalidation during the make exclusive operation will break the mmu read
side critical section forcing a retry of the operation. The owner field is what
is used to filter out invalidations (such as the exclusive invalidation) that
don't need to be retried.

> I thought plumb it right before your EXCLUSIVE notifier init would work?

I did try this just to double check and it doesn't work due to the unconditional
notifier.

> ---8<---
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a94d9aed9d95..360ce86f3822 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
>         swp_entry_t entry;
>         pte_t swp_pte;
> 
> +       /*
> +        * Make sure thps split as device exclusive entries only support pte
> +        * level for now.
> +        */
> +       split_huge_pmd_address(vma, address, false, page);
> +
>         mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
>                                       vma->vm_mm, address, min(vma->vm_end,
>                                       address + page_size(page)), args->owner);
> ---8<---
> 
> Thanks,
> 
> --
> Peter Xu
>