RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

From: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>,
	"Kim, Dongwon" <dongwon.kim@intel.com>,
	David Hildenbrand <david@redhat.com>,
	"Chang, Junxiao" <junxiao.chang@intel.com>,
	Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
Date: Mon, 24 Jul 2023 20:42:40 +0000	[thread overview]
Message-ID: <IA0PR11MB71855B4A7D0F428C9E8771D9F802A@IA0PR11MB7185.namprd11.prod.outlook.com> (raw)
In-Reply-To: <87zg3lzbd1.fsf@nvdebian.thelocal>

Hi Alistair,

> >>
> >> Yes, although obviously as I think you point out below you wouldn't be
> >> able to take any sleeping locks in mmu_notifier_update_mapping().
> > Yes, I understand that, but I am not sure how we can prevent any potential
> > notifier callback from taking sleeping locks other than adding clear
> comments.
> 
> Oh of course not, but is such a restriction on not taking sleeping locks
> acceptable for your implementation of the notifier callback? I notice in
> patch 2 update_udmabuf() takes a mutex so I assumed not being able to
> sleep in the callback would be an issue.
I plan to drop the mutex in v2 which is not really needed as I described in
my previous reply because we ensure Guest and Host synchronization via
other means. 

> 
> >>
> >> > In which case I'd need to make a similar change in the shmem path as
> well.
> >> > And, also redo (or eliminate) the locking in udmabuf (patch) which
> seems a
> >> > bit excessive on a second look given our use-case (where reads and
> writes
> >> do
> >> > not happen simultaneously due to fence synchronization in the guest
> >> driver).
> >>
> >> I'm not at all familiar with the udmabuf use case but that sounds
> >> brittle and effectively makes this notifier udmabuf specific right?
> > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > buffers. In other words, from a core mm standpoint, udmabuf just
> > collects a bunch of pages (associated with buffers) scattered inside
> > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > them in a dmabuf fd. And, since we provide zero-copy access, we
> > use DMA fences to ensure that the components on the Host and
> > Guest do not access the buffer simultaneously.
> 
> Thanks for the background!
> 
> >> contemplated adding a notifier for PTE updates for drivers using
> >> hmm_range_fault() as it would save some expensive device faults and it
> >> this could be useful for that.
> >>
> >> So if we're adding a notifier for PTE updates I think it would be good
> >> if it covered all cases and was robust enough to allow mirroring of the
> >> correct PTE value (ie. by being called under PTL or via some other
> >> synchronisation like hmm_range_fault()).
> > Ok; in order to make it clear that the notifier is associated with PTE
> updates,
> > I think it needs to have a more suitable name such as
> mmu_notifier_update_pte()
> > or mmu_notifier_new_pte(). But we already have
> mmu_notifier_change_pte,
> > which IIUC is used mainly for PTE updates triggered by KSM. So, I am
> inclining
> > towards dropping this new notifier and instead adding a new flag to
> change_pte
> > to distinguish between KSM triggered notifications and others. Something
> along
> > the lines of:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 218ddc3b4bc7..6afce2287143 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -129,7 +129,8 @@ struct mmu_notifier_ops {
> >         void (*change_pte)(struct mmu_notifier *subscription,
> >                            struct mm_struct *mm,
> >                            unsigned long address,
> > -                          pte_t pte);
> > +                          pte_t pte,
> > +                          bool ksm_update);
> > @@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner(
> >         unsigned long ___address = __address;                           \
> >         pte_t ___pte = __pte;                                           \
> >                                                                         \
> > -       mmu_notifier_change_pte(___mm, ___address, ___pte);             \
> > +       mmu_notifier_change_pte(___mm, ___address, ___pte, true);       \
> >
> > And replace mmu_notifier_update_mapping(vma->vm_mm, address,
> pte_pfn(*ptep))
> > in the current patch with
> > mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));
> 
> I wonder if we actually need the flag? IIUC it is already used for more
> than just KSM. For example it can be called as part of fault handling by
> set_pte_at_notify() in in wp_page_copy().
Yes, I noticed that but what I really meant is I'd put all these prior instances
of change_pte in one category using the flag. Without the flag, KVM, the only
user that currently has a callback for change_pte would get notified which
may not be appropriate. Note that the change_pte callback for KVM was
added (based on Git log) for KSM updates and it is not clear to me if that
is still the case.

> 
> > Would that work for your HMM use-case -- assuming we call change_pte
> after
> > taking PTL?
> 
> I suspect being called under the PTL could be an issue. For HMM we use a
> combination of sequence numbers and a mutex to synchronise PTEs. To
> avoid calling the notifier while holding PTL we might be able to record
> the sequence number (subscriptions->invalidate_seq) while holding PTL,
> release the PTL and provide that sequence number to the notifier
> callback along with the PTE.
> 
> Some form of mmu_interval_read_retry() could then be used to detect if
> the PTE has changed between dropping the PTL and calling the
> update_pte()/change_pte() notifier.
> 
> Of course if your design can handle being called while holding the PTL
> then the above is probably unnecessarily complex for your use-case.
Yes, I believe we can handle it while holding the PTL. 

> 
> My primary issue with this patch is the notifier is called without the
> PTL while providing a PTE value. Without some form of synchronisation it
> isn't safe to use the result of eg. pte_page(pte) or pte_write(pte) in
> the notifier callback. Based on your comments it seems udmabuf might
> have some other synchronisation that makes it safe, but being external
> to the notifier calls make it's hard to reason about.
I intend to fix the PTL issue in v2 but I am still not sure what is the best
thing to do as far as the notifier is concerned given the following options:
- Keep this patch (and notifier name) but ensure that it is called under PTL
- Drop this patch and expand the use of change_pte but add the flag to
  distinguish between prior usage and new usage
- Keep this patch but don't include the PTE or the pfn of the new page as
  part of the notifier. In other words, just have this:
mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address)
This way, in udmabuf driver, we could get the new page from the page cache
as soon as we get notified:
	mapoff = linear_page_index(vma, address);
	new_page = find_get_page(vma->vm_file->f_mapping, mapoff);
This last option would probably limit the new notifier to the udmabuf
use-case but I do not intend to pursue it as you suggested that you are
also interested in a new notifier associated with PTE updates.

Thanks,
Vivek

> 
>  - Alistair
> 
> > Thanks,
> > Vivek
> >
> >>
> >> Thanks.
> >>
> >> > Thanks,
> >> > Vivek
> >> >
> >> >>
> >> >> > +		return ret;
> >> >> >
> >> >> >  	ret = 0;
> >> >> >
> >> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct
> >> *mm,
> >> >> struct vm_area_struct *vma,
> >> >> >  	 */
> >> >> >  	if (need_wait_lock)
> >> >> >  		folio_wait_locked(folio);
> >> >> > +	if (!ret)
> >> >> > +		mmu_notifier_update_mapping(vma->vm_mm,
> address,
> >> >> > +					    pte_pfn(*ptep));
> >> >> >  	return ret;
> >> >> >  }
> >> >> >
> >> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> >> >> > index 50c0dde1354f..6421405334b9 100644
> >> >> > --- a/mm/mmu_notifier.c
> >> >> > +++ b/mm/mmu_notifier.c
> >> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
> >> >> mm_struct *mm, unsigned long address,
> >> >> >  	srcu_read_unlock(&srcu, id);
> >> >> >  }
> >> >> >
> >> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm,
> >> unsigned
> >> >> long address,
> >> >> > +				   unsigned long pfn)
> >> >> > +{
> >> >> > +	struct mmu_notifier *subscription;
> >> >> > +	int id;
> >> >> > +
> >> >> > +	id = srcu_read_lock(&srcu);
> >> >> > +	hlist_for_each_entry_rcu(subscription,
> >> >> > +				 &mm->notifier_subscriptions->list,
> hlist,
> >> >> > +				 srcu_read_lock_held(&srcu)) {
> >> >> > +		if (subscription->ops->update_mapping)
> >> >> > +			subscription->ops-
> >update_mapping(subscription,
> >> >> mm,
> >> >> > +							  address,
> pfn);
> >> >> > +	}
> >> >> > +	srcu_read_unlock(&srcu, id);
> >> >> > +}
> >> >> > +
> >> >> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
> >> >> *subscriptions,
> >> >> >  			       const struct mmu_notifier_range *range)
> >> >> >  {
> >> >> > diff --git a/mm/shmem.c b/mm/shmem.c
> >> >> > index 2f2e0e618072..e59eb5fafadb 100644
> >> >> > --- a/mm/shmem.c
> >> >> > +++ b/mm/shmem.c
> >> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
> >> >> >  #include <linux/fcntl.h>
> >> >> >  #include <uapi/linux/memfd.h>
> >> >> >  #include <linux/rmap.h>
> >> >> > +#include <linux/mmu_notifier.h>
> >> >> >  #include <linux/uuid.h>
> >> >> >
> >> >> >  #include <linux/uaccess.h>
> >> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct
> >> vm_fault
> >> >> *vmf)
> >> >> >  				  gfp, vma, vmf, &ret);
> >> >> >  	if (err)
> >> >> >  		return vmf_error(err);
> >> >> > -	if (folio)
> >> >> > +	if (folio) {
> >> >> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
> >> >> > +		if (ret == VM_FAULT_LOCKED)
> >> >> > +			mmu_notifier_update_mapping(vma-
> >vm_mm, vmf-
> >> >> >address,
> >> >> > +						    page_to_pfn(vmf-
> >page));
> >> >> > +	}
> >> >> >  	return ret;
> >> >> >  }