linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>,
	"Kim, Dongwon" <dongwon.kim@intel.com>,
	David Hildenbrand <david@redhat.com>,
	"Chang, Junxiao" <junxiao.chang@intel.com>,
	Hugh Dickins <hughd@google.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Peter Xu <peterx@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
Date: Mon, 28 Aug 2023 04:38:01 +0000	[thread overview]
Message-ID: <IA0PR11MB71856D8161600A04427E5A87F8E0A@IA0PR11MB7185.namprd11.prod.outlook.com> (raw)
In-Reply-To: <IA0PR11MB71855031E9159C8DCB311702F81DA@IA0PR11MB7185.namprd11.prod.outlook.com>

Hi Alistair,

> 
> > >
> > >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> > >> issue.
> > >> >> > > > Although, I do not have THP enabled (or built-in), shmem does
> > not
> > >> evict
> > >> >> > > > the pages after hole punch as noted in the comment in
> > >> >> shmem_fallocate():
> > >> >> > >
> > >> >> > > This is the source of all your problems.
> > >> >> > >
> > >> >> > > Things that are mm-centric are supposed to track the VMAs and
> > >> changes
> > >> >> to
> > >> >> > > the PTEs. If you do something in userspace and it doesn't cause
> the
> > >> >> > > CPU page tables to change then it certainly shouldn't cause any
> > mmu
> > >> >> > > notifiers or hmm_range_fault changes.
> > >> >> > I am not doing anything out of the blue in the userspace. I think the
> > >> >> behavior
> > >> >> > I am seeing with shmem (where an invalidation event
> > >> >> (MMU_NOTIFY_CLEAR)
> > >> >> > does occur because of a hole punch but the PTEs don't really get
> > >> updated)
> > >> >> > can arguably be considered an optimization.
> > >> >>
> > >> >> Your explanations don't make sense.
> > >> >>
> > >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present
> then:
> > >> >>
> > >> >> > > There should still be an invalidation notifier at some point when
> the
> > >> >> > > CPU tables do eventually change, whenever that is. Missing that
> > >> >> > > notification would be a bug.
> > >> >> > I clearly do not see any notification getting triggered (from both
> > >> >> shmem_fault()
> > >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is
> > refilled
> > >> >> > due to writes. Are you saying that there needs to be an invalidation
> > >> event
> > >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> > >> >>
> > >> >> You don't get to get shmem_fault in the first place.
> > >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole
> > punch)
> > >> is sent,
> > >> > hmm_range_fault() finds that the PTEs associated with the hole are still
> > >> pte_present().
> > >> > I think it remains this way as long as there are reads on the hole. Once
> > >> there are
> > >> > writes, it triggers shmem_fault() which results in PTEs getting updated
> > but
> > >> without
> > >> > any notification.
> > >>
> > >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming
> you
> > >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> > >> missing PTE.
> > > When running one of the udmabuf subtests (introduced in the third patch
> > of
> > > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is
> punched.
> > > As a response, hmm_range_fault() is called from the udmabuf invalidate
> > callback,
> >
> > Actually I'm suprised that works. If you've setup an interval notifier
> > and are updating the notifier sequence numbers correctly I would expect
> > hmm_range_fault() to return -EBUSY until
> > mmu_notifier_invalidate_range_end() is called.
> >
> > It might be helpful to post the code you're testing with somewhere but
> > are you calling mmu_interval_read_begin() to start the critical section
> > and mmu_interval_set_seq() to update the sequence in another notifier?
> > I'm not at all convinced calling hmm_range_fault() from a notifier can
> > be made to work though.
Turns out, calling hmm_range_fault() from the invalidate callback was indeed
a problem and the reason why new pages were not faulted-in. In other words,
it looks like the invalidate callback is not the right place to invoke hmm_range_fault()
as the PTEs may not have been cleared.

> That could be part of the problem. I mean the way hmm_range_fault()
> is invoked from the invalidate callback is probably incorrect as you are
> suggesting. Anyway, here is the code I am testing with:
> static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
>                                const struct mmu_notifier_range *range_mn,
>                                unsigned long cur_seq)
> {
>         struct udmabuf_vma_range *range =
>                         container_of(mn, struct udmabuf_vma_range, range_mn);
>         struct udmabuf *ubuf = range->ubuf;
>         struct hmm_range hrange = {0};
>         unsigned long *pfns, num_pages, timeout;
>         int i, ret;
> 
>         printk("invalidate; start = %lu, end = %lu\n",
>                range->start, range->end);
> 
>         hrange.notifier = mn;
>         hrange.default_flags = HMM_PFN_REQ_FAULT;
>         hrange.start = max(range_mn->start, range->start);
>         hrange.end = min(range_mn->end, range->end);
>         num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;
> 
>         pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
>         if (!pfns)
>                 return true;
> 
>         printk("invalidate; num pages = %lu\n", num_pages);
> 
>         hrange.hmm_pfns = pfns;
>         timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>         do {
>                 hrange.notifier_seq = mmu_interval_read_begin(mn);
> 
>                 mmap_read_lock(ubuf->vmm_mm);
>                 ret = hmm_range_fault(&hrange);
>                 mmap_read_unlock(ubuf->vmm_mm);
>                 if (ret) {
>                         if (ret == -EBUSY && !time_after(jiffies, timeout))
>                                 continue;
>                         break;
>                 }
> 
>                 if (mmu_interval_read_retry(mn, hrange.notifier_seq))
>                         continue;
>         } while (ret);
> 
>         if (!ret) {
>                 for (i = 0; i < num_pages; i++) {
>                         printk("hmm returned page = %p; pfn = %lu\n",
>                                hmm_pfn_to_page(pfns[i]),
>                                pfns[i] & ~HMM_PFN_FLAGS);
>                 }
>         }
>         return true;
> }
> 
Doing the above from a wq worker func (scheduled after invalidate event)
instead of the invalidate callback lets hmm_range_fault() fault-in new pages.
What this means is that, at-least in my use-case, getting MMU_NOTIFY_CLEAR
indicates that the invalidation is still ongoing and that it is not done yet.
Sorry for the confusion.

Thanks,
Vivek

> static const struct mmu_interval_notifier_ops udmabuf_invalidate_ops = {
>         .invalidate = invalidate_udmabuf,
> };
> 
> >
> > > to walk over the PTEs associated with the hole. When this happens, I
> > noticed that
> > > the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all
> > the
> > > PTEs associated with the hole.
> > > static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range
> > *range,
> > >                                                  pte_t pte)
> > > {
> > >         if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
> > >                 return 0;
> > >         return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) :
> > HMM_PFN_VALID;
> > > }
> > >
> > > As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
> > > never gets triggered despite specifying HMM_PFN_REQ_FAULT |
> > HMM_PFN_REQ_WRITE.
> > > And, the set of PFNs returned by hmm_range_fault() are the same ones
> > > that existed before the hole was punched.
> > >
> > >> Subsequent writes will just upgrade PTE permissions
> > >> assuming the read didn't map them RW to begin with. If you want to
> > >> actually see the hole with hmm_range_fault() don't specify
> > >> HMM_PFN_REQ_FAULT (or _WRITE).
> > >>
> > >> >>
> > >> >> If they were marked non-prsent during the CLEAR then the shadow
> side
> > >> >> remains non-present until it gets its own fault.
> > >> >>
> > >> >> If they were made non-present without an invalidation then that is a
> > >> >> bug.
> > >> >>
> > >> >> > > hmm_range_fault() is the correct API to use if you are working
> with
> > >> >> > > notifiers. Do not hack something together using pin_user_pages.
> > >> >>
> > >> >> > I noticed that hmm_range_fault() does not seem to be working as
> > >> expected
> > >> >> > given that it gets stuck(hangs) while walking hugetlb pages.
> > >> >>
> > >> >> You are the first to report that, it sounds like a serious bug. Please
> > >> >> try to fix it.
> > >> >>
> > >> >> > Regardless, as I mentioned above, the lack of notification when PTEs
> > >> >> > do get updated due to writes is the crux of the issue
> > >> >> > here. Therefore, AFAIU, triggering an invalidation event or some
> > >> >> > other kind of notification would help in fixing this issue.
> > >> >>
> > >> >> You seem to be facing some kind of bug in the mm, it sounds pretty
> > >> >> serious, and it almost certainly is a missing invalidation.
> > >> >>
> > >> >> Basically, anything that changes a PTE must eventually trigger an
> > >> >> invalidation. It is illegal to change a PTE from one present value to
> > >> >> another present value without invalidation notification.
> > >> >>
> > >> >> It is not surprising something would be missed here.
> > >> > As you suggest, it looks like the root-cause of this issue is the missing
> > >> > invalidation notification when the PTEs are changed from one present
> > >>
> > >> I don't think there's a missing invalidation here. You say you're seeing
> > >> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
> > >> cleared. When else do you expect a notification?
> > > Oh, given that we are finding PTEs that are still pte_present() even after
> > > MMU_NOTIFY_CLEAR is sent, the theory is that another
> > MMU_NOTIFY_CLEAR
> > > needs to be sent after the PTEs are updated when new pages are faulted-
> in.
> > >
> > > However, it just occurred to me that maybe the behavior I am seeing is not
> > > unexpected as it might be a timing issue that has to do with when the
> PTEs
> > > are walked. Let me explain. Here is what shmem does when a hole is
> > punched:
> > >                 if ((u64)unmap_end > (u64)unmap_start)
> > >                         unmap_mapping_range(mapping, unmap_start,
> > >                                             1 + unmap_end - unmap_start, 0);
> > >                 shmem_truncate_range(inode, offset, offset + len - 1);
> > >
> > > IIUC, the invalidate callback is called from unmap_mapping_range() but
> > > the page removal does not happen until shmem_truncate_range() gets
> > > called. So, if I were to call hmm_range_fault() after
> > shmem_truncate_range(),
> > > I might see different results as the PTEs would probably no longer be
> > present.
> > > In order to test this theory, I would have to schedule a wq thread func
> from
> > the
> > > invalidate callback (to walk the PTEs after a slight delay). I'll try this out
> > when
> > > I get a chance after addressing some of the locking concerns associated
> with
> > > pairing static/dynamic dmabuf exporters and importers.
> >
> > That sounds plausible. The PTE will actually be cleared in
> > unmap_mapping_range() after the mmu notifier is called. I'm curious how
> > hmm_range_fault() passes though.
> >
> > > Thanks,
> > > Vivek
> > >
> > >>
> > >> > value to another. I'd like to fix this issue eventually but I first need to
> > >> > focus on addressing udmabuf page migration (out of movable zone)
> > >> > and also look into the locking concerns Daniel mentioned about pairing
> > >> > static and dynamic dmabuf exporters and importers.
> > >> >
> > >> > Thanks,
> > >> > Vivek



  reply	other threads:[~2023-08-28  4:47 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18  8:28 [RFC v1 0/3] udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd Vivek Kasireddy
2023-07-18  8:28 ` [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages) Vivek Kasireddy
2023-07-18 15:36   ` Jason Gunthorpe
2023-07-19  0:05     ` Kasireddy, Vivek
2023-07-19  0:24       ` Jason Gunthorpe
2023-07-19  6:19         ` Kasireddy, Vivek
2023-07-19  2:08   ` Alistair Popple
2023-07-20  7:43     ` Kasireddy, Vivek
2023-07-20  9:00       ` Alistair Popple
2023-07-24  7:54         ` Kasireddy, Vivek
2023-07-24 13:35           ` Jason Gunthorpe
2023-07-24 20:32             ` Kasireddy, Vivek
2023-07-25  4:30               ` Hugh Dickins
2023-07-25 22:24                 ` Kasireddy, Vivek
2023-07-27 21:43                   ` Peter Xu
2023-07-29  0:08                     ` Kasireddy, Vivek
2023-07-31 17:05                       ` Peter Xu
2023-08-01  7:11                         ` Kasireddy, Vivek
2023-08-01 21:57                           ` Peter Xu
2023-08-03  8:08                             ` Kasireddy, Vivek
2023-08-03 13:02                               ` Peter Xu
2023-07-25 12:36               ` Jason Gunthorpe
2023-07-25 22:44                 ` Kasireddy, Vivek
2023-07-25 22:53                   ` Jason Gunthorpe
2023-07-27  7:34                     ` Kasireddy, Vivek
2023-07-27 11:58                       ` Jason Gunthorpe
2023-07-29  0:46                         ` Kasireddy, Vivek
2023-07-30 23:09                           ` Jason Gunthorpe
2023-08-01  5:32                             ` Kasireddy, Vivek
2023-08-01 12:19                               ` Jason Gunthorpe
2023-08-01 12:22                                 ` David Hildenbrand
2023-08-01 12:23                                   ` Jason Gunthorpe
2023-08-01 12:26                                     ` David Hildenbrand
2023-08-01 12:26                                       ` Jason Gunthorpe
2023-08-01 12:28                                         ` David Hildenbrand
2023-08-01 17:53                                           ` Kasireddy, Vivek
2023-08-01 18:19                                             ` Jason Gunthorpe
2023-08-03  7:35                                               ` Kasireddy, Vivek
2023-08-03 12:14                                                 ` Jason Gunthorpe
2023-08-03 12:32                                                   ` David Hildenbrand
2023-08-04  0:14                                                     ` Alistair Popple
2023-08-04  6:39                                                       ` Kasireddy, Vivek
2023-08-04  7:23                                                         ` David Hildenbrand
2023-08-04 21:53                                                           ` Kasireddy, Vivek
2023-08-04 12:49                                                         ` Jason Gunthorpe
2023-08-08  7:37                                                           ` Kasireddy, Vivek
2023-08-08 12:42                                                             ` Jason Gunthorpe
2023-08-16  6:43                                                               ` Kasireddy, Vivek
2023-08-21  9:02                                                                 ` Alistair Popple
2023-08-22  6:14                                                                   ` Kasireddy, Vivek
2023-08-22  8:15                                                                     ` Alistair Popple
2023-08-24  6:48                                                                       ` Kasireddy, Vivek
2023-08-28  4:38                                                                         ` Kasireddy, Vivek [this message]
2023-08-30 16:02                                                                           ` Jason Gunthorpe
2023-07-25  3:38             ` Alistair Popple
2023-07-24 13:36           ` Alistair Popple
2023-07-24 13:37             ` Jason Gunthorpe
2023-07-24 20:42             ` Kasireddy, Vivek
2023-07-25  3:14               ` Alistair Popple
2023-07-18  8:28 ` [RFC v1 2/3] udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd Vivek Kasireddy
2023-08-02 12:40   ` Daniel Vetter
2023-08-03  8:24     ` Kasireddy, Vivek
2023-08-03  8:32       ` Daniel Vetter
2023-07-18  8:28 ` [RFC v1 3/3] selftests/dma-buf/udmabuf: Add tests for huge pages and FALLOC_FL_PUNCH_HOLE Vivek Kasireddy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=IA0PR11MB71856D8161600A04427E5A87F8E0A@IA0PR11MB7185.namprd11.prod.outlook.com \
    --to=vivek.kasireddy@intel.com \
    --cc=apopple@nvidia.com \
    --cc=david@redhat.com \
    --cc=dongwon.kim@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hughd@google.com \
    --cc=jgg@nvidia.com \
    --cc=junxiao.chang@intel.com \
    --cc=kraxel@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).