From: Alistair Popple <apopple@nvidia.com>
To: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
Cc: "Kim, Dongwon" <dongwon.kim@intel.com>,
David Hildenbrand <david@redhat.com>,
"Chang, Junxiao" <junxiao.chang@intel.com>,
Hugh Dickins <hughd@google.com>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Peter Xu <peterx@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
Date: Tue, 22 Aug 2023 18:15:06 +1000 [thread overview]
Message-ID: <87o7izlcg7.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <IA0PR11MB71853FB79625A4F2399FFD79F81FA@IA0PR11MB7185.namprd11.prod.outlook.com>
"Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:
> Hi Alistair,
>
>> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
>> issue.
>> >> > > > Although, I do not have THP enabled (or built-in), shmem does not
>> evict
>> >> > > > the pages after hole punch as noted in the comment in
>> >> shmem_fallocate():
>> >> > >
>> >> > > This is the source of all your problems.
>> >> > >
>> >> > > Things that are mm-centric are supposed to track the VMAs and
>> changes
>> >> to
>> >> > > the PTEs. If you do something in userspace and it doesn't cause the
>> >> > > CPU page tables to change then it certainly shouldn't cause any mmu
>> >> > > notifiers or hmm_range_fault changes.
>> >> > I am not doing anything out of the blue in the userspace. I think the
>> >> behavior
>> >> > I am seeing with shmem (where an invalidation event
>> >> (MMU_NOTIFY_CLEAR)
>> >> > does occur because of a hole punch but the PTEs don't really get
>> updated)
>> >> > can arguably be considered an optimization.
>> >>
>> >> Your explanations don't make sense.
>> >>
>> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
>> >>
>> >> > > There should still be an invalidation notifier at some point when the
>> >> > > CPU tables do eventually change, whenever that is. Missing that
>> >> > > notification would be a bug.
>> >> > I clearly do not see any notification getting triggered (from both
>> >> shmem_fault()
>> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
>> >> > due to writes. Are you saying that there needs to be an invalidation
>> event
>> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
>> >>
>> >> You don't get to get shmem_fault in the first place.
>> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch)
>> is sent,
>> > hmm_range_fault() finds that the PTEs associated with the hole are still
>> pte_present().
>> > I think it remains this way as long as there are reads on the hole. Once
>> there are
>> > writes, it triggers shmem_fault() which results in PTEs getting updated but
>> without
>> > any notification.
>>
>> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
>> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
>> missing PTE.
> When running one of the udmabuf subtests (introduced in the third patch of
> this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
> As a response, hmm_range_fault() is called from the udmabuf invalidate callback,
Actually I'm suprised that works. If you've setup an interval notifier
and are updating the notifier sequence numbers correctly I would expect
hmm_range_fault() to return -EBUSY until
mmu_notifier_invalidate_range_end() is called.
It might be helpful to post the code you're testing with somewhere but
are you calling mmu_interval_read_begin() to start the critical section
and mmu_interval_set_seq() to update the sequence in another notifier?
I'm not at all convinced calling hmm_range_fault() from a notifier can
be made to work though.
> to walk over the PTEs associated with the hole. When this happens, I noticed that
> the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all the
> PTEs associated with the hole.
> static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
> pte_t pte)
> {
> if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
> return 0;
> return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
> }
>
> As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
> never gets triggered despite specifying HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE.
> And, the set of PFNs returned by hmm_range_fault() are the same ones
> that existed before the hole was punched.
>
>> Subsequent writes will just upgrade PTE permissions
>> assuming the read didn't map them RW to begin with. If you want to
>> actually see the hole with hmm_range_fault() don't specify
>> HMM_PFN_REQ_FAULT (or _WRITE).
>>
>> >>
>> >> If they were marked non-prsent during the CLEAR then the shadow side
>> >> remains non-present until it gets its own fault.
>> >>
>> >> If they were made non-present without an invalidation then that is a
>> >> bug.
>> >>
>> >> > > hmm_range_fault() is the correct API to use if you are working with
>> >> > > notifiers. Do not hack something together using pin_user_pages.
>> >>
>> >> > I noticed that hmm_range_fault() does not seem to be working as
>> expected
>> >> > given that it gets stuck(hangs) while walking hugetlb pages.
>> >>
>> >> You are the first to report that, it sounds like a serious bug. Please
>> >> try to fix it.
>> >>
>> >> > Regardless, as I mentioned above, the lack of notification when PTEs
>> >> > do get updated due to writes is the crux of the issue
>> >> > here. Therefore, AFAIU, triggering an invalidation event or some
>> >> > other kind of notification would help in fixing this issue.
>> >>
>> >> You seem to be facing some kind of bug in the mm, it sounds pretty
>> >> serious, and it almost certainly is a missing invalidation.
>> >>
>> >> Basically, anything that changes a PTE must eventually trigger an
>> >> invalidation. It is illegal to change a PTE from one present value to
>> >> another present value without invalidation notification.
>> >>
>> >> It is not surprising something would be missed here.
>> > As you suggest, it looks like the root-cause of this issue is the missing
>> > invalidation notification when the PTEs are changed from one present
>>
>> I don't think there's a missing invalidation here. You say you're seeing
>> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
>> cleared. When else do you expect a notification?
> Oh, given that we are finding PTEs that are still pte_present() even after
> MMU_NOTIFY_CLEAR is sent, the theory is that another MMU_NOTIFY_CLEAR
> needs to be sent after the PTEs are updated when new pages are faulted-in.
>
> However, it just occurred to me that maybe the behavior I am seeing is not
> unexpected as it might be a timing issue that has to do with when the PTEs
> are walked. Let me explain. Here is what shmem does when a hole is punched:
> if ((u64)unmap_end > (u64)unmap_start)
> unmap_mapping_range(mapping, unmap_start,
> 1 + unmap_end - unmap_start, 0);
> shmem_truncate_range(inode, offset, offset + len - 1);
>
> IIUC, the invalidate callback is called from unmap_mapping_range() but
> the page removal does not happen until shmem_truncate_range() gets
> called. So, if I were to call hmm_range_fault() after shmem_truncate_range(),
> I might see different results as the PTEs would probably no longer be present.
> In order to test this theory, I would have to schedule a wq thread func from the
> invalidate callback (to walk the PTEs after a slight delay). I'll try this out when
> I get a chance after addressing some of the locking concerns associated with
> pairing static/dynamic dmabuf exporters and importers.
That sounds plausible. The PTE will actually be cleared in
unmap_mapping_range() after the mmu notifier is called. I'm curious how
hmm_range_fault() passes though.
> Thanks,
> Vivek
>
>>
>> > value to another. I'd like to fix this issue eventually but I first need to
>> > focus on addressing udmabuf page migration (out of movable zone)
>> > and also look into the locking concerns Daniel mentioned about pairing
>> > static and dynamic dmabuf exporters and importers.
>> >
>> > Thanks,
>> > Vivek
next prev parent reply other threads:[~2023-08-22 8:16 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-18 8:28 [RFC v1 0/3] udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd Vivek Kasireddy
2023-07-18 8:28 ` [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages) Vivek Kasireddy
2023-07-18 15:36 ` Jason Gunthorpe
2023-07-19 0:05 ` Kasireddy, Vivek
2023-07-19 0:24 ` Jason Gunthorpe
2023-07-19 6:19 ` Kasireddy, Vivek
2023-07-19 2:08 ` Alistair Popple
2023-07-20 7:43 ` Kasireddy, Vivek
2023-07-20 9:00 ` Alistair Popple
2023-07-24 7:54 ` Kasireddy, Vivek
2023-07-24 13:35 ` Jason Gunthorpe
2023-07-24 20:32 ` Kasireddy, Vivek
2023-07-25 4:30 ` Hugh Dickins
2023-07-25 22:24 ` Kasireddy, Vivek
2023-07-27 21:43 ` Peter Xu
2023-07-29 0:08 ` Kasireddy, Vivek
2023-07-31 17:05 ` Peter Xu
2023-08-01 7:11 ` Kasireddy, Vivek
2023-08-01 21:57 ` Peter Xu
2023-08-03 8:08 ` Kasireddy, Vivek
2023-08-03 13:02 ` Peter Xu
2023-07-25 12:36 ` Jason Gunthorpe
2023-07-25 22:44 ` Kasireddy, Vivek
2023-07-25 22:53 ` Jason Gunthorpe
2023-07-27 7:34 ` Kasireddy, Vivek
2023-07-27 11:58 ` Jason Gunthorpe
2023-07-29 0:46 ` Kasireddy, Vivek
2023-07-30 23:09 ` Jason Gunthorpe
2023-08-01 5:32 ` Kasireddy, Vivek
2023-08-01 12:19 ` Jason Gunthorpe
2023-08-01 12:22 ` David Hildenbrand
2023-08-01 12:23 ` Jason Gunthorpe
2023-08-01 12:26 ` David Hildenbrand
2023-08-01 12:26 ` Jason Gunthorpe
2023-08-01 12:28 ` David Hildenbrand
2023-08-01 17:53 ` Kasireddy, Vivek
2023-08-01 18:19 ` Jason Gunthorpe
2023-08-03 7:35 ` Kasireddy, Vivek
2023-08-03 12:14 ` Jason Gunthorpe
2023-08-03 12:32 ` David Hildenbrand
2023-08-04 0:14 ` Alistair Popple
2023-08-04 6:39 ` Kasireddy, Vivek
2023-08-04 7:23 ` David Hildenbrand
2023-08-04 21:53 ` Kasireddy, Vivek
2023-08-04 12:49 ` Jason Gunthorpe
2023-08-08 7:37 ` Kasireddy, Vivek
2023-08-08 12:42 ` Jason Gunthorpe
2023-08-16 6:43 ` Kasireddy, Vivek
2023-08-21 9:02 ` Alistair Popple
2023-08-22 6:14 ` Kasireddy, Vivek
2023-08-22 8:15 ` Alistair Popple [this message]
2023-08-24 6:48 ` Kasireddy, Vivek
2023-08-28 4:38 ` Kasireddy, Vivek
2023-08-30 16:02 ` Jason Gunthorpe
2023-07-25 3:38 ` Alistair Popple
2023-07-24 13:36 ` Alistair Popple
2023-07-24 13:37 ` Jason Gunthorpe
2023-07-24 20:42 ` Kasireddy, Vivek
2023-07-25 3:14 ` Alistair Popple
2023-07-18 8:28 ` [RFC v1 2/3] udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd Vivek Kasireddy
2023-08-02 12:40 ` Daniel Vetter
2023-08-03 8:24 ` Kasireddy, Vivek
2023-08-03 8:32 ` Daniel Vetter
2023-07-18 8:28 ` [RFC v1 3/3] selftests/dma-buf/udmabuf: Add tests for huge pages and FALLOC_FL_PUNCH_HOLE Vivek Kasireddy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o7izlcg7.fsf@nvdebian.thelocal \
--to=apopple@nvidia.com \
--cc=david@redhat.com \
--cc=dongwon.kim@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hughd@google.com \
--cc=jgg@nvidia.com \
--cc=junxiao.chang@intel.com \
--cc=kraxel@redhat.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
--cc=vivek.kasireddy@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).