some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())

All of lore.kernel.org
 help / color / mirror / Atom feed

* some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
@ 2022-09-23 15:38 Jann Horn
  2022-09-23 17:27 ` Jason Gunthorpe
  0 siblings, 1 reply; 8+ messages in thread
From: Jann Horn @ 2022-09-23 15:38 UTC (permalink / raw)
  To: Will Deacon, Jason Gunthorpe, Sean Christopherson, Joerg Roedel,
	jean-philippe.brucker
  Cc: Linux-MM, kernel list, iommu

Hi!

I looked through some of the code related to IOMMUv2 (the thing where
the IOMMU walks the normal userspace page tables and TLB shootdowns
are replicated to the IOMMU through
mmu_notifier_ops::invalidate_range).

I think there's a bug in the interaction between tlb_finish_mmu() and
mmu_notifier_ops::invalidate_range: In the mm_tlb_flush_nested() case,
__tlb_reset_range() sets tlb->start and tlb->end *both* to ~0.
Afterwards, tlb_finish_mmu() calls
tlb_flush_mmu()->tlb_flush_mmu_tlbonly()->mmu_notifier_invalidate_range(),
which will pass those tlb->start and tlb->end values to
mmu_notifier_ops::invalidate_range callbacks. But those callbacks
don't know about this special case and then basically only flush
virtual address ~0, making the flush useless. (However, pretty much
every place that calls tlb_finish_mmu() first calls
mmu_notifier_invalidate_range_end() even though the appropriate thing
would probably be mmu_notifier_invalidate_range_only_end(); and I
think those two things probably cancel each other out?)

Also, from what I can tell, the mremap() code, in move_page_tables(),
only invokes mmu_notifier_ops::invalidate_range via the
mmu_notifier_invalidate_range_end() at the very end, long after TLB
flushes must have happened - sort of like the bug we had years ago
where mremap() was flushing the normal TLBs too late
(https://bugs.chromium.org/p/project-zero/issues/detail?id=1695).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-23 15:38 some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap()) Jann Horn
@ 2022-09-23 17:27 ` Jason Gunthorpe
  2022-09-26 20:13   ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 17:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Will Deacon, Sean Christopherson, Joerg Roedel,
	jean-philippe.brucker, Linux-MM, kernel list, iommu

On Fri, Sep 23, 2022 at 05:38:12PM +0200, Jann Horn wrote:
> Hi!
> 
> I looked through some of the code related to IOMMUv2 (the thing where
> the IOMMU walks the normal userspace page tables and TLB shootdowns
> are replicated to the IOMMU through
> mmu_notifier_ops::invalidate_range).
> 
> I think there's a bug in the interaction between tlb_finish_mmu() and
> mmu_notifier_ops::invalidate_range: In the mm_tlb_flush_nested() case,
> __tlb_reset_range() sets tlb->start and tlb->end *both* to ~0.
> Afterwards, tlb_finish_mmu() calls
> tlb_flush_mmu()->tlb_flush_mmu_tlbonly()->mmu_notifier_invalidate_range(),
> which will pass those tlb->start and tlb->end values to
> mmu_notifier_ops::invalidate_range callbacks. But those callbacks
> don't know about this special case and then basically only flush
> virtual address ~0, making the flush useless. 

Yeah, that looks wrong to me, and it extends more than just the iommu
drivers kvm_arch_mmu_notifier_invalidate_range() also does not handle
this coding.

Most likely tlb_flush_mmu_tlbonly() need to change it back to 0 to ~0?
I wonder why it uses such an odd coding in the first place?

Actually, maybe having mm_tlb_flush_nested() call __tlb_reset_range()
to generate a 'flush all' request is just a bad idea, as we already
had another bug in 7a30df49f63ad92 related to reset_range doing the
wrong thing for a flush all action.

> (However, pretty much every place that calls tlb_finish_mmu() first
> calls mmu_notifier_invalidate_range_end() even though the
> appropriate thing would probably be
> mmu_notifier_invalidate_range_only_end(); and I think those two
> things probably cancel each other out?)

That does sound like double flushing to me, though as you note below,
the invalidate_range() triggered by range_end() after the TLB
flush/page freeing is functionally incorrect, so we cannot rely on it.

> Also, from what I can tell, the mremap() code, in move_page_tables(),
> only invokes mmu_notifier_ops::invalidate_range via the
> mmu_notifier_invalidate_range_end() at the very end, long after TLB
> flushes must have happened - sort of like the bug we had years ago
> where mremap() was flushing the normal TLBs too late
> (https://bugs.chromium.org/p/project-zero/issues/detail?id=1695).

Based on the description of eb66ae03082960 I would say that yes the
invalidate_range op is missing here for the same reasons the CPU flush
was missing.

AFAIK if we are flushing the CPU tlb then we really must also flush
the CPU tlb that KVM controls, and that is primarily what
invalidate_range() is used for.

Which makes me wonder if the invalidate_range() hidden inside
invalidate_end() is a bad idea in general - when is this need and
would be correct? Isn't it better to put the invalidates near the TLB
invalidates and leave start/end as purely a bracketing API, which by
definition, cannot have an end that is 'too late'?

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-23 17:27 ` Jason Gunthorpe
@ 2022-09-26 20:13   ` Sean Christopherson
  2022-09-26 23:32     ` Jason Gunthorpe
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-09-26 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jann Horn, Will Deacon, Joerg Roedel, jean-philippe.brucker,
	Linux-MM, kernel list, iommu

On Fri, Sep 23, 2022, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 05:38:12PM +0200, Jann Horn wrote:
> > Hi!
> > 
> > I looked through some of the code related to IOMMUv2 (the thing where
> > the IOMMU walks the normal userspace page tables and TLB shootdowns
> > are replicated to the IOMMU through
> > mmu_notifier_ops::invalidate_range).
> > 
> > I think there's a bug in the interaction between tlb_finish_mmu() and
> > mmu_notifier_ops::invalidate_range: In the mm_tlb_flush_nested() case,
> > __tlb_reset_range() sets tlb->start and tlb->end *both* to ~0.
> > Afterwards, tlb_finish_mmu() calls
> > tlb_flush_mmu()->tlb_flush_mmu_tlbonly()->mmu_notifier_invalidate_range(),
> > which will pass those tlb->start and tlb->end values to
> > mmu_notifier_ops::invalidate_range callbacks. But those callbacks
> > don't know about this special case and then basically only flush
> > virtual address ~0, making the flush useless. 
> 
> Yeah, that looks wrong to me, and it extends more than just the iommu
> drivers kvm_arch_mmu_notifier_invalidate_range() also does not handle
> this coding.

FWIW, the bug is likely benign for KVM.  KVM does almost all of its TLB flushing
via invalidate_range_{start,end}(), the invalidate_range() hook is used only by
x86/VMX to react to a specific KVM-allocated page being migrated (the page is only
ever unmapped when the VM is dying).

> Most likely tlb_flush_mmu_tlbonly() need to change it back to 0 to ~0?
> I wonder why it uses such an odd coding in the first place?
> 
> Actually, maybe having mm_tlb_flush_nested() call __tlb_reset_range()
> to generate a 'flush all' request is just a bad idea, as we already
> had another bug in 7a30df49f63ad92 related to reset_range doing the
> wrong thing for a flush all action.
> 
> > (However, pretty much every place that calls tlb_finish_mmu() first
> > calls mmu_notifier_invalidate_range_end() even though the
> > appropriate thing would probably be
> > mmu_notifier_invalidate_range_only_end(); and I think those two
> > things probably cancel each other out?)
> 
> That does sound like double flushing to me, though as you note below,
> the invalidate_range() triggered by range_end() after the TLB
> flush/page freeing is functionally incorrect, so we cannot rely on it.
> 
> > Also, from what I can tell, the mremap() code, in move_page_tables(),
> > only invokes mmu_notifier_ops::invalidate_range via the
> > mmu_notifier_invalidate_range_end() at the very end, long after TLB
> > flushes must have happened - sort of like the bug we had years ago
> > where mremap() was flushing the normal TLBs too late
> > (https://bugs.chromium.org/p/project-zero/issues/detail?id=1695).
> 
> Based on the description of eb66ae03082960 I would say that yes the
> invalidate_range op is missing here for the same reasons the CPU flush
> was missing.
> 
> AFAIK if we are flushing the CPU tlb then we really must also flush
> the CPU tlb that KVM controls, and that is primarily what
> invalidate_range() is used for.

As above, for its actual secondary MMU, KVM invalidates and flushes at
invalidate_range_start(), and then prevents vCPUs from creating new entries for
the range until invalidate_range_start_end().

The VMX use case is for a physical address that is consumed by hardware without
going through the secondary page tables; using the start/end hooks would be slightly
annoying due to the need to stall the vCPU until end, and so KVM uses invalidate_range()
for that one specific case.

> Which makes me wonder if the invalidate_range() hidden inside
> invalidate_end() is a bad idea in general - when is this need and
> would be correct? Isn't it better to put the invalidates near the TLB
> invalidates and leave start/end as purely a bracketing API, which by
> definition, cannot have an end that is 'too late'?

Documentation/mm/mmu_notifier.rst explains this, although even that is quite subtle.
The argument is that if the change is purely to downgrade protections, then
deferring invalidate_range() is ok because the only requirement is that secondary
MMUs invalidate before the "end" of the sequence.

  When changing a pte to write protect or to point to a new write protected page  
  with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range   
  call to mmu_notifier_invalidate_range_end() outside the page table lock. This   
  is true even if the thread doing the page table update is preempted right after 
  releasing page table lock but before call mmu_notifier_invalidate_range_end().

That said, I also dislike hiding invalidate_range() inside end(), I constantly
forget about that behavior.  To address that, what about renaming
mmu_notifier_invalidate_range_end() to make it more explicit, e.g.
mmu_notifier_invalidate_range_and_end().

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-26 20:13   ` Sean Christopherson
@ 2022-09-26 23:32     ` Jason Gunthorpe
  2022-09-27  0:24       ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2022-09-26 23:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jann Horn, Will Deacon, Joerg Roedel, jean-philippe.brucker,
	Linux-MM, kernel list, iommu

On Mon, Sep 26, 2022 at 08:13:00PM +0000, Sean Christopherson wrote:

> > AFAIK if we are flushing the CPU tlb then we really must also flush
> > the CPU tlb that KVM controls, and that is primarily what
> > invalidate_range() is used for.
> 
> As above, for its actual secondary MMU, KVM invalidates and flushes at
> invalidate_range_start(), and then prevents vCPUs from creating new entries for
> the range until invalidate_range_start_end().

Was it always like this? Why did we add this invalidate_range thing if
nothing really needed it?

That means iommu is really the only place using it as a proper
synchronous shadow TLB flush.

> > Which makes me wonder if the invalidate_range() hidden inside
> > invalidate_end() is a bad idea in general - when is this need and
> > would be correct? Isn't it better to put the invalidates near the TLB
> > invalidates and leave start/end as purely a bracketing API, which by
> > definition, cannot have an end that is 'too late'?
> 
> Documentation/mm/mmu_notifier.rst explains this, although even that is quite subtle.
> The argument is that if the change is purely to downgrade protections, then
> deferring invalidate_range() is ok because the only requirement is that secondary
> MMUs invalidate before the "end" of the sequence.
> 
>   When changing a pte to write protect or to point to a new write protected page  
>   with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range   
>   call to mmu_notifier_invalidate_range_end() outside the page table lock. This   

And then if KVM never needed it why on earth did we micro-optimize it
in such an obscure and opaque way?

>   is true even if the thread doing the page table update is preempted right after 
>   releasing page table lock but before call mmu_notifier_invalidate_range_end().

That feels like it is getting dangerously close to the CVE Jan pointed
at.. We have a write protected page, installed in the PTEs, PTLs
unlocked and other things can sense the PTE and see that it is write
protected - is it really true nothing acts on that - especially now
that DavidH has gone and changed all that logic?

IMHO if we had logic that required the CPU TLB to be flushed under a
certain lock I find it to be a very, very, difficult conceptual leap
that a shadow TLB is OK to flush later.  If the shadow TLB is OK then
lets move the CPU TLB out of the lock as well :)

> That said, I also dislike hiding invalidate_range() inside end(), I constantly
> forget about that behavior.  To address that, what about renaming
> mmu_notifier_invalidate_range_end() to make it more explicit, e.g.
> mmu_notifier_invalidate_range_and_end().

The name for the special case should really capture that hidden point
above 'invalidate_range_delayed_write_protect_end' or something else
long and horrible. Because it really is special, it is really is only
allowed in that one special case (assuming the logic still holds) and
every other possible case should catch the invalidate through the tlb
flusher.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-26 23:32     ` Jason Gunthorpe
@ 2022-09-27  0:24       ` Sean Christopherson
  2022-09-28 18:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-09-27  0:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jann Horn, Will Deacon, Joerg Roedel, jean-philippe.brucker,
	Linux-MM, kernel list, iommu

On Mon, Sep 26, 2022, Jason Gunthorpe wrote:
> On Mon, Sep 26, 2022 at 08:13:00PM +0000, Sean Christopherson wrote:
> 
> > > AFAIK if we are flushing the CPU tlb then we really must also flush
> > > the CPU tlb that KVM controls, and that is primarily what
> > > invalidate_range() is used for.
> > 
> > As above, for its actual secondary MMU, KVM invalidates and flushes at
> > invalidate_range_start(), and then prevents vCPUs from creating new entries for
> > the range until invalidate_range_start_end().
> 
> Was it always like this? Why did we add this invalidate_range thing if
> nothing really needed it?

No, the invalidate_range() hook was added by commit 1897bdc4d331 ("mmu_notifier:
add mmu_notifier_invalidate_range()") for IOMMUs.  From that changelog, the key
issue is stalling hardware while the start+end pair is ongoing runs afoul of GPUs
that don't play nice with re-faulting "indefinitely.

    The page-fault handler in the AMD IOMMUv2 driver doesn't handle the fault
    if an invalidate_range_start/end pair is active, it just reports back
    SUCCESS to the device and let it refault the page.  But existing hardware
    (newer Radeon GPUs) that makes use of this feature don't re-fault
    indefinitly, after a certain number of faults for the same address the
    device enters a failure state and needs to be resetted.
    
    To avoid the GPUs entering a failure state we need to get rid of the
    empty-page-table workaround and use the mmu_notifier_invalidate_range()
    function introduced with this patch.

> That means iommu is really the only place using it as a proper
> synchronous shadow TLB flush.

More or less. There's also an "OpenCAPI coherent accelerator support" driver,
drivers/misc/ocxl, that appears use invalidate_range() the same way the IOMMU does.
No idea how relevant that is these days.

I much prefer KVM's (and the old IOMMU's) approach of re-faulting in hardware until
the entire sequence completes.   It _might_ be less performant, but I find it so
much easier to reason about.  I actually had typed out a "can we just kill off
mmu_notifier_invalidate_range() and force users to refault hardware" question
before seeing the above changelog.

> > > Which makes me wonder if the invalidate_range() hidden inside
> > > invalidate_end() is a bad idea in general - when is this need and
> > > would be correct? Isn't it better to put the invalidates near the TLB
> > > invalidates and leave start/end as purely a bracketing API, which by
> > > definition, cannot have an end that is 'too late'?
> > 
> > Documentation/mm/mmu_notifier.rst explains this, although even that is quite subtle.
> > The argument is that if the change is purely to downgrade protections, then
> > deferring invalidate_range() is ok because the only requirement is that secondary
> > MMUs invalidate before the "end" of the sequence.
> > 
> >   When changing a pte to write protect or to point to a new write protected page  
> >   with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range   
> >   call to mmu_notifier_invalidate_range_end() outside the page table lock. This   
> 
> And then if KVM never needed it why on earth did we micro-optimize it
> in such an obscure and opaque way?

I don't know.  I found the series that introduced the behavior[*], but there are
no numbers provided and I haven't been able to dredge up why this was even looked
into in the first place.  From the cover letter:

  It should improve performances but i am lacking hardware and benchmarks
  which might show an improvement. Maybe folks in cc can help here.

[*] https://lore.kernel.org/all/20171017031003.7481-1-jglisse@redhat.com

> >   is true even if the thread doing the page table update is preempted right after 
> >   releasing page table lock but before call mmu_notifier_invalidate_range_end().
> 
> That feels like it is getting dangerously close to the CVE Jan pointed
> at.. We have a write protected page, installed in the PTEs, PTLs
> unlocked and other things can sense the PTE and see that it is write
> protected - is it really true nothing acts on that - especially now
> that DavidH has gone and changed all that logic?
> 
> IMHO if we had logic that required the CPU TLB to be flushed under a
> certain lock I find it to be a very, very, difficult conceptual leap
> that a shadow TLB is OK to flush later.  If the shadow TLB is OK then
> lets move the CPU TLB out of the lock as well :)
> 
> > That said, I also dislike hiding invalidate_range() inside end(), I constantly
> > forget about that behavior.  To address that, what about renaming
> > mmu_notifier_invalidate_range_end() to make it more explicit, e.g.
> > mmu_notifier_invalidate_range_and_end().
> 
> The name for the special case should really capture that hidden point
> above 'invalidate_range_delayed_write_protect_end' or something else
> long and horrible. Because it really is special, it is really is only
> allowed in that one special case (assuming the logic still holds) and
> every other possible case should catch the invalidate through the tlb
> flusher.

If I had a vote to cast, I would vote to always do invalidate_range() at the same
time the primary TLBs are flushed.  That seems completely logical and much harder
to screw up.  I might be a little biased though since KVM doesn't benefit from the
current shenanigans :-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-27  0:24       ` Sean Christopherson
@ 2022-09-28 18:12         ` Jason Gunthorpe
  2022-09-30  2:31           ` John Hubbard
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2022-09-28 18:12 UTC (permalink / raw)
  To: Sean Christopherson, David Hildenbrand
  Cc: Jann Horn, Will Deacon, Joerg Roedel, jean-philippe.brucker,
	Linux-MM, kernel list, iommu

On Tue, Sep 27, 2022 at 12:24:41AM +0000, Sean Christopherson wrote:
> On Mon, Sep 26, 2022, Jason Gunthorpe wrote:
> > On Mon, Sep 26, 2022 at 08:13:00PM +0000, Sean Christopherson wrote:
> > 
> > > > AFAIK if we are flushing the CPU tlb then we really must also flush
> > > > the CPU tlb that KVM controls, and that is primarily what
> > > > invalidate_range() is used for.
> > > 
> > > As above, for its actual secondary MMU, KVM invalidates and flushes at
> > > invalidate_range_start(), and then prevents vCPUs from creating new entries for
> > > the range until invalidate_range_start_end().
> > 
> > Was it always like this? Why did we add this invalidate_range thing if
> > nothing really needed it?
> 
> No, the invalidate_range() hook was added by commit 1897bdc4d331 ("mmu_notifier:
> add mmu_notifier_invalidate_range()") for IOMMUs. 

Ah, right OK. This is specifically because the iommu is sharing the
*exact* page table of the CPU so the trick KVM/etc uses where 'start'
makes the shadow PTE non-present and then delays the fault until end
completes cannot work here.

>     The page-fault handler in the AMD IOMMUv2 driver doesn't handle the fault
>     if an invalidate_range_start/end pair is active, it just reports back
>     SUCCESS to the device and let it refault the page.  

Yah, this algorithm just doesn't really work, IMHO.. So it makes sense
we have invalidate_range as Joerg originally created it. Though the
GPU is still busted IMHO, there is no guarantee of forward progress
after some number of iterations, it is just much more likely if the
non-present is as narrow as possible.

So, then we can see where the end_only thing came from, commit
0f10851ea475 ("mm/mmu_notifier: avoid double notification when it is
useless") and that long winded message explains why some of the cases
must be ordered in the same place as the CPU flush, but doesn't
explain very much why it is OK to push it after beyond saying "ksm is
OK"

Looking at some of the places where 0f10851ea475 removed the notifies
they seem pretty pointless.

- fs/dax.c
  This never needed notify in the first place, it is populating a
  non-present PTE because it just faulted.

- __split_huge_zero_page_pmd()
  Sure, maybe, but who cares? The real fix here was changing
  __split_huge_pmd() to use only_end() because all paths already
  call invalidate_range

- copy_hugetlb_page_range()
  Sure, there is no CPU tlb flush.

  The CPU tlb flush on this path is in flush_tlb_mm() called by
  dup_mmap().

  The right thing to do is to ensure flush_tlb_mm() calls
  invalidate_range and skip it here. But the reasoning is not some
  "we are downgrading protections blah blah", the logic is that the
  CPU TLB flush can be delayed/consolidated so we can delay the
  shadow TLB flush too.

  (And why does copy_hugetlb_page_range use MMU_NOTIFY_CLEAR but 
   copy_p4d_range is bounded by MMU_NOTIFY_PROTECTION_PAGE ??)

- hugetlb_change_protection()
  Again I feel like the sensible thing here is to trigger the shadow
  flush in flush_hugetlb_tlb_range() always and use end_only

.. and so on ..

So, IMHO, we need to rewrite what 0f10851ea475 was trying to do in
order to fix the bug Jann noticed :\  That is bigger than I can knock
off while I write this email though ..

> > That means iommu is really the only place using it as a proper
> > synchronous shadow TLB flush.
> 
> More or less. There's also an "OpenCAPI coherent accelerator
> support" driver, drivers/misc/ocxl, that appears use
> invalidate_range() the same way the IOMMU does.  No idea how
> relevant that is these days.

Yeah, OpenCAPI is the same stuff as the IOMMU. Just PPC got away with
building all their IOMMU layer in its own arch specific subsystem :|

> I much prefer KVM's (and the old IOMMU's) approach of re-faulting in hardware until
> the entire sequence completes.   It _might_ be less performant, but I find it so
> much easier to reason about.  I actually had typed out a "can we just kill off
> mmu_notifier_invalidate_range() and force users to refault hardware" question
> before seeing the above changelog.

The key thing this requires is the ability to put the hardware into
fault mode (non-present), for the range under invalidation. If you
can't do that, then you can't use it.

> I don't know.  I found the series that introduced the behavior[*], but there are
> no numbers provided and I haven't been able to dredge up why this was even looked
> into in the first place.  From the cover letter:

It looks like a 'by inspection' project..

> If I had a vote to cast, I would vote to always do invalidate_range() at the same
> time the primary TLBs are flushed.  That seems completely logical and much harder
> to screw up.  I might be a little biased though since KVM doesn't benefit from the
> current shenanigans :-)

Me too.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-28 18:12         ` Jason Gunthorpe
@ 2022-09-30  2:31           ` John Hubbard
  2022-09-30 12:01             ` Jason Gunthorpe
  0 siblings, 1 reply; 8+ messages in thread
From: John Hubbard @ 2022-09-30  2:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Sean Christopherson, David Hildenbrand
  Cc: Jann Horn, Will Deacon, Joerg Roedel, jean-philippe.brucker,
	Linux-MM, kernel list, iommu

On 9/28/22 11:12, Jason Gunthorpe wrote:
> On Tue, Sep 27, 2022 at 12:24:41AM +0000, Sean Christopherson wrote:
>> On Mon, Sep 26, 2022, Jason Gunthorpe wrote:
>>> On Mon, Sep 26, 2022 at 08:13:00PM +0000, Sean Christopherson wrote:
>>>
>>>>> AFAIK if we are flushing the CPU tlb then we really must also flush
>>>>> the CPU tlb that KVM controls, and that is primarily what
>>>>> invalidate_range() is used for.
>>>>
>>>> As above, for its actual secondary MMU, KVM invalidates and flushes at
>>>> invalidate_range_start(), and then prevents vCPUs from creating new entries for
>>>> the range until invalidate_range_start_end().
>>>
>>> Was it always like this? Why did we add this invalidate_range thing if
>>> nothing really needed it?
>>
>> No, the invalidate_range() hook was added by commit 1897bdc4d331 ("mmu_notifier:
>> add mmu_notifier_invalidate_range()") for IOMMUs.
> 
> Ah, right OK. This is specifically because the iommu is sharing the
> *exact* page table of the CPU so the trick KVM/etc uses where 'start'
> makes the shadow PTE non-present and then delays the fault until end
> completes cannot work here.

ohhh, is this trick something I should read more about, if I'm about to
jump in here?

> 
>>      The page-fault handler in the AMD IOMMUv2 driver doesn't handle the fault
>>      if an invalidate_range_start/end pair is active, it just reports back
>>      SUCCESS to the device and let it refault the page.
> 
> Yah, this algorithm just doesn't really work, IMHO.. So it makes sense
> we have invalidate_range as Joerg originally created it. Though the
> GPU is still busted IMHO, there is no guarantee of forward progress
> after some number of iterations, it is just much more likely if the
> non-present is as narrow as possible.
> 
> So, then we can see where the end_only thing came from, commit
> 0f10851ea475 ("mm/mmu_notifier: avoid double notification when it is
> useless") and that long winded message explains why some of the cases

I seem to recall that there was a performance drop involving GPUs, due
to the double notification. Just to fill in a little bit of history as
to why Jerome was trying to deduplicate the notifier callbacks.

> must be ordered in the same place as the CPU flush, but doesn't
> explain very much why it is OK to push it after beyond saying "ksm is
> OK"
> 
> Looking at some of the places where 0f10851ea475 removed the notifies
> they seem pretty pointless.
> 
> - fs/dax.c
>    This never needed notify in the first place, it is populating a
>    non-present PTE because it just faulted.
> 
> - __split_huge_zero_page_pmd()
>    Sure, maybe, but who cares? The real fix here was changing
>    __split_huge_pmd() to use only_end() because all paths already
>    call invalidate_range
> 
> - copy_hugetlb_page_range()
>    Sure, there is no CPU tlb flush.
> 
>    The CPU tlb flush on this path is in flush_tlb_mm() called by
>    dup_mmap().
> 
>    The right thing to do is to ensure flush_tlb_mm() calls
>    invalidate_range and skip it here. But the reasoning is not some
>    "we are downgrading protections blah blah", the logic is that the
>    CPU TLB flush can be delayed/consolidated so we can delay the
>    shadow TLB flush too.
> 
>    (And why does copy_hugetlb_page_range use MMU_NOTIFY_CLEAR but
>     copy_p4d_range is bounded by MMU_NOTIFY_PROTECTION_PAGE ??)
> 
> - hugetlb_change_protection()
>    Again I feel like the sensible thing here is to trigger the shadow
>    flush in flush_hugetlb_tlb_range() always and use end_only
> 
> .. and so on ..
> 
> So, IMHO, we need to rewrite what 0f10851ea475 was trying to do in
> order to fix the bug Jann noticed :\  That is bigger than I can knock
> off while I write this email though ..

After an initial pass through this, with perhaps 80% understanding
of the story, I'm reading that as:

     Audit all the sites (which you initially did quickly, above)
     that 0f10851ea475 touched, and any other related ones, and
     change things so that invalidate_range() and primary TLB
     flushing happen at the same point(s).

Yes? Anything else?

thanks,
-- 
John Hubbard
NVIDIA

> 
>>> That means iommu is really the only place using it as a proper
>>> synchronous shadow TLB flush.
>>
>> More or less. There's also an "OpenCAPI coherent accelerator
>> support" driver, drivers/misc/ocxl, that appears use
>> invalidate_range() the same way the IOMMU does.  No idea how
>> relevant that is these days.
> 
> Yeah, OpenCAPI is the same stuff as the IOMMU. Just PPC got away with
> building all their IOMMU layer in its own arch specific subsystem :|
> 
>> I much prefer KVM's (and the old IOMMU's) approach of re-faulting in hardware until
>> the entire sequence completes.   It _might_ be less performant, but I find it so
>> much easier to reason about.  I actually had typed out a "can we just kill off
>> mmu_notifier_invalidate_range() and force users to refault hardware" question
>> before seeing the above changelog.
> 
> The key thing this requires is the ability to put the hardware into
> fault mode (non-present), for the range under invalidation. If you
> can't do that, then you can't use it.
> 
>> I don't know.  I found the series that introduced the behavior[*], but there are
>> no numbers provided and I haven't been able to dredge up why this was even looked
>> into in the first place.  From the cover letter:
> 
> It looks like a 'by inspection' project..
> 
>> If I had a vote to cast, I would vote to always do invalidate_range() at the same
>> time the primary TLBs are flushed.  That seems completely logical and much harder
>> to screw up.  I might be a little biased though since KVM doesn't benefit from the
>> current shenanigans :-)
> 
> Me too.
> 
> Jason
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
  2022-09-30  2:31           ` John Hubbard
@ 2022-09-30 12:01             ` Jason Gunthorpe
  0 siblings, 0 replies; 8+ messages in thread
From: Jason Gunthorpe @ 2022-09-30 12:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Sean Christopherson, David Hildenbrand, Jann Horn, Will Deacon,
	Joerg Roedel, jean-philippe.brucker, Linux-MM, kernel list,
	iommu

On Thu, Sep 29, 2022 at 07:31:03PM -0700, John Hubbard wrote:

> > Ah, right OK. This is specifically because the iommu is sharing the
> > *exact* page table of the CPU so the trick KVM/etc uses where 'start'
> > makes the shadow PTE non-present and then delays the fault until end
> > completes cannot work here.
> 
> ohhh, is this trick something I should read more about, if I'm about to
> jump in here?

All the invalidate_start implementations have a parallel page table
structure that they copy from the main page table into, eg using
hmm_range_fault() or whatever kvm does. Thus we can create a situation
where a sPTE is non-present while a PTE is present.

The iommu/etc point the HW at exactly the same page table as the mm,
so we cannot create a situation where a sPTE is non-present while the
PTE is present.

Thus, IMHO, the only correct answer is to flush the shadow TLB at
exactly the same moments we flush the CPU TLB because the two TLBs are
processing exactly the same data structure. If there is logic that the
TLB flush can be delayed to optimize it then that logic applies
equally to both TLBs.

> > So, then we can see where the end_only thing came from, commit
> > 0f10851ea475 ("mm/mmu_notifier: avoid double notification when it is
> > useless") and that long winded message explains why some of the cases
> 
> I seem to recall that there was a performance drop involving GPUs, due
> to the double notification. Just to fill in a little bit of history as
> to why Jerome was trying to deduplicate the notifier callbacks.

Sure, the double notification is clearly undesired, but the right way
to fix that was to remove the call to invalidate_range() from
mn_hlist_invalidate_end() which means adding any missing calls to
invalidate_range() near the CPU TLB

However, as Sean suggested, GPU drivers should not use
invalidate_range() because they are not directly sharing the CPU page
table in the first place. They should use start/end. Maybe the docs
need clarification on this point as well.

> After an initial pass through this, with perhaps 80% understanding
> of the story, I'm reading that as:
> 
>     Audit all the sites (which you initially did quickly, above)
>     that 0f10851ea475 touched, and any other related ones, and
>     change things so that invalidate_range() and primary TLB
>     flushing happen at the same point(s).
> 
> Yes? Anything else?

I would structure it as

1) Make a patch fixing the documentation around all of the 
   mmu_notifier_invalidate_range_only_end() to explain where the
   invalidate_range() call is (eg where the CPU TLB flush is) or why
   the CPU TLB flush is not actually needed. 0f10851ea475 is a good
   guide where to touch

   Remove all the confusing documentation about write protect and
   'promotion'. The new logic is we call invalidate_range when we
   flush the CPU TLB and we might do that outside the start/end
   block.

2) Make patch(es) converting all the places calling
   mmu_notifier_invalidate_range_end() to only_end() by identify where
   the CPU TLB flush is and ensuring the invalidate_range is present
   at the CPU TLB flush point.

   eg all the range_end() calls on the fork() path are switched to
   only_end() and an invalidate_range() put outside the
   start/end/block in dup_mmap() near the TLB flush. This is even more
   optimal because we batch flushing the entire shadown TLB in one
   shot, instead of trying to invalidate every VMA range.

3) Fix the TLB flusher to not send -1 -1 in the corner Jann noticed in
   the first mail

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-30 12:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-23 15:38 some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap()) Jann Horn
2022-09-23 17:27 ` Jason Gunthorpe
2022-09-26 20:13   ` Sean Christopherson
2022-09-26 23:32     ` Jason Gunthorpe
2022-09-27  0:24       ` Sean Christopherson
2022-09-28 18:12         ` Jason Gunthorpe
2022-09-30  2:31           ` John Hubbard
2022-09-30 12:01             ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.