All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	will@kernel.org, catalin.marinas@arm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nicolinc@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	John Hubbard <jhubbard@nvidia.com>,
	zhi.wang.linux@gmail.com, Sean Christopherson <seanjc@google.com>
Subject: Re: [PATCH 2/2] arm64: Notify on pte permission upgrades
Date: Tue, 30 May 2023 11:06:02 -0300	[thread overview]
Message-ID: <ZHYCygONW53/Byp3@nvidia.com> (raw)
In-Reply-To: <89dba89c-cb49-f917-31e4-3eafd484f4b2@arm.com>

On Tue, May 30, 2023 at 02:44:11PM +0100, Robin Murphy wrote:
> On 30/05/2023 1:52 pm, Jason Gunthorpe wrote:
> > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote:
> > > On 2023-05-30 12:54, Jason Gunthorpe wrote:
> > > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote:
> > > > > 
> > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates
> > > > > > > it will continue to return read-only entries to a device even though
> > > > > > > the CPU page table contains a writable entry. This leads to a
> > > > > > > continually faulting device and no way of handling the fault.
> > > > > > 
> > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should
> > > > > > just have the iommu driver push an iotlb invalidation command before
> > > > > > it acks it? PRI is already really slow so I'm not sure a pipelined
> > > > > > invalidation is going to be a problem? Does the SMMU architecture
> > > > > > permit negative caching which would suggest we need it anyhow?
> > > > > 
> > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to
> > > > > TLB maintenance requirements) permits negative caching of some mapping
> > > > > attributes including the read-only attribute. Hence without the flushing
> > > > > we fault continuously.
> > > > 
> > > > Sounds like a straight up SMMU bug, invalidate the cache after
> > > > resolving the PRI event.
> > > 
> > > No, if the IOPF handler calls back into the mm layer to resolve the fault,
> > > and the mm layer issues an invalidation in the process of that which isn't
> > > propagated back to the SMMU (as it would be if BTM were in use), logically
> > > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra
> > > mostly-redundant invalidations just because different CPU architectures have
> > > different idiosyncracies around caching of permissions.
> > 
> > The mm has a definition for invalidate_range that does not include all
> > the invalidation points SMMU needs. This is difficult to sort out
> > because this is general purpose cross arch stuff.
> > 
> > You are right that this is worth optimizing, but right now we have a
> > -rc bug that needs fixing and adding and extra SMMU invalidation is a
> > straightforward -rc friendly way to address it.
> 
> Sure; to clarify, I'm not against the overall idea of putting a hack in the
> SMMU driver with a big comment that it is a hack to work around missing
> notifications under SVA, but it would not constitute an "SMMU bug" to not do
> that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some
> other arm64 hypervisor driver wanted to do something funky with notifiers to
> shadow stage 1 permissions for some reason, it would presumably be equally
> affected.

Okay, Alistair can you make this?

> FWIW, the VT-d spec seems to suggest that invalidation on RO->RW is only
> optional if the requester supports recoverable page faults, so although
> there's no use-case for non-PRI-based SVA at the moment, there is some
> potential argument that the notifier issue generalises even to x86.

IMHO I think we messed this up at some point..

Joerg added invalidate_range just for the iommu to use, so having it
be arch specific could make some sense.

However, KVM later co-opted it to do this:

commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2
Author: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Date:   Sat Jun 6 13:26:27 2020 +0900

    KVM: x86: Fix APIC page invalidation race
    
    Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
    to fix inappropriate APIC page invalidation by re-introducing arch
    specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
    kvm_mmu_notifier_invalidate_range_start. However, the patch left a
    possible race where the VMCS APIC address cache is updated *before*
    it is unmapped:
    
      (Invalidator) kvm_mmu_notifier_invalidate_range_start()
      (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
      (KVM VCPU) vcpu_enter_guest()
      (KVM VCPU) kvm_vcpu_reload_apic_access_page()
      (Invalidator) actually unmap page
    
    Because of the above race, there can be a mismatch between the
    host physical address stored in the APIC_ACCESS_PAGE VMCS field and
    the host physical address stored in the EPT entry for the APIC GPA
    (0xfee0000).  When this happens, the processor will not trap APIC
    accesses, and will instead show the raw contents of the APIC-access page.
    Because Windows OS periodically checks for unexpected modifications to
    the LAPIC register, this will show up as a BSOD crash with BugCheck
    CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
    https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
    
    The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
    cannot guarantee that no additional references are taken to the pages in
    the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
    this case is supported by the MMU notifier API, as documented in
    include/linux/mmu_notifier.h:
    
             * If the subsystem
             * can't guarantee that no additional references are taken to
             * the pages in the range, it has to implement the
             * invalidate_range() notifier to remove any references taken
             * after invalidate_range_start().
    
    The fix therefore is to reload the APIC-access page field in the VMCS
    from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

Which I think is a hacky fix.

KVM already has locking for invalidate_start/end - it has to check
mmu_notifier_retry_cache() with the sequence numbers/etc around when
it does does hva_to_pfn()

The bug is that the kvm_vcpu_reload_apic_access_page() path is
ignoring this locking so it ignores in-progress range
invalidations. It should spin until the invalidation clears like other
places in KVM.

The comment is kind of misleading because drivers shouldn't be abusing
the iommu centric invalidate_range() thing to fix missing locking in
start/end users. :\

So if KVM could be fixed up we could make invalidate_range defined to
be an arch specific callback to synchronize the iommu TLB.

Sean?

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	will@kernel.org, catalin.marinas@arm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nicolinc@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	John Hubbard <jhubbard@nvidia.com>,
	zhi.wang.linux@gmail.com, Sean Christopherson <seanjc@google.com>
Subject: Re: [PATCH 2/2] arm64: Notify on pte permission upgrades
Date: Tue, 30 May 2023 11:06:02 -0300	[thread overview]
Message-ID: <ZHYCygONW53/Byp3@nvidia.com> (raw)
In-Reply-To: <89dba89c-cb49-f917-31e4-3eafd484f4b2@arm.com>

On Tue, May 30, 2023 at 02:44:11PM +0100, Robin Murphy wrote:
> On 30/05/2023 1:52 pm, Jason Gunthorpe wrote:
> > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote:
> > > On 2023-05-30 12:54, Jason Gunthorpe wrote:
> > > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote:
> > > > > 
> > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates
> > > > > > > it will continue to return read-only entries to a device even though
> > > > > > > the CPU page table contains a writable entry. This leads to a
> > > > > > > continually faulting device and no way of handling the fault.
> > > > > > 
> > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should
> > > > > > just have the iommu driver push an iotlb invalidation command before
> > > > > > it acks it? PRI is already really slow so I'm not sure a pipelined
> > > > > > invalidation is going to be a problem? Does the SMMU architecture
> > > > > > permit negative caching which would suggest we need it anyhow?
> > > > > 
> > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to
> > > > > TLB maintenance requirements) permits negative caching of some mapping
> > > > > attributes including the read-only attribute. Hence without the flushing
> > > > > we fault continuously.
> > > > 
> > > > Sounds like a straight up SMMU bug, invalidate the cache after
> > > > resolving the PRI event.
> > > 
> > > No, if the IOPF handler calls back into the mm layer to resolve the fault,
> > > and the mm layer issues an invalidation in the process of that which isn't
> > > propagated back to the SMMU (as it would be if BTM were in use), logically
> > > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra
> > > mostly-redundant invalidations just because different CPU architectures have
> > > different idiosyncracies around caching of permissions.
> > 
> > The mm has a definition for invalidate_range that does not include all
> > the invalidation points SMMU needs. This is difficult to sort out
> > because this is general purpose cross arch stuff.
> > 
> > You are right that this is worth optimizing, but right now we have a
> > -rc bug that needs fixing and adding and extra SMMU invalidation is a
> > straightforward -rc friendly way to address it.
> 
> Sure; to clarify, I'm not against the overall idea of putting a hack in the
> SMMU driver with a big comment that it is a hack to work around missing
> notifications under SVA, but it would not constitute an "SMMU bug" to not do
> that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some
> other arm64 hypervisor driver wanted to do something funky with notifiers to
> shadow stage 1 permissions for some reason, it would presumably be equally
> affected.

Okay, Alistair can you make this?

> FWIW, the VT-d spec seems to suggest that invalidation on RO->RW is only
> optional if the requester supports recoverable page faults, so although
> there's no use-case for non-PRI-based SVA at the moment, there is some
> potential argument that the notifier issue generalises even to x86.

IMHO I think we messed this up at some point..

Joerg added invalidate_range just for the iommu to use, so having it
be arch specific could make some sense.

However, KVM later co-opted it to do this:

commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2
Author: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Date:   Sat Jun 6 13:26:27 2020 +0900

    KVM: x86: Fix APIC page invalidation race
    
    Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
    to fix inappropriate APIC page invalidation by re-introducing arch
    specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
    kvm_mmu_notifier_invalidate_range_start. However, the patch left a
    possible race where the VMCS APIC address cache is updated *before*
    it is unmapped:
    
      (Invalidator) kvm_mmu_notifier_invalidate_range_start()
      (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
      (KVM VCPU) vcpu_enter_guest()
      (KVM VCPU) kvm_vcpu_reload_apic_access_page()
      (Invalidator) actually unmap page
    
    Because of the above race, there can be a mismatch between the
    host physical address stored in the APIC_ACCESS_PAGE VMCS field and
    the host physical address stored in the EPT entry for the APIC GPA
    (0xfee0000).  When this happens, the processor will not trap APIC
    accesses, and will instead show the raw contents of the APIC-access page.
    Because Windows OS periodically checks for unexpected modifications to
    the LAPIC register, this will show up as a BSOD crash with BugCheck
    CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
    https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
    
    The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
    cannot guarantee that no additional references are taken to the pages in
    the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
    this case is supported by the MMU notifier API, as documented in
    include/linux/mmu_notifier.h:
    
             * If the subsystem
             * can't guarantee that no additional references are taken to
             * the pages in the range, it has to implement the
             * invalidate_range() notifier to remove any references taken
             * after invalidate_range_start().
    
    The fix therefore is to reload the APIC-access page field in the VMCS
    from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

Which I think is a hacky fix.

KVM already has locking for invalidate_start/end - it has to check
mmu_notifier_retry_cache() with the sequence numbers/etc around when
it does does hva_to_pfn()

The bug is that the kvm_vcpu_reload_apic_access_page() path is
ignoring this locking so it ignores in-progress range
invalidations. It should spin until the invalidation clears like other
places in KVM.

The comment is kind of misleading because drivers shouldn't be abusing
the iommu centric invalidate_range() thing to fix missing locking in
start/end users. :\

So if KVM could be fixed up we could make invalidate_range defined to
be an arch specific callback to synchronize the iommu TLB.

Sean?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-05-30 14:06 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24  1:47 [PATCH 1/2] mmu_notifiers: Restore documentation for .invalidate_range() Alistair Popple
2023-05-24  1:47 ` Alistair Popple
2023-05-24  1:47 ` [PATCH 2/2] arm64: Notify on pte permission upgrades Alistair Popple
2023-05-24  1:47   ` Alistair Popple
2023-05-28  0:02   ` Jason Gunthorpe
2023-05-28  0:02     ` Jason Gunthorpe
2023-05-30  8:05     ` Alistair Popple
2023-05-30  8:05       ` Alistair Popple
2023-05-30 11:54       ` Jason Gunthorpe
2023-05-30 11:54         ` Jason Gunthorpe
2023-05-30 12:14         ` Robin Murphy
2023-05-30 12:14           ` Robin Murphy
2023-05-30 12:52           ` Jason Gunthorpe
2023-05-30 12:52             ` Jason Gunthorpe
2023-05-30 13:44             ` Robin Murphy
2023-05-30 13:44               ` Robin Murphy
2023-05-30 14:06               ` Jason Gunthorpe [this message]
2023-05-30 14:06                 ` Jason Gunthorpe
2023-05-30 21:44                 ` Sean Christopherson
2023-05-30 21:44                   ` Sean Christopherson
2023-05-30 23:08                   ` Jason Gunthorpe
2023-05-30 23:08                     ` Jason Gunthorpe
2023-05-31  0:30                     ` Alistair Popple
2023-05-31  0:30                       ` Alistair Popple
2023-05-31  0:32                       ` Jason Gunthorpe
2023-05-31  0:32                         ` Jason Gunthorpe
2023-05-31  2:46                         ` Alistair Popple
2023-05-31  2:46                           ` Alistair Popple
2023-05-31 15:30                           ` Jason Gunthorpe
2023-05-31 15:30                             ` Jason Gunthorpe
2023-05-31 23:56                             ` Alistair Popple
2023-05-31 23:56                               ` Alistair Popple
     [not found]                       ` <31cdd164783fefad4c9ef4a6d33c1e0094405d0f03added523a82dd9febdf15f@mu.id>
2023-06-09  2:06                         ` Alistair Popple
2023-06-09  2:06                           ` Alistair Popple
2023-06-09  6:05                           ` Alistair Popple
2023-06-09  6:05                             ` Alistair Popple
2023-05-24  2:20 ` [PATCH 1/2] mmu_notifiers: Restore documentation for .invalidate_range() John Hubbard
2023-05-24  2:20   ` John Hubbard
2023-05-24  4:45   ` Alistair Popple
2023-05-24  4:45     ` Alistair Popple
2023-05-27 23:56   ` Jason Gunthorpe
2023-05-27 23:56     ` Jason Gunthorpe
2023-05-24  3:48 ` Zhi Wang
2023-05-24  3:48   ` Zhi Wang
2023-05-24  4:57   ` Alistair Popple
2023-05-24  4:57     ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZHYCygONW53/Byp3@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nicolinc@nvidia.com \
    --cc=robin.murphy@arm.com \
    --cc=seanjc@google.com \
    --cc=will@kernel.org \
    --cc=zhi.wang.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.