Re: [PATCH 2/2] arm64: Notify on pte permission upgrades

From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	will@kernel.org, catalin.marinas@arm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nicolinc@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	John Hubbard <jhubbard@nvidia.com>,
	zhi.wang.linux@gmail.com, Sean Christopherson <seanjc@google.com>
Subject: Re: [PATCH 2/2] arm64: Notify on pte permission upgrades
Date: Tue, 30 May 2023 11:06:02 -0300	[thread overview]
Message-ID: <ZHYCygONW53/Byp3@nvidia.com> (raw)
In-Reply-To: <89dba89c-cb49-f917-31e4-3eafd484f4b2@arm.com>

On Tue, May 30, 2023 at 02:44:11PM +0100, Robin Murphy wrote:
> On 30/05/2023 1:52 pm, Jason Gunthorpe wrote:
> > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote:
> > > On 2023-05-30 12:54, Jason Gunthorpe wrote:
> > > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote:
> > > > > 
> > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates
> > > > > > > it will continue to return read-only entries to a device even though
> > > > > > > the CPU page table contains a writable entry. This leads to a
> > > > > > > continually faulting device and no way of handling the fault.
> > > > > > 
> > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should
> > > > > > just have the iommu driver push an iotlb invalidation command before
> > > > > > it acks it? PRI is already really slow so I'm not sure a pipelined
> > > > > > invalidation is going to be a problem? Does the SMMU architecture
> > > > > > permit negative caching which would suggest we need it anyhow?
> > > > > 
> > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to
> > > > > TLB maintenance requirements) permits negative caching of some mapping
> > > > > attributes including the read-only attribute. Hence without the flushing
> > > > > we fault continuously.
> > > > 
> > > > Sounds like a straight up SMMU bug, invalidate the cache after
> > > > resolving the PRI event.
> > > 
> > > No, if the IOPF handler calls back into the mm layer to resolve the fault,
> > > and the mm layer issues an invalidation in the process of that which isn't
> > > propagated back to the SMMU (as it would be if BTM were in use), logically
> > > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra
> > > mostly-redundant invalidations just because different CPU architectures have
> > > different idiosyncracies around caching of permissions.
> > 
> > The mm has a definition for invalidate_range that does not include all
> > the invalidation points SMMU needs. This is difficult to sort out
> > because this is general purpose cross arch stuff.
> > 
> > You are right that this is worth optimizing, but right now we have a
> > -rc bug that needs fixing and adding and extra SMMU invalidation is a
> > straightforward -rc friendly way to address it.
> 
> Sure; to clarify, I'm not against the overall idea of putting a hack in the
> SMMU driver with a big comment that it is a hack to work around missing
> notifications under SVA, but it would not constitute an "SMMU bug" to not do
> that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some
> other arm64 hypervisor driver wanted to do something funky with notifiers to
> shadow stage 1 permissions for some reason, it would presumably be equally
> affected.

Okay, Alistair can you make this?

> FWIW, the VT-d spec seems to suggest that invalidation on RO->RW is only
> optional if the requester supports recoverable page faults, so although
> there's no use-case for non-PRI-based SVA at the moment, there is some
> potential argument that the notifier issue generalises even to x86.

IMHO I think we messed this up at some point..

Joerg added invalidate_range just for the iommu to use, so having it
be arch specific could make some sense.

However, KVM later co-opted it to do this:

commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2
Author: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Date:   Sat Jun 6 13:26:27 2020 +0900

    KVM: x86: Fix APIC page invalidation race

    Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
    to fix inappropriate APIC page invalidation by re-introducing arch
    specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
    kvm_mmu_notifier_invalidate_range_start. However, the patch left a
    possible race where the VMCS APIC address cache is updated *before*
    it is unmapped:

      (Invalidator) kvm_mmu_notifier_invalidate_range_start()
      (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
      (KVM VCPU) vcpu_enter_guest()
      (KVM VCPU) kvm_vcpu_reload_apic_access_page()
      (Invalidator) actually unmap page

    Because of the above race, there can be a mismatch between the
    host physical address stored in the APIC_ACCESS_PAGE VMCS field and
    the host physical address stored in the EPT entry for the APIC GPA
    (0xfee0000).  When this happens, the processor will not trap APIC
    accesses, and will instead show the raw contents of the APIC-access page.
    Because Windows OS periodically checks for unexpected modifications to
    the LAPIC register, this will show up as a BSOD crash with BugCheck
    CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
    https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

    The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
    cannot guarantee that no additional references are taken to the pages in
    the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
    this case is supported by the MMU notifier API, as documented in
    include/linux/mmu_notifier.h:

             * If the subsystem
             * can't guarantee that no additional references are taken to
             * the pages in the range, it has to implement the
             * invalidate_range() notifier to remove any references taken
             * after invalidate_range_start().

    The fix therefore is to reload the APIC-access page field in the VMCS
    from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

Which I think is a hacky fix.

KVM already has locking for invalidate_start/end - it has to check
mmu_notifier_retry_cache() with the sequence numbers/etc around when
it does does hva_to_pfn()

The bug is that the kvm_vcpu_reload_apic_access_page() path is
ignoring this locking so it ignores in-progress range
invalidations. It should spin until the invalidation clears like other
places in KVM.

The comment is kind of misleading because drivers shouldn't be abusing
the iommu centric invalidate_range() thing to fix missing locking in
start/end users. :\

So if KVM could be fixed up we could make invalidate_range defined to
be an arch specific callback to synchronize the iommu TLB.

Sean?

Jason