Re: [PATCH 2/2] arm64: Notify on pte permission upgrades

From: Alistair Popple <apopple@nvidia.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	will@kernel.org, catalin.marinas@arm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, robin.murphy@arm.com,
	nicolinc@nvidia.com, linux-arm-kernel@lists.infradead.org,
	kvm@vger.kernel.org, John Hubbard <jhubbard@nvidia.com>,
	zhi.wang.linux@gmail.com, Sean Christopherson <seanjc@google.com>
Subject: Re: [PATCH 2/2] arm64: Notify on pte permission upgrades
Date: Tue, 30 May 2023 18:05:41 +1000	[thread overview]
Message-ID: <87pm6ii6qi.fsf@nvidia.com> (raw)
In-Reply-To: <ZHKaBQt8623s9+VK@nvidia.com>

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Wed, May 24, 2023 at 11:47:29AM +1000, Alistair Popple wrote:
>> ARM64 requires TLB invalidates when upgrading pte permission from
>> read-only to read-write. However mmu_notifiers assume upgrades do not
>> need notifications and none are sent. This causes problems when a
>> secondary TLB such as implemented by an ARM SMMU doesn't support
>> broadcast TLB maintenance (BTM) and caches a read-only PTE.
>
> I don't really like this design, but I see how you get here..

Not going to argue with that, I don't love it either but it seemed like
the most straight forward approach.

> mmu notifiers behavior should not be tied to the architecture, they
> are supposed to be generic reflections of what the MM is doing so that
> they can be hooked into by general purpose drivers.

Interesting. I've always assumed mmu notifiers were primarly about
keeping cache invalidations in sync with what the MM is doing. This is
based on the fact that you can't use mmu notifiers to establish mappings
and we instead have this rather complex dance with hmm_range_fault() to
establish a mapping.

My initial version [1] effectivly did add a generic event. Admittedly it
was somewhat incomplete, because I didn't hook up the new mmu notifier
event type to every user that could possibliy ignore it (eg. KVM). But
that was part of the problem - it was hard to figure out which mmu
notifier users can safely ignore it versus ones that can't, and that may
depend on what architecture it's running on. Hence why I hooked it up to
ptep_set_access_flags, because you get arch specific filtering as
required.

Perhaps the better option is to introduce a new mmu notifier op and let
drivers opt-in?

> If you want to hardwire invalidate_range to be only for SVA cases that
> actually share the page table itself and rely on some arch-defined
> invalidation, then we should give the op a much better name and
> discourage anyone else from abusing the new ops variable behavior.

Well that's the only use case I currently care about because we have hit
this issue, so for now at least I'd much rather a straight forward fix
we can backport.

The problem is an invalidation isn't well defined. If we are to make
this architecture independent then we need to be sending an invalidation
for any PTE state change
(ie. clean/dirty/writeable/read-only/present/not-present/etc) which we
don't do currently.

>> As no notification is sent and the SMMU does not snoop TLB invalidates
>> it will continue to return read-only entries to a device even though
>> the CPU page table contains a writable entry. This leads to a
>> continually faulting device and no way of handling the fault.
>
> Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should
> just have the iommu driver push an iotlb invalidation command before
> it acks it? PRI is already really slow so I'm not sure a pipelined
> invalidation is going to be a problem? Does the SMMU architecture
> permit negative caching which would suggest we need it anyhow?

Yes, SMMU architecture (which matches the ARM architecture in regards to
TLB maintenance requirements) permits negative caching of some mapping
attributes including the read-only attribute. Hence without the flushing
we fault continuously.

> Jason

[1] - https://lore.kernel.org/linux-mm/ZGxg+I8FWz3YqBMk@infradead.org/T/