Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())

From: Jason Gunthorpe <jgg@nvidia.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Jann Horn <jannh@google.com>, Will Deacon <will@kernel.org>,
	Joerg Roedel <jroedel@suse.de>,
	jean-philippe.brucker@arm.com, Linux-MM <linux-mm@kvack.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	iommu@lists.linux.dev
Subject: Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
Date: Mon, 26 Sep 2022 20:32:31 -0300	[thread overview]
Message-ID: <YzI2jzvc8D9lYU6G@nvidia.com> (raw)
In-Reply-To: <YzIHzIxknGNba6CC@google.com>

On Mon, Sep 26, 2022 at 08:13:00PM +0000, Sean Christopherson wrote:

> > AFAIK if we are flushing the CPU tlb then we really must also flush
> > the CPU tlb that KVM controls, and that is primarily what
> > invalidate_range() is used for.
> 
> As above, for its actual secondary MMU, KVM invalidates and flushes at
> invalidate_range_start(), and then prevents vCPUs from creating new entries for
> the range until invalidate_range_start_end().

Was it always like this? Why did we add this invalidate_range thing if
nothing really needed it?

That means iommu is really the only place using it as a proper
synchronous shadow TLB flush.

> > Which makes me wonder if the invalidate_range() hidden inside
> > invalidate_end() is a bad idea in general - when is this need and
> > would be correct? Isn't it better to put the invalidates near the TLB
> > invalidates and leave start/end as purely a bracketing API, which by
> > definition, cannot have an end that is 'too late'?
> 
> Documentation/mm/mmu_notifier.rst explains this, although even that is quite subtle.
> The argument is that if the change is purely to downgrade protections, then
> deferring invalidate_range() is ok because the only requirement is that secondary
> MMUs invalidate before the "end" of the sequence.
> 
>   When changing a pte to write protect or to point to a new write protected page  
>   with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range   
>   call to mmu_notifier_invalidate_range_end() outside the page table lock. This   

And then if KVM never needed it why on earth did we micro-optimize it
in such an obscure and opaque way?

>   is true even if the thread doing the page table update is preempted right after 
>   releasing page table lock but before call mmu_notifier_invalidate_range_end().

That feels like it is getting dangerously close to the CVE Jan pointed
at.. We have a write protected page, installed in the PTEs, PTLs
unlocked and other things can sense the PTE and see that it is write
protected - is it really true nothing acts on that - especially now
that DavidH has gone and changed all that logic?

IMHO if we had logic that required the CPU TLB to be flushed under a
certain lock I find it to be a very, very, difficult conceptual leap
that a shadow TLB is OK to flush later.  If the shadow TLB is OK then
lets move the CPU TLB out of the lock as well :)

> That said, I also dislike hiding invalidate_range() inside end(), I constantly
> forget about that behavior.  To address that, what about renaming
> mmu_notifier_invalidate_range_end() to make it more explicit, e.g.
> mmu_notifier_invalidate_range_and_end().

The name for the special case should really capture that hidden point
above 'invalidate_range_delayed_write_protect_end' or something else
long and horrible. Because it really is special, it is really is only
allowed in that one special case (assuming the logic still holds) and
every other possible case should catch the invalidate through the tlb
flusher.

Jason