Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())

From: Jason Gunthorpe <jgg@nvidia.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: Sean Christopherson <seanjc@google.com>,
	David Hildenbrand <david@redhat.com>,
	Jann Horn <jannh@google.com>, Will Deacon <will@kernel.org>,
	Joerg Roedel <jroedel@suse.de>,
	jean-philippe.brucker@arm.com, Linux-MM <linux-mm@kvack.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	iommu@lists.linux.dev
Subject: Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
Date: Fri, 30 Sep 2022 09:01:03 -0300	[thread overview]
Message-ID: <Yzbaf9HW1/reKqR8@nvidia.com> (raw)
In-Reply-To: <bd235db0-8355-2038-3b71-3b17cde91293@nvidia.com>

On Thu, Sep 29, 2022 at 07:31:03PM -0700, John Hubbard wrote:

> > Ah, right OK. This is specifically because the iommu is sharing the
> > *exact* page table of the CPU so the trick KVM/etc uses where 'start'
> > makes the shadow PTE non-present and then delays the fault until end
> > completes cannot work here.
> 
> ohhh, is this trick something I should read more about, if I'm about to
> jump in here?

All the invalidate_start implementations have a parallel page table
structure that they copy from the main page table into, eg using
hmm_range_fault() or whatever kvm does. Thus we can create a situation
where a sPTE is non-present while a PTE is present.

The iommu/etc point the HW at exactly the same page table as the mm,
so we cannot create a situation where a sPTE is non-present while the
PTE is present.

Thus, IMHO, the only correct answer is to flush the shadow TLB at
exactly the same moments we flush the CPU TLB because the two TLBs are
processing exactly the same data structure. If there is logic that the
TLB flush can be delayed to optimize it then that logic applies
equally to both TLBs.

> > So, then we can see where the end_only thing came from, commit
> > 0f10851ea475 ("mm/mmu_notifier: avoid double notification when it is
> > useless") and that long winded message explains why some of the cases
> 
> I seem to recall that there was a performance drop involving GPUs, due
> to the double notification. Just to fill in a little bit of history as
> to why Jerome was trying to deduplicate the notifier callbacks.

Sure, the double notification is clearly undesired, but the right way
to fix that was to remove the call to invalidate_range() from
mn_hlist_invalidate_end() which means adding any missing calls to
invalidate_range() near the CPU TLB

However, as Sean suggested, GPU drivers should not use
invalidate_range() because they are not directly sharing the CPU page
table in the first place. They should use start/end. Maybe the docs
need clarification on this point as well.

> After an initial pass through this, with perhaps 80% understanding
> of the story, I'm reading that as:
> 
>     Audit all the sites (which you initially did quickly, above)
>     that 0f10851ea475 touched, and any other related ones, and
>     change things so that invalidate_range() and primary TLB
>     flushing happen at the same point(s).
> 
> Yes? Anything else?

I would structure it as

1) Make a patch fixing the documentation around all of the 
   mmu_notifier_invalidate_range_only_end() to explain where the
   invalidate_range() call is (eg where the CPU TLB flush is) or why
   the CPU TLB flush is not actually needed. 0f10851ea475 is a good
   guide where to touch

   Remove all the confusing documentation about write protect and
   'promotion'. The new logic is we call invalidate_range when we
   flush the CPU TLB and we might do that outside the start/end
   block.

2) Make patch(es) converting all the places calling
   mmu_notifier_invalidate_range_end() to only_end() by identify where
   the CPU TLB flush is and ensuring the invalidate_range is present
   at the CPU TLB flush point.

   eg all the range_end() calls on the fork() path are switched to
   only_end() and an invalidate_range() put outside the
   start/end/block in dup_mmap() near the TLB flush. This is even more
   optimal because we batch flushing the entire shadown TLB in one
   shot, instead of trying to invalidate every VMA range.

3) Fix the TLB flusher to not send -1 -1 in the corner Jann noticed in
   the first mail

Jason