Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

From: Jason Gunthorpe <jgg@nvidia.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: "Zeng, Oak" <oak.zeng@intel.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Brost, Matthew" <matthew.brost@intel.com>,
	"Welty, Brian" <brian.welty@intel.com>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"Bommu, Krishnaiah" <krishnaiah.bommu@intel.com>,
	"Vishwanathapura,
	Niranjana" <niranjana.vishwanathapura@intel.com>,
	Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
Date: Fri, 26 Apr 2024 13:35:19 -0300	[thread overview]
Message-ID: <20240426163519.GZ941030@nvidia.com> (raw)
In-Reply-To: <ad82f95ee29ada403459416d4c97c2b9083b5a0f.camel@linux.intel.com>

On Fri, Apr 26, 2024 at 04:49:26PM +0200, Thomas Hellström wrote:
> On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > > First, the gpu_vma structure is something that partitions the
> > > gpu_vm
> > > that holds gpu-related range metadata, like what to mirror, desired
> > > gpu
> > > caching policies etc. These are managed (created, removed and
> > > split)
> > > mainly from user-space. These are stored and looked up from an rb-
> > > tree.
> > 
> > Except we are talking about SVA here, so all of this should not be
> > exposed to userspace.
> 
> I think you are misreading. this is on the level "Mirror this region of
> the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
> accesses on this region", very similar to cpu mmap / munmap and
> madvise. What I'm trying to say here is that this does not directly
> affect the SVA except whether to do SVA or not, and in that case what
> region of the CPU mm will be mirrored, and in addition, any gpu
> attributes for the mirrored region.

SVA is you bind the whole MM and device faults dynamically populate
the mirror page table. There aren't non-SVA regions. Meta data, like
you describe, is meta data for the allocation/migration mechanism, not
for the page table or that has anything to do with the SVA mirror
operation.

Yes there is another common scheme where you bind a window of CPU to a
window on the device and mirror a fixed range, but this is a quite
different thing. It is not SVA, it has a fixed range, and it is
probably bound to a single GPU VMA in a multi-VMA device page table.

SVA is not just a whole bunch of windows being dynamically created by
the OS, that is entirely the wrong mental model. It would be horrible
to expose to userspace something like that as uAPI. Any hidden SVA
granules and other implementation specific artifacts must not be made
visible to userspace!!

> > If you use dma_map_sg you get into the world of wrongness where you
> > have to track ranges and invalidation has to wipe an entire range -
> > because you cannot do a dma unmap of a single page from a dma_map_sg
> > mapping. This is all the wrong way to use hmm_range_fault.
> > 
> > hmm_range_fault() is page table mirroring, it fundamentally must be
> > page-by-page. The target page table structure must have similar
> > properties to the MM page table - especially page by page
> > validate/invalidate. Meaning you cannot use dma_map_sg().
> 
> To me this is purely an optimization to make the driver page-table and
> hence the GPU TLB benefit from iommu coalescing / large pages and large
> driver PTEs.

This is a different topic. Leon is working on improving the DMA API to
get these kinds of benifits for HMM users. dma_map_sg is not the path
to get this. Leon's work should be significantly better in terms of
optimizing IOVA contiguity for a GPU use case. You can get a
guaranteed DMA contiguity at your choosen granual level, even up to
something like 512M.

> It is true that invalidation will sometimes shoot down
> large gpu ptes unnecessarily but it will not put any additional burden
> on the core AFAICT. 

In my experience people doing performance workloads don't enable the
IOMMU due to the high performance cost, so while optimizing iommu
coalescing is sort of interesting, it is not as important as using the
APIs properly and not harming the much more common situation when
there is no iommu and there is no artificial contiguity.

> on invalidation since zapping the gpu PTEs effectively stops any dma
> accesses. The dma mappings are rebuilt on the next gpu pagefault,
> which, as you mention, are considered slow anyway, but will probably
> still reuse the same prefault region, hence needing to rebuild the dma
> mappings anyway.

This is bad too. The DMA should not remain mapped after pages have
been freed, it completely destroys the concept of IOMMU enforced DMA
security and the ACPI notion of untrusted external devices.

> So as long as we are correct and do not adversely affect core mm, If
> the gpu performance (for whatever reason) is severely hampered if
> large gpu page-table-entries are not used, couldn't this be considered
> left to the driver?

Please use the APIs properly. We are trying to improve the DMA API to
better support HMM users, and doing unnecessary things like this in
drivers is only harmful to that kind of consolidation.

There is nothing stopping getting large GPU page table entries for
large CPU page table entries.

> And a related question. What about THP pages? OK to set up a single
> dma-mapping to those?

Yes, THP is still a page and dma_map_page() will map it.

> > > That's why finer-granularity mmu_interval notifiers might be
> > > beneficial
> > > (and then cached for future re-use of the same prefault range).
> > > This
> > > leads me to the next question:
> > 
> > It is not the design, please don't invent crazy special Intel things
> > on top of hmm_range_fault.
> 
> For the record, this is not a "crazy special Intel" invention. It's the
> way all GPU implementations do this so far.

"all GPU implementations" you mean AMD, and AMD predates alot of the
modern versions of this infrastructure IIRC.

> > Why would a prefetch have anything to do with a VMA? Ie your app
> > calls
> > malloc() and gets a little allocation out of a giant mmap() arena -
> > you want to prefault the entire arena? Does that really make any
> > sense?
> 
> Personally, no it doesn't. I'd rather use some sort of fixed-size
> chunk. But to rephrase, the question was more into the strong "drivers
> should not be aware of the cpu mm vma structures" comment. 

But this is essentially why - there is nothing useful the driver can
possibly learn from the CPU VMA to drive
hmm_range_fault(). hmm_range_fault already has to walk the VMA's if
someday something is actually needed it needs to be integrated in a
general way, not by having the driver touch vmas directly.

Jason