All of lore.kernel.org
 help / color / mirror / Atom feed
* s390-iommu.c default domain conversion
@ 2022-05-09 23:35 Jason Gunthorpe
  2022-05-10 15:25 ` Matthew Rosato
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2022-05-09 23:35 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: linux-s390, alex.williamson, cohuck, schnelle, farman, pmorel,
	borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja,
	david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

Hi s390 folks/Matthew

Since everyone is looking at iommu support for the nested domains,
could we also tackle the default domain conversion please? s390 is one
of the last 4 drivers that need it.

From what I can see it looks like when detach_dev() is called it
expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ?

Has anyone thought about converting the dma_ops to use the normal DMA
API iommu support and run it through the iommu driver instead of
through the dma_ops?

Alternatively perhaps we can keep the dma_ops with some iommu
side-change.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe
@ 2022-05-10 15:25 ` Matthew Rosato
  2022-05-10 16:09   ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Rosato @ 2022-05-10 15:25 UTC (permalink / raw)
  To: Jason Gunthorpe, schnelle
  Cc: linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, frankja, david,
	imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy

On 5/9/22 7:35 PM, Jason Gunthorpe wrote:
> Hi s390 folks/Matthew
> 
> Since everyone is looking at iommu support for the nested domains,
> could we also tackle the default domain conversion please? s390 is one
> of the last 4 drivers that need it.
> 
>  From what I can see it looks like when detach_dev() is called it
> expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ?

Yes

> 
> Has anyone thought about converting the dma_ops to use the normal DMA
> API iommu support and run it through the iommu driver instead of
> through the dma_ops?
> 
> Alternatively perhaps we can keep the dma_ops with some iommu
> side-change.

It has come up before.  So ultimately the goal is to be driving the dma 
through the default iommu domain (via dma-iommu) rather than directly in 
the dma_ops?  One of our main concerns is performance loss from s390-ism 
optimizations in the dma_ops like RPCIT avoidance / lazy map + global 
flush (maybe I am wrong or something can be generalized through the api 
- or something is already there that will work).  Of course there are 
some obvious benefits too, we have some duplication of work between 
s390-iommu.c and arch/s390/pci/pci_dma.c

I think the reality is that Niklas and I need to have a close look and 
do some testing on our end to see what it will take and if we can get 
acceptable performance from a conversion, then get back to you.

Thanks,
Matt



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-10 15:25 ` Matthew Rosato
@ 2022-05-10 16:09   ` Jason Gunthorpe
  2022-05-20 13:05     ` Niklas Schnelle
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2022-05-10 16:09 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: schnelle, linux-s390, alex.williamson, cohuck, farman, pmorel,
	borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja,
	david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

On Tue, May 10, 2022 at 11:25:54AM -0400, Matthew Rosato wrote:
> On 5/9/22 7:35 PM, Jason Gunthorpe wrote:
> > Hi s390 folks/Matthew
> > 
> > Since everyone is looking at iommu support for the nested domains,
> > could we also tackle the default domain conversion please? s390 is one
> > of the last 4 drivers that need it.
> > 
> >  From what I can see it looks like when detach_dev() is called it
> > expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ?
> 
> Yes
> 
> > 
> > Has anyone thought about converting the dma_ops to use the normal DMA
> > API iommu support and run it through the iommu driver instead of
> > through the dma_ops?
> > 
> > Alternatively perhaps we can keep the dma_ops with some iommu
> > side-change.
> 
> It has come up before.  So ultimately the goal is to be driving the dma
> through the default iommu domain (via dma-iommu) rather than directly in the
> dma_ops?  One of our main concerns is performance loss from s390-ism
> optimizations in the dma_ops like RPCIT avoidance / lazy map +
> global flush

The core version is somewhat different, it triggers the
iotlb_flush_all from a timer, not just on address space wrap around,
but the fast path on unmap can still skip the zpci_refresh_trans().

On the other hand it doesn't have the limit of iova space, and the
iova allocator is somewhat more sophisticated which will optimize
large page cases that s390 currently doesn't. Basically it will work
better with things like mlx5 cards in the normal case.

The lasy flush is done via the IOMMU_DOMAIN_DMA_FQ and the iommu gather->queued
stuff to allow skipping the RCPIT during the normal iotlb_sync.

> I think the reality is that Niklas and I need to have a close look and do
> some testing on our end to see what it will take and if we can get
> acceptable performance from a conversion, then get back to you.

It would be a good long term goal, getting rid of these duplicated
dma_ops is another open task. There is a patch series out there to
convert arm, so this whole area will become even more niche.

But another path is to somehow keep them and just allow default
domains to work - ARM did this.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-10 16:09   ` Jason Gunthorpe
@ 2022-05-20 13:05     ` Niklas Schnelle
  2022-05-20 13:44       ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Niklas Schnelle @ 2022-05-20 13:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, frankja, david,
	imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy

On Tue, 2022-05-10 at 13:09 -0300, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 11:25:54AM -0400, Matthew Rosato wrote:
> > On 5/9/22 7:35 PM, Jason Gunthorpe wrote:
> > > Hi s390 folks/Matthew
> > > 
> > > Since everyone is looking at iommu support for the nested domains,
> > > could we also tackle the default domain conversion please? s390 is one
> > > of the last 4 drivers that need it.
> > > 
> > >  From what I can see it looks like when detach_dev() is called it
> > > expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ?
> > 
> > Yes
> > 
> > > Has anyone thought about converting the dma_ops to use the normal DMA
> > > API iommu support and run it through the iommu driver instead of
> > > through the dma_ops?
> > > 
> > > Alternatively perhaps we can keep the dma_ops with some iommu
> > > side-change.
> > 
> > It has come up before.  So ultimately the goal is to be driving the dma
> > through the default iommu domain (via dma-iommu) rather than directly in the
> > dma_ops?  One of our main concerns is performance loss from s390-ism
> > optimizations in the dma_ops like RPCIT avoidance / lazy map +
> > global flush
> 
> The core version is somewhat different, it triggers the
> iotlb_flush_all from a timer, not just on address space wrap around,
> but the fast path on unmap can still skip the zpci_refresh_trans().
> 
> On the other hand it doesn't have the limit of iova space, and the
> iova allocator is somewhat more sophisticated which will optimize
> large page cases that s390 currently doesn't. Basically it will work
> better with things like mlx5 cards in the normal case.
> 
> The lasy flush is done via the IOMMU_DOMAIN_DMA_FQ and the iommu gather->queued
> stuff to allow skipping the RCPIT during the normal iotlb_sync.
> 
> > I think the reality is that Niklas and I need to have a close look and do
> > some testing on our end to see what it will take and if we can get
> > acceptable performance from a conversion, then get back to you.
> 
> It would be a good long term goal, getting rid of these duplicated
> dma_ops is another open task. There is a patch series out there to
> convert arm, so this whole area will become even more niche.
> 
> But another path is to somehow keep them and just allow default
> domains to work - ARM did this.
> 
> Jason

I did some testing and created a prototype that gets rid of
arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU
driver. It looks like the existing dma-iommu code allows us to do this
with relatively simple changes to the IOMMU driver only, mostly just
implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so
that's great. They also do seem to map quite well to our RPCIT I/O TLB
flush so that's great. For now the prototype still uses 4k pages only.

With that the performance on the LPAR machine hypervisor (no paging) is
on par with our existing code. On paging hypervisors (z/VM and KVM)
i.e. with the hypervisor shadowing the I/O translation tables, it's
still slower than our existing code and interestingly strict mode seems
to be better than lazy here. One thing I haven't done yet is implement
the map_pages() operation or adding larger page sizes. Maybe you have
some tips what you'd expect to be most beneficial? Either way we're
optimistic this can be solved and this conversion will be a high
ranking item on my backlog going forward.

I also stumbled over the following patch series which I think would
also help our paging hypervisor cases a lot since it should alleviate
the cost of shadowing short lived mappings:

https://lore.kernel.org/linux-iommu/20210806103423.3341285-1-stevensd@google.com/

Sadly it seems it hasn't gained much traction so far.

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-20 13:05     ` Niklas Schnelle
@ 2022-05-20 13:44       ` Jason Gunthorpe
  2022-05-20 15:17         ` Niklas Schnelle
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2022-05-20 13:44 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman,
	pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens,
	frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote:

> I did some testing and created a prototype that gets rid of
> arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU
> driver. It looks like the existing dma-iommu code allows us to do this
> with relatively simple changes to the IOMMU driver only, mostly just
> implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so
> that's great. They also do seem to map quite well to our RPCIT I/O TLB
> flush so that's great. For now the prototype still uses 4k pages only.

You are going to want to improve that page sizes in the iommu driver
anyhow for VFIO.
 
> With that the performance on the LPAR machine hypervisor (no paging) is
> on par with our existing code. On paging hypervisors (z/VM and KVM)
> i.e. with the hypervisor shadowing the I/O translation tables, it's
> still slower than our existing code and interestingly strict mode seems
> to be better than lazy here. One thing I haven't done yet is implement
> the map_pages() operation or adding larger page sizes. 

map_pages() speeds thiings up if there is contiguous memory, I'm not
sure what work load you are testing with so hard to guess if that is
interesting or not.

> Maybe you have some tips what you'd expect to be most beneficial?
> Either way we're optimistic this can be solved and this conversion
> will be a high ranking item on my backlog going forward.

I'm not really sure I understand the differences, do you have a sense
what is making it slower? Maybe there is some small feature that can
be added to the core code? It is very strange that strict is faster,
that should not be, strict requires synchronous flush in the unmap
cas, lazy does not. Are you sure you are getting the lazy flushes
enabled?

> I also stumbled over the following patch series which I think would
> also help our paging hypervisor cases a lot since it should alleviate
> the cost of shadowing short lived mappings:

This is quite different than what your current code does though?

Still, it seems encouraging

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-20 13:44       ` Jason Gunthorpe
@ 2022-05-20 15:17         ` Niklas Schnelle
  2022-05-20 15:51           ` Robin Murphy
  2022-05-20 15:56           ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: Niklas Schnelle @ 2022-05-20 15:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman,
	pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens,
	frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote:
> On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote:
> 
> > I did some testing and created a prototype that gets rid of
> > arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU
> > driver. It looks like the existing dma-iommu code allows us to do this
> > with relatively simple changes to the IOMMU driver only, mostly just
> > implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so
> > that's great. They also do seem to map quite well to our RPCIT I/O TLB
> > flush so that's great. For now the prototype still uses 4k pages only.
> 
> You are going to want to improve that page sizes in the iommu driver
> anyhow for VFIO.

Ok, we'll look into this.

>  
> > With that the performance on the LPAR machine hypervisor (no paging) is
> > on par with our existing code. On paging hypervisors (z/VM and KVM)
> > i.e. with the hypervisor shadowing the I/O translation tables, it's
> > still slower than our existing code and interestingly strict mode seems
> > to be better than lazy here. One thing I haven't done yet is implement
> > the map_pages() operation or adding larger page sizes. 
> 
> map_pages() speeds thiings up if there is contiguous memory, I'm not
> sure what work load you are testing with so hard to guess if that is
> interesting or not.

Our most important driver is mlx5 with both IP and RDMA traffic on
ConnectX-4/5/6 but we also support NVMes.

> 
> > Maybe you have some tips what you'd expect to be most beneficial?
> > Either way we're optimistic this can be solved and this conversion
> > will be a high ranking item on my backlog going forward.
> 
> I'm not really sure I understand the differences, do you have a sense
> what is making it slower? Maybe there is some small feature that can
> be added to the core code? It is very strange that strict is faster,
> that should not be, strict requires synchronous flush in the unmap
> cas, lazy does not. Are you sure you are getting the lazy flushes
> enabled?

The lazy flushes are the timer triggered flush_iotlb_all() in
fq_flush_iotlb(), right? I definitely see that when tracing my
flush_iotlb_all() implementation via that path. That flush_iotlb_all()
in my prototype is basically the same as the global RPCIT we did once
we wrapped around our IOVA address space. I suspect that this just
happens much more often with the timer than our wrap around and
flushing the entire aperture is somewhat slow because it causes the
hypervisor to re-examine the entire I/O translation table. On the other
hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
always flushes a relatively small contiguous range as I'm using the
following construct to extend gather:

	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
		iommu_iotlb_sync(domain, gather);

	iommu_iotlb_gather_add_range(gather, iova, size);

Maybe the smaller contiguous ranges just help with locality/caching
because the flushed range in the guests I/O tables was just updated.

> > I also stumbled over the following patch series which I think would
> > also help our paging hypervisor cases a lot since it should alleviate
> > the cost of shadowing short lived mappings:
> 
> This is quite different than what your current code does though?

Yes

> 
> Still, it seems encouraging
> 
> Jason



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-20 15:17         ` Niklas Schnelle
@ 2022-05-20 15:51           ` Robin Murphy
  2022-05-20 15:56           ` Jason Gunthorpe
  1 sibling, 0 replies; 10+ messages in thread
From: Robin Murphy @ 2022-05-20 15:51 UTC (permalink / raw)
  To: Niklas Schnelle, Jason Gunthorpe
  Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman,
	pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens,
	frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic

On 2022-05-20 16:17, Niklas Schnelle wrote:
> On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote:
>> On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote:
>>
>>> I did some testing and created a prototype that gets rid of
>>> arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU
>>> driver. It looks like the existing dma-iommu code allows us to do this
>>> with relatively simple changes to the IOMMU driver only, mostly just
>>> implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so
>>> that's great. They also do seem to map quite well to our RPCIT I/O TLB
>>> flush so that's great. For now the prototype still uses 4k pages only.
>>
>> You are going to want to improve that page sizes in the iommu driver
>> anyhow for VFIO.
> 
> Ok, we'll look into this.
> 
>>   
>>> With that the performance on the LPAR machine hypervisor (no paging) is
>>> on par with our existing code. On paging hypervisors (z/VM and KVM)
>>> i.e. with the hypervisor shadowing the I/O translation tables, it's
>>> still slower than our existing code and interestingly strict mode seems
>>> to be better than lazy here. One thing I haven't done yet is implement
>>> the map_pages() operation or adding larger page sizes.
>>
>> map_pages() speeds thiings up if there is contiguous memory, I'm not
>> sure what work load you are testing with so hard to guess if that is
>> interesting or not.
> 
> Our most important driver is mlx5 with both IP and RDMA traffic on
> ConnectX-4/5/6 but we also support NVMes.

Since you already have the loop in s390_iommu_update_trans(), updating 
to map/unmap_pages should be trivial, and well worth it.

>>> Maybe you have some tips what you'd expect to be most beneficial?
>>> Either way we're optimistic this can be solved and this conversion
>>> will be a high ranking item on my backlog going forward.
>>
>> I'm not really sure I understand the differences, do you have a sense
>> what is making it slower? Maybe there is some small feature that can
>> be added to the core code? It is very strange that strict is faster,
>> that should not be, strict requires synchronous flush in the unmap
>> cas, lazy does not. Are you sure you are getting the lazy flushes
>> enabled?
> 
> The lazy flushes are the timer triggered flush_iotlb_all() in
> fq_flush_iotlb(), right? I definitely see that when tracing my
> flush_iotlb_all() implementation via that path. That flush_iotlb_all()
> in my prototype is basically the same as the global RPCIT we did once
> we wrapped around our IOVA address space. I suspect that this just
> happens much more often with the timer than our wrap around and
> flushing the entire aperture is somewhat slow because it causes the
> hypervisor to re-examine the entire I/O translation table. On the other
> hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
> always flushes a relatively small contiguous range as I'm using the
> following construct to extend gather:
> 
> 	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
> 		iommu_iotlb_sync(domain, gather);
> 
> 	iommu_iotlb_gather_add_range(gather, iova, size);
> 
> Maybe the smaller contiguous ranges just help with locality/caching
> because the flushed range in the guests I/O tables was just updated.

That's entirely believable - both the AMD and Intel drivers force strict 
mode when virtualised for similar reasons, so feel free to do the same.

Robin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-20 15:17         ` Niklas Schnelle
  2022-05-20 15:51           ` Robin Murphy
@ 2022-05-20 15:56           ` Jason Gunthorpe
  2022-05-20 16:26             ` Niklas Schnelle
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2022-05-20 15:56 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman,
	pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens,
	frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote:

> > > With that the performance on the LPAR machine hypervisor (no paging) is
> > > on par with our existing code. On paging hypervisors (z/VM and KVM)
> > > i.e. with the hypervisor shadowing the I/O translation tables, it's
> > > still slower than our existing code and interestingly strict mode seems
> > > to be better than lazy here. One thing I haven't done yet is implement
> > > the map_pages() operation or adding larger page sizes. 
> > 
> > map_pages() speeds thiings up if there is contiguous memory, I'm not
> > sure what work load you are testing with so hard to guess if that is
> > interesting or not.
> 
> Our most important driver is mlx5 with both IP and RDMA traffic on
> ConnectX-4/5/6 but we also support NVMes.

So you probably won't see big gains here from larger page sizes unless
you also have a specific userspace that is trigger huge pages.

qemu users spaces do this so it is worth doing anyhow though.

> > > Maybe you have some tips what you'd expect to be most beneficial?
> > > Either way we're optimistic this can be solved and this conversion
> > > will be a high ranking item on my backlog going forward.
> > 
> > I'm not really sure I understand the differences, do you have a sense
> > what is making it slower? Maybe there is some small feature that can
> > be added to the core code? It is very strange that strict is faster,
> > that should not be, strict requires synchronous flush in the unmap
> > cas, lazy does not. Are you sure you are getting the lazy flushes
> > enabled?
> 
> The lazy flushes are the timer triggered flush_iotlb_all() in
> fq_flush_iotlb(), right? I definitely see that when tracing my
> flush_iotlb_all() implementation via that path. That flush_iotlb_all()
> in my prototype is basically the same as the global RPCIT we did once
> we wrapped around our IOVA address space. I suspect that this just
> happens much more often with the timer than our wrap around and
> flushing the entire aperture is somewhat slow because it causes the
> hypervisor to re-examine the entire I/O translation table. On the other
> hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
> always flushes a relatively small contiguous range as I'm using the
> following construct to extend gather:
> 
> 	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
> 		iommu_iotlb_sync(domain, gather);
> 
> 	iommu_iotlb_gather_add_range(gather, iova, size);
> 
> Maybe the smaller contiguous ranges just help with locality/caching
> because the flushed range in the guests I/O tables was just updated.

So, from what I can tell, the S390 HW is not really the same as a
normal iommu in that you can do map over IOVA that hasn't been flushed
yet and the map will restore coherency to the new page table
entries. I see the zpci_refresh_trans() call in map which is why I
assume this?

(note that normal HW has a HW IOTLB cache that MUST be flushed or new
maps will not be loaded by the HW, so mapping to areas that previously
had uninvalidated IOVA is a functional problem, which motivates the
design of this scheme)

However, since S390 can restore coherency during map the lazy
invalidation is not for correctness but only for security - to
eventually unmap things that the DMA device should not be
touching?

So, what you want is a slightly different FQ operation in the core
code:

 - Continue to skip iotlb_sync during unmap
 - Do not do fq_ring_add() tracking, immediately recycle the IOVA
 - Call flush all on a timer with a suitable duration - much like today

You can teach the core code to do this with some new flag, or 
rely on a custom driver implementation:

 - Make the alloc for IOMMU_DOMAIN_DMA produce a iommu_domain with
   special ops: iotlb_sync/flush all both NOPs.

 - Have a timer per iommu_domain and internally flush all on that
   timer, quite similar to how the current code works

 - Flush all when detaching a device

 - IOMMU_DOMAIN_UNMANAGED will work the same as today and includes the
   ops.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-20 15:56           ` Jason Gunthorpe
@ 2022-05-20 16:26             ` Niklas Schnelle
  2022-05-20 16:43               ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Niklas Schnelle @ 2022-05-20 16:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman,
	pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens,
	frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

On Fri, 2022-05-20 at 12:56 -0300, Jason Gunthorpe wrote:
> On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote:
> 
> > > > With that the performance on the LPAR machine hypervisor (no paging) is
> > > > on par with our existing code. On paging hypervisors (z/VM and KVM)
> > > > i.e. with the hypervisor shadowing the I/O translation tables, it's
> > > > still slower than our existing code and interestingly strict mode seems
> > > > to be better than lazy here. One thing I haven't done yet is implement
> > > > the map_pages() operation or adding larger page sizes. 
> > > 
> > > map_pages() speeds thiings up if there is contiguous memory, I'm not
> > > sure what work load you are testing with so hard to guess if that is
> > > interesting or not.
> > 
> > Our most important driver is mlx5 with both IP and RDMA traffic on
> > ConnectX-4/5/6 but we also support NVMes.
> 
> So you probably won't see big gains here from larger page sizes unless
> you also have a specific userspace that is trigger huge pages.
> 
> qemu users spaces do this so it is worth doing anyhow though.
> 
> > > > Maybe you have some tips what you'd expect to be most beneficial?
> > > > Either way we're optimistic this can be solved and this conversion
> > > > will be a high ranking item on my backlog going forward.
> > > 
> > > I'm not really sure I understand the differences, do you have a sense
> > > what is making it slower? Maybe there is some small feature that can
> > > be added to the core code? It is very strange that strict is faster,
> > > that should not be, strict requires synchronous flush in the unmap
> > > cas, lazy does not. Are you sure you are getting the lazy flushes
> > > enabled?
> > 
> > The lazy flushes are the timer triggered flush_iotlb_all() in
> > fq_flush_iotlb(), right? I definitely see that when tracing my
> > flush_iotlb_all() implementation via that path. That flush_iotlb_all()
> > in my prototype is basically the same as the global RPCIT we did once
> > we wrapped around our IOVA address space. I suspect that this just
> > happens much more often with the timer than our wrap around and
> > flushing the entire aperture is somewhat slow because it causes the
> > hypervisor to re-examine the entire I/O translation table. On the other
> > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
> > always flushes a relatively small contiguous range as I'm using the
> > following construct to extend gather:
> > 
> > 	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
> > 		iommu_iotlb_sync(domain, gather);
> > 
> > 	iommu_iotlb_gather_add_range(gather, iova, size);
> > 
> > Maybe the smaller contiguous ranges just help with locality/caching
> > because the flushed range in the guests I/O tables was just updated.
> 
> So, from what I can tell, the S390 HW is not really the same as a
> normal iommu in that you can do map over IOVA that hasn't been flushed
> yet and the map will restore coherency to the new page table
> entries. I see the zpci_refresh_trans() call in map which is why I
> assume this?

The zpci_refresh_trans() in map is only there because previously we
didn't implement iotlb_sync_map(). Also, we only need to flush on map
for the paged guest case so the hypervisor can update its shadow table.
It happens unconditionally in the existing s390_iommu.c because that
was not well optimized and uses the same s390_iommu_update_trans() for
map and unmap. We had the skipping of the TLB flush handled properly in
the arch/s390/pci_dma.c mapping code where !zdev->tlb_refresh indicates
that we don't need flushes on map.

> 
> (note that normal HW has a HW IOTLB cache that MUST be flushed or new
> maps will not be loaded by the HW, so mapping to areas that previously
> had uninvalidated IOVA is a functional problem, which motivates the
> design of this scheme)

We do need to flush the TLBs on unmap. The reason is that under LPAR
(non paging hypervisor) the hardware can establish a new mapping on its
own if an I/O PTE is changed from invalid to a valid translation and it
wasn't previously in the TLB. I think that's how most hardware IOMMUs
work and how I understand your explanation too.

> 
> However, since S390 can restore coherency during map the lazy
> invalidation is not for correctness but only for security - to
> eventually unmap things that the DMA device should not be
> touching?

As explained above it is for correctness but with the existing code we
handle this slightly differently. As we go through the entire IOVA
space we're never reusing a previously unmapped IOVA until we run out
of IOVAs. Then we do a global flush which on LPAR just drops the
hardware TLBs making sure that future re-uses of IOVAs will trigger a
harware walk of the I/O translation tables. Same constraints just a
different scheme.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: s390-iommu.c default domain conversion
  2022-05-20 16:26             ` Niklas Schnelle
@ 2022-05-20 16:43               ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2022-05-20 16:43 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman,
	pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens,
	frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic,
	Robin Murphy

On Fri, May 20, 2022 at 06:26:03PM +0200, Niklas Schnelle wrote:
> > So, from what I can tell, the S390 HW is not really the same as a
> > normal iommu in that you can do map over IOVA that hasn't been flushed
> > yet and the map will restore coherency to the new page table
> > entries. I see the zpci_refresh_trans() call in map which is why I
> > assume this?
> 
> The zpci_refresh_trans() in map is only there because previously we
> didn't implement iotlb_sync_map(). Also, we only need to flush on map
> for the paged guest case so the hypervisor can update its shadow table.
> It happens unconditionally in the existing s390_iommu.c because that
> was not well optimized and uses the same s390_iommu_update_trans() for
> map and unmap. We had the skipping of the TLB flush handled properly in
> the arch/s390/pci_dma.c mapping code where !zdev->tlb_refresh indicates
> that we don't need flushes on map.

Even the arch/s390/pci_dma.c has a zpci_refresh_trans() on map, it is
just conditional on zdev->tlb_refresh

I had also assumed that the paging case uses this path?

> > (note that normal HW has a HW IOTLB cache that MUST be flushed or new
> > maps will not be loaded by the HW, so mapping to areas that previously
> > had uninvalidated IOVA is a functional problem, which motivates the
> > design of this scheme)
> 
> We do need to flush the TLBs on unmap. The reason is that under LPAR
> (non paging hypervisor) the hardware can establish a new mapping on its
> own if an I/O PTE is changed from invalid to a valid translation and it
> wasn't previously in the TLB. I think that's how most hardware IOMMUs
> work and how I understand your explanation too.

Since you said LPAR was OK performance wise, I was thinking only about
the paging case.

You can have different iommu_domains implementations for these two
cases.

But if the paging case doesn't need the hypercall anyhow it doesn't
work.

You'd have to change the core code to increase the timer duration, I
think.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-05-20 16:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe
2022-05-10 15:25 ` Matthew Rosato
2022-05-10 16:09   ` Jason Gunthorpe
2022-05-20 13:05     ` Niklas Schnelle
2022-05-20 13:44       ` Jason Gunthorpe
2022-05-20 15:17         ` Niklas Schnelle
2022-05-20 15:51           ` Robin Murphy
2022-05-20 15:56           ` Jason Gunthorpe
2022-05-20 16:26             ` Niklas Schnelle
2022-05-20 16:43               ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.