All of lore.kernel.org
 help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>,
	linux-s390@vger.kernel.org, alex.williamson@redhat.com,
	cohuck@redhat.com, farman@linux.ibm.com, pmorel@linux.ibm.com,
	borntraeger@linux.ibm.com, hca@linux.ibm.com, gor@linux.ibm.com,
	gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com,
	svens@linux.ibm.com, frankja@linux.ibm.com, david@redhat.com,
	imbrenda@linux.ibm.com, vneethv@linux.ibm.com,
	oberpar@linux.ibm.com, freude@linux.ibm.com, thuth@redhat.com,
	pasic@linux.ibm.com, Robin Murphy <robin.murphy@arm.com>
Subject: Re: s390-iommu.c default domain conversion
Date: Fri, 20 May 2022 18:26:03 +0200	[thread overview]
Message-ID: <dd2d49bb798effaeac55f4bf3625f376bb1abda0.camel@linux.ibm.com> (raw)
In-Reply-To: <20220520155649.GJ1343366@nvidia.com>

On Fri, 2022-05-20 at 12:56 -0300, Jason Gunthorpe wrote:
> On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote:
> 
> > > > With that the performance on the LPAR machine hypervisor (no paging) is
> > > > on par with our existing code. On paging hypervisors (z/VM and KVM)
> > > > i.e. with the hypervisor shadowing the I/O translation tables, it's
> > > > still slower than our existing code and interestingly strict mode seems
> > > > to be better than lazy here. One thing I haven't done yet is implement
> > > > the map_pages() operation or adding larger page sizes. 
> > > 
> > > map_pages() speeds thiings up if there is contiguous memory, I'm not
> > > sure what work load you are testing with so hard to guess if that is
> > > interesting or not.
> > 
> > Our most important driver is mlx5 with both IP and RDMA traffic on
> > ConnectX-4/5/6 but we also support NVMes.
> 
> So you probably won't see big gains here from larger page sizes unless
> you also have a specific userspace that is trigger huge pages.
> 
> qemu users spaces do this so it is worth doing anyhow though.
> 
> > > > Maybe you have some tips what you'd expect to be most beneficial?
> > > > Either way we're optimistic this can be solved and this conversion
> > > > will be a high ranking item on my backlog going forward.
> > > 
> > > I'm not really sure I understand the differences, do you have a sense
> > > what is making it slower? Maybe there is some small feature that can
> > > be added to the core code? It is very strange that strict is faster,
> > > that should not be, strict requires synchronous flush in the unmap
> > > cas, lazy does not. Are you sure you are getting the lazy flushes
> > > enabled?
> > 
> > The lazy flushes are the timer triggered flush_iotlb_all() in
> > fq_flush_iotlb(), right? I definitely see that when tracing my
> > flush_iotlb_all() implementation via that path. That flush_iotlb_all()
> > in my prototype is basically the same as the global RPCIT we did once
> > we wrapped around our IOVA address space. I suspect that this just
> > happens much more often with the timer than our wrap around and
> > flushing the entire aperture is somewhat slow because it causes the
> > hypervisor to re-examine the entire I/O translation table. On the other
> > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
> > always flushes a relatively small contiguous range as I'm using the
> > following construct to extend gather:
> > 
> > 	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
> > 		iommu_iotlb_sync(domain, gather);
> > 
> > 	iommu_iotlb_gather_add_range(gather, iova, size);
> > 
> > Maybe the smaller contiguous ranges just help with locality/caching
> > because the flushed range in the guests I/O tables was just updated.
> 
> So, from what I can tell, the S390 HW is not really the same as a
> normal iommu in that you can do map over IOVA that hasn't been flushed
> yet and the map will restore coherency to the new page table
> entries. I see the zpci_refresh_trans() call in map which is why I
> assume this?

The zpci_refresh_trans() in map is only there because previously we
didn't implement iotlb_sync_map(). Also, we only need to flush on map
for the paged guest case so the hypervisor can update its shadow table.
It happens unconditionally in the existing s390_iommu.c because that
was not well optimized and uses the same s390_iommu_update_trans() for
map and unmap. We had the skipping of the TLB flush handled properly in
the arch/s390/pci_dma.c mapping code where !zdev->tlb_refresh indicates
that we don't need flushes on map.

> 
> (note that normal HW has a HW IOTLB cache that MUST be flushed or new
> maps will not be loaded by the HW, so mapping to areas that previously
> had uninvalidated IOVA is a functional problem, which motivates the
> design of this scheme)

We do need to flush the TLBs on unmap. The reason is that under LPAR
(non paging hypervisor) the hardware can establish a new mapping on its
own if an I/O PTE is changed from invalid to a valid translation and it
wasn't previously in the TLB. I think that's how most hardware IOMMUs
work and how I understand your explanation too.

> 
> However, since S390 can restore coherency during map the lazy
> invalidation is not for correctness but only for security - to
> eventually unmap things that the DMA device should not be
> touching?

As explained above it is for correctness but with the existing code we
handle this slightly differently. As we go through the entire IOVA
space we're never reusing a previously unmapped IOVA until we run out
of IOVAs. Then we do a global flush which on LPAR just drops the
hardware TLBs making sure that future re-uses of IOVAs will trigger a
harware walk of the I/O translation tables. Same constraints just a
different scheme.



  reply	other threads:[~2022-05-20 16:26 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe
2022-05-10 15:25 ` Matthew Rosato
2022-05-10 16:09   ` Jason Gunthorpe
2022-05-20 13:05     ` Niklas Schnelle
2022-05-20 13:44       ` Jason Gunthorpe
2022-05-20 15:17         ` Niklas Schnelle
2022-05-20 15:51           ` Robin Murphy
2022-05-20 15:56           ` Jason Gunthorpe
2022-05-20 16:26             ` Niklas Schnelle [this message]
2022-05-20 16:43               ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dd2d49bb798effaeac55f4bf3625f376bb1abda0.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=cohuck@redhat.com \
    --cc=david@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=frankja@linux.ibm.com \
    --cc=freude@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=jgg@nvidia.com \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=oberpar@linux.ibm.com \
    --cc=pasic@linux.ibm.com \
    --cc=pmorel@linux.ibm.com \
    --cc=robin.murphy@arm.com \
    --cc=svens@linux.ibm.com \
    --cc=thuth@redhat.com \
    --cc=vneethv@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.