Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

From: David Woodhouse <dwmw2@infradead.org>
To: Benjamin Serebrin <serebrin@google.com>,
	Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Cc: David Miller <davem@davemloft.net>,
	Joerg Roedel <jroedel@suse.de>,
	benh@kernel.crashing.org, arnd@arndb.de, corbet@lwn.net,
	linux-doc@vger.kernel.org, linux-arch@vger.kernel.org,
	luto@kernel.org, borntraeger@de.ibm.com,
	cornelia.huck@de.ibm.com, sebott@linux.vnet.ibm.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	hch@lst.de, kvm@vger.kernel.org, schwidefsky@de.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
Date: Mon, 16 Nov 2015 08:42:33 +0000	[thread overview]
Message-ID: <1447663353.145626.46.camel@infradead.org> (raw)
In-Reply-To: <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> We looked into Intel IOMMU performance a while ago and learned a few
> useful things.  We generally did a parallel 200 thread TCP_RR test,
> as this churns through mappings quickly and uses all available cores.
> 
> First, the main bottleneck was software performance[1].

For the Intel IOMMU, *all* we need to do is put a PTE in place. For
real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
need to do an IOTLB flush. It's a single 64-bit write of the PTE.

All else is software overhead.

(I'm deliberately ignoring the stupid chipsets where DMA page tables
aren't cache coherent and we need a clflush too. They make me too sad.)

>   This study preceded the recent patch to break the locks into pools
> ("Break up monolithic iommu table/lock into finer graularity pools
> and lock").  There were several points of lock contention:
> - the RB tree ...
> - the RB tree ...
> - the RB tree ...
> 
> Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> c15-paper-peleg.pdf) has some promising approaches.  The magazine
> avoids the RB tree issue.  

I'm thinking of ditching the RB tree altogether and switching to the
allocator in lib/iommu-common.c (and thus getting the benefit of the
finer granularity pools).

> I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> page table cleanup algorithm could do well.

When you say 'dynamic 1:1 mapping', is that the same thing that's been
suggested elsewhere — avoiding the IOVA allocator completely by using a
virtual address which *matches* the physical address, if that virtual
address is available? Simply cmpxchg on the PTE itself, and if it was
already set *then* we fall back to the allocator, obviously configured
to allocate from a range *higher* than the available physical memory.

Jörg has been looking at this too, and was even trying to find space in
the PTE for a use count so a given page could be in more than one
mapping before we call back to the IOVA allocator.

> There are correctness fixes and optimizations left in the
> invalidation path: I want strict-ish semantics (a page doesn't go
> back into the freelist until the last IOTLB/IOMMU TLB entry is
> invalidated) with good performance, and that seems to imply that an
> additional page reference should be gotten at dma_map time and put
> back at the completion of the IOMMU flush routine.  (This is worthy
> of much discussion.)  

We already do something like this for page table pages which are freed
by an unmap, FWIW.

> Additionally, we can find ways to optimize the flush routine by
> realizing that if we have frequent maps and unmaps, it may be because
> the device creates and destroys buffers a lot; these kind of
> workloads use an IOVA for one event and then never come back.  Maybe
> TLBs don't do much good and we could just flush the entire IOMMU TLB
> [and ATS cache] for that BDF.

That would be a very interesting thing to look at. Although it would be
nice if we had a better way to measure the performance impact of IOTLB
misses — currently we don't have a lot of visibility at all.

> 1: We verified that the IOMMU costs are almost entirely software
> overheads by forcing software 1:1 mode, where we create page tables
> for all physical addresses.  We tested using leaf nodes of size 4KB,
> of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> of the page tables, and no IOMMU invalidations.  We did piles of DMA
> maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> addresses.  At 4KB page size, we could see some bandwidth slowdown,
> but at 2MB and 1GB, there was < 1% performance loss as compared with
> IOMMU off.

Was this with ATS on or off? With ATS, the cost of the page walk can be
amortised in some cases — you can look up the physical address *before*
you are ready to actually start the DMA to it, and don't take that
latency at the time you're actually moving the data.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]