Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

From: Benjamin Serebrin <serebrin@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>,
	David Miller <davem@davemloft.net>,
	Joerg Roedel <jroedel@suse.de>,
	benh@kernel.crashing.org, Arnd Bergmann <arnd@arndb.de>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, linux-arch@vger.kernel.org,
	luto@kernel.org, borntraeger@de.ibm.com,
	cornelia.huck@de.ibm.com, sebott@linux.vnet.ibm.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	hch@lst.de, kvm@vger.kernel.org, schwidefsky@de.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
Date: Mon, 16 Nov 2015 10:42:23 -0800	[thread overview]
Message-ID: <CAN+hb0XLpQs9UiuWkzU=xBT_Tpfntd+XGCLYQDagwsY+HeN-kw@mail.gmail.com> (raw)
In-Reply-To: <CAN+hb0UWpfcS5DvgMxNjY-5JOztw2mO1r2FJAW17fn974mhxPA@mail.gmail.com>

On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> > We looked into Intel IOMMU performance a while ago and learned a few
> > useful things.  We generally did a parallel 200 thread TCP_RR test,
> > as this churns through mappings quickly and uses all available cores.
> >
> > First, the main bottleneck was software performance[1].
>
> For the Intel IOMMU, *all* we need to do is put a PTE in place. For
> real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
> need to do an IOTLB flush. It's a single 64-bit write of the PTE.
>
> All else is software overhead.
>

Agreed!

>
> (I'm deliberately ignoring the stupid chipsets where DMA page tables
> aren't cache coherent and we need a clflush too. They make me too sad.)

How much does Linux need to care about such chipsets?  Can we argue
that they get very sad performance and so be it?

>
>
>
> >   This study preceded the recent patch to break the locks into pools
> > ("Break up monolithic iommu table/lock into finer graularity pools
> > and lock").  There were several points of lock contention:
> > - the RB tree ...
> > - the RB tree ...
> > - the RB tree ...
> >
> > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> > c15-paper-peleg.pdf) has some promising approaches.  The magazine
> > avoids the RB tree issue.
>
> I'm thinking of ditching the RB tree altogether and switching to the
> allocator in lib/iommu-common.c (and thus getting the benefit of the
> finer granularity pools).

Sounds promising!  Is 4 parallel arenas enough?  We'll try to play
with that here.
I think lazy_flush leaves dangling references.

>
>
> > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> > page table cleanup algorithm could do well.
>
> When you say 'dynamic 1:1 mapping', is that the same thing that's been
> suggested elsewhere — avoiding the IOVA allocator completely by using a
> virtual address which *matches* the physical address, if that virtual
> address is available?

Yes.  We munge in 2 upper address bits in the IOVA to encode read and
write permissions as well.

>
> Simply cmpxchg on the PTE itself, and if it was
> already set *then* we fall back to the allocator, obviously configured
> to allocate from a range *higher* than the available physical memory.
>
> Jörg has been looking at this too, and was even trying to find space in
> the PTE for a use count so a given page could be in more than one
> mapping before we call back to the IOVA allocator.

Aren't bits 63:52 available at all levels?

>
>
>
> > There are correctness fixes and optimizations left in the
> > invalidation path: I want strict-ish semantics (a page doesn't go
> > back into the freelist until the last IOTLB/IOMMU TLB entry is
> > invalidated) with good performance, and that seems to imply that an
> > additional page reference should be gotten at dma_map time and put
> > back at the completion of the IOMMU flush routine.  (This is worthy
> > of much discussion.)
>
> We already do something like this for page table pages which are freed
> by an unmap, FWIW.

As I understood the code when I last looked, this was true only if a
single unmap operation covered an entire table's worth (2MByte, or
1GByte, etc) of mappings.  The caffeine hasn't hit yet, though, so I
can't even begin to dive into the new calls into mm.c.

>
>
> > Additionally, we can find ways to optimize the flush routine by
> > realizing that if we have frequent maps and unmaps, it may be because
> > the device creates and destroys buffers a lot; these kind of
> > workloads use an IOVA for one event and then never come back.  Maybe
> > TLBs don't do much good and we could just flush the entire IOMMU TLB
> > [and ATS cache] for that BDF.
>
> That would be a very interesting thing to look at. Although it would be
> nice if we had a better way to measure the performance impact of IOTLB
> misses — currently we don't have a lot of visibility at all.

All benchmarks are lies.  But we intend to run internal workloads as
well as well-agreed loads and see how things go.

>
>
> > 1: We verified that the IOMMU costs are almost entirely software
> > overheads by forcing software 1:1 mode, where we create page tables
> > for all physical addresses.  We tested using leaf nodes of size 4KB,
> > of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> > of the page tables, and no IOMMU invalidations.  We did piles of DMA
> > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> > addresses.  At 4KB page size, we could see some bandwidth slowdown,
> > but at 2MB and 1GB, there was < 1% performance loss as compared with
> > IOMMU off.
>
> Was this with ATS on or off? With ATS, the cost of the page walk can be
> amortised in some cases — you can look up the physical address *before*
> you are ready to actually start the DMA to it, and don't take that
> latency at the time you're actually moving the data.

This was with ATS intentionally disabled; we wanted to stress the
IOMMU's page walker.
Most devices aren't as smart as you describe, though in any kind of
ring buffer where storage is posted,
the device should be doing that.

On Mon, Nov 16, 2015 at 10:09 AM, Benjamin Serebrin <serebrin@google.com> wrote:
>
>
> On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse <dwmw2@infradead.org>
> wrote:
>>
>> On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
>> > We looked into Intel IOMMU performance a while ago and learned a few
>> > useful things.  We generally did a parallel 200 thread TCP_RR test,
>> > as this churns through mappings quickly and uses all available cores.
>> >
>> > First, the main bottleneck was software performance[1].
>>
>> For the Intel IOMMU, *all* we need to do is put a PTE in place. For
>> real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
>> need to do an IOTLB flush. It's a single 64-bit write of the PTE.
>>
>> All else is software overhead.
>>
>
> Agreed!
>
>>
>> (I'm deliberately ignoring the stupid chipsets where DMA page tables
>> aren't cache coherent and we need a clflush too. They make me too sad.)
>
>
>
> How much does Linux need to care about such chipsets?  Can we argue that
> they get very sad performance and so be it?
>
>>
>>
>>
>> >   This study preceded the recent patch to break the locks into pools
>> > ("Break up monolithic iommu table/lock into finer graularity pools
>> > and lock").  There were several points of lock contention:
>> > - the RB tree ...
>> > - the RB tree ...
>> > - the RB tree ...
>> >
>> > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
>> > c15-paper-peleg.pdf) has some promising approaches.  The magazine
>> > avoids the RB tree issue.
>>
>> I'm thinking of ditching the RB tree altogether and switching to the
>> allocator in lib/iommu-common.c (and thus getting the benefit of the
>> finer granularity pools).
>
>
> Sounds promising!  Is 4 parallel arenas enough?  We'll try to play with that
> here.
> I think lazy_flush leaves dangling references.
>
>>
>>
>> > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
>> > page table cleanup algorithm could do well.
>>
>> When you say 'dynamic 1:1 mapping', is that the same thing that's been
>> suggested elsewhere — avoiding the IOVA allocator completely by using a
>> virtual address which *matches* the physical address, if that virtual
>> address is available?
>
>
> Yes.  We munge in 2 upper address bits in the IOVA to encode read and write
> permissions as well.
>
>>
>> Simply cmpxchg on the PTE itself, and if it was
>> already set *then* we fall back to the allocator, obviously configured
>> to allocate from a range *higher* than the available physical memory.
>>
>> Jörg has been looking at this too, and was even trying to find space in
>> the PTE for a use count so a given page could be in more than one
>> mapping before we call back to the IOVA allocator.
>
>
> Aren't bits 63:52 available at all levels?
>
>>
>>
>>
>> > There are correctness fixes and optimizations left in the
>> > invalidation path: I want strict-ish semantics (a page doesn't go
>> > back into the freelist until the last IOTLB/IOMMU TLB entry is
>> > invalidated) with good performance, and that seems to imply that an
>> > additional page reference should be gotten at dma_map time and put
>> > back at the completion of the IOMMU flush routine.  (This is worthy
>> > of much discussion.)
>>
>> We already do something like this for page table pages which are freed
>> by an unmap, FWIW.
>
>
> As I understood the code when I last looked, this was true only if a single
> unmap operation covered an entire table's worth (2MByte, or 1GByte, etc) of
> mappings.  The caffeine hasn't hit yet, though, so I can't even begin to
> dive into the new calls into mm.c.
>
>
>>
>>
>> > Additionally, we can find ways to optimize the flush routine by
>> > realizing that if we have frequent maps and unmaps, it may be because
>> > the device creates and destroys buffers a lot; these kind of
>> > workloads use an IOVA for one event and then never come back.  Maybe
>> > TLBs don't do much good and we could just flush the entire IOMMU TLB
>> > [and ATS cache] for that BDF.
>>
>> That would be a very interesting thing to look at. Although it would be
>> nice if we had a better way to measure the performance impact of IOTLB
>> misses — currently we don't have a lot of visibility at all.
>
>
> All benchmarks are lies.  But we intend to run internal workloads as well as
> well-agreed loads and see how things go.
>
>>
>>
>> > 1: We verified that the IOMMU costs are almost entirely software
>> > overheads by forcing software 1:1 mode, where we create page tables
>> > for all physical addresses.  We tested using leaf nodes of size 4KB,
>> > of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
>> > of the page tables, and no IOMMU invalidations.  We did piles of DMA
>> > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
>> > addresses.  At 4KB page size, we could see some bandwidth slowdown,
>> > but at 2MB and 1GB, there was < 1% performance loss as compared with
>> > IOMMU off.
>>
>> Was this with ATS on or off? With ATS, the cost of the page walk can be
>> amortised in some cases — you can look up the physical address *before*
>> you are ready to actually start the DMA to it, and don't take that
>> latency at the time you're actually moving the data.
>
>
> This was with ATS intentionally disabled; we wanted to stress the IOMMU's
> page walker.
> Most devices aren't as smart as you describe, though in any kind of ring
> buffer where storage is posted,
> the device should be doing that.
>
>>
>>
>> --
>> David Woodhouse                            Open Source Technology Centre
>> David.Woodhouse@intel.com                              Intel Corporation
>>
>