All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Woodhouse <dwmw2@infradead.org>
To: Benjamin Serebrin <serebrin@google.com>,
	Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Cc: David Miller <davem@davemloft.net>,
	Joerg Roedel <jroedel@suse.de>,
	benh@kernel.crashing.org, arnd@arndb.de, corbet@lwn.net,
	linux-doc@vger.kernel.org, linux-arch@vger.kernel.org,
	luto@kernel.org, borntraeger@de.ibm.com,
	cornelia.huck@de.ibm.com, sebott@linux.vnet.ibm.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	hch@lst.de, kvm@vger.kernel.org, schwidefsky@de.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
Date: Mon, 16 Nov 2015 08:42:33 +0000	[thread overview]
Message-ID: <1447663353.145626.46.camel@infradead.org> (raw)
In-Reply-To: <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> We looked into Intel IOMMU performance a while ago and learned a few
> useful things.  We generally did a parallel 200 thread TCP_RR test,
> as this churns through mappings quickly and uses all available cores.
> 
> First, the main bottleneck was software performance[1].

For the Intel IOMMU, *all* we need to do is put a PTE in place. For
real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
need to do an IOTLB flush. It's a single 64-bit write of the PTE.

All else is software overhead.

(I'm deliberately ignoring the stupid chipsets where DMA page tables
aren't cache coherent and we need a clflush too. They make me too sad.)


>   This study preceded the recent patch to break the locks into pools
> ("Break up monolithic iommu table/lock into finer graularity pools
> and lock").  There were several points of lock contention:
> - the RB tree ...
> - the RB tree ...
> - the RB tree ...
> 
> Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> c15-paper-peleg.pdf) has some promising approaches.  The magazine
> avoids the RB tree issue.  

I'm thinking of ditching the RB tree altogether and switching to the
allocator in lib/iommu-common.c (and thus getting the benefit of the
finer granularity pools).

> I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> page table cleanup algorithm could do well.

When you say 'dynamic 1:1 mapping', is that the same thing that's been
suggested elsewhere — avoiding the IOVA allocator completely by using a
virtual address which *matches* the physical address, if that virtual
address is available? Simply cmpxchg on the PTE itself, and if it was
already set *then* we fall back to the allocator, obviously configured
to allocate from a range *higher* than the available physical memory.

Jörg has been looking at this too, and was even trying to find space in
the PTE for a use count so a given page could be in more than one
mapping before we call back to the IOVA allocator.


> There are correctness fixes and optimizations left in the
> invalidation path: I want strict-ish semantics (a page doesn't go
> back into the freelist until the last IOTLB/IOMMU TLB entry is
> invalidated) with good performance, and that seems to imply that an
> additional page reference should be gotten at dma_map time and put
> back at the completion of the IOMMU flush routine.  (This is worthy
> of much discussion.)  

We already do something like this for page table pages which are freed
by an unmap, FWIW.

> Additionally, we can find ways to optimize the flush routine by
> realizing that if we have frequent maps and unmaps, it may be because
> the device creates and destroys buffers a lot; these kind of
> workloads use an IOVA for one event and then never come back.  Maybe
> TLBs don't do much good and we could just flush the entire IOMMU TLB
> [and ATS cache] for that BDF.

That would be a very interesting thing to look at. Although it would be
nice if we had a better way to measure the performance impact of IOTLB
misses — currently we don't have a lot of visibility at all.

> 1: We verified that the IOMMU costs are almost entirely software
> overheads by forcing software 1:1 mode, where we create page tables
> for all physical addresses.  We tested using leaf nodes of size 4KB,
> of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> of the page tables, and no IOMMU invalidations.  We did piles of DMA
> maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> addresses.  At 4KB page size, we could see some bandwidth slowdown,
> but at 2MB and 1GB, there was < 1% performance loss as compared with
> IOMMU off.

Was this with ATS on or off? With ATS, the cost of the page walk can be
amortised in some cases — you can look up the physical address *before*
you are ready to actually start the DMA to it, and don't take that
latency at the time you're actually moving the data.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

  parent reply	other threads:[~2015-11-16  8:42 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-25 16:07 [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API Shamir Rabinovitch
2015-10-25 16:07 ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS Shamir Rabinovitch
2015-10-28  6:30   ` David Woodhouse
2015-10-28 11:10     ` Shamir Rabinovitch
2015-10-28 13:31       ` David Woodhouse
2015-10-28 14:07         ` David Miller
2015-10-28 13:57           ` David Woodhouse
2015-10-29  0:23             ` David Miller
2015-10-29  0:32         ` Benjamin Herrenschmidt
2015-10-29  0:42           ` David Woodhouse
2015-10-29  1:10             ` Benjamin Herrenschmidt
2015-10-29 18:31               ` Andy Lutomirski
2015-10-29 22:35                 ` David Woodhouse
2015-11-01  7:45                   ` Shamir Rabinovitch
2015-11-01 21:10                     ` Benjamin Herrenschmidt
2015-11-02  7:23                       ` Shamir Rabinovitch
2015-11-02 10:00                         ` Benjamin Herrenschmidt
2015-11-02 12:07                           ` Shamir Rabinovitch
2015-11-02 20:13                             ` Benjamin Herrenschmidt
2015-11-02 21:45                               ` Arnd Bergmann
2015-11-02 23:08                                 ` Benjamin Herrenschmidt
2015-11-03 13:11                                   ` Christoph Hellwig
2015-11-03 19:35                                     ` Benjamin Herrenschmidt
2015-11-02 21:49                               ` Shamir Rabinovitch
2015-11-02 22:48                       ` David Woodhouse
2015-11-02 23:10                         ` Benjamin Herrenschmidt
2015-11-05 21:08                   ` David Miller
2015-10-30  1:51                 ` Benjamin Herrenschmidt
2015-10-30 10:32               ` Arnd Bergmann
2015-10-30 23:17                 ` Benjamin Herrenschmidt
2015-10-30 23:24                   ` Arnd Bergmann
2015-11-02 14:51                 ` Joerg Roedel
2015-10-29  7:32             ` Shamir Rabinovitch
2015-11-02 14:44               ` Joerg Roedel
2015-11-02 17:32                 ` Shamir Rabinovitch
2015-11-05 13:42                   ` Joerg Roedel
2015-11-05 21:11                     ` David Miller
2015-11-07 15:06                       ` Shamir Rabinovitch
     [not found]                         ` <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>
2015-11-16  8:42                           ` David Woodhouse [this message]
     [not found]                             ` <CAN+hb0UWpfcS5DvgMxNjY-5JOztw2mO1r2FJAW17fn974mhxPA@mail.gmail.com>
2015-11-16 18:42                               ` Benjamin Serebrin
2015-10-25 16:37 [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API Shamir Rabinovitch
2015-10-25 16:37 ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS Shamir Rabinovitch
2015-11-16  6:56 Benjamin Serebrin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1447663353.145626.46.camel@infradead.org \
    --to=dwmw2@infradead.org \
    --cc=arnd@arndb.de \
    --cc=benh@kernel.crashing.org \
    --cc=borntraeger@de.ibm.com \
    --cc=corbet@lwn.net \
    --cc=cornelia.huck@de.ibm.com \
    --cc=davem@davemloft.net \
    --cc=hch@lst.de \
    --cc=jroedel@suse.de \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=sebott@linux.vnet.ibm.com \
    --cc=serebrin@google.com \
    --cc=shamir.rabinovitch@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.