All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API
@ 2015-10-25 16:07 Shamir Rabinovitch
  2015-10-25 16:07 ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS Shamir Rabinovitch
  0 siblings, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-10-25 16:07 UTC (permalink / raw)
  To: arnd, corbet, linux-doc, linux-arch

SPARC64 arch need the ability to pass DMA attributes when mapping
pages. DMA performance will be low if specific attributes such as
DMA_ATTR_WEAK_ORDERING are not used.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
---
 include/asm-generic/dma-mapping-common.h |   32 +++++++++++++++++++++++------
 1 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
index b1bc954..b816ed7 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -74,32 +74,50 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg
 		ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
-static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
-				      size_t offset, size_t size,
-				      enum dma_data_direction dir)
+static inline dma_addr_t dma_map_page_attrs(struct device *dev,
+					    struct page *page,
+					    size_t offset, size_t size,
+					    enum dma_data_direction dir,
+					    struct dma_attrs *attrs)
 {
 	struct dma_map_ops *ops = get_dma_ops(dev);
 	dma_addr_t addr;
 
 	kmemcheck_mark_initialized(page_address(page) + offset, size);
 	BUG_ON(!valid_dma_direction(dir));
-	addr = ops->map_page(dev, page, offset, size, dir, NULL);
+	addr = ops->map_page(dev, page, offset, size, dir, attrs);
 	debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
 	return addr;
 }
 
-static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
-				  size_t size, enum dma_data_direction dir)
+static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr,
+					size_t size,
+					enum dma_data_direction dir,
+					struct dma_attrs *attrs)
 {
 	struct dma_map_ops *ops = get_dma_ops(dev);
 
 	BUG_ON(!valid_dma_direction(dir));
 	if (ops->unmap_page)
-		ops->unmap_page(dev, addr, size, dir, NULL);
+		ops->unmap_page(dev, addr, size, dir, attrs);
 	debug_dma_unmap_page(dev, addr, size, dir, false);
 }
 
+static inline dma_addr_t dma_map_page(struct device *dev,
+				      struct page *page,
+				      size_t offset, size_t size,
+				      enum dma_data_direction dir)
+{
+	return dma_map_page_attrs(dev, page, offset, size, dir, NULL);
+}
+
+static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
+				  size_t size, enum dma_data_direction dir)
+{
+	return dma_unmap_page_attrs(dev, addr, size, dir, NULL);
+}
+
 static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
 					   size_t size,
 					   enum dma_data_direction dir)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-25 16:07 [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API Shamir Rabinovitch
@ 2015-10-25 16:07 ` Shamir Rabinovitch
  2015-10-28  6:30   ` David Woodhouse
  0 siblings, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-10-25 16:07 UTC (permalink / raw)
  To: arnd, corbet, linux-doc, linux-arch

For systems with IOMMU it is assumed all DMA translations use the IOMMU.
There are, however, cases when IOMMU might map only limited range of the
memory or when the cost of setting IOMMU dma mapping is too big. For those
cases, this flag can be used to request the DMA-mapping subsystem to bypass
the IOMMU if applicable.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
---
 Documentation/DMA-attributes.txt |    9 +++++++++
 include/linux/dma-attrs.h        |    1 +
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt
index 18dc52c..1174afc 100644
--- a/Documentation/DMA-attributes.txt
+++ b/Documentation/DMA-attributes.txt
@@ -100,3 +100,12 @@ allocated by dma_alloc_attrs() function from individual pages if it can
 be mapped as contiguous chunk into device dma address space. By
 specifying this attribute the allocated buffer is forced to be contiguous
 also in physical memory.
+
+DMA_ATTR_IOMMU_BYPASS
+---------------------
+
+For systems with IOMMU it is assumed all DMA translations use the IOMMU.
+There are, however, cases when IOMMU might map only limited range of the
+memory or when the cost of setting IOMMU dma mapping is too big. For those
+cases, this flag can be used to request the DMA-mapping subsystem to bypass
+the IOMMU if applicable.
diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h
index c8e1831..7c78691 100644
--- a/include/linux/dma-attrs.h
+++ b/include/linux/dma-attrs.h
@@ -18,6 +18,7 @@ enum dma_attr {
 	DMA_ATTR_NO_KERNEL_MAPPING,
 	DMA_ATTR_SKIP_CPU_SYNC,
 	DMA_ATTR_FORCE_CONTIGUOUS,
+	DMA_ATTR_IOMMU_BYPASS,
 	DMA_ATTR_MAX,
 };
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-25 16:07 ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS Shamir Rabinovitch
@ 2015-10-28  6:30   ` David Woodhouse
  2015-10-28 11:10     ` Shamir Rabinovitch
  0 siblings, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2015-10-28  6:30 UTC (permalink / raw)
  To: Shamir Rabinovitch, arnd, corbet, linux-doc, linux-arch
  Cc: Andy Lutomirski, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	benh, KVM, dwmw2, Martin Schwidefsky, linux-s390

[-- Attachment #1: Type: text/plain, Size: 607 bytes --]

On Sun, 2015-10-25 at 09:07 -0700, Shamir Rabinovitch wrote:
> 
> +
> +DMA_ATTR_IOMMU_BYPASS
> +---------------------
> +
> > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.

Not entirely true. We have per-device dma_ops on a most architectures
already, and we were just talking about the need to add them to
POWER/SPARC too, because we need to avoid trying to use the IOMMU to
map virtio devices too.

As we look at that (and make the per-device dma_ops a generic thing
rather than per-arch), we should probably look at folding your case in
too.

-- 
dwmw2



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-28  6:30   ` David Woodhouse
@ 2015-10-28 11:10     ` Shamir Rabinovitch
  2015-10-28 13:31       ` David Woodhouse
  0 siblings, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-10-28 11:10 UTC (permalink / raw)
  To: David Woodhouse
  Cc: arnd, corbet, linux-doc, linux-arch, Andy Lutomirski,
	Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	Martin Schwidefsky, linux-s390

On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.
> 
> Not entirely true. We have per-device dma_ops on a most architectures
> already, and we were just talking about the need to add them to
> POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> map virtio devices too.

SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).

Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
translation that does not require any complex translations tables.

> 
> As we look at that (and make the per-device dma_ops a generic thing
> rather than per-arch), we should probably look at folding your case in
> too.

Whether to use IOMMU or not for DMA is up to the driver. The above example
show real situation where one driver can use IOMMU and the other can't. It
seems that the device cannot know when to use IOMMU bypass and when not.
Even in the driver we can decide to DMA map some buffers using IOMMU
translation and some as IOMMU bypass.

Do you agree that we need this attribute in the generic DMA API? 

> 
> -- 
> dwmw2
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-28 11:10     ` Shamir Rabinovitch
@ 2015-10-28 13:31       ` David Woodhouse
  2015-10-28 14:07         ` David Miller
  2015-10-29  0:32         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 42+ messages in thread
From: David Woodhouse @ 2015-10-28 13:31 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: arnd, corbet, linux-doc, linux-arch, Andy Lutomirski,
	Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	Martin Schwidefsky, linux-s390

[-- Attachment #1: Type: text/plain, Size: 2722 bytes --]

On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote:
> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.
> > 
> > Not entirely true. We have per-device dma_ops on a most architectures
> > already, and we were just talking about the need to add them to
> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> > map virtio devices too.
> 
> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).
> 
> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
> translation that does not require any complex translations tables.

We have an option in the Intel IOMMU for pass-through mode too, which
basically *is* a total bypass. In practice, what's the difference
between that and a "simple translation that does not require any
[translation]"? We set up a full 1:1 mapping of all memory, and then
the map/unmap methods become no-ops.

Currently we have no way to request that mode on a per-device basis; we
only have 'iommu=pt' on the command line to set it for *all* devices.
But performance-sensitive devices might want it, while we keep doing
proper translation for others.

> > 
> > As we look at that (and make the per-device dma_ops a generic thing
> > rather than per-arch), we should probably look at folding your case in
> > too.
> 
> Whether to use IOMMU or not for DMA is up to the driver. The above example
> show real situation where one driver can use IOMMU and the other can't. It
> seems that the device cannot know when to use IOMMU bypass and when not.
> Even in the driver we can decide to DMA map some buffers using IOMMU
> translation and some as IOMMU bypass.

In practice there are multiple possibilities — there are cases where
you *must* use an IOMMU and do full translation, and there is no option
of a bypass. There are cases where there just isn't an IOMMU (and
sometimes that's a per-device fact, like with virtio). And there are
cases where you *can* use the IOMMU, but if you ask nicely you can get
away without it.

My point in linking up these two threads is that we should contemplate
all of those use cases and come up with something that addresses it
all.

> Do you agree that we need this attribute in the generic DMA API? 

Yeah, I think it can be useful.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-28 14:07         ` David Miller
@ 2015-10-28 13:57           ` David Woodhouse
  2015-10-29  0:23             ` David Miller
  0 siblings, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2015-10-28 13:57 UTC (permalink / raw)
  To: David Miller
  Cc: shamir.rabinovitch, arnd, corbet, linux-doc, linux-arch, luto,
	jroedel, borntraeger, cornelia.huck, sebott, pbonzini, hch, benh,
	kvm, schwidefsky, linux-s390

[-- Attachment #1: Type: text/plain, Size: 1637 bytes --]

On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote:
> In the sparc64 case, the 64-bit DMA address space is divided into
> IOMMU translated and non-IOMMU translated.
> 
> You just set the high bits differently depending upon what you want.

Wait, does that mean a (rogue) device could *always* get full access to
physical memory just by setting the high bits appropriately? That
mapping is *always* available?

> So a device could use both IOMMU translated and bypass accesses at 
> the same time.  While seemingly interesting, I do not recommend we 
> provide this kind of flexibility in our DMA interfaces.

Now I could understand this if the answer to my question above was
'no'. We absolutely want the *security* all the time, and we don't want
the device to be able to do stupid stuff. But if the answer was 'yes'
then we take the map/unmap performance hit for... *what* benefit?

On Intel we have the passthrough as an *option* and I have the same
initial reaction — "Hell no, we want the security". But I concede the
performance motivation for it, and I'm not *dead* set against
permitting it.

If I tolerate a per-device request for passthrough mode, that might
prevent people from disabling the IOMMU or putting it into passthrough
mode *entirely*. So actually, I'm *improving* security...

I think it makes sense to allow performance sensitive device drivers to
*request* a passthrough mode. The platform can reserve the right to
refuse, if either the IOMMU hardware doesn't support that, or we're in
a paranoid mode (with iommu=always or something on the command line).


-- 
dwmw2



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-28 13:31       ` David Woodhouse
@ 2015-10-28 14:07         ` David Miller
  2015-10-28 13:57           ` David Woodhouse
  2015-10-29  0:32         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: David Miller @ 2015-10-28 14:07 UTC (permalink / raw)
  To: dwmw2
  Cc: shamir.rabinovitch, arnd, corbet, linux-doc, linux-arch, luto,
	jroedel, borntraeger, cornelia.huck, sebott, pbonzini, hch, benh,
	kvm, schwidefsky, linux-s390

From: David Woodhouse <dwmw2@infradead.org>
Date: Wed, 28 Oct 2015 22:31:50 +0900

> On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote:
>> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
>> > > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.
>> > 
>> > Not entirely true. We have per-device dma_ops on a most architectures
>> > already, and we were just talking about the need to add them to
>> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to
>> > map virtio devices too.
>> 
>> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).
>> 
>> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
>> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
>> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
>> translation that does not require any complex translations tables.
> 
> We have an option in the Intel IOMMU for pass-through mode too, which
> basically *is* a total bypass. In practice, what's the difference
> between that and a "simple translation that does not require any
> [translation]"? We set up a full 1:1 mapping of all memory, and then
> the map/unmap methods become no-ops.
> 
> Currently we have no way to request that mode on a per-device basis; we
> only have 'iommu=pt' on the command line to set it for *all* devices.
> But performance-sensitive devices might want it, while we keep doing
> proper translation for others.

In the sparc64 case, the 64-bit DMA address space is divided into
IOMMU translated and non-IOMMU translated.

You just set the high bits differently depending upon what you want.

So a device could use both IOMMU translated and bypass accesses at the
same time.  While seemingly interesting, I do not recommend we provide
this kind of flexibility in our DMA interfaces.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-28 13:57           ` David Woodhouse
@ 2015-10-29  0:23             ` David Miller
  0 siblings, 0 replies; 42+ messages in thread
From: David Miller @ 2015-10-29  0:23 UTC (permalink / raw)
  To: dwmw2
  Cc: shamir.rabinovitch, arnd, corbet, linux-doc, linux-arch, luto,
	jroedel, borntraeger, cornelia.huck, sebott, pbonzini, hch, benh,
	kvm, schwidefsky, linux-s390

From: David Woodhouse <dwmw2@infradead.org>
Date: Wed, 28 Oct 2015 22:57:12 +0900

> On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote:
>> In the sparc64 case, the 64-bit DMA address space is divided into
>> IOMMU translated and non-IOMMU translated.
>> 
>> You just set the high bits differently depending upon what you want.
> 
> Wait, does that mean a (rogue) device could *always* get full access to
> physical memory just by setting the high bits appropriately? That
> mapping is *always* available?

The IOMMU supports several modes, one of which disallows passthrough
and this is what you would use in a virtual guest.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-28 13:31       ` David Woodhouse
  2015-10-28 14:07         ` David Miller
@ 2015-10-29  0:32         ` Benjamin Herrenschmidt
  2015-10-29  0:42           ` David Woodhouse
  1 sibling, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-29  0:32 UTC (permalink / raw)
  To: David Woodhouse, Shamir Rabinovitch
  Cc: arnd, corbet, linux-doc, linux-arch, Andy Lutomirski,
	Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390

On Wed, 2015-10-28 at 22:31 +0900, David Woodhouse wrote:
> We have an option in the Intel IOMMU for pass-through mode too, which
> basically *is* a total bypass. In practice, what's the difference
> between that and a "simple translation that does not require any
> [translation]"? We set up a full 1:1 mapping of all memory, and then
> the map/unmap methods become no-ops.
> 
> Currently we have no way to request that mode on a per-device basis;
> we
> only have 'iommu=pt' on the command line to set it for *all* devices.
> But performance-sensitive devices might want it, while we keep doing
> proper translation for others.

On Power, I generally have 2 IOMMU windows for a device, one at the
bottom is remapped, and is generally used for 32-bit devices and the
one at the top us setup as a bypass (or in the case of KVM, as a linear
mapping of guest memory which looks the same as a bypass to the guest).

The DMA ops will automatically hit the appropriate one based on the
DMA mask.

I don't see how that attribute would work for us.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29  0:32         ` Benjamin Herrenschmidt
@ 2015-10-29  0:42           ` David Woodhouse
  2015-10-29  1:10             ` Benjamin Herrenschmidt
  2015-10-29  7:32             ` Shamir Rabinovitch
  0 siblings, 2 replies; 42+ messages in thread
From: David Woodhouse @ 2015-10-29  0:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Shamir Rabinovitch
  Cc: arnd, corbet, linux-doc, linux-arch, Andy Lutomirski,
	Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390

[-- Attachment #1: Type: text/plain, Size: 1225 bytes --]

On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:

> On Power, I generally have 2 IOMMU windows for a device, one at the
> bottom is remapped, and is generally used for 32-bit devices and the
> one at the top us setup as a bypass 

So in the normal case of decent 64-bit devices (and not in a VM),
they'll *already* be using the bypass region and have full access to
all of memory, all of the time? And you have no protection against
driver and firmware bugs causing stray DMA?

> I don't see how that attribute would work for us.

Because you're already doing it anyway without being asked :)

If SPARC and POWER are both doing that, perhaps we should change the
default for Intel too?

Aside from the lack of security, the other disadvantage of that is that
you have to pin *all* pages of a guest in case DMA happens; you don't
get to pin *only* those pages which are referenced by that guest's
IOMMU page tables...

Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
sane meaning in the paranoid mode (and perhaps we'd want an ultra
-paranoid mode where it's not honoured).

-- 
dwmw2



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29  0:42           ` David Woodhouse
@ 2015-10-29  1:10             ` Benjamin Herrenschmidt
  2015-10-29 18:31               ` Andy Lutomirski
  2015-10-30 10:32               ` Arnd Bergmann
  2015-10-29  7:32             ` Shamir Rabinovitch
  1 sibling, 2 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-29  1:10 UTC (permalink / raw)
  To: David Woodhouse, Shamir Rabinovitch
  Cc: arnd, corbet, linux-doc, linux-arch, Andy Lutomirski,
	Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390

On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> 
> > On Power, I generally have 2 IOMMU windows for a device, one at the
> > bottom is remapped, and is generally used for 32-bit devices and the
> > one at the top us setup as a bypass 
> 
> So in the normal case of decent 64-bit devices (and not in a VM),
> they'll *already* be using the bypass region and have full access to
> all of memory, all of the time? And you have no protection against
> driver and firmware bugs causing stray DMA?

Correct, we chose to do that for performance reasons.

> > I don't see how thata ttribute would work for us.
> 
> Because you're already doing it anyway without being asked :)
> 
> If SPARC and POWER are both doing that, perhaps we should change the
> default for Intel too?
> 
> Aside from the lack of security, the other disadvantage of that is that
> you have to pin *all* pages of a guest in case DMA happens; you don't
> get to pin *only* those pages which are referenced by that guest's
> IOMMU page tables...

Correct, the problem is that the cost of doing map/unmap from a guest
is really a huge hit on things like network devices.

Another problem is that the failure mode isn't great if you don't pin. 

IE. You have to pin pages as they get mapped into the iommu by the
guest, but you don't know in advance how much and you may hit the
process ulimit on pinned pages half way through.

We tried to address that in various ways but it always ended up horrid.

> Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
> architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
> sane meaning in the paranoid mode (and perhaps we'd want an ultra
> -paranoid mode where it's not honoured).

Possibly, though ideally that would be a user policy but of course by
the time you get to userspace it's generally too late.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29  0:42           ` David Woodhouse
  2015-10-29  1:10             ` Benjamin Herrenschmidt
@ 2015-10-29  7:32             ` Shamir Rabinovitch
  2015-11-02 14:44               ` Joerg Roedel
  1 sibling, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-10-29  7:32 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Benjamin Herrenschmidt, arnd, corbet, linux-doc, linux-arch,
	Andy Lutomirski, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Thu, Oct 29, 2015 at 09:42:12AM +0900, David Woodhouse wrote:

> Aside from the lack of security, the other disadvantage of that is that
> you have to pin *all* pages of a guest in case DMA happens; you don't
> get to pin *only* those pages which are referenced by that guest's
> IOMMU page tables...

We do bypass the IOMMU but not the DMA API and given that before we call 
the DMA API we pin the page then we do not need to pin all the pages.
Just the ones we use for the DMA. 

For me this flag looks orthogonal to the page pinning issue you brought. It
is just a hint to the DMA API that we want to use simple & fast mapping while
knowing we loose IOMMU protection for this memory.

For the IB case, setting and tearing DMA mappings for the drivers data buffers
is expensive. But we could for example consider to map all the HW control objects
that validate the HW access to the drivers data buffers as IOMMU protected and so
have IOMMU protection on those critical objects while having fast set-up/tear-down
of driver data buffers. The HW control objects have stable mapping for the lifetime
of the system. So the case of using both types of DMA mappings is still valid. 
 
> 
> -- 
> dwmw2
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29  1:10             ` Benjamin Herrenschmidt
@ 2015-10-29 18:31               ` Andy Lutomirski
  2015-10-29 22:35                 ` David Woodhouse
  2015-10-30  1:51                 ` Benjamin Herrenschmidt
  2015-10-30 10:32               ` Arnd Bergmann
  1 sibling, 2 replies; 42+ messages in thread
From: Andy Lutomirski @ 2015-10-29 18:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, linux-arch, Paolo Bonzini,
	Shamir Rabinovitch, David Woodhouse, Martin Schwidefsky,
	linux-doc, Sebastian Ott, linux-s390, Cornelia Huck,
	Joerg Roedel, Jonathan Corbet, KVM, Arnd Bergmann,
	Christoph Hellwig

On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
<benh@kernel.crashing.org> wrote:
>
> On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> >
> > > On Power, I generally have 2 IOMMU windows for a device, one at the
> > > bottom is remapped, and is generally used for 32-bit devices and the
> > > one at the top us setup as a bypass
> >
> > So in the normal case of decent 64-bit devices (and not in a VM),
> > they'll *already* be using the bypass region and have full access to
> > all of memory, all of the time? And you have no protection against
> > driver and firmware bugs causing stray DMA?
>
> Correct, we chose to do that for performance reasons.

Could this be mitigated using pools?  I don't know if the net code
would play along easily.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29 18:31               ` Andy Lutomirski
@ 2015-10-29 22:35                 ` David Woodhouse
  2015-11-01  7:45                   ` Shamir Rabinovitch
  2015-11-05 21:08                   ` David Miller
  2015-10-30  1:51                 ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 42+ messages in thread
From: David Woodhouse @ 2015-10-29 22:35 UTC (permalink / raw)
  To: Andy Lutomirski, Benjamin Herrenschmidt
  Cc: Christian Borntraeger, linux-arch, Paolo Bonzini,
	Shamir Rabinovitch, Martin Schwidefsky, linux-doc, Sebastian Ott,
	linux-s390, Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Arnd Bergmann, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]

On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
> <benh@kernel.crashing.org> wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

For the receive side, it shouldn't be beyond the wit of man to
introduce an API which allocates *and* DMA-maps a skb. Pass it to
netif_rx() still mapped, with a destructor that just shoves it back in
a pool for re-use.

Doing it for transmit might be a little more complex, but perhaps still
possible.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29 18:31               ` Andy Lutomirski
  2015-10-29 22:35                 ` David Woodhouse
@ 2015-10-30  1:51                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-30  1:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Borntraeger, linux-arch, Paolo Bonzini,
	Shamir Rabinovitch, David Woodhouse, Martin Schwidefsky,
	linux-doc, Sebastian Ott, linux-s390, Cornelia Huck,
	Joerg Roedel, Jonathan Corbet, KVM, Arnd Bergmann,
	Christoph Hellwig

On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
> <benh@kernel.crashing.org> wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

Possibly, the pools we have already limit the lock contention but we
still have the map/unmap overhead which under a hypervisor can be quite
high. I'm not necessarily against changing the way we do things but it
would have to be backed up with numbers.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29  1:10             ` Benjamin Herrenschmidt
  2015-10-29 18:31               ` Andy Lutomirski
@ 2015-10-30 10:32               ` Arnd Bergmann
  2015-10-30 23:17                 ` Benjamin Herrenschmidt
  2015-11-02 14:51                 ` Joerg Roedel
  1 sibling, 2 replies; 42+ messages in thread
From: Arnd Bergmann @ 2015-10-30 10:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Woodhouse, Shamir Rabinovitch, corbet, linux-doc,
	linux-arch, Andy Lutomirski, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> 
> > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
> > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
> > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > -paranoid mode where it's not honoured).
> 
> Possibly, though ideally that would be a user policy but of course by
> the time you get to userspace it's generally too late.

IIRC, we have an 'iommu=force' command line switch for this, to ensure
that no device can use a linear mapping and everything goes through
the page tables. This is often useful for both debugging and as a
security measure when dealing with unpriviledged DMA access (virtual
machines, vfio, ...).

If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly document
which of the two we expect to take priority in cases where we have a
choice.

I wonder if the 'iommu=force' attribute is too coarse-grained though,
and if we should perhaps allow a per-device setting on architectures
that allow this.

	Arnd

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-30 10:32               ` Arnd Bergmann
@ 2015-10-30 23:17                 ` Benjamin Herrenschmidt
  2015-10-30 23:24                   ` Arnd Bergmann
  2015-11-02 14:51                 ` Joerg Roedel
  1 sibling, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-30 23:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David Woodhouse, Shamir Rabinovitch, corbet, linux-doc,
	linux-arch, Andy Lutomirski, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote:
> On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> > 
> > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes
> > > across
> > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have
> > > a
> > > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > > -paranoid mode where it's not honoured).
> > 
> > Possibly, though ideally that would be a user policy but of course
> > by
> > the time you get to userspace it's generally too late.
> 
> IIRC, we have an 'iommu=force' command line switch for this, to
> ensure
> that no device can use a linear mapping and everything goes th ough
> the page tables. This is often useful for both debugging and as a
> security measure when dealing with unpriviledged DMA access (virtual
> machines, vfio, ...).

That was used to force-enable the iommu on platforms like G5s where we
would otherwise only do so if the memory was larger than 32-bit but we
never implemented using it to prevent the bypass region.

> If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly
> document
> which osed to force-enable the iommu on HGthe two we expect to take
> priority in cases where we have a
> choice.
>
> I wonder if the 'iommu=force' attribute is too coarse-grained though,
> and if we should perhaps allow a per-device setting on architectures
> that allow this.

The interesting thing, if we can make it work, is the bypass attribute
being per mapping... 

Ben. 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-30 23:17                 ` Benjamin Herrenschmidt
@ 2015-10-30 23:24                   ` Arnd Bergmann
  0 siblings, 0 replies; 42+ messages in thread
From: Arnd Bergmann @ 2015-10-30 23:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Woodhouse, Shamir Rabinovitch, corbet, linux-doc,
	linux-arch, Andy Lutomirski, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Saturday 31 October 2015 10:17:22 Benjamin Herrenschmidt wrote:
> On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote:
> > On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> > > 
> > > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes
> > > > across
> > > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have
> > > > a
> > > > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > > > -paranoid mode where it's not honoured).
> > > 
> > > Possibly, though ideally that would be a user policy but of course
> > > by
> > > the time you get to userspace it's generally too late.
> > 
> > IIRC, we have an 'iommu=force' command line switch for this, to
> > ensure
> > that no device can use a linear mapping and everything goes th ough
> > the page tables. This is often useful for both debugging and as a
> > security measure when dealing with unpriviledged DMA access (virtual
> > machines, vfio, ...).
> 
> That was used to force-enable the iommu on platforms like G5s where we
> would otherwise only do so if the memory was larger than 32-bit but we
> never implemented using it to prevent the bypass region.

Ah, I see. Thanks for the clarification.

> > If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly
> > document
> > which osed to force-enable the iommu on HGthe two we expect to take
> > priority in cases where we have a
> > choice.
> >
> > I wonder if the 'iommu=force' attribute is too coarse-grained though,
> > and if we should perhaps allow a per-device setting on architectures
> > that allow this.
> 
> The interesting thing, if we can make it work, is the bypass attribute
> being per mapping... 

I would say we want both: for the device driver it can make sense to
choose per mapping what it can do, but for the iommu driver, it
can also make sense to ensure we never provide a linear mapping,
because otherwise the additional security aspect is moot.

In particular for the unprivileged VM guest or vfio access, the
code that gives access to the device to something else should
have a way to tell the IOMMU that the linear mapping can no longer
be used.

	Arnd

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29 22:35                 ` David Woodhouse
@ 2015-11-01  7:45                   ` Shamir Rabinovitch
  2015-11-01 21:10                     ` Benjamin Herrenschmidt
  2015-11-05 21:08                   ` David Miller
  1 sibling, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-11-01  7:45 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Andy Lutomirski, Benjamin Herrenschmidt, Christian Borntraeger,
	linux-arch, Paolo Bonzini, Martin Schwidefsky, linux-doc,
	Sebastian Ott, linux-s390, Cornelia Huck, Joerg Roedel,
	Jonathan Corbet, KVM, Arnd Bergmann, Christoph Hellwig

On Thu, Oct 29, 2015 at 10:35:25PM +0000, David Woodhouse wrote:
> > 
> > Could this be mitigated using pools?  I don't know if the net code
> > would play along easily.
> 
> For the receive side, it shouldn't be beyond the wit of man to
> introduce an API which allocates *and* DMA-maps a skb. Pass it to
> netif_rx() still mapped, with a destructor that just shoves it back in
> a pool for re-use.
> 
> Doing it for transmit might be a little more complex, but perhaps still
> possible.

Not sure this use case is possible for Infiniband where application hold
the data buffers and there is no way to force application to re use the 
buffer as suggested.

This is why I think there will be no easy way to bypass the DMA mapping cost
for all use cases unless we allow applications to request bypass/pass through
DMA mapping (implicitly or explicitly).

> 
> -- 
> dwmw2
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-01  7:45                   ` Shamir Rabinovitch
@ 2015-11-01 21:10                     ` Benjamin Herrenschmidt
  2015-11-02  7:23                       ` Shamir Rabinovitch
  2015-11-02 22:48                       ` David Woodhouse
  0 siblings, 2 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-01 21:10 UTC (permalink / raw)
  To: Shamir Rabinovitch, David Woodhouse
  Cc: Andy Lutomirski, Christian Borntraeger, linux-arch,
	Paolo Bonzini, Martin Schwidefsky, linux-doc, Sebastian Ott,
	linux-s390, Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Arnd Bergmann, Christoph Hellwig

On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote:
> Not sure this use case is possible for Infiniband where application hold
> the data buffers and there is no way to force application to re use the 
> buffer as suggested.
> 
> This is why I think there will be no easy way to bypass the DMA mapping cost
> for all use cases unless we allow applications to request bypass/pass through
> DMA mapping (implicitly or explicitly).

But but but ...

What I don't understand is how that brings you any safety.

Basically, either your bridge has a bypass window, or it doesn't. (Or
it has one and it's enabled or not, same thing).

If you request the bypass on a per-mapping basis, you basically have to
keep the window always enabled, unless you do some nasty refcounting of
how many people have a bypass mapping in flight, but that doesn't seem
useful.

Thus you have already lost all protection from the device, since your
entire memory is accessible via the bypass mapping.

Which means what is the point of then having non-bypass mappings for
other things ? Just to make things slower ?

I really don't see what this whole "bypass on a per-mapping basis" buys
you.

Note that we implicitly already do that on powerpc, but not for those
reasons, we do it based on the DMA mask, so that if your coherent mask
is 32-bit but your dma mask is 64-bit (which is not an uncommon
combination), we service the "coherent" requests (basically the long
lifed dma allocs) from the remapped region and the "stream" requests
(ie, map_page/map_sg) from the bypass.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-01 21:10                     ` Benjamin Herrenschmidt
@ 2015-11-02  7:23                       ` Shamir Rabinovitch
  2015-11-02 10:00                         ` Benjamin Herrenschmidt
  2015-11-02 22:48                       ` David Woodhouse
  1 sibling, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-11-02  7:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Woodhouse, Andy Lutomirski, Christian Borntraeger,
	linux-arch, Paolo Bonzini, Martin Schwidefsky, linux-doc,
	Sebastian Ott, linux-s390, Cornelia Huck, Joerg Roedel,
	Jonathan Corbet, KVM, Arnd Bergmann, Christoph Hellwig

On Mon, Nov 02, 2015 at 08:10:49AM +1100, Benjamin Herrenschmidt wrote:
> But but but ...
> 
> What I don't understand is how that brings you any safety.

Limited safety maybe? If some device DMA mappings are via IOMMU 
and this fall to some address range that is far from the bypass / 
pass through range then small drifts in address might be still figured 
if we do not bypass / pass through the IOMMU - right?

Device can sure use the bypass address and just reach the memory w/o 
IOMMU protection. See some comments about that below.

> 
> Basically, either your bridge has a bypass window, or it doesn't. (Or
> it has one and it's enabled or not, same thing).

Agree.

> 
> If you request the bypass on a per-mapping basis, you basically have to
> keep the window always enabled, unless you do some nasty refcounting of
> how many people have a bypass mapping in flight, but that doesn't seem
> useful.
> 
> Thus you have already lost all protection from the device, since your
> entire memory is accessible via the bypass mapping.

Correct, see my above comment. 

> 
> Which means what is the point of then having non-bypass mappings for
> other things ? Just to make things slower ?
> 
> I really don't see what this whole "bypass on a per-mapping basis" buys
> you.
> 
> Note that we implicitly already do that on powerpc, but not for those
> reasons, we do it based on the DMA mask, so that if your coherent mask
> is 32-bit but your dma mask is 64-bit (which is not an uncommon
> combination), we service the "coherent" requests (basically the long
> lifed dma allocs) from the remapped region and the "stream" requests
> (ie, map_page/map_sg) from the bypass.



To summary -

1. The whole point of the IOMMU pass through was to get bigger address space
	and faster map/unmap operations for performance critical hardware
2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
	protection concerns raised above. Not sure what will be the performance
	impact though. This still need a lot of work before we could test this.
3. On x86 we use IOMMU in pass through mode so all the above concerns are valid

The question are -

1. Does partial use of IOMMU while the pass through window is enabled add some
	protection?
2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
	translations at kernel level?

I think that I can live with option (2) till I have DVMA if there is strong
disagree on the need for per allocation IOMMU bypass.


> 
> Cheers,
> Ben.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02  7:23                       ` Shamir Rabinovitch
@ 2015-11-02 10:00                         ` Benjamin Herrenschmidt
  2015-11-02 12:07                           ` Shamir Rabinovitch
  0 siblings, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-02 10:00 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: David Woodhouse, Andy Lutomirski, Christian Borntraeger,
	linux-arch, Paolo Bonzini, Martin Schwidefsky, linux-doc,
	Sebastian Ott, linux-s390, Cornelia Huck, Joerg Roedel,
	Jonathan Corbet, KVM, Arnd Bergmann, Christoph Hellwig

On Mon, 2015-11-02 at 09:23 +0200, Shamir Rabinovitch wrote:
> To summary -
> 
> 1. The whole point of the IOMMU pass through was to get bigger address space
>         and faster map/unmap operations for performance critical hardware
> 2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
>         protection concerns raised above. Not sure what will be the performance
>         impact though. This still need a lot of work before we could test this.
> 3. On x86 we use IOMMU in pass through mode so all the above concerns are valid
> 
> The question are -
> 
> 1. Does partial use of IOMMU while the pass through window is enabled add some
>         protection?
> 2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
>         translations at kernel level?
> 
> I think that I can live with option (2) till I have DVMA if there is strong
> disagree on the need for per allocation IOMMU bypass.

Chosing on a per-mapping basis *in the back end* might still make some
amount of sense. What I don't completely grasp is what does it give
you to expose that choice to the *driver* all the way up the chain. Why
does the driver knows better whether something should use the bypass or
not ?

I can imagine some in-between setups, for example, on POWER (and
probably x86), I could setup a window that is TCE-mapped (TCEs are our
iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
map to the same physical page. I could then use map/unmap to adjust the
protection, the idea being that only "relaxing" the protection requires
flushing the IO TLB, ie, we could delay most flushes.

But that sort of optimization only makes sense in the back-end.

So what was your original idea where you thought the driver was the
right one to decide whether to use the bypass or not for a given map
operation ? That's what I don't grasp... you might have a valid case
that I just fail to see.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 10:00                         ` Benjamin Herrenschmidt
@ 2015-11-02 12:07                           ` Shamir Rabinovitch
  2015-11-02 20:13                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-11-02 12:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Woodhouse, Andy Lutomirski, Christian Borntraeger,
	linux-arch, Paolo Bonzini, Martin Schwidefsky, linux-doc,
	Sebastian Ott, linux-s390, Cornelia Huck, Joerg Roedel,
	Jonathan Corbet, KVM, Arnd Bergmann, Christoph Hellwig

On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt wrote:
> 
> Chosing on a per-mapping basis *in the back end* might still make some

In my case, choosing mapping based on the hardware that will use this
mappings makes more sense. Most hardware are not that performance sensitive
as the Infiniband hardware.

> amount of sense. What I don't completely grasp is what does it give
> you to expose that choice to the *driver* all the way up the chain. Why
> does the driver knows better whether something should use the bypass or
> not ?

The driver know for what hardware it is mapping the memory so it know if
the memory will be used by performance sensitive hardware or not.

> 
> I can imagine some in-between setups, for example, on POWER (and
> probably x86), I could setup a window that is TCE-mapped (TCEs are our
> iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
> map to the same physical page. I could then use map/unmap to adjust the
> protection, the idea being that only "relaxing" the protection requires
> flushing the IO TLB, ie, we could delay most flushes.

In your case, what will give the better performance - 1:1 mapping or IOMMU
mapping? When you say 'relaxing the protection' you refer to 1:1 mapping?

Also, how this 1:1 window address the security concerns that other raised
by other here?

> 
> But that sort of optimization only makes sense in the back-end.
> 
> So what was your original idea where you thought the driver was the
> right one to decide whether to use the bypass or not for a given map
> operation ? That's what I don't grasp... you might have a valid case
> that I just fail to see.

Please see above.

> 
> Cheers,
> Ben.
> 

I think that given that IOMMU bypass on per allocation basis raise some
concerns, the only path for SPARC is this:

1. Support 'iommu=pt' as x86 for total IOMMU as intermediate step. Systems
	that use Infiniband will be set to pass through.

2. Add support in DVMA which allow less contention on the IOMMU resources
	while doing the map/unmap, bigger address range and full protection.
	Still this is not clear what will be the performance cost of using
	DVMA.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29  7:32             ` Shamir Rabinovitch
@ 2015-11-02 14:44               ` Joerg Roedel
  2015-11-02 17:32                 ` Shamir Rabinovitch
  0 siblings, 1 reply; 42+ messages in thread
From: Joerg Roedel @ 2015-11-02 14:44 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: David Woodhouse, Benjamin Herrenschmidt, arnd, corbet, linux-doc,
	linux-arch, Andy Lutomirski, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Thu, Oct 29, 2015 at 09:32:32AM +0200, Shamir Rabinovitch wrote:
> For the IB case, setting and tearing DMA mappings for the drivers data buffers
> is expensive. But we could for example consider to map all the HW control objects
> that validate the HW access to the drivers data buffers as IOMMU protected and so
> have IOMMU protection on those critical objects while having fast set-up/tear-down
> of driver data buffers. The HW control objects have stable mapping for the lifetime
> of the system. So the case of using both types of DMA mappings is still valid.

How do you envision this per-mapping by-pass to work? For the DMA-API
mappings you have only one device address space. For by-pass to work,
you need to map all physical memory of the machine into this space,
which leaves the question where you want to put the window for
remapping.

You surely have to put it after the physical mappings, but any
protection will be gone, as the device can access all physical memory.

So instead of working around the shortcomings of DMA-API
implementations, can you present us some numbers and analysis of how bad
the performance impact with an IOMMU is and what causes it?

I know that we have lock-contention issues in our IOMMU drivers, which
can be fixed. Maybe the performance problems you are seeing can be fixed
too, when you give us more details about them.


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-30 10:32               ` Arnd Bergmann
  2015-10-30 23:17                 ` Benjamin Herrenschmidt
@ 2015-11-02 14:51                 ` Joerg Roedel
  1 sibling, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2015-11-02 14:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Benjamin Herrenschmidt, David Woodhouse, Shamir Rabinovitch,
	corbet, linux-doc, linux-arch, Andy Lutomirski,
	Christian Borntraeger, Cornelia Huck, Sebastian Ott,
	Paolo Bonzini, Christoph Hellwig, KVM, Martin Schwidefsky,
	linux-s390

On Fri, Oct 30, 2015 at 11:32:06AM +0100, Arnd Bergmann wrote:
> I wonder if the 'iommu=force' attribute is too coarse-grained though,
> and if we should perhaps allow a per-device setting on architectures
> that allow this.

Yeah, definitly. Currently we only have iommu=pt to enable pass-through
mode for _all_ devices. I think it makes sense to introduce a per-device
opt-in for pass-through, but have it configured by the user and not by
the device driver.

If the user enables the IOMMU in his system, he expects to be secure
against DMA attacks. If drivers could opt-out, every protection would be
voided.


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 14:44               ` Joerg Roedel
@ 2015-11-02 17:32                 ` Shamir Rabinovitch
  2015-11-05 13:42                   ` Joerg Roedel
  0 siblings, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-11-02 17:32 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Woodhouse, Benjamin Herrenschmidt, arnd, corbet, linux-doc,
	linux-arch, Andy Lutomirski, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Mon, Nov 02, 2015 at 03:44:27PM +0100, Joerg Roedel wrote:
> 
> How do you envision this per-mapping by-pass to work? For the DMA-API
> mappings you have only one device address space. For by-pass to work,
> you need to map all physical memory of the machine into this space,
> which leaves the question where you want to put the window for
> remapping.
> 
> You surely have to put it after the physical mappings, but any
> protection will be gone, as the device can access all physical memory.

Correct. This issue is one of the concerns here in the previous replies.
I will take different approach which will not require the IOMMU bypass
per mapping. Will try to shift to the x86 'iommu=pt' approach.

> 
> So instead of working around the shortcomings of DMA-API
> implementations, can you present us some numbers and analysis of how bad
> the performance impact with an IOMMU is and what causes it?

We had a bunch of issues around SPARC IOMMU. Not all of them relate to
performance. The first issue was that on SPARC, currently, we only have 
limited address space to IOMMU so we had issue to do large DMA mappings
for Infiniband. Second issue was that we identified high contention on 
the IOMMU locks even in ETH driver. 

> 
> I know that we have lock-contention issues in our IOMMU drivers, which
> can be fixed. Maybe the performance problems you are seeing can be fixed
> too, when you give us more details about them.
> 

I do not want to put too much information here but you can see some results:

rds-stress test from sparc t5-2 -> x86:

with iommu bypass:
---------------------
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   3 141278      0 1165565.81       0.00       0.00    8.93   376.60 -1.00  (average)

without iommu bypass:
---------------------
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
   3  78558      0  648101.41       0.00       0.00   15.05   876.72 -1.00  (average)

+ RDMA tests are totally not working (might be due to failure to DMA map all the memory).

So IOMMU bypass give ~80% performance boost.

> 
> 	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 12:07                           ` Shamir Rabinovitch
@ 2015-11-02 20:13                             ` Benjamin Herrenschmidt
  2015-11-02 21:45                               ` Arnd Bergmann
  2015-11-02 21:49                               ` Shamir Rabinovitch
  0 siblings, 2 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-02 20:13 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: David Woodhouse, Andy Lutomirski, Christian Borntraeger,
	linux-arch, Paolo Bonzini, Martin Schwidefsky, linux-doc,
	Sebastian Ott, linux-s390, Cornelia Huck, Joerg Roedel,
	Jonathan Corbet, KVM, Arnd Bergmann, Christoph Hellwig

On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> wrote:
> > 
> > Chosing on a per-mapping basis *in the back end* might still make
> > some
> 
> In my case, choosing mapping based on the hardware that will use this
> mappings makes more sense. Most hardware are not that performance 
> sensitive as the Infiniband hardware.

 ...

> The driver know for what hardware it is mapping the memory so it know 
> if the memory will be used by performance sensitive hardware or not.

Then I would argue for naming this differently. Make it an optional
hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
achieved via using a bypass or other means in the backend not the
business of the driver.

> In your case, what will give the better performance - 1:1 mapping or
> IOMMU
> mapping? When you say 'relaxing the protection' you refer to 1:1
> mapping?

> Also, how this 1:1 window address the security concerns that other
> raised
> by other here?

It will partially only but it's just an example of another way the
bakcend could provide some improved performances without a bypass.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 20:13                             ` Benjamin Herrenschmidt
@ 2015-11-02 21:45                               ` Arnd Bergmann
  2015-11-02 23:08                                 ` Benjamin Herrenschmidt
  2015-11-02 21:49                               ` Shamir Rabinovitch
  1 sibling, 1 reply; 42+ messages in thread
From: Arnd Bergmann @ 2015-11-02 21:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Shamir Rabinovitch, David Woodhouse, Andy Lutomirski,
	Christian Borntraeger, linux-arch, Paolo Bonzini,
	Martin Schwidefsky, linux-doc, Sebastian Ott, linux-s390,
	Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Christoph Hellwig

On Tuesday 03 November 2015 07:13:28 Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> > On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> > wrote:
> > > 
> > > Chosing on a per-mapping basis *in the back end* might still make
> > > some
> > 
> > In my case, choosing mapping based on the hardware that will use this
> > mappings makes more sense. Most hardware are not that performance 
> > sensitive as the Infiniband hardware.
> 
>  ...
> 
> > The driver know for what hardware it is mapping the memory so it know 
> > if the memory will be used by performance sensitive hardware or not.
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

With a name like that, who wouldn't pass that flag? ;-)

	Arnd

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 20:13                             ` Benjamin Herrenschmidt
  2015-11-02 21:45                               ` Arnd Bergmann
@ 2015-11-02 21:49                               ` Shamir Rabinovitch
  1 sibling, 0 replies; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-11-02 21:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Woodhouse, Andy Lutomirski, Christian Borntraeger,
	linux-arch, Paolo Bonzini, Martin Schwidefsky, linux-doc,
	Sebastian Ott, linux-s390, Cornelia Huck, Joerg Roedel,
	Jonathan Corbet, KVM, Arnd Bergmann, Christoph Hellwig

On Tue, Nov 03, 2015 at 07:13:28AM +1100, Benjamin Herrenschmidt wrote:
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

Correct. This comment was also from internal review :-) Although
currently there is strong opposition to such new attribute.

> 
> It will partially only but it's just an example of another way the
> bakcend could provide some improved performances without a bypass.

Just curious.. 

In the Infiniband case where user application request the driver to DMA 
map some random pages they allocated - will your suggestion still
function? 

I mean can you use this limited 1:1 mapping window to map any page the
user application choose? How?

At the bypass we just use the physical address of the page (almost as is).
We use the fact that bypass address space cover the whole memory and so we
have mapping for any address.

In IOMMU case we use the IOMMU translation tables and so can map 64 bit 
address to some 32 bit address (or less).

In your case it is not clear to me what can we do with such limited 1:1
mapping.

> 
> Cheers,
> Ben.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-01 21:10                     ` Benjamin Herrenschmidt
  2015-11-02  7:23                       ` Shamir Rabinovitch
@ 2015-11-02 22:48                       ` David Woodhouse
  2015-11-02 23:10                         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2015-11-02 22:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Shamir Rabinovitch
  Cc: Andy Lutomirski, Christian Borntraeger, linux-arch,
	Paolo Bonzini, Martin Schwidefsky, linux-doc, Sebastian Ott,
	linux-s390, Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Arnd Bergmann, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2113 bytes --]

On Mon, 2015-11-02 at 08:10 +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote:
> > Not sure this use case is possible for Infiniband where application hold
> > the data buffers and there is no way to force application to re use the 
> > buffer as suggested.
> > 
> > This is why I think there will be no easy way to bypass the DMA mapping cost
> > for all use cases unless we allow applications to request bypass/pass through
> > DMA mapping (implicitly or explicitly).
> 
> But but but ...
> 
> What I don't understand is how that brings you any safety.
> 
> Basically, either your bridge has a bypass window, or it doesn't. (Or
> it has one and it's enabled or not, same thing).
> 
> If you request the bypass on a per-mapping basis, you basically have to
> keep the window always enabled, unless you do some nasty refcounting of
> how many people have a bypass mapping in flight, but that doesn't seem
> useful.
> 
> Thus you have already lost all protection from the device, since your
> entire memory is accessible via the bypass mapping.
> 
> Which means what is the point of then having non-bypass mappings for
> other things ? Just to make things slower ?
> 
> I really don't see what this whole "bypass on a per-mapping basis" buys
> you.

In the Intel case, the mapping setup is entirely per-device (except for
crap devices and devices behind a PCIe-PCI bridge, etc.).

So you can happily have a passthrough mapping for *one* device, without
making that same mapping available to another device. You can make that
trade-off of speed vs. protection for each device in the system.

Currently we have the 'iommu=pt' option that makes us use passthrough
for *all* devices (except those whose dma_mask can't reach all of
physical memory). But I could live with something like Shamir is
proposing, which lets us do the bypass only for performance-sensitive
devices which actually *ask* for it (and of course we'd have a system-
wide mode which declines that request and does normal mappings anyway).

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 21:45                               ` Arnd Bergmann
@ 2015-11-02 23:08                                 ` Benjamin Herrenschmidt
  2015-11-03 13:11                                   ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-02 23:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Shamir Rabinovitch, David Woodhouse, Andy Lutomirski,
	Christian Borntraeger, linux-arch, Paolo Bonzini,
	Martin Schwidefsky, linux-doc, Sebastian Ott, linux-s390,
	Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Christoph Hellwig

On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > Then I would argue for naming this differently. Make it an optional
> > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > achieved via using a bypass or other means in the backend not the
> > business of the driver.
> > 
> 
> With a name like that, who wouldn't pass that flag? ;-)

xHCI for example, vs. something like 10G ethernet... but yes I agree it
sucks. I don't like that sort of policy anywhere in drivers. On the
other hand the platform doesn't have much information to make that sort
of decision either.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 22:48                       ` David Woodhouse
@ 2015-11-02 23:10                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-02 23:10 UTC (permalink / raw)
  To: David Woodhouse, Shamir Rabinovitch
  Cc: Andy Lutomirski, Christian Borntraeger, linux-arch,
	Paolo Bonzini, Martin Schwidefsky, linux-doc, Sebastian Ott,
	linux-s390, Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Arnd Bergmann, Christoph Hellwig

On Mon, 2015-11-02 at 22:48 +0000, David Woodhouse wrote:
> 
> In the Intel case, the mapping setup is entirely per-device (except for
> crap devices and devices behind a PCIe-PCI bridge, etc.).
> 
> So you can happily have a passthrough mapping for *one* device, without
> making that same mapping available to another device. You can make that
> trade-off of speed vs. protection for each device in the system.

Same for me. I should have written "in the context of a device ...",
the problem reamains that it doesn't buy you much to do it *per
-mapping* which is what this API seems to be about.

> Currently we have the 'iommu=pt' option that makes us use passthrough
> for *all* devices (except those whose dma_mask can't reach all of
> physical memory). But I could live with something like Shamir is
> proposing, which lets us do the bypass only for performance-sensitive
> devices which actually *ask* for it (and of course we'd have a system-
> wide mode which declines that request and does normal mappings anyway).
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 23:08                                 ` Benjamin Herrenschmidt
@ 2015-11-03 13:11                                   ` Christoph Hellwig
  2015-11-03 19:35                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2015-11-03 13:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Shamir Rabinovitch, David Woodhouse,
	Andy Lutomirski, Christian Borntraeger, linux-arch,
	Paolo Bonzini, Martin Schwidefsky, linux-doc, Sebastian Ott,
	linux-s390, Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM,
	Christoph Hellwig

On Tue, Nov 03, 2015 at 10:08:13AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > > Then I would argue for naming this differently. Make it an optional
> > > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > > achieved via using a bypass or other means in the backend not the
> > > business of the driver.
> > > 
> > 
> > With a name like that, who wouldn't pass that flag? ;-)
> 
> xHCI for example, vs. something like 10G ethernet... but yes I agree it
> sucks. I don't like that sort of policy anywhere in drivers. On the
> other hand the platform doesn't have much information to make that sort
> of decision either.

Mabye because it should simply use what's optimal?  E.g. passthrough
whenever possible, where arguments against possible are:  dma_mask, vfio
requirements, kernel command line option.  This is what a lot of
architectures already do, I remember the SGI Origin / Altix code has the
same behavior as well.  Those IOMMUs already had the 64 bit passthrough
and 32-bit sliding window in addition to the real IOMMU 10 years ago.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-03 13:11                                   ` Christoph Hellwig
@ 2015-11-03 19:35                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-03 19:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Arnd Bergmann, Shamir Rabinovitch, David Woodhouse,
	Andy Lutomirski, Christian Borntraeger, linux-arch,
	Paolo Bonzini, Martin Schwidefsky, linux-doc, Sebastian Ott,
	linux-s390, Cornelia Huck, Joerg Roedel, Jonathan Corbet, KVM

On Tue, 2015-11-03 at 14:11 +0100, Christoph Hellwig wrote:
> > xHCI for example, vs. something like 10G ethernet... but yes I agree it
> > sucks. I don't like that sort of policy anywhere in drivers. On the
> > other hand the platform doesn't have much information to make that sort
> > of decision either.
> 
> Mabye because it should simply use what's optimal?  E.g. passthrough
> whenever possible, where arguments against possible are:  dma_mask, vfio
> requirements, kernel command line option. 

Right this is what I do today on powerpc with the exception of
the command line option.

>  This is what a lot of
> architectures already do, I remember the SGI Origin / Altix code has the
> same behavior as well.  Those IOMMUs already had the 64 bit passthrough
> and 32-bit sliding window in addition to the real IOMMU 10 years ago.
> --

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-02 17:32                 ` Shamir Rabinovitch
@ 2015-11-05 13:42                   ` Joerg Roedel
  2015-11-05 21:11                     ` David Miller
  0 siblings, 1 reply; 42+ messages in thread
From: Joerg Roedel @ 2015-11-05 13:42 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: David Woodhouse, Benjamin Herrenschmidt, arnd, corbet, linux-doc,
	linux-arch, Andy Lutomirski, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390

On Mon, Nov 02, 2015 at 07:32:19PM +0200, Shamir Rabinovitch wrote:
> Correct. This issue is one of the concerns here in the previous replies.
> I will take different approach which will not require the IOMMU bypass
> per mapping. Will try to shift to the x86 'iommu=pt' approach.

Yeah, it doesn't really make sense to have an extra remappable area when
the device can access all physical memory anyway.

> We had a bunch of issues around SPARC IOMMU. Not all of them relate to
> performance. The first issue was that on SPARC, currently, we only have 
> limited address space to IOMMU so we had issue to do large DMA mappings
> for Infiniband. Second issue was that we identified high contention on 
> the IOMMU locks even in ETH driver.

Contended IOMMU locks are not only a problem on SPARC, but on x86 and
various other IOMMU drivers too. But I have some ideas on how to improve
the situation there.

> I do not want to put too much information here but you can see some results:
> 
> rds-stress test from sparc t5-2 -> x86:
> 
> with iommu bypass:
> ---------------------
> sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
> tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
>    3 141278      0 1165565.81       0.00       0.00    8.93   376.60 -1.00  (average)
> 
> without iommu bypass:
> ---------------------
> sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
> tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu %
>    3  78558      0  648101.41       0.00       0.00   15.05   876.72 -1.00  (average)
> 
> + RDMA tests are totally not working (might be due to failure to DMA map all the memory).
> 
> So IOMMU bypass give ~80% performance boost.

Interesting. Have you looked more closely on what causes the performance
degradation? Is it the lock contention or something else?


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-29 22:35                 ` David Woodhouse
  2015-11-01  7:45                   ` Shamir Rabinovitch
@ 2015-11-05 21:08                   ` David Miller
  1 sibling, 0 replies; 42+ messages in thread
From: David Miller @ 2015-11-05 21:08 UTC (permalink / raw)
  To: dwmw2
  Cc: luto, benh, borntraeger, linux-arch, pbonzini,
	shamir.rabinovitch, schwidefsky, linux-doc, sebott, linux-s390,
	cornelia.huck, jroedel, corbet, kvm, arnd, hch

From: David Woodhouse <dwmw2@infradead.org>
Date: Thu, 29 Oct 2015 22:35:25 +0000

> For the receive side, it shouldn't be beyond the wit of man to
> introduce an API which allocates *and* DMA-maps a skb. Pass it to
> netif_rx() still mapped, with a destructor that just shoves it back in
> a pool for re-use.

For forwarding, the SKB is going to another device to be DMA'd,
perhaps via another IOMMU.

For local connections, it's going to sit for an unpredictable and
unbounded amount of time in the socket queue.

We've been through this thought process before, believe me :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-05 13:42                   ` Joerg Roedel
@ 2015-11-05 21:11                     ` David Miller
  2015-11-07 15:06                       ` Shamir Rabinovitch
  0 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2015-11-05 21:11 UTC (permalink / raw)
  To: jroedel
  Cc: shamir.rabinovitch, dwmw2, benh, arnd, corbet, linux-doc,
	linux-arch, luto, borntraeger, cornelia.huck, sebott, pbonzini,
	hch, kvm, schwidefsky, linux-s390

From: Joerg Roedel <jroedel@suse.de>
Date: Thu, 5 Nov 2015 14:42:06 +0100

> Contended IOMMU locks are not only a problem on SPARC, but on x86 and
> various other IOMMU drivers too. But I have some ideas on how to improve
> the situation there.

And for the record Sowmini fixed a lot of the lock contention:

commit ff7d37a502022149655c18035b99a53391be0383
Author: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date:   Thu Apr 9 15:33:30 2015 -0400

    Break up monolithic iommu table/lock into finer graularity pools and lock
    
    Investigation of multithreaded iperf experiments on an ethernet
    interface show the iommu->lock as the hottest lock identified by
    lockstat, with something of the order of  21M contentions out of
    27M acquisitions, and an average wait time of 26 us for the lock.
    This is not efficient. A more scalable design is to follow the ppc
    model, where the iommu_map_table has multiple pools, each stretching
    over a segment of the map, and with a separate lock for each pool.
    This model allows for better parallelization of the iommu map search.
    
    This patch adds the iommu range alloc/free function infrastructure.
    
    Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
    Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-11-05 21:11                     ` David Miller
@ 2015-11-07 15:06                       ` Shamir Rabinovitch
       [not found]                         ` <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>
  0 siblings, 1 reply; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-11-07 15:06 UTC (permalink / raw)
  To: David Miller
  Cc: jroedel, dwmw2, benh, arnd, corbet, linux-doc, linux-arch, luto,
	borntraeger, cornelia.huck, sebott, pbonzini, hch, kvm,
	schwidefsky, linux-s390

On Thu, Nov 05, 2015 at 04:11:21PM -0500, David Miller wrote:
> 
> And for the record Sowmini fixed a lot of the lock contention:
> 
> commit ff7d37a502022149655c18035b99a53391be0383
> Author: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date:   Thu Apr 9 15:33:30 2015 -0400
> 
>     Break up monolithic iommu table/lock into finer graularity pools and lock
>     

The poor rds-stress results w/o IOMMU bypass I sent in early post were taken from
kernel that has the above patch and that has all the needed changes in arch/sparc
to use this new feature.

It seems that it worked well for 10G ETH IOMMU lock contention but it still not solving
the rds-stress issue.

The difference can be from:

1. Lock contention still left with this enhancement <-- zero in bypass
2. Overhead to setup the IOMMU mapping <-- almost zero in bypass (require 1 HV call)
3. Overhead to use the IOMMU mapping <-- not sure how to measure this
4. Overhead to tear the IOMMU mapping <-- zero in bypass 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
       [not found]                         ` <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>
@ 2015-11-16  8:42                           ` David Woodhouse
       [not found]                             ` <CAN+hb0UWpfcS5DvgMxNjY-5JOztw2mO1r2FJAW17fn974mhxPA@mail.gmail.com>
  0 siblings, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2015-11-16  8:42 UTC (permalink / raw)
  To: Benjamin Serebrin, Shamir Rabinovitch
  Cc: David Miller, Joerg Roedel, benh, arnd, corbet, linux-doc,
	linux-arch, luto, borntraeger, cornelia.huck, sebott,
	Paolo Bonzini, hch, kvm, schwidefsky, linux-s390

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> We looked into Intel IOMMU performance a while ago and learned a few
> useful things.  We generally did a parallel 200 thread TCP_RR test,
> as this churns through mappings quickly and uses all available cores.
> 
> First, the main bottleneck was software performance[1].

For the Intel IOMMU, *all* we need to do is put a PTE in place. For
real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
need to do an IOTLB flush. It's a single 64-bit write of the PTE.

All else is software overhead.

(I'm deliberately ignoring the stupid chipsets where DMA page tables
aren't cache coherent and we need a clflush too. They make me too sad.)


>   This study preceded the recent patch to break the locks into pools
> ("Break up monolithic iommu table/lock into finer graularity pools
> and lock").  There were several points of lock contention:
> - the RB tree ...
> - the RB tree ...
> - the RB tree ...
> 
> Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> c15-paper-peleg.pdf) has some promising approaches.  The magazine
> avoids the RB tree issue.  

I'm thinking of ditching the RB tree altogether and switching to the
allocator in lib/iommu-common.c (and thus getting the benefit of the
finer granularity pools).

> I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> page table cleanup algorithm could do well.

When you say 'dynamic 1:1 mapping', is that the same thing that's been
suggested elsewhere — avoiding the IOVA allocator completely by using a
virtual address which *matches* the physical address, if that virtual
address is available? Simply cmpxchg on the PTE itself, and if it was
already set *then* we fall back to the allocator, obviously configured
to allocate from a range *higher* than the available physical memory.

Jörg has been looking at this too, and was even trying to find space in
the PTE for a use count so a given page could be in more than one
mapping before we call back to the IOVA allocator.


> There are correctness fixes and optimizations left in the
> invalidation path: I want strict-ish semantics (a page doesn't go
> back into the freelist until the last IOTLB/IOMMU TLB entry is
> invalidated) with good performance, and that seems to imply that an
> additional page reference should be gotten at dma_map time and put
> back at the completion of the IOMMU flush routine.  (This is worthy
> of much discussion.)  

We already do something like this for page table pages which are freed
by an unmap, FWIW.

> Additionally, we can find ways to optimize the flush routine by
> realizing that if we have frequent maps and unmaps, it may be because
> the device creates and destroys buffers a lot; these kind of
> workloads use an IOVA for one event and then never come back.  Maybe
> TLBs don't do much good and we could just flush the entire IOMMU TLB
> [and ATS cache] for that BDF.

That would be a very interesting thing to look at. Although it would be
nice if we had a better way to measure the performance impact of IOTLB
misses — currently we don't have a lot of visibility at all.

> 1: We verified that the IOMMU costs are almost entirely software
> overheads by forcing software 1:1 mode, where we create page tables
> for all physical addresses.  We tested using leaf nodes of size 4KB,
> of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> of the page tables, and no IOMMU invalidations.  We did piles of DMA
> maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> addresses.  At 4KB page size, we could see some bandwidth slowdown,
> but at 2MB and 1GB, there was < 1% performance loss as compared with
> IOMMU off.

Was this with ATS on or off? With ATS, the cost of the page walk can be
amortised in some cases — you can look up the physical address *before*
you are ready to actually start the DMA to it, and don't take that
latency at the time you're actually moving the data.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
       [not found]                             ` <CAN+hb0UWpfcS5DvgMxNjY-5JOztw2mO1r2FJAW17fn974mhxPA@mail.gmail.com>
@ 2015-11-16 18:42                               ` Benjamin Serebrin
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Serebrin @ 2015-11-16 18:42 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Shamir Rabinovitch, David Miller, Joerg Roedel, benh,
	Arnd Bergmann, Jonathan Corbet, linux-doc, linux-arch, luto,
	borntraeger, cornelia.huck, sebott, Paolo Bonzini, hch, kvm,
	schwidefsky, linux-s390

On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> > We looked into Intel IOMMU performance a while ago and learned a few
> > useful things.  We generally did a parallel 200 thread TCP_RR test,
> > as this churns through mappings quickly and uses all available cores.
> >
> > First, the main bottleneck was software performance[1].
>
> For the Intel IOMMU, *all* we need to do is put a PTE in place. For
> real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
> need to do an IOTLB flush. It's a single 64-bit write of the PTE.
>
> All else is software overhead.
>

Agreed!

>
> (I'm deliberately ignoring the stupid chipsets where DMA page tables
> aren't cache coherent and we need a clflush too. They make me too sad.)



How much does Linux need to care about such chipsets?  Can we argue
that they get very sad performance and so be it?

>
>
>
> >   This study preceded the recent patch to break the locks into pools
> > ("Break up monolithic iommu table/lock into finer graularity pools
> > and lock").  There were several points of lock contention:
> > - the RB tree ...
> > - the RB tree ...
> > - the RB tree ...
> >
> > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> > c15-paper-peleg.pdf) has some promising approaches.  The magazine
> > avoids the RB tree issue.
>
> I'm thinking of ditching the RB tree altogether and switching to the
> allocator in lib/iommu-common.c (and thus getting the benefit of the
> finer granularity pools).


Sounds promising!  Is 4 parallel arenas enough?  We'll try to play
with that here.
I think lazy_flush leaves dangling references.

>
>
> > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> > page table cleanup algorithm could do well.
>
> When you say 'dynamic 1:1 mapping', is that the same thing that's been
> suggested elsewhere — avoiding the IOVA allocator completely by using a
> virtual address which *matches* the physical address, if that virtual
> address is available?


Yes.  We munge in 2 upper address bits in the IOVA to encode read and
write permissions as well.

>
> Simply cmpxchg on the PTE itself, and if it was
> already set *then* we fall back to the allocator, obviously configured
> to allocate from a range *higher* than the available physical memory.
>
> Jörg has been looking at this too, and was even trying to find space in
> the PTE for a use count so a given page could be in more than one
> mapping before we call back to the IOVA allocator.


Aren't bits 63:52 available at all levels?

>
>
>
> > There are correctness fixes and optimizations left in the
> > invalidation path: I want strict-ish semantics (a page doesn't go
> > back into the freelist until the last IOTLB/IOMMU TLB entry is
> > invalidated) with good performance, and that seems to imply that an
> > additional page reference should be gotten at dma_map time and put
> > back at the completion of the IOMMU flush routine.  (This is worthy
> > of much discussion.)
>
> We already do something like this for page table pages which are freed
> by an unmap, FWIW.


As I understood the code when I last looked, this was true only if a
single unmap operation covered an entire table's worth (2MByte, or
1GByte, etc) of mappings.  The caffeine hasn't hit yet, though, so I
can't even begin to dive into the new calls into mm.c.


>
>
> > Additionally, we can find ways to optimize the flush routine by
> > realizing that if we have frequent maps and unmaps, it may be because
> > the device creates and destroys buffers a lot; these kind of
> > workloads use an IOVA for one event and then never come back.  Maybe
> > TLBs don't do much good and we could just flush the entire IOMMU TLB
> > [and ATS cache] for that BDF.
>
> That would be a very interesting thing to look at. Although it would be
> nice if we had a better way to measure the performance impact of IOTLB
> misses — currently we don't have a lot of visibility at all.


All benchmarks are lies.  But we intend to run internal workloads as
well as well-agreed loads and see how things go.

>
>
> > 1: We verified that the IOMMU costs are almost entirely software
> > overheads by forcing software 1:1 mode, where we create page tables
> > for all physical addresses.  We tested using leaf nodes of size 4KB,
> > of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> > of the page tables, and no IOMMU invalidations.  We did piles of DMA
> > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> > addresses.  At 4KB page size, we could see some bandwidth slowdown,
> > but at 2MB and 1GB, there was < 1% performance loss as compared with
> > IOMMU off.
>
> Was this with ATS on or off? With ATS, the cost of the page walk can be
> amortised in some cases — you can look up the physical address *before*
> you are ready to actually start the DMA to it, and don't take that
> latency at the time you're actually moving the data.


This was with ATS intentionally disabled; we wanted to stress the
IOMMU's page walker.
Most devices aren't as smart as you describe, though in any kind of
ring buffer where storage is posted,
the device should be doing that.

On Mon, Nov 16, 2015 at 10:09 AM, Benjamin Serebrin <serebrin@google.com> wrote:
>
>
> On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse <dwmw2@infradead.org>
> wrote:
>>
>> On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
>> > We looked into Intel IOMMU performance a while ago and learned a few
>> > useful things.  We generally did a parallel 200 thread TCP_RR test,
>> > as this churns through mappings quickly and uses all available cores.
>> >
>> > First, the main bottleneck was software performance[1].
>>
>> For the Intel IOMMU, *all* we need to do is put a PTE in place. For
>> real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
>> need to do an IOTLB flush. It's a single 64-bit write of the PTE.
>>
>> All else is software overhead.
>>
>
> Agreed!
>
>>
>> (I'm deliberately ignoring the stupid chipsets where DMA page tables
>> aren't cache coherent and we need a clflush too. They make me too sad.)
>
>
>
> How much does Linux need to care about such chipsets?  Can we argue that
> they get very sad performance and so be it?
>
>>
>>
>>
>> >   This study preceded the recent patch to break the locks into pools
>> > ("Break up monolithic iommu table/lock into finer graularity pools
>> > and lock").  There were several points of lock contention:
>> > - the RB tree ...
>> > - the RB tree ...
>> > - the RB tree ...
>> >
>> > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
>> > c15-paper-peleg.pdf) has some promising approaches.  The magazine
>> > avoids the RB tree issue.
>>
>> I'm thinking of ditching the RB tree altogether and switching to the
>> allocator in lib/iommu-common.c (and thus getting the benefit of the
>> finer granularity pools).
>
>
> Sounds promising!  Is 4 parallel arenas enough?  We'll try to play with that
> here.
> I think lazy_flush leaves dangling references.
>
>>
>>
>> > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
>> > page table cleanup algorithm could do well.
>>
>> When you say 'dynamic 1:1 mapping', is that the same thing that's been
>> suggested elsewhere — avoiding the IOVA allocator completely by using a
>> virtual address which *matches* the physical address, if that virtual
>> address is available?
>
>
> Yes.  We munge in 2 upper address bits in the IOVA to encode read and write
> permissions as well.
>
>>
>> Simply cmpxchg on the PTE itself, and if it was
>> already set *then* we fall back to the allocator, obviously configured
>> to allocate from a range *higher* than the available physical memory.
>>
>> Jörg has been looking at this too, and was even trying to find space in
>> the PTE for a use count so a given page could be in more than one
>> mapping before we call back to the IOVA allocator.
>
>
> Aren't bits 63:52 available at all levels?
>
>>
>>
>>
>> > There are correctness fixes and optimizations left in the
>> > invalidation path: I want strict-ish semantics (a page doesn't go
>> > back into the freelist until the last IOTLB/IOMMU TLB entry is
>> > invalidated) with good performance, and that seems to imply that an
>> > additional page reference should be gotten at dma_map time and put
>> > back at the completion of the IOMMU flush routine.  (This is worthy
>> > of much discussion.)
>>
>> We already do something like this for page table pages which are freed
>> by an unmap, FWIW.
>
>
> As I understood the code when I last looked, this was true only if a single
> unmap operation covered an entire table's worth (2MByte, or 1GByte, etc) of
> mappings.  The caffeine hasn't hit yet, though, so I can't even begin to
> dive into the new calls into mm.c.
>
>
>>
>>
>> > Additionally, we can find ways to optimize the flush routine by
>> > realizing that if we have frequent maps and unmaps, it may be because
>> > the device creates and destroys buffers a lot; these kind of
>> > workloads use an IOVA for one event and then never come back.  Maybe
>> > TLBs don't do much good and we could just flush the entire IOMMU TLB
>> > [and ATS cache] for that BDF.
>>
>> That would be a very interesting thing to look at. Although it would be
>> nice if we had a better way to measure the performance impact of IOTLB
>> misses — currently we don't have a lot of visibility at all.
>
>
> All benchmarks are lies.  But we intend to run internal workloads as well as
> well-agreed loads and see how things go.
>
>>
>>
>> > 1: We verified that the IOMMU costs are almost entirely software
>> > overheads by forcing software 1:1 mode, where we create page tables
>> > for all physical addresses.  We tested using leaf nodes of size 4KB,
>> > of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
>> > of the page tables, and no IOMMU invalidations.  We did piles of DMA
>> > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
>> > addresses.  At 4KB page size, we could see some bandwidth slowdown,
>> > but at 2MB and 1GB, there was < 1% performance loss as compared with
>> > IOMMU off.
>>
>> Was this with ATS on or off? With ATS, the cost of the page walk can be
>> amortised in some cases — you can look up the physical address *before*
>> you are ready to actually start the DMA to it, and don't take that
>> latency at the time you're actually moving the data.
>
>
> This was with ATS intentionally disabled; we wanted to stress the IOMMU's
> page walker.
> Most devices aren't as smart as you describe, though in any kind of ring
> buffer where storage is posted,
> the device should be doing that.
>
>>
>>
>> --
>> David Woodhouse                            Open Source Technology Centre
>> David.Woodhouse@intel.com                              Intel Corporation
>>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
@ 2015-11-16  6:56 Benjamin Serebrin
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Serebrin @ 2015-11-16  6:56 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: David Miller, Joerg Roedel, dwmw2, benh, arnd, corbet, linux-doc,
	linux-arch, luto, borntraeger, cornelia.huck, sebott,
	Paolo Bonzini, hch, kvm, schwidefsky, linux-s390

We looked into Intel IOMMU performance a while ago and learned a few
useful things.  We generally did a parallel 200 thread TCP_RR test, as
this churns through mappings quickly and uses all available cores.

First, the main bottleneck was software performance[1].  This study
preceded the recent patch to break the locks into pools ("Break up
monolithic iommu table/lock into finer graularity pools and lock").
There were several points of lock contention:
- the RB tree is per device (and in the network test, there's one
device).  Every dma_map and unmap holds the lock.
- the RB tree lock is held during invalidations as well.  There's a
250-entry queue for invalidations that doesn't do any batching
intelligence (for example, promote to larger-range invalidations,
flush entire device, etc).  RB tree locks may be held while waiting
for invalidation drains.  Invalidations have even worse behavior with
ATS enabled for a given device.
- the RB tree has one entry per dma_map call (that entry is deleted by
the corresponding dma_unmap).  If we had merged all adjacent entries
when we could, we would have not lost any information that's actually
used by the code today.  (There could be a check that a dma_unmap
actually covers the entire region that was mapped, but there isn't.)
At boot (without network traffic), two common NICs' drivers show tens
of thousands of static dma_maps that never go away; this means the RB
tree is ~14-16 levels deep.  A rbtree walk (holding that big lock) has
a 14-16 level pointer chase through mostly cache-cold entries.  I
wrote a modification to the RB tree handling that merges nodes that
represent abutting IOVA ranges (and unmerges them on dma_unmap), and
the same drivers created around 7 unique entries.  Steady state grew
to a few hundreds and maybe a thousand, but the fragmentation didn't
get worse than that.  This optimization got about a third of the
performance back.

Omer's paper (https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf)
has some promising approaches.  The magazine avoids the RB tree issue.

I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
page table cleanup algorithm could do well.

There are correctness fixes and optimizations left in the invalidation
path: I want strict-ish semantics (a page doesn't go back into the
freelist until the last IOTLB/IOMMU TLB entry is invalidated) with
good performance, and that seems to imply that an additional page
reference should be gotten at dma_map time and put back at the
completion of the IOMMU flush routine.  (This is worthy of much
discussion.)

Additionally, we can find ways to optimize the flush routine by
realizing that if we have frequent maps and unmaps, it may be because
the device creates and destroys buffers a lot; these kind of workloads
use an IOVA for one event and then never come back.  Maybe TLBs don't
do much good and we could just flush the entire IOMMU TLB [and ATS
cache] for that BDF.

We'll try to get free time to do some of these things soon.

Ben


1: We verified that the IOMMU costs are almost entirely software
overheads by forcing software 1:1 mode, where we create page tables
for all physical addresses.  We tested using leaf nodes of size 4KB,
of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
of the page tables, and no IOMMU invalidations.  We did piles of DMA
maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
addresses.  At 4KB page size, we could see some bandwidth slowdown,
but at 2MB and 1GB, there was < 1% performance loss as compared with
IOMMU off.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
  2015-10-25 16:37 [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API Shamir Rabinovitch
@ 2015-10-25 16:37 ` Shamir Rabinovitch
  0 siblings, 0 replies; 42+ messages in thread
From: Shamir Rabinovitch @ 2015-10-25 16:37 UTC (permalink / raw)
  To: linux-arch; +Cc: arnd, corbet, linux-doc

For systems with IOMMU it is assumed all DMA translations use the IOMMU.
There are, however, cases when IOMMU might map only limited range of the
memory or when the cost of setting IOMMU dma mapping is too big. For those
cases, this flag can be used to request the DMA-mapping subsystem to bypass
the IOMMU if applicable.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
---
 Documentation/DMA-attributes.txt |    9 +++++++++
 include/linux/dma-attrs.h        |    1 +
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt
index 18dc52c..1174afc 100644
--- a/Documentation/DMA-attributes.txt
+++ b/Documentation/DMA-attributes.txt
@@ -100,3 +100,12 @@ allocated by dma_alloc_attrs() function from individual pages if it can
 be mapped as contiguous chunk into device dma address space. By
 specifying this attribute the allocated buffer is forced to be contiguous
 also in physical memory.
+
+DMA_ATTR_IOMMU_BYPASS
+---------------------
+
+For systems with IOMMU it is assumed all DMA translations use the IOMMU.
+There are, however, cases when IOMMU might map only limited range of the
+memory or when the cost of setting IOMMU dma mapping is too big. For those
+cases, this flag can be used to request the DMA-mapping subsystem to bypass
+the IOMMU if applicable.
diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h
index c8e1831..7c78691 100644
--- a/include/linux/dma-attrs.h
+++ b/include/linux/dma-attrs.h
@@ -18,6 +18,7 @@ enum dma_attr {
 	DMA_ATTR_NO_KERNEL_MAPPING,
 	DMA_ATTR_SKIP_CPU_SYNC,
 	DMA_ATTR_FORCE_CONTIGUOUS,
+	DMA_ATTR_IOMMU_BYPASS,
 	DMA_ATTR_MAX,
 };
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-11-16 18:42 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-25 16:07 [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API Shamir Rabinovitch
2015-10-25 16:07 ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS Shamir Rabinovitch
2015-10-28  6:30   ` David Woodhouse
2015-10-28 11:10     ` Shamir Rabinovitch
2015-10-28 13:31       ` David Woodhouse
2015-10-28 14:07         ` David Miller
2015-10-28 13:57           ` David Woodhouse
2015-10-29  0:23             ` David Miller
2015-10-29  0:32         ` Benjamin Herrenschmidt
2015-10-29  0:42           ` David Woodhouse
2015-10-29  1:10             ` Benjamin Herrenschmidt
2015-10-29 18:31               ` Andy Lutomirski
2015-10-29 22:35                 ` David Woodhouse
2015-11-01  7:45                   ` Shamir Rabinovitch
2015-11-01 21:10                     ` Benjamin Herrenschmidt
2015-11-02  7:23                       ` Shamir Rabinovitch
2015-11-02 10:00                         ` Benjamin Herrenschmidt
2015-11-02 12:07                           ` Shamir Rabinovitch
2015-11-02 20:13                             ` Benjamin Herrenschmidt
2015-11-02 21:45                               ` Arnd Bergmann
2015-11-02 23:08                                 ` Benjamin Herrenschmidt
2015-11-03 13:11                                   ` Christoph Hellwig
2015-11-03 19:35                                     ` Benjamin Herrenschmidt
2015-11-02 21:49                               ` Shamir Rabinovitch
2015-11-02 22:48                       ` David Woodhouse
2015-11-02 23:10                         ` Benjamin Herrenschmidt
2015-11-05 21:08                   ` David Miller
2015-10-30  1:51                 ` Benjamin Herrenschmidt
2015-10-30 10:32               ` Arnd Bergmann
2015-10-30 23:17                 ` Benjamin Herrenschmidt
2015-10-30 23:24                   ` Arnd Bergmann
2015-11-02 14:51                 ` Joerg Roedel
2015-10-29  7:32             ` Shamir Rabinovitch
2015-11-02 14:44               ` Joerg Roedel
2015-11-02 17:32                 ` Shamir Rabinovitch
2015-11-05 13:42                   ` Joerg Roedel
2015-11-05 21:11                     ` David Miller
2015-11-07 15:06                       ` Shamir Rabinovitch
     [not found]                         ` <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>
2015-11-16  8:42                           ` David Woodhouse
     [not found]                             ` <CAN+hb0UWpfcS5DvgMxNjY-5JOztw2mO1r2FJAW17fn974mhxPA@mail.gmail.com>
2015-11-16 18:42                               ` Benjamin Serebrin
2015-10-25 16:37 [PATCH v1 1/2] dma-mapping-common: add dma_map_page_attrs API Shamir Rabinovitch
2015-10-25 16:37 ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS Shamir Rabinovitch
2015-11-16  6:56 Benjamin Serebrin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.