iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* the XSK buffer pool needs be to reverted
@ 2020-06-26  7:47 Christoph Hellwig
  2020-06-26 12:22 ` Björn Töpel
  2020-06-26 20:54 ` Jonathan Lemon
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2020-06-26  7:47 UTC (permalink / raw)
  To: Björn Töpel, David S. Miller, netdev,
	Konrad Rzeszutek Wilk, iommu, linux-kernel

Hi Björn,

you addition of the xsk_buff_pool.c APIs in commit 2b43470add8c
("xsk: Introduce AF_XDP buffer allocation API") is unfortunately rather
broken by making lots of assumptions and poking into dma-direct and
swiotlb internals that are of no business to outside users and clearly
marked as such.   I'd be glad to work with your doing something proper
for pools, but that needs proper APIs and probably live in the dma
mapping core, but for that you'd actually need to contact the relevant
maintainers before poking into internals.

The commit seems to have a long dove tail of commits depending on it
despite only being a month old, so maybe you can do the revert for now?

Note that this is somewhat urgent, as various of the APIs that the code
is abusing are slated to go away for Linux 5.9, so this addition comes
at a really bad time.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-26  7:47 the XSK buffer pool needs be to reverted Christoph Hellwig
@ 2020-06-26 12:22 ` Björn Töpel
  2020-06-26 12:41   ` Christoph Hellwig
  2020-06-26 20:54 ` Jonathan Lemon
  1 sibling, 1 reply; 11+ messages in thread
From: Björn Töpel @ 2020-06-26 12:22 UTC (permalink / raw)
  To: Christoph Hellwig, David S. Miller, netdev,
	Konrad Rzeszutek Wilk, iommu, linux-kernel, bpf
  Cc: Maxim Mikityanskiy, Karlsson, Magnus

On 2020-06-26 09:47, Christoph Hellwig wrote:
> Hi Björn,
> 
> you addition of the xsk_buff_pool.c APIs in commit 2b43470add8c
> ("xsk: Introduce AF_XDP buffer allocation API") is unfortunately rather
> broken by making lots of assumptions and poking into dma-direct and
> swiotlb internals that are of no business to outside users and clearly
> marked as such.   I'd be glad to work with your doing something proper
> for pools, but that needs proper APIs and probably live in the dma
> mapping core, but for that you'd actually need to contact the relevant
> maintainers before poking into internals.
>

Christoph,

Thanks for clarifying that. Let's work on a solution that can reside in
the dma mapping core.

> The commit seems to have a long dove tail of commits depending on it
> despite only being a month old, so maybe you can do the revert for now?
>

Reverting the whole series sounds a bit too much. Let's focus on the
part that breaks the dma api abstraction. I'm assuming that you're
referring to the

   static bool xp_check_cheap_dma(struct xsk_buff_pool *pool)

function (and related functions called from that)?

> Note that this is somewhat urgent, as various of the APIs that the code
> is abusing are slated to go away for Linux 5.9, so this addition comes
> at a really bad time.
> 

Understood. Wdyt about something in the lines of the diff below? It's
build tested only, but removes all non-dma API usage ("poking
internals"). Would that be a way forward, and then as a next step work
on a solution that would give similar benefits, but something that would
live in the dma mapping core?

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index a4ff226505c9..003b172ce0d2 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -40,7 +40,6 @@ struct xsk_buff_pool {
  	u32 headroom;
  	u32 chunk_size;
  	u32 frame_len;
-	bool cheap_dma;
  	bool unaligned;
  	void *addrs;
  	struct device *dev;
@@ -80,9 +79,6 @@ static inline dma_addr_t xp_get_frame_dma(struct 
xdp_buff_xsk *xskb)
  void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb);
  static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb)
  {
-	if (xskb->pool->cheap_dma)
-		return;
-
  	xp_dma_sync_for_cpu_slow(xskb);
  }

@@ -91,9 +87,6 @@ void xp_dma_sync_for_device_slow(struct xsk_buff_pool 
*pool, dma_addr_t dma,
  static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
  					  dma_addr_t dma, size_t size)
  {
-	if (pool->cheap_dma)
-		return;
-
  	xp_dma_sync_for_device_slow(pool, dma, size);
  }

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 540ed75e4482..5714f3711381 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -2,9 +2,6 @@

  #include <net/xsk_buff_pool.h>
  #include <net/xdp_sock.h>
-#include <linux/dma-direct.h>
-#include <linux/dma-noncoherent.h>
-#include <linux/swiotlb.h>

  #include "xsk_queue.h"

@@ -55,7 +52,6 @@ struct xsk_buff_pool *xp_create(struct page **pages, 
u32 nr_pages, u32 chunks,
  	pool->free_heads_cnt = chunks;
  	pool->headroom = headroom;
  	pool->chunk_size = chunk_size;
-	pool->cheap_dma = true;
  	pool->unaligned = unaligned;
  	pool->frame_len = chunk_size - headroom - XDP_PACKET_HEADROOM;
  	INIT_LIST_HEAD(&pool->free_list);
@@ -125,48 +121,6 @@ static void xp_check_dma_contiguity(struct 
xsk_buff_pool *pool)
  	}
  }

-static bool __maybe_unused xp_check_swiotlb_dma(struct xsk_buff_pool *pool)
-{
-#if defined(CONFIG_SWIOTLB)
-	phys_addr_t paddr;
-	u32 i;
-
-	for (i = 0; i < pool->dma_pages_cnt; i++) {
-		paddr = dma_to_phys(pool->dev, pool->dma_pages[i]);
-		if (is_swiotlb_buffer(paddr))
-			return false;
-	}
-#endif
-	return true;
-}
-
-static bool xp_check_cheap_dma(struct xsk_buff_pool *pool)
-{
-#if defined(CONFIG_HAS_DMA)
-	const struct dma_map_ops *ops = get_dma_ops(pool->dev);
-
-	if (ops) {
-		return !ops->sync_single_for_cpu &&
-			!ops->sync_single_for_device;
-	}
-
-	if (!dma_is_direct(ops))
-		return false;
-
-	if (!xp_check_swiotlb_dma(pool))
-		return false;
-
-	if (!dev_is_dma_coherent(pool->dev)) {
-#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) ||		\
-	defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) ||	\
-	defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE)
-		return false;
-#endif
-	}
-#endif
-	return true;
-}
-
  int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
  	       unsigned long attrs, struct page **pages, u32 nr_pages)
  {
@@ -195,7 +149,6 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct 
device *dev,
  		xp_check_dma_contiguity(pool);

  	pool->dev = dev;
-	pool->cheap_dma = xp_check_cheap_dma(pool);
  	return 0;
  }
  EXPORT_SYMBOL(xp_dma_map);
@@ -280,11 +233,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
  	xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
  	xskb->xdp.data_meta = xskb->xdp.data;

-	if (!pool->cheap_dma) {
-		dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
-						 pool->frame_len,
-						 DMA_BIDIRECTIONAL);
-	}
+	dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
+					 pool->frame_len,
+					 DMA_BIDIRECTIONAL);
  	return &xskb->xdp;
  }
  EXPORT_SYMBOL(xp_alloc);
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-26 12:22 ` Björn Töpel
@ 2020-06-26 12:41   ` Christoph Hellwig
  2020-06-26 12:45     ` Björn Töpel
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2020-06-26 12:41 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Maxim Mikityanskiy, Konrad Rzeszutek Wilk, netdev, linux-kernel,
	David S. Miller, iommu, bpf, Christoph Hellwig, Karlsson, Magnus

On Fri, Jun 26, 2020 at 02:22:41PM +0200, Björn Töpel wrote:
> Thanks for clarifying that. Let's work on a solution that can reside in
> the dma mapping core.
>
>> The commit seems to have a long dove tail of commits depending on it
>> despite only being a month old, so maybe you can do the revert for now?
>>
>
> Reverting the whole series sounds a bit too much. Let's focus on the
> part that breaks the dma api abstraction. I'm assuming that you're
> referring to the
>
>   static bool xp_check_cheap_dma(struct xsk_buff_pool *pool)
>
> function (and related functions called from that)?

Yes.

>
>> Note that this is somewhat urgent, as various of the APIs that the code
>> is abusing are slated to go away for Linux 5.9, so this addition comes
>> at a really bad time.
>>
>
> Understood. Wdyt about something in the lines of the diff below? It's
> build tested only, but removes all non-dma API usage ("poking
> internals"). Would that be a way forward, and then as a next step work
> on a solution that would give similar benefits, but something that would
> live in the dma mapping core?

Yes, that would solve the immediate issues.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-26 12:41   ` Christoph Hellwig
@ 2020-06-26 12:45     ` Björn Töpel
  0 siblings, 0 replies; 11+ messages in thread
From: Björn Töpel @ 2020-06-26 12:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Maxim Mikityanskiy, Konrad Rzeszutek Wilk, netdev, linux-kernel,
	iommu, bpf, David S. Miller, Karlsson, Magnus

On 2020-06-26 14:41, Christoph Hellwig wrote:
> On Fri, Jun 26, 2020 at 02:22:41PM +0200, Björn Töpel wrote:
[...]
>>
>> Understood. Wdyt about something in the lines of the diff below? It's
>> build tested only, but removes all non-dma API usage ("poking
>> internals"). Would that be a way forward, and then as a next step work
>> on a solution that would give similar benefits, but something that would
>> live in the dma mapping core?
> 
> Yes, that would solve the immediate issues.
> 

Good. I'll cook a proper patch and post it.


Björn
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-26  7:47 the XSK buffer pool needs be to reverted Christoph Hellwig
  2020-06-26 12:22 ` Björn Töpel
@ 2020-06-26 20:54 ` Jonathan Lemon
  2020-06-27  7:02   ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Jonathan Lemon @ 2020-06-26 20:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: netdev, iommu, Björn Töpel

On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
>
> Note that this is somewhat urgent, as various of the APIs that the code
> is abusing are slated to go away for Linux 5.9, so this addition comes
> at a really bad time.

Could you elaborate on what is upcoming here?


Also, on a semi-related note, are there limitations on how many pages
can be left mapped by the iommu?  Some of the page pool work involves
leaving the pages mapped instead of constantly mapping/unmapping them.

On a heavily loaded box with iommu enabled, it seems that quite often
there is contention on the iova_lock.  Are there known issues in this
area?
-- 
Jonathan
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-26 20:54 ` Jonathan Lemon
@ 2020-06-27  7:02   ` Christoph Hellwig
  2020-06-29 13:15     ` Robin Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2020-06-27  7:02 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, iommu, Björn Töpel, Christoph Hellwig

On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:
> On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
> >
> > Note that this is somewhat urgent, as various of the APIs that the code
> > is abusing are slated to go away for Linux 5.9, so this addition comes
> > at a really bad time.
> 
> Could you elaborate on what is upcoming here?

Moving all these calls out of line, and adding a bypass flag to avoid
the indirect function call for IOMMUs in direct mapped mode.

> Also, on a semi-related note, are there limitations on how many pages
> can be left mapped by the iommu?  Some of the page pool work involves
> leaving the pages mapped instead of constantly mapping/unmapping them.

There are, but I think for all modern IOMMUs they are so big that they
don't matter.  Maintaines of the individual IOMMU drivers might know
more.

> On a heavily loaded box with iommu enabled, it seems that quite often
> there is contention on the iova_lock.  Are there known issues in this
> area?

I'll have to defer to the IOMMU maintainers, and for that you'll need
to say what code you are using.  Current mainlaine doesn't even have
an iova_lock anywhere.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-27  7:02   ` Christoph Hellwig
@ 2020-06-29 13:15     ` Robin Murphy
  2020-06-30 19:08       ` Jonathan Lemon
  0 siblings, 1 reply; 11+ messages in thread
From: Robin Murphy @ 2020-06-29 13:15 UTC (permalink / raw)
  To: Christoph Hellwig, Jonathan Lemon; +Cc: netdev, iommu, Björn Töpel

On 2020-06-27 08:02, Christoph Hellwig wrote:
> On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:
>> On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
>>>
>>> Note that this is somewhat urgent, as various of the APIs that the code
>>> is abusing are slated to go away for Linux 5.9, so this addition comes
>>> at a really bad time.
>>
>> Could you elaborate on what is upcoming here?
> 
> Moving all these calls out of line, and adding a bypass flag to avoid
> the indirect function call for IOMMUs in direct mapped mode.
> 
>> Also, on a semi-related note, are there limitations on how many pages
>> can be left mapped by the iommu?  Some of the page pool work involves
>> leaving the pages mapped instead of constantly mapping/unmapping them.
> 
> There are, but I think for all modern IOMMUs they are so big that they
> don't matter.  Maintaines of the individual IOMMU drivers might know
> more.

Right - I don't know too much about older and more esoteric stuff like 
POWER TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, 
and Arm SMMU, the only "limits" are such that legitimate DMA API use 
should never get anywhere near them (you'd run out of RAM for actual 
buffers long beforehand). The most vaguely-realistic concern might be a 
pathological system topology where some old 32-bit PCI device doesn't 
have ACS isolation from your high-performance NIC such that they have to 
share an address space, where the NIC might happen to steal all the low 
addresses and prevent the soundcard or whatever from being able to map a 
usable buffer.

With an IOMMU, you typically really *want* to keep a full working set's 
worth of pages mapped, since dma_map/unmap are expensive while dma_sync 
is somewhere between relatively cheap and free. With no IOMMU it makes 
no real difference from the DMA API perspective since map/unmap are 
effectively no more than the equivalent sync operations anyway (I'm 
assuming we're not talking about the kind of constrained hardware that 
might need SWIOTLB).

>> On a heavily loaded box with iommu enabled, it seems that quite often
>> there is contention on the iova_lock.  Are there known issues in this
>> area?
> 
> I'll have to defer to the IOMMU maintainers, and for that you'll need
> to say what code you are using.  Current mainlaine doesn't even have
> an iova_lock anywhere.

Again I can't speak for non-mainstream stuff outside drivers/iommu, but 
it's been over 4 years now since merging the initial scalability work 
for the generic IOVA allocator there that focused on minimising lock 
contention, and it's had considerable evaluation and tweaking since. But 
if we can achieve the goal of efficiently recycling mapped buffers then 
we shouldn't need to go anywhere near IOVA allocation either way except 
when expanding the pool.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-29 13:15     ` Robin Murphy
@ 2020-06-30 19:08       ` Jonathan Lemon
  2020-07-01  9:46         ` Robin Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Lemon @ 2020-06-30 19:08 UTC (permalink / raw)
  To: Robin Murphy; +Cc: netdev, iommu, Björn Töpel, Christoph Hellwig

On Mon, Jun 29, 2020 at 02:15:16PM +0100, Robin Murphy wrote:
> On 2020-06-27 08:02, Christoph Hellwig wrote:
> > On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:
> > > On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
> > > > 
> > > > Note that this is somewhat urgent, as various of the APIs that the code
> > > > is abusing are slated to go away for Linux 5.9, so this addition comes
> > > > at a really bad time.
> > > 
> > > Could you elaborate on what is upcoming here?
> > 
> > Moving all these calls out of line, and adding a bypass flag to avoid
> > the indirect function call for IOMMUs in direct mapped mode.
> > 
> > > Also, on a semi-related note, are there limitations on how many pages
> > > can be left mapped by the iommu?  Some of the page pool work involves
> > > leaving the pages mapped instead of constantly mapping/unmapping them.
> > 
> > There are, but I think for all modern IOMMUs they are so big that they
> > don't matter.  Maintaines of the individual IOMMU drivers might know
> > more.
> 
> Right - I don't know too much about older and more esoteric stuff like POWER
> TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, and Arm
> SMMU, the only "limits" are such that legitimate DMA API use should never
> get anywhere near them (you'd run out of RAM for actual buffers long
> beforehand). The most vaguely-realistic concern might be a pathological
> system topology where some old 32-bit PCI device doesn't have ACS isolation
> from your high-performance NIC such that they have to share an address
> space, where the NIC might happen to steal all the low addresses and prevent
> the soundcard or whatever from being able to map a usable buffer.
> 
> With an IOMMU, you typically really *want* to keep a full working set's
> worth of pages mapped, since dma_map/unmap are expensive while dma_sync is
> somewhere between relatively cheap and free. With no IOMMU it makes no real
> difference from the DMA API perspective since map/unmap are effectively no
> more than the equivalent sync operations anyway (I'm assuming we're not
> talking about the kind of constrained hardware that might need SWIOTLB).
> 
> > > On a heavily loaded box with iommu enabled, it seems that quite often
> > > there is contention on the iova_lock.  Are there known issues in this
> > > area?
> > 
> > I'll have to defer to the IOMMU maintainers, and for that you'll need
> > to say what code you are using.  Current mainlaine doesn't even have
> > an iova_lock anywhere.
> 
> Again I can't speak for non-mainstream stuff outside drivers/iommu, but it's
> been over 4 years now since merging the initial scalability work for the
> generic IOVA allocator there that focused on minimising lock contention, and
> it's had considerable evaluation and tweaking since. But if we can achieve
> the goal of efficiently recycling mapped buffers then we shouldn't need to
> go anywhere near IOVA allocation either way except when expanding the pool.


I'm running a set of patches which uses the page pool to try and keep
all the RX buffers mapped as the skb goes up the stack, returning the
pages to the pool when the skb is freed.

On a dual-socket 12-core Intel machine (48 processors), and 256G of
memory, when iommu is enabled, I see the following from 'perf top -U',
as the hottest function being run:

-   43.42%  worker      [k] queued_spin_lock_slowpath
   - 43.42% queued_spin_lock_slowpath
      - 41.69% _raw_spin_lock_irqsave
         + 41.39% alloc_iova 
         + 0.28% iova_magazine_free_pfns
      + 1.07% lock_sock_nested

Which likely is heavy contention on the iovad->iova_rbtree_lock.
(This is on a 5.6 based system, BTW).  More scripts and data are below.
Is there a way to reduce the contention here?



The following quick and dirty [and possibly wrong] .bpf script was used
to try and find the time spent in __alloc_and_insert_iova_range():

kprobe:alloc_iova_fast
{
        @fast = count();
}

kprobe:alloc_iova
{
        @iova_start[tid] = nsecs;
        @iova = count();
}

kretprobe:alloc_iova / @iova_start[tid] /
{
        @alloc_h = hist(nsecs - @iova_start[tid] - @mem_delta[tid]);
        delete(@iova_start[tid]);
        delete(@mem_delta[tid]);
}

kprobe:alloc_iova_mem / @iova_start[tid] /
{
        @mem_start[tid] = nsecs;
}

kretprobe:alloc_iova_mem / @mem_start[tid] /
{
        @mem_delta[tid] = nsecs - @mem_start[tid];
        delete(@mem_start[tid]);
}

kprobe:iova_insert_rbtree / @iova_start[tid] /
{
        @rb_start[tid] = nsecs;
        @rbtree = count();
}

kretprobe:iova_insert_rbtree / @rb_start[tid] /
{
        @insert_h = hist(nsecs - @rb_start[tid]);
        delete(@rb_start[tid]);
}

interval:s:2
{
        print(@fast);
        print(@iova);
        print(@rbtree);
        print(@alloc_h);
        print(@insert_h);
        printf("--------\n");
}

I see the following results.

@fast: 1989223
@iova: 725269
@rbtree: 689306

@alloc_h:
[64, 128)              2 |                                                    |
[128, 256)           118 |                                                    |
[256, 512)           983 |                                                    |
[512, 1K)           3816 |@@                                                  |
[1K, 2K)           10557 |@@@@@@                                              |
[2K, 4K)           19540 |@@@@@@@@@@@@                                        |
[4K, 8K)           31294 |@@@@@@@@@@@@@@@@@@@                                 |
[8K, 16K)          38112 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[16K, 32K)         46948 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[32K, 64K)         69728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[64K, 128K)        83797 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128K, 256K)       84317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)       82962 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512K, 1M)         72751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[1M, 2M)           49191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[2M, 4M)           26591 |@@@@@@@@@@@@@@@@                                    |
[4M, 8M)           15559 |@@@@@@@@@                                           |
[8M, 16M)          12283 |@@@@@@@                                             |
[16M, 32M)         18266 |@@@@@@@@@@@                                         |
[32M, 64M)         22539 |@@@@@@@@@@@@@                                       |
[64M, 128M)         3005 |@                                                   |
[128M, 256M)          41 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)             101 |                                                    |

@insert_h:
[128, 256)          2380 |                                                    |
[256, 512)         70043 |@@@@@@@@                                            |
[512, 1K)         431263 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)          182804 |@@@@@@@@@@@@@@@@@@@@@@                              |
[2K, 4K)            2742 |                                                    |
[4K, 8K)              43 |                                                    |
[8K, 16K)             25 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           6 |                                                    |


-- 
Jonathan
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-06-30 19:08       ` Jonathan Lemon
@ 2020-07-01  9:46         ` Robin Murphy
  2020-07-06 19:59           ` Jonathan Lemon
  0 siblings, 1 reply; 11+ messages in thread
From: Robin Murphy @ 2020-07-01  9:46 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, iommu, Björn Töpel, Christoph Hellwig

On 2020-06-30 20:08, Jonathan Lemon wrote:
> On Mon, Jun 29, 2020 at 02:15:16PM +0100, Robin Murphy wrote:
>> On 2020-06-27 08:02, Christoph Hellwig wrote:
>>> On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:
>>>> On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
>>>>>
>>>>> Note that this is somewhat urgent, as various of the APIs that the code
>>>>> is abusing are slated to go away for Linux 5.9, so this addition comes
>>>>> at a really bad time.
>>>>
>>>> Could you elaborate on what is upcoming here?
>>>
>>> Moving all these calls out of line, and adding a bypass flag to avoid
>>> the indirect function call for IOMMUs in direct mapped mode.
>>>
>>>> Also, on a semi-related note, are there limitations on how many pages
>>>> can be left mapped by the iommu?  Some of the page pool work involves
>>>> leaving the pages mapped instead of constantly mapping/unmapping them.
>>>
>>> There are, but I think for all modern IOMMUs they are so big that they
>>> don't matter.  Maintaines of the individual IOMMU drivers might know
>>> more.
>>
>> Right - I don't know too much about older and more esoteric stuff like POWER
>> TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, and Arm
>> SMMU, the only "limits" are such that legitimate DMA API use should never
>> get anywhere near them (you'd run out of RAM for actual buffers long
>> beforehand). The most vaguely-realistic concern might be a pathological
>> system topology where some old 32-bit PCI device doesn't have ACS isolation
>> from your high-performance NIC such that they have to share an address
>> space, where the NIC might happen to steal all the low addresses and prevent
>> the soundcard or whatever from being able to map a usable buffer.
>>
>> With an IOMMU, you typically really *want* to keep a full working set's
>> worth of pages mapped, since dma_map/unmap are expensive while dma_sync is
>> somewhere between relatively cheap and free. With no IOMMU it makes no real
>> difference from the DMA API perspective since map/unmap are effectively no
>> more than the equivalent sync operations anyway (I'm assuming we're not
>> talking about the kind of constrained hardware that might need SWIOTLB).
>>
>>>> On a heavily loaded box with iommu enabled, it seems that quite often
>>>> there is contention on the iova_lock.  Are there known issues in this
>>>> area?
>>>
>>> I'll have to defer to the IOMMU maintainers, and for that you'll need
>>> to say what code you are using.  Current mainlaine doesn't even have
>>> an iova_lock anywhere.
>>
>> Again I can't speak for non-mainstream stuff outside drivers/iommu, but it's
>> been over 4 years now since merging the initial scalability work for the
>> generic IOVA allocator there that focused on minimising lock contention, and
>> it's had considerable evaluation and tweaking since. But if we can achieve
>> the goal of efficiently recycling mapped buffers then we shouldn't need to
>> go anywhere near IOVA allocation either way except when expanding the pool.
> 
> 
> I'm running a set of patches which uses the page pool to try and keep
> all the RX buffers mapped as the skb goes up the stack, returning the
> pages to the pool when the skb is freed.
> 
> On a dual-socket 12-core Intel machine (48 processors), and 256G of
> memory, when iommu is enabled, I see the following from 'perf top -U',
> as the hottest function being run:
> 
> -   43.42%  worker      [k] queued_spin_lock_slowpath
>     - 43.42% queued_spin_lock_slowpath
>        - 41.69% _raw_spin_lock_irqsave
>           + 41.39% alloc_iova
>           + 0.28% iova_magazine_free_pfns
>        + 1.07% lock_sock_nested
> 
> Which likely is heavy contention on the iovad->iova_rbtree_lock.
> (This is on a 5.6 based system, BTW).  More scripts and data are below.
> Is there a way to reduce the contention here?

Hmm, how big are your DMA mappings? If you're still hitting the rbtree a 
lot, that most likely implies that either you're making giant IOVA 
allocations that are too big to be cached, or you're allocating/freeing 
IOVAs in a pathological pattern that defeats the whole magazine cache 
mechanism (It's optimised for relatively-balanced allocation and freeing 
of sizes up order 6). On a further hunch, does the 
"intel_iommu=forcedac" option make any difference at all?

Either way if this persists after some initial warm-up period, it 
further implies that the page pool is not doing its job properly (or at 
least in the way I would have expected). The alloc_iova() call is part 
of the dma_map_*() overhead, and if the aim is to keep pages mapped then 
that should only be called relatively infrequently. The optimal 
behaviour would be to dma_map() new clean pages as they are added to the 
pool, use dma_sync() when they are claimed and returned by the driver, 
and only dma_unmap() if they're actually freed back to the page 
allocator. And if you're still seeing a lot of dma_map/unmap time after 
that, then the pool itself is churning pages and clearly needs its 
size/thresholds tuning.

Robin.

> 
> 
> 
> The following quick and dirty [and possibly wrong] .bpf script was used
> to try and find the time spent in __alloc_and_insert_iova_range():
> 
> kprobe:alloc_iova_fast
> {
>          @fast = count();
> }
> 
> kprobe:alloc_iova
> {
>          @iova_start[tid] = nsecs;
>          @iova = count();
> }
> 
> kretprobe:alloc_iova / @iova_start[tid] /
> {
>          @alloc_h = hist(nsecs - @iova_start[tid] - @mem_delta[tid]);
>          delete(@iova_start[tid]);
>          delete(@mem_delta[tid]);
> }
> 
> kprobe:alloc_iova_mem / @iova_start[tid] /
> {
>          @mem_start[tid] = nsecs;
> }
> 
> kretprobe:alloc_iova_mem / @mem_start[tid] /
> {
>          @mem_delta[tid] = nsecs - @mem_start[tid];
>          delete(@mem_start[tid]);
> }
> 
> kprobe:iova_insert_rbtree / @iova_start[tid] /
> {
>          @rb_start[tid] = nsecs;
>          @rbtree = count();
> }
> 
> kretprobe:iova_insert_rbtree / @rb_start[tid] /
> {
>          @insert_h = hist(nsecs - @rb_start[tid]);
>          delete(@rb_start[tid]);
> }
> 
> interval:s:2
> {
>          print(@fast);
>          print(@iova);
>          print(@rbtree);
>          print(@alloc_h);
>          print(@insert_h);
>          printf("--------\n");
> }
> 
> I see the following results.
> 
> @fast: 1989223
> @iova: 725269
> @rbtree: 689306
> 
> @alloc_h:
> [64, 128)              2 |                                                    |
> [128, 256)           118 |                                                    |
> [256, 512)           983 |                                                    |
> [512, 1K)           3816 |@@                                                  |
> [1K, 2K)           10557 |@@@@@@                                              |
> [2K, 4K)           19540 |@@@@@@@@@@@@                                        |
> [4K, 8K)           31294 |@@@@@@@@@@@@@@@@@@@                                 |
> [8K, 16K)          38112 |@@@@@@@@@@@@@@@@@@@@@@@                             |
> [16K, 32K)         46948 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [32K, 64K)         69728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [64K, 128K)        83797 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [128K, 256K)       84317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [256K, 512K)       82962 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [512K, 1M)         72751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> [1M, 2M)           49191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
> [2M, 4M)           26591 |@@@@@@@@@@@@@@@@                                    |
> [4M, 8M)           15559 |@@@@@@@@@                                           |
> [8M, 16M)          12283 |@@@@@@@                                             |
> [16M, 32M)         18266 |@@@@@@@@@@@                                         |
> [32M, 64M)         22539 |@@@@@@@@@@@@@                                       |
> [64M, 128M)         3005 |@                                                   |
> [128M, 256M)          41 |                                                    |
> [256M, 512M)           0 |                                                    |
> [512M, 1G)             0 |                                                    |
> [1G, 2G)               0 |                                                    |
> [2G, 4G)             101 |                                                    |
> 
> @insert_h:
> [128, 256)          2380 |                                                    |
> [256, 512)         70043 |@@@@@@@@                                            |
> [512, 1K)         431263 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1K, 2K)          182804 |@@@@@@@@@@@@@@@@@@@@@@                              |
> [2K, 4K)            2742 |                                                    |
> [4K, 8K)              43 |                                                    |
> [8K, 16K)             25 |                                                    |
> [16K, 32K)             0 |                                                    |
> [32K, 64K)             0 |                                                    |
> [64K, 128K)            0 |                                                    |
> [128K, 256K)           6 |                                                    |
> 
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-07-01  9:46         ` Robin Murphy
@ 2020-07-06 19:59           ` Jonathan Lemon
  2020-07-07 17:35             ` Robin Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Lemon @ 2020-07-06 19:59 UTC (permalink / raw)
  To: Robin Murphy; +Cc: netdev, iommu, Björn Töpel, Christoph Hellwig

On Wed, Jul 01, 2020 at 10:46:40AM +0100, Robin Murphy wrote:
> On 2020-06-30 20:08, Jonathan Lemon wrote:
> > On Mon, Jun 29, 2020 at 02:15:16PM +0100, Robin Murphy wrote:
> > > On 2020-06-27 08:02, Christoph Hellwig wrote:
> > > > On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:
> > > > > On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
> > > > > > 
> > > > > > Note that this is somewhat urgent, as various of the APIs that the code
> > > > > > is abusing are slated to go away for Linux 5.9, so this addition comes
> > > > > > at a really bad time.
> > > > > 
> > > > > Could you elaborate on what is upcoming here?
> > > > 
> > > > Moving all these calls out of line, and adding a bypass flag to avoid
> > > > the indirect function call for IOMMUs in direct mapped mode.
> > > > 
> > > > > Also, on a semi-related note, are there limitations on how many pages
> > > > > can be left mapped by the iommu?  Some of the page pool work involves
> > > > > leaving the pages mapped instead of constantly mapping/unmapping them.
> > > > 
> > > > There are, but I think for all modern IOMMUs they are so big that they
> > > > don't matter.  Maintaines of the individual IOMMU drivers might know
> > > > more.
> > > 
> > > Right - I don't know too much about older and more esoteric stuff like POWER
> > > TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, and Arm
> > > SMMU, the only "limits" are such that legitimate DMA API use should never
> > > get anywhere near them (you'd run out of RAM for actual buffers long
> > > beforehand). The most vaguely-realistic concern might be a pathological
> > > system topology where some old 32-bit PCI device doesn't have ACS isolation
> > > from your high-performance NIC such that they have to share an address
> > > space, where the NIC might happen to steal all the low addresses and prevent
> > > the soundcard or whatever from being able to map a usable buffer.
> > > 
> > > With an IOMMU, you typically really *want* to keep a full working set's
> > > worth of pages mapped, since dma_map/unmap are expensive while dma_sync is
> > > somewhere between relatively cheap and free. With no IOMMU it makes no real
> > > difference from the DMA API perspective since map/unmap are effectively no
> > > more than the equivalent sync operations anyway (I'm assuming we're not
> > > talking about the kind of constrained hardware that might need SWIOTLB).
> > > 
> > > > > On a heavily loaded box with iommu enabled, it seems that quite often
> > > > > there is contention on the iova_lock.  Are there known issues in this
> > > > > area?
> > > > 
> > > > I'll have to defer to the IOMMU maintainers, and for that you'll need
> > > > to say what code you are using.  Current mainlaine doesn't even have
> > > > an iova_lock anywhere.
> > > 
> > > Again I can't speak for non-mainstream stuff outside drivers/iommu, but it's
> > > been over 4 years now since merging the initial scalability work for the
> > > generic IOVA allocator there that focused on minimising lock contention, and
> > > it's had considerable evaluation and tweaking since. But if we can achieve
> > > the goal of efficiently recycling mapped buffers then we shouldn't need to
> > > go anywhere near IOVA allocation either way except when expanding the pool.
> > 
> > 
> > I'm running a set of patches which uses the page pool to try and keep
> > all the RX buffers mapped as the skb goes up the stack, returning the
> > pages to the pool when the skb is freed.
> > 
> > On a dual-socket 12-core Intel machine (48 processors), and 256G of
> > memory, when iommu is enabled, I see the following from 'perf top -U',
> > as the hottest function being run:
> > 
> > -   43.42%  worker      [k] queued_spin_lock_slowpath
> >     - 43.42% queued_spin_lock_slowpath
> >        - 41.69% _raw_spin_lock_irqsave
> >           + 41.39% alloc_iova
> >           + 0.28% iova_magazine_free_pfns
> >        + 1.07% lock_sock_nested
> > 
> > Which likely is heavy contention on the iovad->iova_rbtree_lock.
> > (This is on a 5.6 based system, BTW).  More scripts and data are below.
> > Is there a way to reduce the contention here?
> 
> Hmm, how big are your DMA mappings? If you're still hitting the rbtree a
> lot, that most likely implies that either you're making giant IOVA
> allocations that are too big to be cached, or you're allocating/freeing
> IOVAs in a pathological pattern that defeats the whole magazine cache
> mechanism (It's optimised for relatively-balanced allocation and freeing of
> sizes up order 6). On a further hunch, does the "intel_iommu=forcedac"
> option make any difference at all?

The allocations are only 4K in size (packet memory) but there are a lot
of them.  I tried running with "forcedac" over the weekend, and that
seems to have made a huge difference.  Why the name of 'forcedac'?  It
would seem this should be the default for a system with a sane 64-bit PCI
device.

 
> Either way if this persists after some initial warm-up period, it further
> implies that the page pool is not doing its job properly (or at least in the
> way I would have expected). The alloc_iova() call is part of the dma_map_*()
> overhead, and if the aim is to keep pages mapped then that should only be
> called relatively infrequently. The optimal behaviour would be to dma_map()
> new clean pages as they are added to the pool, use dma_sync() when they are
> claimed and returned by the driver, and only dma_unmap() if they're actually
> freed back to the page allocator. And if you're still seeing a lot of
> dma_map/unmap time after that, then the pool itself is churning pages and
> clearly needs its size/thresholds tuning.

Most of the dma_map() overhead is coming from the TX path, which isn't
using the page_pool.

With the RX datapath I'm using, pages are returned to the pool in a
non-napi context (so they enter the ring), then the driver refills the
cache from the ring.  I'm seeing a 1% overflow (ring_full), so the
default size seems appropriate.  The single ring lock is likely a
problem, but can be addressed now that things aren't blowing up in
alloc_iova().

From a box with the mlx4 driver + persistent page pool:

     rx_alloc_pages: 35239524767
     cache_hit: 34381377143
     cache_full: 19935
     cache_empty: 859266494
     ring_consume: 31665993135
     ring_produce: 31666119491
     ring_full: 353805844

-- 
Jonathan
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: the XSK buffer pool needs be to reverted
  2020-07-06 19:59           ` Jonathan Lemon
@ 2020-07-07 17:35             ` Robin Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Robin Murphy @ 2020-07-07 17:35 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, iommu, Björn Töpel, Christoph Hellwig

On 2020-07-06 20:59, Jonathan Lemon wrote:
> On Wed, Jul 01, 2020 at 10:46:40AM +0100, Robin Murphy wrote:
>> On 2020-06-30 20:08, Jonathan Lemon wrote:
>>> On Mon, Jun 29, 2020 at 02:15:16PM +0100, Robin Murphy wrote:
>>>> On 2020-06-27 08:02, Christoph Hellwig wrote:
>>>>> On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:
>>>>>> On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:
>>>>>>>
>>>>>>> Note that this is somewhat urgent, as various of the APIs that the code
>>>>>>> is abusing are slated to go away for Linux 5.9, so this addition comes
>>>>>>> at a really bad time.
>>>>>>
>>>>>> Could you elaborate on what is upcoming here?
>>>>>
>>>>> Moving all these calls out of line, and adding a bypass flag to avoid
>>>>> the indirect function call for IOMMUs in direct mapped mode.
>>>>>
>>>>>> Also, on a semi-related note, are there limitations on how many pages
>>>>>> can be left mapped by the iommu?  Some of the page pool work involves
>>>>>> leaving the pages mapped instead of constantly mapping/unmapping them.
>>>>>
>>>>> There are, but I think for all modern IOMMUs they are so big that they
>>>>> don't matter.  Maintaines of the individual IOMMU drivers might know
>>>>> more.
>>>>
>>>> Right - I don't know too much about older and more esoteric stuff like POWER
>>>> TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, and Arm
>>>> SMMU, the only "limits" are such that legitimate DMA API use should never
>>>> get anywhere near them (you'd run out of RAM for actual buffers long
>>>> beforehand). The most vaguely-realistic concern might be a pathological
>>>> system topology where some old 32-bit PCI device doesn't have ACS isolation
>>>> from your high-performance NIC such that they have to share an address
>>>> space, where the NIC might happen to steal all the low addresses and prevent
>>>> the soundcard or whatever from being able to map a usable buffer.
>>>>
>>>> With an IOMMU, you typically really *want* to keep a full working set's
>>>> worth of pages mapped, since dma_map/unmap are expensive while dma_sync is
>>>> somewhere between relatively cheap and free. With no IOMMU it makes no real
>>>> difference from the DMA API perspective since map/unmap are effectively no
>>>> more than the equivalent sync operations anyway (I'm assuming we're not
>>>> talking about the kind of constrained hardware that might need SWIOTLB).
>>>>
>>>>>> On a heavily loaded box with iommu enabled, it seems that quite often
>>>>>> there is contention on the iova_lock.  Are there known issues in this
>>>>>> area?
>>>>>
>>>>> I'll have to defer to the IOMMU maintainers, and for that you'll need
>>>>> to say what code you are using.  Current mainlaine doesn't even have
>>>>> an iova_lock anywhere.
>>>>
>>>> Again I can't speak for non-mainstream stuff outside drivers/iommu, but it's
>>>> been over 4 years now since merging the initial scalability work for the
>>>> generic IOVA allocator there that focused on minimising lock contention, and
>>>> it's had considerable evaluation and tweaking since. But if we can achieve
>>>> the goal of efficiently recycling mapped buffers then we shouldn't need to
>>>> go anywhere near IOVA allocation either way except when expanding the pool.
>>>
>>>
>>> I'm running a set of patches which uses the page pool to try and keep
>>> all the RX buffers mapped as the skb goes up the stack, returning the
>>> pages to the pool when the skb is freed.
>>>
>>> On a dual-socket 12-core Intel machine (48 processors), and 256G of
>>> memory, when iommu is enabled, I see the following from 'perf top -U',
>>> as the hottest function being run:
>>>
>>> -   43.42%  worker      [k] queued_spin_lock_slowpath
>>>      - 43.42% queued_spin_lock_slowpath
>>>         - 41.69% _raw_spin_lock_irqsave
>>>            + 41.39% alloc_iova
>>>            + 0.28% iova_magazine_free_pfns
>>>         + 1.07% lock_sock_nested
>>>
>>> Which likely is heavy contention on the iovad->iova_rbtree_lock.
>>> (This is on a 5.6 based system, BTW).  More scripts and data are below.
>>> Is there a way to reduce the contention here?
>>
>> Hmm, how big are your DMA mappings? If you're still hitting the rbtree a
>> lot, that most likely implies that either you're making giant IOVA
>> allocations that are too big to be cached, or you're allocating/freeing
>> IOVAs in a pathological pattern that defeats the whole magazine cache
>> mechanism (It's optimised for relatively-balanced allocation and freeing of
>> sizes up order 6). On a further hunch, does the "intel_iommu=forcedac"
>> option make any difference at all?
> 
> The allocations are only 4K in size (packet memory) but there are a lot
> of them.  I tried running with "forcedac" over the weekend, and that
> seems to have made a huge difference.  Why the name of 'forcedac'?  It
> would seem this should be the default for a system with a sane 64-bit PCI
> device.

OK, I think I can imagine what's happening there - I could elaborate if 
you're particularly interested, but suffice to say that despite all the 
tricks we've put in to avoid spending unnecessary time down that path 
wherever possible, there are still ways the allocator could behave 
pathologically once the low end below 32 bits gets (nearly) full. As far 
as that command-line option goes, I just happened to notice that it 
already existed and has the same effect as the hack I was going to 
suggest ;)

Either way, though, I'm inclined to agree that it might be time to stop 
applying that particular "optimisation" indiscriminately - I'll cook up 
some patches.

>> Either way if this persists after some initial warm-up period, it further
>> implies that the page pool is not doing its job properly (or at least in the
>> way I would have expected). The alloc_iova() call is part of the dma_map_*()
>> overhead, and if the aim is to keep pages mapped then that should only be
>> called relatively infrequently. The optimal behaviour would be to dma_map()
>> new clean pages as they are added to the pool, use dma_sync() when they are
>> claimed and returned by the driver, and only dma_unmap() if they're actually
>> freed back to the page allocator. And if you're still seeing a lot of
>> dma_map/unmap time after that, then the pool itself is churning pages and
>> clearly needs its size/thresholds tuning.
> 
> Most of the dma_map() overhead is coming from the TX path, which isn't
> using the page_pool.

Ah, OK, I guess I assumed the TX path might copy into preallocated DMA 
buffers such that it would use a pool too. Fair enough if it doesn't.
(Although perhaps it's still worth noting that IOMMUs are liable to skew 
the threshold for zero-copy benefits in general)

Robin.

> With the RX datapath I'm using, pages are returned to the pool in a
> non-napi context (so they enter the ring), then the driver refills the
> cache from the ring.  I'm seeing a 1% overflow (ring_full), so the
> default size seems appropriate.  The single ring lock is likely a
> problem, but can be addressed now that things aren't blowing up in
> alloc_iova().
> 
>  From a box with the mlx4 driver + persistent page pool:
> 
>       rx_alloc_pages: 35239524767
>       cache_hit: 34381377143
>       cache_full: 19935
>       cache_empty: 859266494
>       ring_consume: 31665993135
>       ring_produce: 31666119491
>       ring_full: 353805844
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-07-07 17:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26  7:47 the XSK buffer pool needs be to reverted Christoph Hellwig
2020-06-26 12:22 ` Björn Töpel
2020-06-26 12:41   ` Christoph Hellwig
2020-06-26 12:45     ` Björn Töpel
2020-06-26 20:54 ` Jonathan Lemon
2020-06-27  7:02   ` Christoph Hellwig
2020-06-29 13:15     ` Robin Murphy
2020-06-30 19:08       ` Jonathan Lemon
2020-07-01  9:46         ` Robin Murphy
2020-07-06 19:59           ` Jonathan Lemon
2020-07-07 17:35             ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).