All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
@ 2012-10-24 11:42 Ian Campbell
  2012-10-24 12:28 ` Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 11:42 UTC (permalink / raw)
  To: netdev; +Cc: Ian Campbell, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
lead to 70%+ packet loss under Xen when transmitting from physical (as
opposed to virtual) network devices.

This is because under Xen pages which are contiguous in the physical
address space may not be contiguous in the DMA space, in fact it is
very likely that they are not. I think there are other architectures
where this is true, although perhaps non quite so aggressive as to
have this property at a per-order-0-page granularity.

The real underlying bug here most likely lies in the swiotlb not
correctly handling compound pages, and Konrad is investigating this.
However even with the swiotlb issue fixed the current arrangement
seems likely to result in a lot of bounce buffering which seems likely
to more than offset any benefit from the use of larger pages.

Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
use this to request order-0 frags under Xen. Also expose this setting
via sysctl.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: netdev@vger.kernel.org
Cc: xen-devel@lists.xen.org
---
 arch/x86/xen/setup.c       |    7 +++++++
 include/linux/skbuff.h     |    2 ++
 net/core/skbuff.c          |    7 ++++---
 net/core/sysctl_net_core.c |    7 +++++++
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 8971a26..ad14d46 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -11,6 +11,7 @@
 #include <linux/memblock.h>
 #include <linux/cpuidle.h>
 #include <linux/cpufreq.h>
+#include <linux/skbuff.h>
 
 #include <asm/elf.h>
 #include <asm/vdso.h>
@@ -555,6 +556,12 @@ void __init xen_arch_setup(void)
 	       MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
 	       COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
 
+	/*
+	 * Xen cannot handle DMA to/from compound pages so avoid
+	 * bounce buffering by not allocating large network frags.
+	 */
+	netdev_frag_page_max_order = 0;
+
 	/* Set up idle, making sure it calls safe_halt() pvop */
 #ifdef CONFIG_X86_32
 	boot_cpu_data.hlt_works_ok = 1;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6a2c34e..a3a748f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1719,6 +1719,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
+extern int netdev_frag_page_max_order;
+
 extern void *netdev_alloc_frag(unsigned int fragsz);
 
 extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6e04b1f..88cbe5f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -348,8 +348,9 @@ struct netdev_alloc_cache {
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
-#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
-#define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
+int netdev_frag_page_max_order __read_mostly = get_order(32768);
+
+#define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << netdev_frag_page_max_order)
 #define NETDEV_PAGECNT_MAX_BIAS	   NETDEV_FRAG_PAGE_MAX_SIZE
 
 static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
@@ -363,7 +364,7 @@ static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->frag.page)) {
 refill:
-		for (order = NETDEV_FRAG_PAGE_MAX_ORDER; ;) {
+		for (order = netdev_frag_page_max_order; ;) {
 			gfp_t gfp = gfp_mask;
 
 			if (order)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index a7c3684..e5ab6df 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -129,6 +129,13 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "netdev_frag_page_max_order",
+		.data		= &netdev_frag_page_max_order,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 #ifdef CONFIG_BPF_JIT
 	{
 		.procname	= "bpf_jit_enable",
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 11:42 [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag Ian Campbell
  2012-10-24 12:28 ` Eric Dumazet
@ 2012-10-24 12:28 ` Eric Dumazet
  2012-10-24 13:16   ` Ian Campbell
  2012-10-24 13:16   ` Ian Campbell
  2012-10-24 18:19 ` David Miller
  2012-10-24 18:19 ` David Miller
  3 siblings, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 12:28 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 12:42 +0100, Ian Campbell wrote:
> The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
> lead to 70%+ packet loss under Xen when transmitting from physical (as
> opposed to virtual) network devices.
> 
> This is because under Xen pages which are contiguous in the physical
> address space may not be contiguous in the DMA space, in fact it is
> very likely that they are not. I think there are other architectures
> where this is true, although perhaps non quite so aggressive as to
> have this property at a per-order-0-page granularity.
> 
> The real underlying bug here most likely lies in the swiotlb not
> correctly handling compound pages, and Konrad is investigating this.
> However even with the swiotlb issue fixed the current arrangement
> seems likely to result in a lot of bounce buffering which seems likely
> to more than offset any benefit from the use of larger pages.
> 
> Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
> use this to request order-0 frags under Xen. Also expose this setting
> via sysctl.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: netdev@vger.kernel.org
> Cc: xen-devel@lists.xen.org
> ---

I understand your concern, but this seems a quick/dirty hack at this
moment. After setting the sysctl to 0, some tasks may still have some
order-3 pages in their cache.

Your driver must already cope with skb->head being split on several
pages.

So what fundamental difference exists with frags ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 11:42 [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag Ian Campbell
@ 2012-10-24 12:28 ` Eric Dumazet
  2012-10-24 12:28 ` Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 12:28 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 12:42 +0100, Ian Campbell wrote:
> The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
> lead to 70%+ packet loss under Xen when transmitting from physical (as
> opposed to virtual) network devices.
> 
> This is because under Xen pages which are contiguous in the physical
> address space may not be contiguous in the DMA space, in fact it is
> very likely that they are not. I think there are other architectures
> where this is true, although perhaps non quite so aggressive as to
> have this property at a per-order-0-page granularity.
> 
> The real underlying bug here most likely lies in the swiotlb not
> correctly handling compound pages, and Konrad is investigating this.
> However even with the swiotlb issue fixed the current arrangement
> seems likely to result in a lot of bounce buffering which seems likely
> to more than offset any benefit from the use of larger pages.
> 
> Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
> use this to request order-0 frags under Xen. Also expose this setting
> via sysctl.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: netdev@vger.kernel.org
> Cc: xen-devel@lists.xen.org
> ---

I understand your concern, but this seems a quick/dirty hack at this
moment. After setting the sysctl to 0, some tasks may still have some
order-3 pages in their cache.

Your driver must already cope with skb->head being split on several
pages.

So what fundamental difference exists with frags ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 12:28 ` Eric Dumazet
  2012-10-24 13:16   ` Ian Campbell
@ 2012-10-24 13:16   ` Ian Campbell
  2012-10-24 13:30     ` Eric Dumazet
  2012-10-24 13:30     ` Eric Dumazet
  1 sibling, 2 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 13:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 13:28 +0100, Eric Dumazet wrote:
> On Wed, 2012-10-24 at 12:42 +0100, Ian Campbell wrote:
> > The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
> > lead to 70%+ packet loss under Xen when transmitting from physical (as
> > opposed to virtual) network devices.
> > 
> > This is because under Xen pages which are contiguous in the physical
> > address space may not be contiguous in the DMA space, in fact it is
> > very likely that they are not. I think there are other architectures
> > where this is true, although perhaps non quite so aggressive as to
> > have this property at a per-order-0-page granularity.
> > 
> > The real underlying bug here most likely lies in the swiotlb not
> > correctly handling compound pages, and Konrad is investigating this.
> > However even with the swiotlb issue fixed the current arrangement
> > seems likely to result in a lot of bounce buffering which seems likely
> > to more than offset any benefit from the use of larger pages.
> > 
> > Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
> > use this to request order-0 frags under Xen. Also expose this setting
> > via sysctl.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: netdev@vger.kernel.org
> > Cc: xen-devel@lists.xen.org
> > ---
> 
> I understand your concern, but this seems a quick/dirty hack at this
> moment. After setting the sysctl to 0, some tasks may still have some
> order-3 pages in their cache.

Right, the sysctl thing might be overkill, I just figured it was useful
for debugging. When booting in a Xen VM the patch sets it to zero very
early on, during setup_arch(), which is before any tasks even exist.

> Your driver must already cope with skb->head being split on several
> pages.
> 
> So what fundamental difference exists with frags ?

The issue here is with drivers for physical network devices when running
under Xen not with the Xen paravirtualised network drivers (AKA
netback/netfront).

The problem is that pages which are contiguous in the physical address
space may not be contiguous in the DMA address space. With order>0 pages
this becomes a problem when you poke down the DMA address and length of
a compound page into the hardware registers. The DMA address will be
right for the head of the page but once the hardware steps off the end
of that it'll get the wrong page.

I don't think this non-contiguousness between physical and DMA addresses
is specific to Xen, although it is more frequent under Xen than any real
hardware platform. (Xen has often been a good canary for these sorts of
issues which turn out later on to impact other arches too.)

In theory this could be fixed in all the drivers for physical network
devices, but that would be a lot of effort (and probably a fair bit of
ugliness in the drivers) for a gain which was only relevant to Xen. 

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 12:28 ` Eric Dumazet
@ 2012-10-24 13:16   ` Ian Campbell
  2012-10-24 13:16   ` Ian Campbell
  1 sibling, 0 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 13:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 13:28 +0100, Eric Dumazet wrote:
> On Wed, 2012-10-24 at 12:42 +0100, Ian Campbell wrote:
> > The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
> > lead to 70%+ packet loss under Xen when transmitting from physical (as
> > opposed to virtual) network devices.
> > 
> > This is because under Xen pages which are contiguous in the physical
> > address space may not be contiguous in the DMA space, in fact it is
> > very likely that they are not. I think there are other architectures
> > where this is true, although perhaps non quite so aggressive as to
> > have this property at a per-order-0-page granularity.
> > 
> > The real underlying bug here most likely lies in the swiotlb not
> > correctly handling compound pages, and Konrad is investigating this.
> > However even with the swiotlb issue fixed the current arrangement
> > seems likely to result in a lot of bounce buffering which seems likely
> > to more than offset any benefit from the use of larger pages.
> > 
> > Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
> > use this to request order-0 frags under Xen. Also expose this setting
> > via sysctl.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: netdev@vger.kernel.org
> > Cc: xen-devel@lists.xen.org
> > ---
> 
> I understand your concern, but this seems a quick/dirty hack at this
> moment. After setting the sysctl to 0, some tasks may still have some
> order-3 pages in their cache.

Right, the sysctl thing might be overkill, I just figured it was useful
for debugging. When booting in a Xen VM the patch sets it to zero very
early on, during setup_arch(), which is before any tasks even exist.

> Your driver must already cope with skb->head being split on several
> pages.
> 
> So what fundamental difference exists with frags ?

The issue here is with drivers for physical network devices when running
under Xen not with the Xen paravirtualised network drivers (AKA
netback/netfront).

The problem is that pages which are contiguous in the physical address
space may not be contiguous in the DMA address space. With order>0 pages
this becomes a problem when you poke down the DMA address and length of
a compound page into the hardware registers. The DMA address will be
right for the head of the page but once the hardware steps off the end
of that it'll get the wrong page.

I don't think this non-contiguousness between physical and DMA addresses
is specific to Xen, although it is more frequent under Xen than any real
hardware platform. (Xen has often been a good canary for these sorts of
issues which turn out later on to impact other arches too.)

In theory this could be fixed in all the drivers for physical network
devices, but that would be a lot of effort (and probably a fair bit of
ugliness in the drivers) for a gain which was only relevant to Xen. 

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 13:16   ` Ian Campbell
  2012-10-24 13:30     ` Eric Dumazet
@ 2012-10-24 13:30     ` Eric Dumazet
  2012-10-24 14:02       ` Ian Campbell
  2012-10-24 14:02       ` Ian Campbell
  1 sibling, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 13:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 14:16 +0100, Ian Campbell wrote:
> On Wed, 2012-10-24 at 13:28 +0100, Eric Dumazet wrote:
> > On Wed, 2012-10-24 at 12:42 +0100, Ian Campbell wrote:
> > > The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
> > > lead to 70%+ packet loss under Xen when transmitting from physical (as
> > > opposed to virtual) network devices.
> > > 
> > > This is because under Xen pages which are contiguous in the physical
> > > address space may not be contiguous in the DMA space, in fact it is
> > > very likely that they are not. I think there are other architectures
> > > where this is true, although perhaps non quite so aggressive as to
> > > have this property at a per-order-0-page granularity.
> > > 
> > > The real underlying bug here most likely lies in the swiotlb not
> > > correctly handling compound pages, and Konrad is investigating this.
> > > However even with the swiotlb issue fixed the current arrangement
> > > seems likely to result in a lot of bounce buffering which seems likely
> > > to more than offset any benefit from the use of larger pages.
> > > 
> > > Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
> > > use this to request order-0 frags under Xen. Also expose this setting
> > > via sysctl.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > Cc: netdev@vger.kernel.org
> > > Cc: xen-devel@lists.xen.org
> > > ---
> > 
> > I understand your concern, but this seems a quick/dirty hack at this
> > moment. After setting the sysctl to 0, some tasks may still have some
> > order-3 pages in their cache.
> 
> Right, the sysctl thing might be overkill, I just figured it was useful
> for debugging. When booting in a Xen VM the patch sets it to zero very
> early on, during setup_arch(), which is before any tasks even exist.
> 
> > Your driver must already cope with skb->head being split on several
> > pages.
> > 
> > So what fundamental difference exists with frags ?
> 
> The issue here is with drivers for physical network devices when running
> under Xen not with the Xen paravirtualised network drivers (AKA
> netback/netfront).
> 
> The problem is that pages which are contiguous in the physical address
> space may not be contiguous in the DMA address space. With order>0 pages
> this becomes a problem when you poke down the DMA address and length of
> a compound page into the hardware registers. The DMA address will be
> right for the head of the page but once the hardware steps off the end
> of that it'll get the wrong page.
> 
> I don't think this non-contiguousness between physical and DMA addresses
> is specific to Xen, although it is more frequent under Xen than any real
> hardware platform. (Xen has often been a good canary for these sorts of
> issues which turn out later on to impact other arches too.)
> 
> In theory this could be fixed in all the drivers for physical network
> devices, but that would be a lot of effort (and probably a fair bit of
> ugliness in the drivers) for a gain which was only relevant to Xen. 

I still have concerns about skb->head that you dint really answered.

Why skb->head can be on order-1 or order-2 pages and this is working ?

It seems to me its a driver issue, for example
drivers/net/xen-netfront.c has assumptions that can be easily fixed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 13:16   ` Ian Campbell
@ 2012-10-24 13:30     ` Eric Dumazet
  2012-10-24 13:30     ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 13:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 14:16 +0100, Ian Campbell wrote:
> On Wed, 2012-10-24 at 13:28 +0100, Eric Dumazet wrote:
> > On Wed, 2012-10-24 at 12:42 +0100, Ian Campbell wrote:
> > > The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
> > > lead to 70%+ packet loss under Xen when transmitting from physical (as
> > > opposed to virtual) network devices.
> > > 
> > > This is because under Xen pages which are contiguous in the physical
> > > address space may not be contiguous in the DMA space, in fact it is
> > > very likely that they are not. I think there are other architectures
> > > where this is true, although perhaps non quite so aggressive as to
> > > have this property at a per-order-0-page granularity.
> > > 
> > > The real underlying bug here most likely lies in the swiotlb not
> > > correctly handling compound pages, and Konrad is investigating this.
> > > However even with the swiotlb issue fixed the current arrangement
> > > seems likely to result in a lot of bounce buffering which seems likely
> > > to more than offset any benefit from the use of larger pages.
> > > 
> > > Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
> > > use this to request order-0 frags under Xen. Also expose this setting
> > > via sysctl.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > Cc: netdev@vger.kernel.org
> > > Cc: xen-devel@lists.xen.org
> > > ---
> > 
> > I understand your concern, but this seems a quick/dirty hack at this
> > moment. After setting the sysctl to 0, some tasks may still have some
> > order-3 pages in their cache.
> 
> Right, the sysctl thing might be overkill, I just figured it was useful
> for debugging. When booting in a Xen VM the patch sets it to zero very
> early on, during setup_arch(), which is before any tasks even exist.
> 
> > Your driver must already cope with skb->head being split on several
> > pages.
> > 
> > So what fundamental difference exists with frags ?
> 
> The issue here is with drivers for physical network devices when running
> under Xen not with the Xen paravirtualised network drivers (AKA
> netback/netfront).
> 
> The problem is that pages which are contiguous in the physical address
> space may not be contiguous in the DMA address space. With order>0 pages
> this becomes a problem when you poke down the DMA address and length of
> a compound page into the hardware registers. The DMA address will be
> right for the head of the page but once the hardware steps off the end
> of that it'll get the wrong page.
> 
> I don't think this non-contiguousness between physical and DMA addresses
> is specific to Xen, although it is more frequent under Xen than any real
> hardware platform. (Xen has often been a good canary for these sorts of
> issues which turn out later on to impact other arches too.)
> 
> In theory this could be fixed in all the drivers for physical network
> devices, but that would be a lot of effort (and probably a fair bit of
> ugliness in the drivers) for a gain which was only relevant to Xen. 

I still have concerns about skb->head that you dint really answered.

Why skb->head can be on order-1 or order-2 pages and this is working ?

It seems to me its a driver issue, for example
drivers/net/xen-netfront.c has assumptions that can be easily fixed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 13:30     ` Eric Dumazet
  2012-10-24 14:02       ` Ian Campbell
@ 2012-10-24 14:02       ` Ian Campbell
  2012-10-24 15:21         ` Eric Dumazet
  2012-10-24 15:21         ` Eric Dumazet
  1 sibling, 2 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 14:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 14:30 +0100, Eric Dumazet wrote:
> It seems to me its a driver issue, for example
> drivers/net/xen-netfront.c has assumptions that can be easily fixed.

The netfront ->head thing is a separate (although perhaps related)
issue, I intended to fix along the same lines as the previous netback
except for some unfathomable reason I haven't been able to reproduce the
problem with netfront -- I've no idea why though since it seems like it
should be a no brainer!

> Why skb->head can be on order-1 or order-2 pages and this is working ?

skb->head being order 1 or 2 isn't working for me. The driver I'm having
issues with which caused me to create this particular patch is the tg3
driver (although I don't think this is by any means specific to tg3).

For the ->head the tg3 driver does:
        mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
while for the frags it does:
        mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0, len, DMA_TO_DEVICE);

This ought to do the Right Thing but doesn't seem to be working. Konrad
suspected an issue with the swiotlb's handling of order>0 pages in some
cases. As I said in the commit message he is looking into this issue.

My concern however was that even once the swiotlb is fixed to work right
the effect of pci_map_single on a order>0 page is going to be that the
data gets bounced into contiguous memory -- that is a memcpy which would
undo the benefit of having allocating large pages to start with. So I
figured that in such cases we'd be better off just using order 0
allocations to start with.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 13:30     ` Eric Dumazet
@ 2012-10-24 14:02       ` Ian Campbell
  2012-10-24 14:02       ` Ian Campbell
  1 sibling, 0 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 14:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 14:30 +0100, Eric Dumazet wrote:
> It seems to me its a driver issue, for example
> drivers/net/xen-netfront.c has assumptions that can be easily fixed.

The netfront ->head thing is a separate (although perhaps related)
issue, I intended to fix along the same lines as the previous netback
except for some unfathomable reason I haven't been able to reproduce the
problem with netfront -- I've no idea why though since it seems like it
should be a no brainer!

> Why skb->head can be on order-1 or order-2 pages and this is working ?

skb->head being order 1 or 2 isn't working for me. The driver I'm having
issues with which caused me to create this particular patch is the tg3
driver (although I don't think this is by any means specific to tg3).

For the ->head the tg3 driver does:
        mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
while for the frags it does:
        mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0, len, DMA_TO_DEVICE);

This ought to do the Right Thing but doesn't seem to be working. Konrad
suspected an issue with the swiotlb's handling of order>0 pages in some
cases. As I said in the commit message he is looking into this issue.

My concern however was that even once the swiotlb is fixed to work right
the effect of pci_map_single on a order>0 page is going to be that the
data gets bounced into contiguous memory -- that is a memcpy which would
undo the benefit of having allocating large pages to start with. So I
figured that in such cases we'd be better off just using order 0
allocations to start with.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 14:02       ` Ian Campbell
  2012-10-24 15:21         ` Eric Dumazet
@ 2012-10-24 15:21         ` Eric Dumazet
  2012-10-24 16:22           ` Ian Campbell
  2012-10-24 16:22           ` Ian Campbell
  1 sibling, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 15:21 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 15:02 +0100, Ian Campbell wrote:
> On Wed, 2012-10-24 at 14:30 +0100, Eric Dumazet wrote:
> > It seems to me its a driver issue, for example
> > drivers/net/xen-netfront.c has assumptions that can be easily fixed.
> 
> The netfront ->head thing is a separate (although perhaps related)
> issue, I intended to fix along the same lines as the previous netback
> except for some unfathomable reason I haven't been able to reproduce the
> problem with netfront -- I've no idea why though since it seems like it
> should be a no brainer!
> 
> > Why skb->head can be on order-1 or order-2 pages and this is working ?
> 
> skb->head being order 1 or 2 isn't working for me. The driver I'm having
> issues with which caused me to create this particular patch is the tg3
> driver (although I don't think this is by any means specific to tg3).
> 
> For the ->head the tg3 driver does:
>         mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
> while for the frags it does:
>         mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0, len, DMA_TO_DEVICE);
> 
> This ought to do the Right Thing but doesn't seem to be working. Konrad
> suspected an issue with the swiotlb's handling of order>0 pages in some
> cases. As I said in the commit message he is looking into this issue.
> 
> My concern however was that even once the swiotlb is fixed to work right
> the effect of pci_map_single on a order>0 page is going to be that the
> data gets bounced into contiguous memory -- that is a memcpy which would
> undo the benefit of having allocating large pages to start with. So I
> figured that in such cases we'd be better off just using order 0
> allocations to start with.

I am really confused.

If you really have such problems, why locally generated TCP traffic
doesnt also have it ?

Your patch doesnt touch sk_page_frag_refill(), does it ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 14:02       ` Ian Campbell
@ 2012-10-24 15:21         ` Eric Dumazet
  2012-10-24 15:21         ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 15:21 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 15:02 +0100, Ian Campbell wrote:
> On Wed, 2012-10-24 at 14:30 +0100, Eric Dumazet wrote:
> > It seems to me its a driver issue, for example
> > drivers/net/xen-netfront.c has assumptions that can be easily fixed.
> 
> The netfront ->head thing is a separate (although perhaps related)
> issue, I intended to fix along the same lines as the previous netback
> except for some unfathomable reason I haven't been able to reproduce the
> problem with netfront -- I've no idea why though since it seems like it
> should be a no brainer!
> 
> > Why skb->head can be on order-1 or order-2 pages and this is working ?
> 
> skb->head being order 1 or 2 isn't working for me. The driver I'm having
> issues with which caused me to create this particular patch is the tg3
> driver (although I don't think this is by any means specific to tg3).
> 
> For the ->head the tg3 driver does:
>         mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
> while for the frags it does:
>         mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0, len, DMA_TO_DEVICE);
> 
> This ought to do the Right Thing but doesn't seem to be working. Konrad
> suspected an issue with the swiotlb's handling of order>0 pages in some
> cases. As I said in the commit message he is looking into this issue.
> 
> My concern however was that even once the swiotlb is fixed to work right
> the effect of pci_map_single on a order>0 page is going to be that the
> data gets bounced into contiguous memory -- that is a memcpy which would
> undo the benefit of having allocating large pages to start with. So I
> figured that in such cases we'd be better off just using order 0
> allocations to start with.

I am really confused.

If you really have such problems, why locally generated TCP traffic
doesnt also have it ?

Your patch doesnt touch sk_page_frag_refill(), does it ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 15:21         ` Eric Dumazet
  2012-10-24 16:22           ` Ian Campbell
@ 2012-10-24 16:22           ` Ian Campbell
  2012-10-24 16:43             ` Eric Dumazet
  2012-10-24 16:43             ` Eric Dumazet
  1 sibling, 2 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 16:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> On Wed, 2012-10-24 at 15:02 +0100, Ian Campbell wrote:
> > On Wed, 2012-10-24 at 14:30 +0100, Eric Dumazet wrote:
> > > It seems to me its a driver issue, for example
> > > drivers/net/xen-netfront.c has assumptions that can be easily fixed.
> > 
> > The netfront ->head thing is a separate (although perhaps related)
> > issue, I intended to fix along the same lines as the previous netback
> > except for some unfathomable reason I haven't been able to reproduce the
> > problem with netfront -- I've no idea why though since it seems like it
> > should be a no brainer!
> > 
> > > Why skb->head can be on order-1 or order-2 pages and this is working ?
> > 
> > skb->head being order 1 or 2 isn't working for me. The driver I'm having
> > issues with which caused me to create this particular patch is the tg3
> > driver (although I don't think this is by any means specific to tg3).
> > 
> > For the ->head the tg3 driver does:
> >         mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
> > while for the frags it does:
> >         mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0, len, DMA_TO_DEVICE);
> > 
> > This ought to do the Right Thing but doesn't seem to be working. Konrad
> > suspected an issue with the swiotlb's handling of order>0 pages in some
> > cases. As I said in the commit message he is looking into this issue.
> > 
> > My concern however was that even once the swiotlb is fixed to work right
> > the effect of pci_map_single on a order>0 page is going to be that the
> > data gets bounced into contiguous memory -- that is a memcpy which would
> > undo the benefit of having allocating large pages to start with. So I
> > figured that in such cases we'd be better off just using order 0
> > allocations to start with.
> 
> I am really confused.
> 
> If you really have such problems, why locally generated TCP traffic
> doesnt also have it ?

I think it does. The reason I noticed the original problem was that ssh
to the machine was virtually (no pun intended) unusable.

> Your patch doesnt touch sk_page_frag_refill(), does it ?

That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
Is it possible I'm just not hitting that case?

Is it possible that this only affects certain traffic patterns (I only
really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
only broken in one corner case and not the other.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 15:21         ` Eric Dumazet
@ 2012-10-24 16:22           ` Ian Campbell
  2012-10-24 16:22           ` Ian Campbell
  1 sibling, 0 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 16:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> On Wed, 2012-10-24 at 15:02 +0100, Ian Campbell wrote:
> > On Wed, 2012-10-24 at 14:30 +0100, Eric Dumazet wrote:
> > > It seems to me its a driver issue, for example
> > > drivers/net/xen-netfront.c has assumptions that can be easily fixed.
> > 
> > The netfront ->head thing is a separate (although perhaps related)
> > issue, I intended to fix along the same lines as the previous netback
> > except for some unfathomable reason I haven't been able to reproduce the
> > problem with netfront -- I've no idea why though since it seems like it
> > should be a no brainer!
> > 
> > > Why skb->head can be on order-1 or order-2 pages and this is working ?
> > 
> > skb->head being order 1 or 2 isn't working for me. The driver I'm having
> > issues with which caused me to create this particular patch is the tg3
> > driver (although I don't think this is by any means specific to tg3).
> > 
> > For the ->head the tg3 driver does:
> >         mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
> > while for the frags it does:
> >         mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0, len, DMA_TO_DEVICE);
> > 
> > This ought to do the Right Thing but doesn't seem to be working. Konrad
> > suspected an issue with the swiotlb's handling of order>0 pages in some
> > cases. As I said in the commit message he is looking into this issue.
> > 
> > My concern however was that even once the swiotlb is fixed to work right
> > the effect of pci_map_single on a order>0 page is going to be that the
> > data gets bounced into contiguous memory -- that is a memcpy which would
> > undo the benefit of having allocating large pages to start with. So I
> > figured that in such cases we'd be better off just using order 0
> > allocations to start with.
> 
> I am really confused.
> 
> If you really have such problems, why locally generated TCP traffic
> doesnt also have it ?

I think it does. The reason I noticed the original problem was that ssh
to the machine was virtually (no pun intended) unusable.

> Your patch doesnt touch sk_page_frag_refill(), does it ?

That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
Is it possible I'm just not hitting that case?

Is it possible that this only affects certain traffic patterns (I only
really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
only broken in one corner case and not the other.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 16:22           ` Ian Campbell
  2012-10-24 16:43             ` Eric Dumazet
@ 2012-10-24 16:43             ` Eric Dumazet
  2012-10-30 16:53               ` Konrad Rzeszutek Wilk
  2012-10-30 16:53               ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 16:43 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, Konrad Rzeszutek Wilk, xen-devel

On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:

> > If you really have such problems, why locally generated TCP traffic
> > doesnt also have it ?
> 
> I think it does. The reason I noticed the original problem was that ssh
> to the machine was virtually (no pun intended) unusable.
> 
> > Your patch doesnt touch sk_page_frag_refill(), does it ?
> 
> That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> Is it possible I'm just not hitting that case?
> 

I hope not. GFP_KERNEL has __GFP_WAIT.

> Is it possible that this only affects certain traffic patterns (I only
> really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> only broken in one corner case and not the other.

Could you try a netperf -t TCP_STREAM ?

Because ssh use small packets, and small TCP packets dont use frags but
skb->head.

You mentioned a 70% drop of performance, but what test have you used
exactly ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 16:22           ` Ian Campbell
@ 2012-10-24 16:43             ` Eric Dumazet
  2012-10-24 16:43             ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-24 16:43 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Eric Dumazet, xen-devel, Konrad Rzeszutek Wilk

On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:

> > If you really have such problems, why locally generated TCP traffic
> > doesnt also have it ?
> 
> I think it does. The reason I noticed the original problem was that ssh
> to the machine was virtually (no pun intended) unusable.
> 
> > Your patch doesnt touch sk_page_frag_refill(), does it ?
> 
> That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> Is it possible I'm just not hitting that case?
> 

I hope not. GFP_KERNEL has __GFP_WAIT.

> Is it possible that this only affects certain traffic patterns (I only
> really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> only broken in one corner case and not the other.

Could you try a netperf -t TCP_STREAM ?

Because ssh use small packets, and small TCP packets dont use frags but
skb->head.

You mentioned a 70% drop of performance, but what test have you used
exactly ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 11:42 [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag Ian Campbell
  2012-10-24 12:28 ` Eric Dumazet
  2012-10-24 12:28 ` Eric Dumazet
@ 2012-10-24 18:19 ` David Miller
  2012-10-24 18:19 ` David Miller
  3 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2012-10-24 18:19 UTC (permalink / raw)
  To: ian.campbell; +Cc: netdev, edumazet, konrad.wilk, xen-devel


I'm not applying this.

Fix your drivers and the infrastructure they use, don't paper
around it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 11:42 [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag Ian Campbell
                   ` (2 preceding siblings ...)
  2012-10-24 18:19 ` David Miller
@ 2012-10-24 18:19 ` David Miller
  3 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2012-10-24 18:19 UTC (permalink / raw)
  To: ian.campbell; +Cc: netdev, edumazet, xen-devel, konrad.wilk


I'm not applying this.

Fix your drivers and the infrastructure they use, don't paper
around it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 16:43             ` Eric Dumazet
  2012-10-30 16:53               ` Konrad Rzeszutek Wilk
@ 2012-10-30 16:53               ` Konrad Rzeszutek Wilk
  2012-10-30 17:23                 ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-30 16:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ian Campbell, netdev, Eric Dumazet, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2625 bytes --]

On Wed, Oct 24, 2012 at 06:43:20PM +0200, Eric Dumazet wrote:
> On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> > On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> 
> > > If you really have such problems, why locally generated TCP traffic
> > > doesnt also have it ?
> > 
> > I think it does. The reason I noticed the original problem was that ssh
> > to the machine was virtually (no pun intended) unusable.
> > 
> > > Your patch doesnt touch sk_page_frag_refill(), does it ?
> > 
> > That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> > Is it possible I'm just not hitting that case?
> > 
> 
> I hope not. GFP_KERNEL has __GFP_WAIT.
> 
> > Is it possible that this only affects certain traffic patterns (I only
> > really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> > only broken in one corner case and not the other.
> 
> Could you try a netperf -t TCP_STREAM ?

For fun I did a couple of tests - I setup two machines (one r8168, the other
e1000e) and tried to do netperf/netserver. Both of them are running a baremetal
kernel and one of them has 'iommu=soft swiotlb=force' to simulate the worst
case. This is using v3.7-rc3.

The r8169 is booted without any arguments, the e1000e is using 'iommu=soft
swiotlb=force'.

So r8169 -> e1000e, I get ~940 (this is odd, I expected that the e1000e
on the recv side would be using the bounce buffer, but then I realized it
sets up using pci_alloc_coherent an 'dma' pool).

The other way - e1000e -> r8169 got me around ~128. So it is the sending
side that ends up using the bounce buffer and it slows down considerably.

I also swapped the machine that had e1000e with a tg3 - and got around
the same numbers.

So all of this points to the swiotlb and to just make sure that nothing
was amiss I wrote a little driver that would allocate a compound page,
setup DMA mapping, do some writes, sync and unmap the DMA page. And it works
correctly - so swiotlb (and the xen variant) work right just right.
Attached for your fun.

Then I decided to try v3.6.3, with the same exact parameters.. and
the problem went away.

The e1000e -> r8169 which got me around ~128, now gets ~940! Still
using the swiotlb bounce buffer.


> 
> Because ssh use small packets, and small TCP packets dont use frags but
> skb->head.
> 
> You mentioned a 70% drop of performance, but what test have you used
> exactly ?

Note, I did not provide any arguments to netperf, but it did pick the
test you wanted:

> netperf -H tst019
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to tst019.dumpdata.com (192.168.101.39) port 0 AF_INET

> 
> 

[-- Attachment #2: dma_test.c --]
[-- Type: text/plain, Size: 5698 bytes --]

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <xen/page.h>

#define DMA_TEST  "0.1"

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("dma_test");
MODULE_LICENSE("GPL");
MODULE_VERSION(DMA_TEST);

static struct bus_type fallback_bus_type = {
	.name = "fallback_bus:",
};

static void fake_release(struct device *dev)
{
	/* No kfree as the device was allocated on stack. */
}

struct args {
	int len;
	enum dma_data_direction dir;
};

#define MAGIC_DEVICE 0xffffffdd
#define MAGIC_CPU 0xffffffcc

static int dma_test_thread(void *arg)
{
	struct page *page;
	dma_addr_t dma_addr = 0;
	struct device fake = {
		.coherent_dma_mask = DMA_BIT_MASK(32),
		.bus = &fallback_bus_type,
		.release = fake_release,
	};
	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | GFP_ATOMIC;
	int ret;
	int i;
	void *addr;
	struct page *p;

	struct args *args = (struct args *)arg;
	int dir = args->dir;
	int len = args->len;

	dev_set_name(&fake, "%s", dir == DMA_TO_DEVICE ? "to_dev" : "to_cpu");
 	fake.dma_mask = &fake.coherent_dma_mask;
 	ret = device_register(&fake);
	if (ret)
		goto out;

	do {
		unsigned long prev_mfn = 0;
		bool bus_and_dma_same;

		page = alloc_pages(gfp, get_order(len));
		p = page;
		/* Check that the bus addresses are contingous. */
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long pfn, mfn;

			addr = page_address(p);
			pfn = PFN_DOWN(virt_to_phys(addr));
			if (xen_domain())
				mfn = pfn_to_mfn(pfn);
			else
				mfn = pfn;
			if (i != 0) {
				if (prev_mfn + 1 != mfn)
					dev_warn(&fake, "va: %lx (pfn:%lx, mfn:%lx) w.r.t prev mfn: %lx!\n",
						 (unsigned long)addr, pfn, mfn, prev_mfn);
			}
			prev_mfn = mfn;
		}
		dma_addr = dma_map_page(&fake, page, 0 /* no offset */, len, dir);
		/* Note, dma_addr is the physical address ! */
		if (dma_mapping_error(&fake, dma_addr)) {
			dev_warn(&fake, "DMA %lx for %lx is not right\n", (unsigned long)dma_addr,
				(unsigned long)page_address(page));
			__free_pages(page, get_order(len));
			page = NULL;
		}
		bus_and_dma_same = false;
		if (page) {
			unsigned long phys;
			unsigned long pfn, mfn, bus_addr_mfn;
			unsigned long bus_addr = 0;

			p = page;
			for (i = 0; i < len / PAGE_SIZE; i++, p++) {
				void *bus_va;

				addr = page_address(p);
				phys = virt_to_phys(addr);
				pfn = PFN_DOWN(phys);

				bus_va = (void *)(dma_addr + (i * PAGE_SIZE));

				if (xen_domain()) {
					void * tmp;

					/* Find the bus frame for the physical frame*/
					mfn = pfn_to_mfn(pfn);
					/* and .. voodoo time! */
					bus_addr_mfn = PFN_DOWN(dma_addr + (i * PAGE_SIZE));
					bus_addr = PFN_PHYS(mfn_to_pfn(bus_addr_mfn));
					tmp = __va(bus_addr);
					bus_va = mfn_to_virt(bus_addr_mfn);
					WARN(bus_va != tmp, "Expected %lx (%lx+%d*PAGE_SIZE), got: %lx (pfn: %lx, mfn: %lx)!\n",
						(unsigned long)bus_va, (unsigned long)dma_addr, i, (unsigned long)tmp, PFN_DOWN(bus_addr), bus_addr_mfn);
				} else {
					mfn = pfn;
					bus_addr = (unsigned long)bus_va;
					/* Assume DMA addr == physical addr */
					bus_addr_mfn = PFN_DOWN(bus_addr);
					bus_va = __va(PFN_PHYS(bus_addr_mfn));
				}

				dev_info(&fake, "%lx (pfn:%lx, bus frame: %lx) %s %lx (addr: %lx, frame: %lx)\n",
					(unsigned long)addr, pfn, mfn,
					dir == DMA_TO_DEVICE ? "=>" : "<=",
					(unsigned long)bus_va, bus_addr, bus_addr_mfn);

				if (!virt_addr_valid(bus_va))
					break;
				if (!virt_addr_valid(addr))
					break;

				/* CPU */
				memset(addr, 0xCC, PAGE_SIZE);

				/* Device */
				memset(bus_va, 0xDD, PAGE_SIZE);
				if (addr == bus_va)
					bus_and_dma_same = true;
			}
		}
		set_current_state(TASK_INTERRUPTIBLE);
		schedule_timeout_interruptible(5*HZ);

		if (!page)
			continue;
		p = page;

		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			if (bus_and_dma_same)
				continue;
			addr = page_address(p);
			if (((char *)addr)[0] != MAGIC_CPU)
				dev_warn(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
					(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
					((char *)addr)[0], (unsigned long)MAGIC_CPU);
		}
		/* sync the page */
		dma_sync_single_for_cpu(&fake, dma_addr, len, dir);
		p = page;
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long check_val = MAGIC_DEVICE;

			addr = page_address(p);
			if (dir == DMA_TO_DEVICE)
				check_val = MAGIC_CPU;
			if (dir == DMA_FROM_DEVICE)
				check_val = MAGIC_DEVICE;

			dev_info(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
				(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
				((char *)addr)[0], check_val);
		}
		dma_unmap_page(&fake, dma_addr, len, dir);
		dma_addr = 0;
		__free_pages(page, get_order(len));
		page = NULL;
	} while (!kthread_should_stop());

	if (dma_addr)
		dma_unmap_page(&fake, dma_addr, len, dir);
	if (page)
		__free_pages(page, get_order(len));
   	put_device(&fake);
 	device_unregister(&fake);
out:
	return 0;
}
static struct task_struct *t[2];
static struct args a[2];

static int __init dma_test_init(void)
{
	int ret;

	/* No point doing this without SWIOTLB */
	if (!swiotlb_nr_tbl())
		return -ENODEV;

	ret = bus_register(&fallback_bus_type);
	if (ret)
		return ret;

	a[0].dir = DMA_TO_DEVICE;
	a[0].len = 32768;
	t[0] = kthread_run(dma_test_thread, &a[0], "dma_test_dev");

	a[1].len = 16384;
	a[1].dir = DMA_FROM_DEVICE;
	t[1] = kthread_run(dma_test_thread, &a[1], "dma_test_cpu");
	return 0;
}
static void __exit dma_test_exit(void)
{
	if (t[0])
		kthread_stop(t[0]);
	if (t[1])
		kthread_stop(t[1]);

 	bus_unregister(&fallback_bus_type);
}
module_init(dma_test_init);
module_exit(dma_test_exit);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-24 16:43             ` Eric Dumazet
@ 2012-10-30 16:53               ` Konrad Rzeszutek Wilk
  2012-10-30 16:53               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-30 16:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, Ian Campbell, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2625 bytes --]

On Wed, Oct 24, 2012 at 06:43:20PM +0200, Eric Dumazet wrote:
> On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> > On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> 
> > > If you really have such problems, why locally generated TCP traffic
> > > doesnt also have it ?
> > 
> > I think it does. The reason I noticed the original problem was that ssh
> > to the machine was virtually (no pun intended) unusable.
> > 
> > > Your patch doesnt touch sk_page_frag_refill(), does it ?
> > 
> > That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> > Is it possible I'm just not hitting that case?
> > 
> 
> I hope not. GFP_KERNEL has __GFP_WAIT.
> 
> > Is it possible that this only affects certain traffic patterns (I only
> > really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> > only broken in one corner case and not the other.
> 
> Could you try a netperf -t TCP_STREAM ?

For fun I did a couple of tests - I setup two machines (one r8168, the other
e1000e) and tried to do netperf/netserver. Both of them are running a baremetal
kernel and one of them has 'iommu=soft swiotlb=force' to simulate the worst
case. This is using v3.7-rc3.

The r8169 is booted without any arguments, the e1000e is using 'iommu=soft
swiotlb=force'.

So r8169 -> e1000e, I get ~940 (this is odd, I expected that the e1000e
on the recv side would be using the bounce buffer, but then I realized it
sets up using pci_alloc_coherent an 'dma' pool).

The other way - e1000e -> r8169 got me around ~128. So it is the sending
side that ends up using the bounce buffer and it slows down considerably.

I also swapped the machine that had e1000e with a tg3 - and got around
the same numbers.

So all of this points to the swiotlb and to just make sure that nothing
was amiss I wrote a little driver that would allocate a compound page,
setup DMA mapping, do some writes, sync and unmap the DMA page. And it works
correctly - so swiotlb (and the xen variant) work right just right.
Attached for your fun.

Then I decided to try v3.6.3, with the same exact parameters.. and
the problem went away.

The e1000e -> r8169 which got me around ~128, now gets ~940! Still
using the swiotlb bounce buffer.


> 
> Because ssh use small packets, and small TCP packets dont use frags but
> skb->head.
> 
> You mentioned a 70% drop of performance, but what test have you used
> exactly ?

Note, I did not provide any arguments to netperf, but it did pick the
test you wanted:

> netperf -H tst019
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to tst019.dumpdata.com (192.168.101.39) port 0 AF_INET

> 
> 

[-- Attachment #2: dma_test.c --]
[-- Type: text/plain, Size: 5698 bytes --]

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <xen/page.h>

#define DMA_TEST  "0.1"

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("dma_test");
MODULE_LICENSE("GPL");
MODULE_VERSION(DMA_TEST);

static struct bus_type fallback_bus_type = {
	.name = "fallback_bus:",
};

static void fake_release(struct device *dev)
{
	/* No kfree as the device was allocated on stack. */
}

struct args {
	int len;
	enum dma_data_direction dir;
};

#define MAGIC_DEVICE 0xffffffdd
#define MAGIC_CPU 0xffffffcc

static int dma_test_thread(void *arg)
{
	struct page *page;
	dma_addr_t dma_addr = 0;
	struct device fake = {
		.coherent_dma_mask = DMA_BIT_MASK(32),
		.bus = &fallback_bus_type,
		.release = fake_release,
	};
	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | GFP_ATOMIC;
	int ret;
	int i;
	void *addr;
	struct page *p;

	struct args *args = (struct args *)arg;
	int dir = args->dir;
	int len = args->len;

	dev_set_name(&fake, "%s", dir == DMA_TO_DEVICE ? "to_dev" : "to_cpu");
 	fake.dma_mask = &fake.coherent_dma_mask;
 	ret = device_register(&fake);
	if (ret)
		goto out;

	do {
		unsigned long prev_mfn = 0;
		bool bus_and_dma_same;

		page = alloc_pages(gfp, get_order(len));
		p = page;
		/* Check that the bus addresses are contingous. */
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long pfn, mfn;

			addr = page_address(p);
			pfn = PFN_DOWN(virt_to_phys(addr));
			if (xen_domain())
				mfn = pfn_to_mfn(pfn);
			else
				mfn = pfn;
			if (i != 0) {
				if (prev_mfn + 1 != mfn)
					dev_warn(&fake, "va: %lx (pfn:%lx, mfn:%lx) w.r.t prev mfn: %lx!\n",
						 (unsigned long)addr, pfn, mfn, prev_mfn);
			}
			prev_mfn = mfn;
		}
		dma_addr = dma_map_page(&fake, page, 0 /* no offset */, len, dir);
		/* Note, dma_addr is the physical address ! */
		if (dma_mapping_error(&fake, dma_addr)) {
			dev_warn(&fake, "DMA %lx for %lx is not right\n", (unsigned long)dma_addr,
				(unsigned long)page_address(page));
			__free_pages(page, get_order(len));
			page = NULL;
		}
		bus_and_dma_same = false;
		if (page) {
			unsigned long phys;
			unsigned long pfn, mfn, bus_addr_mfn;
			unsigned long bus_addr = 0;

			p = page;
			for (i = 0; i < len / PAGE_SIZE; i++, p++) {
				void *bus_va;

				addr = page_address(p);
				phys = virt_to_phys(addr);
				pfn = PFN_DOWN(phys);

				bus_va = (void *)(dma_addr + (i * PAGE_SIZE));

				if (xen_domain()) {
					void * tmp;

					/* Find the bus frame for the physical frame*/
					mfn = pfn_to_mfn(pfn);
					/* and .. voodoo time! */
					bus_addr_mfn = PFN_DOWN(dma_addr + (i * PAGE_SIZE));
					bus_addr = PFN_PHYS(mfn_to_pfn(bus_addr_mfn));
					tmp = __va(bus_addr);
					bus_va = mfn_to_virt(bus_addr_mfn);
					WARN(bus_va != tmp, "Expected %lx (%lx+%d*PAGE_SIZE), got: %lx (pfn: %lx, mfn: %lx)!\n",
						(unsigned long)bus_va, (unsigned long)dma_addr, i, (unsigned long)tmp, PFN_DOWN(bus_addr), bus_addr_mfn);
				} else {
					mfn = pfn;
					bus_addr = (unsigned long)bus_va;
					/* Assume DMA addr == physical addr */
					bus_addr_mfn = PFN_DOWN(bus_addr);
					bus_va = __va(PFN_PHYS(bus_addr_mfn));
				}

				dev_info(&fake, "%lx (pfn:%lx, bus frame: %lx) %s %lx (addr: %lx, frame: %lx)\n",
					(unsigned long)addr, pfn, mfn,
					dir == DMA_TO_DEVICE ? "=>" : "<=",
					(unsigned long)bus_va, bus_addr, bus_addr_mfn);

				if (!virt_addr_valid(bus_va))
					break;
				if (!virt_addr_valid(addr))
					break;

				/* CPU */
				memset(addr, 0xCC, PAGE_SIZE);

				/* Device */
				memset(bus_va, 0xDD, PAGE_SIZE);
				if (addr == bus_va)
					bus_and_dma_same = true;
			}
		}
		set_current_state(TASK_INTERRUPTIBLE);
		schedule_timeout_interruptible(5*HZ);

		if (!page)
			continue;
		p = page;

		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			if (bus_and_dma_same)
				continue;
			addr = page_address(p);
			if (((char *)addr)[0] != MAGIC_CPU)
				dev_warn(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
					(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
					((char *)addr)[0], (unsigned long)MAGIC_CPU);
		}
		/* sync the page */
		dma_sync_single_for_cpu(&fake, dma_addr, len, dir);
		p = page;
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long check_val = MAGIC_DEVICE;

			addr = page_address(p);
			if (dir == DMA_TO_DEVICE)
				check_val = MAGIC_CPU;
			if (dir == DMA_FROM_DEVICE)
				check_val = MAGIC_DEVICE;

			dev_info(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
				(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
				((char *)addr)[0], check_val);
		}
		dma_unmap_page(&fake, dma_addr, len, dir);
		dma_addr = 0;
		__free_pages(page, get_order(len));
		page = NULL;
	} while (!kthread_should_stop());

	if (dma_addr)
		dma_unmap_page(&fake, dma_addr, len, dir);
	if (page)
		__free_pages(page, get_order(len));
   	put_device(&fake);
 	device_unregister(&fake);
out:
	return 0;
}
static struct task_struct *t[2];
static struct args a[2];

static int __init dma_test_init(void)
{
	int ret;

	/* No point doing this without SWIOTLB */
	if (!swiotlb_nr_tbl())
		return -ENODEV;

	ret = bus_register(&fallback_bus_type);
	if (ret)
		return ret;

	a[0].dir = DMA_TO_DEVICE;
	a[0].len = 32768;
	t[0] = kthread_run(dma_test_thread, &a[0], "dma_test_dev");

	a[1].len = 16384;
	a[1].dir = DMA_FROM_DEVICE;
	t[1] = kthread_run(dma_test_thread, &a[1], "dma_test_cpu");
	return 0;
}
static void __exit dma_test_exit(void)
{
	if (t[0])
		kthread_stop(t[0]);
	if (t[1])
		kthread_stop(t[1]);

 	bus_unregister(&fallback_bus_type);
}
module_init(dma_test_init);
module_exit(dma_test_exit);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-30 16:53               ` Konrad Rzeszutek Wilk
@ 2012-10-30 17:23                 ` Konrad Rzeszutek Wilk
  2012-10-31 11:01                   ` Konrad Rzeszutek Wilk
  2012-10-31 11:01                   ` [Xen-devel] " Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-30 17:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, Ian Campbell, xen-devel

On Tue, Oct 30, 2012 at 12:53:09PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 24, 2012 at 06:43:20PM +0200, Eric Dumazet wrote:
> > On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> > > On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> > 
> > > > If you really have such problems, why locally generated TCP traffic
> > > > doesnt also have it ?
> > > 
> > > I think it does. The reason I noticed the original problem was that ssh
> > > to the machine was virtually (no pun intended) unusable.
> > > 
> > > > Your patch doesnt touch sk_page_frag_refill(), does it ?
> > > 
> > > That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> > > Is it possible I'm just not hitting that case?
> > > 
> > 
> > I hope not. GFP_KERNEL has __GFP_WAIT.
> > 
> > > Is it possible that this only affects certain traffic patterns (I only
> > > really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> > > only broken in one corner case and not the other.
> > 
> > Could you try a netperf -t TCP_STREAM ?
> 
> For fun I did a couple of tests - I setup two machines (one r8168, the other
> e1000e) and tried to do netperf/netserver. Both of them are running a baremetal
> kernel and one of them has 'iommu=soft swiotlb=force' to simulate the worst
> case. This is using v3.7-rc3.

I also did a test with the patch at the top, with the same setup and ... it
does look like it fixes some issues, but not the underlaying one.

The same test, with net.core.netdev_frag_page_max_order=0, the e1000e->r8169
gets ~124, but then on subsequent runs it picks up to ~933. If I let the
machine stay a bit idle and then do this again, it does around ~124 again.

Thoughts?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xen-devel] [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-30 17:23                 ` Konrad Rzeszutek Wilk
  2012-10-31 11:01                   ` Konrad Rzeszutek Wilk
@ 2012-10-31 11:01                   ` Konrad Rzeszutek Wilk
  2012-10-31 11:19                     ` Eric Dumazet
  2012-10-31 11:19                     ` [Xen-devel] " Eric Dumazet
  1 sibling, 2 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-31 11:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Eric Dumazet, netdev, Eric Dumazet, Ian Campbell, xen-devel

On Tue, Oct 30, 2012 at 01:23:52PM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 30, 2012 at 12:53:09PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Oct 24, 2012 at 06:43:20PM +0200, Eric Dumazet wrote:
> > > On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> > > > On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> > > 
> > > > > If you really have such problems, why locally generated TCP traffic
> > > > > doesnt also have it ?
> > > > 
> > > > I think it does. The reason I noticed the original problem was that ssh
> > > > to the machine was virtually (no pun intended) unusable.
> > > > 
> > > > > Your patch doesnt touch sk_page_frag_refill(), does it ?
> > > > 
> > > > That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> > > > Is it possible I'm just not hitting that case?
> > > > 
> > > 
> > > I hope not. GFP_KERNEL has __GFP_WAIT.
> > > 
> > > > Is it possible that this only affects certain traffic patterns (I only
> > > > really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> > > > only broken in one corner case and not the other.
> > > 
> > > Could you try a netperf -t TCP_STREAM ?
> > 
> > For fun I did a couple of tests - I setup two machines (one r8168, the other
> > e1000e) and tried to do netperf/netserver. Both of them are running a baremetal
> > kernel and one of them has 'iommu=soft swiotlb=force' to simulate the worst
> > case. This is using v3.7-rc3.
> 
> I also did a test with the patch at the top, with the same setup and ... it
> does look like it fixes some issues, but not the underlaying one.
> 
> The same test, with net.core.netdev_frag_page_max_order=0, the e1000e->r8169
> gets ~124, but then on subsequent runs it picks up to ~933. If I let the
> machine stay a bit idle and then do this again, it does around ~124 again.
> 
> Thoughts?

Argh. Please disregard this test. I added an extra patch in the kernel
tree to track the SWIOTLB bounce usage and print it.. and the printk was
going out to the FB, which was not too fast - so the whole test was being
slowed down by FB drivers :-)

Will re-run this test without the offending patch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-30 17:23                 ` Konrad Rzeszutek Wilk
@ 2012-10-31 11:01                   ` Konrad Rzeszutek Wilk
  2012-10-31 11:01                   ` [Xen-devel] " Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-31 11:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: netdev, Eric Dumazet, Ian Campbell, Eric Dumazet, xen-devel

On Tue, Oct 30, 2012 at 01:23:52PM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 30, 2012 at 12:53:09PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Oct 24, 2012 at 06:43:20PM +0200, Eric Dumazet wrote:
> > > On Wed, 2012-10-24 at 17:22 +0100, Ian Campbell wrote:
> > > > On Wed, 2012-10-24 at 16:21 +0100, Eric Dumazet wrote:
> > > 
> > > > > If you really have such problems, why locally generated TCP traffic
> > > > > doesnt also have it ?
> > > > 
> > > > I think it does. The reason I noticed the original problem was that ssh
> > > > to the machine was virtually (no pun intended) unusable.
> > > > 
> > > > > Your patch doesnt touch sk_page_frag_refill(), does it ?
> > > > 
> > > > That's right. It doesn't. When is (sk->sk_allocation & __GFP_WAIT) true?
> > > > Is it possible I'm just not hitting that case?
> > > > 
> > > 
> > > I hope not. GFP_KERNEL has __GFP_WAIT.
> > > 
> > > > Is it possible that this only affects certain traffic patterns (I only
> > > > really tried ssh/scp and ping)? Or perhaps its just that the swiotlb is
> > > > only broken in one corner case and not the other.
> > > 
> > > Could you try a netperf -t TCP_STREAM ?
> > 
> > For fun I did a couple of tests - I setup two machines (one r8168, the other
> > e1000e) and tried to do netperf/netserver. Both of them are running a baremetal
> > kernel and one of them has 'iommu=soft swiotlb=force' to simulate the worst
> > case. This is using v3.7-rc3.
> 
> I also did a test with the patch at the top, with the same setup and ... it
> does look like it fixes some issues, but not the underlaying one.
> 
> The same test, with net.core.netdev_frag_page_max_order=0, the e1000e->r8169
> gets ~124, but then on subsequent runs it picks up to ~933. If I let the
> machine stay a bit idle and then do this again, it does around ~124 again.
> 
> Thoughts?

Argh. Please disregard this test. I added an extra patch in the kernel
tree to track the SWIOTLB bounce usage and print it.. and the printk was
going out to the FB, which was not too fast - so the whole test was being
slowed down by FB drivers :-)

Will re-run this test without the offending patch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xen-devel] [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-31 11:01                   ` [Xen-devel] " Konrad Rzeszutek Wilk
  2012-10-31 11:19                     ` Eric Dumazet
@ 2012-10-31 11:19                     ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-31 11:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, netdev, Eric Dumazet, Ian Campbell, xen-devel

On Wed, 2012-10-31 at 07:01 -0400, Konrad Rzeszutek Wilk wrote:

> Argh. Please disregard this test. I added an extra patch in the kernel
> tree to track the SWIOTLB bounce usage and print it.. and the printk was
> going out to the FB, which was not too fast - so the whole test was being
> slowed down by FB drivers :-)
> 
> Will re-run this test without the offending patch.

Anyway, I must confess I didnt understand what you did ;)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
  2012-10-31 11:01                   ` [Xen-devel] " Konrad Rzeszutek Wilk
@ 2012-10-31 11:19                     ` Eric Dumazet
  2012-10-31 11:19                     ` [Xen-devel] " Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2012-10-31 11:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: netdev, Eric Dumazet, xen-devel, Ian Campbell, Konrad Rzeszutek Wilk

On Wed, 2012-10-31 at 07:01 -0400, Konrad Rzeszutek Wilk wrote:

> Argh. Please disregard this test. I added an extra patch in the kernel
> tree to track the SWIOTLB bounce usage and print it.. and the printk was
> going out to the FB, which was not too fast - so the whole test was being
> slowed down by FB drivers :-)
> 
> Will re-run this test without the offending patch.

Anyway, I must confess I didnt understand what you did ;)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag
@ 2012-10-24 11:42 Ian Campbell
  0 siblings, 0 replies; 25+ messages in thread
From: Ian Campbell @ 2012-10-24 11:42 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, xen-devel, Ian Campbell, Konrad Rzeszutek Wilk

The commit 69b08f62e174 "net: use bigger pages in __netdev_alloc_frag"
lead to 70%+ packet loss under Xen when transmitting from physical (as
opposed to virtual) network devices.

This is because under Xen pages which are contiguous in the physical
address space may not be contiguous in the DMA space, in fact it is
very likely that they are not. I think there are other architectures
where this is true, although perhaps non quite so aggressive as to
have this property at a per-order-0-page granularity.

The real underlying bug here most likely lies in the swiotlb not
correctly handling compound pages, and Konrad is investigating this.
However even with the swiotlb issue fixed the current arrangement
seems likely to result in a lot of bounce buffering which seems likely
to more than offset any benefit from the use of larger pages.

Therefore make NETDEV_FRAG_PAGE_MAX_ORDER configurable at runtime and
use this to request order-0 frags under Xen. Also expose this setting
via sysctl.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: netdev@vger.kernel.org
Cc: xen-devel@lists.xen.org
---
 arch/x86/xen/setup.c       |    7 +++++++
 include/linux/skbuff.h     |    2 ++
 net/core/skbuff.c          |    7 ++++---
 net/core/sysctl_net_core.c |    7 +++++++
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 8971a26..ad14d46 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -11,6 +11,7 @@
 #include <linux/memblock.h>
 #include <linux/cpuidle.h>
 #include <linux/cpufreq.h>
+#include <linux/skbuff.h>
 
 #include <asm/elf.h>
 #include <asm/vdso.h>
@@ -555,6 +556,12 @@ void __init xen_arch_setup(void)
 	       MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
 	       COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
 
+	/*
+	 * Xen cannot handle DMA to/from compound pages so avoid
+	 * bounce buffering by not allocating large network frags.
+	 */
+	netdev_frag_page_max_order = 0;
+
 	/* Set up idle, making sure it calls safe_halt() pvop */
 #ifdef CONFIG_X86_32
 	boot_cpu_data.hlt_works_ok = 1;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6a2c34e..a3a748f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1719,6 +1719,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
+extern int netdev_frag_page_max_order;
+
 extern void *netdev_alloc_frag(unsigned int fragsz);
 
 extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6e04b1f..88cbe5f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -348,8 +348,9 @@ struct netdev_alloc_cache {
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
-#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
-#define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
+int netdev_frag_page_max_order __read_mostly = get_order(32768);
+
+#define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << netdev_frag_page_max_order)
 #define NETDEV_PAGECNT_MAX_BIAS	   NETDEV_FRAG_PAGE_MAX_SIZE
 
 static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
@@ -363,7 +364,7 @@ static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->frag.page)) {
 refill:
-		for (order = NETDEV_FRAG_PAGE_MAX_ORDER; ;) {
+		for (order = netdev_frag_page_max_order; ;) {
 			gfp_t gfp = gfp_mask;
 
 			if (order)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index a7c3684..e5ab6df 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -129,6 +129,13 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "netdev_frag_page_max_order",
+		.data		= &netdev_frag_page_max_order,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 #ifdef CONFIG_BPF_JIT
 	{
 		.procname	= "bpf_jit_enable",
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-10-31 11:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-24 11:42 [PATCH] net: allow configuration of the size of page in __netdev_alloc_frag Ian Campbell
2012-10-24 12:28 ` Eric Dumazet
2012-10-24 12:28 ` Eric Dumazet
2012-10-24 13:16   ` Ian Campbell
2012-10-24 13:16   ` Ian Campbell
2012-10-24 13:30     ` Eric Dumazet
2012-10-24 13:30     ` Eric Dumazet
2012-10-24 14:02       ` Ian Campbell
2012-10-24 14:02       ` Ian Campbell
2012-10-24 15:21         ` Eric Dumazet
2012-10-24 15:21         ` Eric Dumazet
2012-10-24 16:22           ` Ian Campbell
2012-10-24 16:22           ` Ian Campbell
2012-10-24 16:43             ` Eric Dumazet
2012-10-24 16:43             ` Eric Dumazet
2012-10-30 16:53               ` Konrad Rzeszutek Wilk
2012-10-30 16:53               ` Konrad Rzeszutek Wilk
2012-10-30 17:23                 ` Konrad Rzeszutek Wilk
2012-10-31 11:01                   ` Konrad Rzeszutek Wilk
2012-10-31 11:01                   ` [Xen-devel] " Konrad Rzeszutek Wilk
2012-10-31 11:19                     ` Eric Dumazet
2012-10-31 11:19                     ` [Xen-devel] " Eric Dumazet
2012-10-24 18:19 ` David Miller
2012-10-24 18:19 ` David Miller
  -- strict thread matches above, loose matches on Subject: below --
2012-10-24 11:42 Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.