All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
@ 2010-10-11 23:03 Eric Dumazet
  2010-10-11 23:22 ` Eric Dumazet
  2010-10-12 16:07 ` [BUG net-next] bnx2x: all traffic comes to RX queue 0 Eric Dumazet
  0 siblings, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2010-10-11 23:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Michael Chan, Eilon Greenstein

netdev_alloc_skb() is a very wrong interface, really.

We should remove/deprecate it.

For multi queue devices, it makes more sense to allocate skb on local
node of the cpu handling RX interrupts. This allow each cpu to
manipulate its own slub/slab queues/structures without doing expensive
cross-node business.

For non multi queue devices, IRQ affinity should be set so that a cpu
close to the device services interrupts. Even if not set, using
dev_alloc_skb() is faster.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/net/bnx2x/bnx2x_cmn.c     |   11 +++++------
 drivers/net/bnx2x/bnx2x_cmn.h     |    2 +-
 drivers/net/bnx2x/bnx2x_ethtool.c |    2 +-
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_cmn.c b/drivers/net/bnx2x/bnx2x_cmn.c
index 97ef674..0561bd9 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/bnx2x/bnx2x_cmn.c
@@ -335,7 +335,7 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 	struct sw_rx_bd *rx_buf = &fp->tpa_pool[queue];
 	struct sk_buff *skb = rx_buf->skb;
 	/* alloc new skb */
-	struct sk_buff *new_skb = netdev_alloc_skb(bp->dev, bp->rx_buf_size);
+	struct sk_buff *new_skb = dev_alloc_skb(bp->rx_buf_size);
 
 	/* Unmap skb in the pool anyway, as we are going to change
 	   pool entry status to BNX2X_TPA_STOP even if new skb allocation
@@ -576,8 +576,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 			    (len <= RX_COPY_THRESH)) {
 				struct sk_buff *new_skb;
 
-				new_skb = netdev_alloc_skb(bp->dev,
-							   len + pad);
+				new_skb = dev_alloc_skb(len + pad);
 				if (new_skb == NULL) {
 					DP(NETIF_MSG_RX_ERR,
 					   "ERROR  packet dropped "
@@ -587,9 +586,9 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 				}
 
 				/* aligned copy */
-				skb_copy_from_linear_data_offset(skb, pad,
-						    new_skb->data + pad, len);
 				skb_reserve(new_skb, pad);
+				skb_copy_from_linear_data_offset(skb, pad,
+						    new_skb->data, len);
 				skb_put(new_skb, len);
 
 				bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
@@ -841,7 +840,7 @@ void bnx2x_init_rx_rings(struct bnx2x *bp)
 		if (!fp->disable_tpa) {
 			for (i = 0; i < max_agg_queues; i++) {
 				fp->tpa_pool[i].skb =
-				   netdev_alloc_skb(bp->dev, bp->rx_buf_size);
+				   dev_alloc_skb(bp->rx_buf_size);
 				if (!fp->tpa_pool[i].skb) {
 					BNX2X_ERR("Failed to allocate TPA "
 						  "skb pool for queue[%d] - "
diff --git a/drivers/net/bnx2x/bnx2x_cmn.h b/drivers/net/bnx2x/bnx2x_cmn.h
index 7f52cec..f422cfc 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/bnx2x/bnx2x_cmn.h
@@ -833,7 +833,7 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
 	struct eth_rx_bd *rx_bd = &fp->rx_desc_ring[index];
 	dma_addr_t mapping;
 
-	skb = netdev_alloc_skb(bp->dev, bp->rx_buf_size);
+	skb = dev_alloc_skb(bp->rx_buf_size);
 	if (unlikely(skb == NULL))
 		return -ENOMEM;
 
diff --git a/drivers/net/bnx2x/bnx2x_ethtool.c b/drivers/net/bnx2x/bnx2x_ethtool.c
index 54fe061..5224daa 100644
--- a/drivers/net/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/bnx2x/bnx2x_ethtool.c
@@ -1430,7 +1430,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode, u8 link_up)
 	/* prepare the loopback packet */
 	pkt_size = (((bp->dev->mtu < ETH_MAX_PACKET_SIZE) ?
 		     bp->dev->mtu : ETH_MAX_PACKET_SIZE) + ETH_HLEN);
-	skb = netdev_alloc_skb(bp->dev, bp->rx_buf_size);
+	skb = dev_alloc_skb(bp->rx_buf_size);
 	if (!skb) {
 		rc = -ENOMEM;
 		goto test_loopback_exit;



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-11 23:03 [PATCH net-next] bnx2x: dont use netdev_alloc_skb() Eric Dumazet
@ 2010-10-11 23:22 ` Eric Dumazet
  2010-10-12  5:03   ` Tom Herbert
  2010-10-12  5:05   ` [PATCH net-next] net: allocate skbs on local node Eric Dumazet
  2010-10-12 16:07 ` [BUG net-next] bnx2x: all traffic comes to RX queue 0 Eric Dumazet
  1 sibling, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2010-10-11 23:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Michael Chan, Eilon Greenstein

Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> netdev_alloc_skb() is a very wrong interface, really.
> 
> We should remove/deprecate it.
> 
> For multi queue devices, it makes more sense to allocate skb on local
> node of the cpu handling RX interrupts. This allow each cpu to
> manipulate its own slub/slab queues/structures without doing expensive
> cross-node business.
> 
> For non multi queue devices, IRQ affinity should be set so that a cpu
> close to the device services interrupts. Even if not set, using
> dev_alloc_skb() is faster.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Or maybe revert :

commit b30973f877fea1a3fb84e05599890fcc082a88e5
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Dec 6 20:32:36 2006 -0800

    [PATCH] node-aware skb allocation
    
    Node-aware allocation of skbs for the receive path.
    
    Details:
    
      - __alloc_skb gets a new node argument and cals the node-aware
        slab functions with it.
      - netdev_alloc_skb passed the node number it gets from dev_to_node
        to it, everyone else passes -1 (any node)
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <clameter@engr.sgi.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Signed-off-by: Andrew Morton <akpm@osdl.org>


Apparently, only Christoph and Andrew signed it.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-11 23:22 ` Eric Dumazet
@ 2010-10-12  5:03   ` Tom Herbert
  2010-10-12  5:16     ` Eric Dumazet
  2010-10-12  5:05   ` [PATCH net-next] net: allocate skbs on local node Eric Dumazet
  1 sibling, 1 reply; 40+ messages in thread
From: Tom Herbert @ 2010-10-12  5:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Michael Chan, Eilon Greenstein

On Mon, Oct 11, 2010 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
>> netdev_alloc_skb() is a very wrong interface, really.
>>
>> We should remove/deprecate it.
>>
>> For multi queue devices, it makes more sense to allocate skb on local
>> node of the cpu handling RX interrupts. This allow each cpu to
>> manipulate its own slub/slab queues/structures without doing expensive
>> cross-node business.
>>
>> For non multi queue devices, IRQ affinity should be set so that a cpu
>> close to the device services interrupts. Even if not set, using
>> dev_alloc_skb() is faster.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Or maybe revert :
>
> commit b30973f877fea1a3fb84e05599890fcc082a88e5
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Wed Dec 6 20:32:36 2006 -0800
>

I second this revert.  Node aware allocation by device's node makes
little sense on a multi-queue device and leads to mediocre
performance.

>    [PATCH] node-aware skb allocation
>
>    Node-aware allocation of skbs for the receive path.
>
>    Details:
>
>      - __alloc_skb gets a new node argument and cals the node-aware
>        slab functions with it.
>      - netdev_alloc_skb passed the node number it gets from dev_to_node
>        to it, everyone else passes -1 (any node)
>
>    Signed-off-by: Christoph Hellwig <hch@lst.de>
>    Cc: Christoph Lameter <clameter@engr.sgi.com>
>    Cc: "David S. Miller" <davem@davemloft.net>
>    Signed-off-by: Andrew Morton <akpm@osdl.org>
>
>
> Apparently, only Christoph and Andrew signed it.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH net-next] net:  allocate skbs on local node
  2010-10-11 23:22 ` Eric Dumazet
  2010-10-12  5:03   ` Tom Herbert
@ 2010-10-12  5:05   ` Eric Dumazet
  2010-10-12  5:35     ` Tom Herbert
                       ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Eric Dumazet @ 2010-10-12  5:05 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Michael Chan, Eilon Greenstein, Andrew Morton,
	Christoph Hellwig, Christoph Lameter

Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit :
> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> > 
> > For multi queue devices, it makes more sense to allocate skb on local
> > node of the cpu handling RX interrupts. This allow each cpu to
> > manipulate its own slub/slab queues/structures without doing expensive
> > cross-node business.
> > 
> > For non multi queue devices, IRQ affinity should be set so that a cpu
> > close to the device services interrupts. Even if not set, using
> > dev_alloc_skb() is faster.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Or maybe revert :
> 
> commit b30973f877fea1a3fb84e05599890fcc082a88e5
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Wed Dec 6 20:32:36 2006 -0800
> 
>     [PATCH] node-aware skb allocation
>     
>     Node-aware allocation of skbs for the receive path.
>     
>     Details:
>     
>       - __alloc_skb gets a new node argument and cals the node-aware
>         slab functions with it.
>       - netdev_alloc_skb passed the node number it gets from dev_to_node
>         to it, everyone else passes -1 (any node)
>     
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Cc: Christoph Lameter <clameter@engr.sgi.com>
>     Cc: "David S. Miller" <davem@davemloft.net>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
> 
> 
> Apparently, only Christoph and Andrew signed it.
> 
> 

[PATCH net-next] net: allocate skbs on local node

commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   20 ++++++++++++++++----
 net/core/skbuff.c      |   13 +------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0b53c43..05a358f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size,
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
-	return __alloc_skb(size, priority, 0, -1);
+	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
 }
 
 extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return skb;
 }
 
-extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+/**
+ *	__netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *	@gfp_mask: alloc_pages_node mask
+ *
+ * 	Allocate a new page. dev currently unused.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
+}
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
  *	@dev: network device to receive on
  *
- * 	Allocate a new page node local to the specified device.
+ * 	Allocate a new page. dev currently unused.
  *
  * 	%NULL is returned if there is no free memory.
  */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 752c197..4e8b82e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
-struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
-{
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
-	struct page *page;
-
-	page = alloc_pages_node(node, gfp_mask, 0);
-	return page;
-}
-EXPORT_SYMBOL(__netdev_alloc_page);
-
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-12  5:03   ` Tom Herbert
@ 2010-10-12  5:16     ` Eric Dumazet
  2010-10-12  9:12       ` Vladislav Zolotarov
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-12  5:16 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev, Michael Chan, Eilon Greenstein

Le lundi 11 octobre 2010 à 22:03 -0700, Tom Herbert a écrit :
> On Mon, Oct 11, 2010 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> >> netdev_alloc_skb() is a very wrong interface, really.
> >>
> >> We should remove/deprecate it.
> >>
> >> For multi queue devices, it makes more sense to allocate skb on local
> >> node of the cpu handling RX interrupts. This allow each cpu to
> >> manipulate its own slub/slab queues/structures without doing expensive
> >> cross-node business.
> >>
> >> For non multi queue devices, IRQ affinity should be set so that a cpu
> >> close to the device services interrupts. Even if not set, using
> >> dev_alloc_skb() is faster.
> >>
> >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Or maybe revert :
> >
> > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Wed Dec 6 20:32:36 2006 -0800
> >
> 
> I second this revert.  Node aware allocation by device's node makes
> little sense on a multi-queue device and leads to mediocre
> performance.

Yes, I said this several time in the past, I believe time has come to
get rid of it.

I posted a patch some minutes ago, so you can review it and ack it ;)

Thanks !



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-12  5:05   ` [PATCH net-next] net: allocate skbs on local node Eric Dumazet
@ 2010-10-12  5:35     ` Tom Herbert
  2010-10-12  6:03     ` Andrew Morton
  2010-10-16 18:54     ` David Miller
  2 siblings, 0 replies; 40+ messages in thread
From: Tom Herbert @ 2010-10-12  5:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Andrew Morton, Christoph Hellwig, Christoph Lameter

Acked-by: Tom Herbert <therbert@google.com>

On Mon, Oct 11, 2010 at 10:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit :
>> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
>> >
>> > For multi queue devices, it makes more sense to allocate skb on local
>> > node of the cpu handling RX interrupts. This allow each cpu to
>> > manipulate its own slub/slab queues/structures without doing expensive
>> > cross-node business.
>> >
>> > For non multi queue devices, IRQ affinity should be set so that a cpu
>> > close to the device services interrupts. Even if not set, using
>> > dev_alloc_skb() is faster.
>> >
>> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>
>> Or maybe revert :
>>
>> commit b30973f877fea1a3fb84e05599890fcc082a88e5
>> Author: Christoph Hellwig <hch@lst.de>
>> Date:   Wed Dec 6 20:32:36 2006 -0800
>>
>>     [PATCH] node-aware skb allocation
>>
>>     Node-aware allocation of skbs for the receive path.
>>
>>     Details:
>>
>>       - __alloc_skb gets a new node argument and cals the node-aware
>>         slab functions with it.
>>       - netdev_alloc_skb passed the node number it gets from dev_to_node
>>         to it, everyone else passes -1 (any node)
>>
>>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>>     Cc: Christoph Lameter <clameter@engr.sgi.com>
>>     Cc: "David S. Miller" <davem@davemloft.net>
>>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>>
>>
>> Apparently, only Christoph and Andrew signed it.
>>
>>
>
> [PATCH net-next] net: allocate skbs on local node
>
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
>
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
>
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
>
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   20 ++++++++++++++++----
>  net/core/skbuff.c      |   13 +------------
>  2 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0b53c43..05a358f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size,
>  static inline struct sk_buff *alloc_skb(unsigned int size,
>                                        gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, 0, -1);
> +       return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
>  }
>
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>                                               gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, 1, -1);
> +       return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
>  }
>
>  extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
> @@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
>        return skb;
>  }
>
> -extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
> +/**
> + *     __netdev_alloc_page - allocate a page for ps-rx on a specific device
> + *     @dev: network device to receive on
> + *     @gfp_mask: alloc_pages_node mask
> + *
> + *     Allocate a new page. dev currently unused.
> + *
> + *     %NULL is returned if there is no free memory.
> + */
> +static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
> +{
> +       return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
> +}
>
>  /**
>  *     netdev_alloc_page - allocate a page for ps-rx on a specific device
>  *     @dev: network device to receive on
>  *
> - *     Allocate a new page node local to the specified device.
> + *     Allocate a new page. dev currently unused.
>  *
>  *     %NULL is returned if there is no free memory.
>  */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 752c197..4e8b82e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb);
>  struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>                unsigned int length, gfp_t gfp_mask)
>  {
> -       int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
>        struct sk_buff *skb;
>
> -       skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
> +       skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
>        if (likely(skb)) {
>                skb_reserve(skb, NET_SKB_PAD);
>                skb->dev = dev;
> @@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>  }
>  EXPORT_SYMBOL(__netdev_alloc_skb);
>
> -struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
> -{
> -       int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
> -       struct page *page;
> -
> -       page = alloc_pages_node(node, gfp_mask, 0);
> -       return page;
> -}
> -EXPORT_SYMBOL(__netdev_alloc_page);
> -
>  void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
>                int size)
>  {
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12  5:05   ` [PATCH net-next] net: allocate skbs on local node Eric Dumazet
  2010-10-12  5:35     ` Tom Herbert
@ 2010-10-12  6:03     ` Andrew Morton
  2010-10-12  6:58       ` Eric Dumazet
  2010-10-14 15:31       ` Tom Herbert
  2010-10-16 18:54     ` David Miller
  2 siblings, 2 replies; 40+ messages in thread
From: Andrew Morton @ 2010-10-12  6:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter

On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 octobre 2010 __ 01:22 +0200, Eric Dumazet a __crit :
> > Le mardi 12 octobre 2010 __ 01:03 +0200, Eric Dumazet a __crit :
> > > 
> > > For multi queue devices, it makes more sense to allocate skb on local
> > > node of the cpu handling RX interrupts. This allow each cpu to
> > > manipulate its own slub/slab queues/structures without doing expensive
> > > cross-node business.
> > > 
> > > For non multi queue devices, IRQ affinity should be set so that a cpu
> > > close to the device services interrupts. Even if not set, using
> > > dev_alloc_skb() is faster.
> > > 
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > Or maybe revert :
> > 
> > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Wed Dec 6 20:32:36 2006 -0800
> > 
> >     [PATCH] node-aware skb allocation
> >     
> >     Node-aware allocation of skbs for the receive path.
> >     
> >     Details:
> >     
> >       - __alloc_skb gets a new node argument and cals the node-aware
> >         slab functions with it.
> >       - netdev_alloc_skb passed the node number it gets from dev_to_node
> >         to it, everyone else passes -1 (any node)
> >     
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Cc: Christoph Lameter <clameter@engr.sgi.com>
> >     Cc: "David S. Miller" <davem@davemloft.net>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> > 
> > 
> > Apparently, only Christoph and Andrew signed it.
> > 
> > 
> 
> [PATCH net-next] net: allocate skbs on local node
> 
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
> 
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
> 
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
> 
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
> 

This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)

The mooted effects should be tested for on both slab and slub, I
suggest.  They're pretty different beasts.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12  6:03     ` Andrew Morton
@ 2010-10-12  6:58       ` Eric Dumazet
  2010-10-12  7:24         ` Andrew Morton
  2010-10-14 15:31       ` Tom Herbert
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-12  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter

Le lundi 11 octobre 2010 à 23:03 -0700, Andrew Morton a écrit :
> On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > [PATCH net-next] net: allocate skbs on local node
> > 
> > commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> > allocating net drivers skbs on a given memory node : The one closest to
> > the NIC hardware. This is wrong because as soon as we try to scale
> > network stack, we need to use many cpus to handle traffic and hit
> > slub/slab management on cross-node allocations/frees when these cpus
> > have to alloc/free skbs bound to a central node.
> > 
> > skb allocated in RX path are ephemeral, they have a very short
> > lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> > appeared as a nice idea four years ago is in fact a bad one.
> > 
> > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> > and two 10Gb NIC might deliver more than 28 million packets per second,
> > needing all the available cpus.
> > 
> > Cost of cross-node handling in network and vm stacks outperforms the
> > small benefit hardware had when doing its DMA transfert in its 'local'
> > memory node at RX time. Even trying to differentiate the two allocations
> > done for one skb (the sk_buff on local node, the data part on NIC
> > hardware node) is not enough to bring good performance.
> > 
> 
> This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> 

I would say, _you_ should prove that original patch was good. It seems
no network guy was really in the discussion ?

Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the
difference. Thats about a 40% slowdown on high packet rates, on a dual
socket machine (dual X5570  @2.93GHz). You can expect higher values on
four nodes (I dont have such hardware to do the test)


> The mooted effects should be tested for on both slab and slub, I
> suggest.  They're pretty different beasts.

SLAB is so slow on NUMA these days, you can forget it for good.

Its about 40% slower on some tests I did this week on net-next, to
speedup output (and routing) performance, so it was with normal (local)
allocations, not even cross-nodes ones.

Once you remove network bottlenecks, you badly hit contention on SLAB
and are forced to switch to SLUB ;)

Sending 160.000.000 udp frames on same neighbour/destination,
IP route cache disabled (to mimic DDOS on a router)
16 threads, 16 logical cpus. 32bit kernel (dual E5540  @ 2.53GHz)


(It takes more than 2 minutes with linux-2.6, so use net-next-2.6 if you
really want to get these numbers)

SLUB :

real	0m50.661s
user	0m15.973s
sys	11m42.548s


18348.00 21.4% dst_destroy             vmlinux
 5674.00  6.6% fib_table_lookup        vmlinux
 5563.00  6.5% dst_alloc               vmlinux
 5226.00  6.1% neigh_lookup            vmlinux
 3590.00  4.2% __ip_route_output_key   vmlinux
 2712.00  3.2% neigh_resolve_output    vmlinux
 2511.00  2.9% fib_semantic_match      vmlinux
 2488.00  2.9% ipv4_dst_destroy        vmlinux
 2206.00  2.6% __xfrm_lookup           vmlinux
 2119.00  2.5% memset                  vmlinux
 2015.00  2.4% __copy_from_user_ll     vmlinux
 1722.00  2.0% udp_sendmsg             vmlinux
 1679.00  2.0% __slab_free             vmlinux
 1152.00  1.3% ip_append_data          vmlinux
 1044.00  1.2% __alloc_skb             vmlinux
  952.00  1.1% kmem_cache_free         vmlinux
  942.00  1.1% udp_push_pending_frames vmlinux
  877.00  1.0% kfree                   vmlinux
  870.00  1.0% __call_rcu              vmlinux
  829.00  1.0% ip_push_pending_frames  vmlinux
  799.00  0.9% _raw_spin_lock_bh       vmlinux

SLAB:

real	1m10.771s
user	0m13.941s
sys	12m42.188s


22734.00 26.0% _raw_spin_lock          vmlinux
 8238.00  9.4% dst_destroy             vmlinux
 4393.00  5.0% fib_table_lookup        vmlinux
 3652.00  4.2% dst_alloc               vmlinux
 3335.00  3.8% neigh_lookup            vmlinux
 2444.00  2.8% memset                  vmlinux
 2443.00  2.8% __ip_route_output_key   vmlinux
 1916.00  2.2% fib_semantic_match      vmlinux
 1708.00  2.0% __copy_from_user_ll     vmlinux
 1669.00  1.9% __xfrm_lookup           vmlinux
 1642.00  1.9% free_block              vmlinux
 1554.00  1.8% neigh_resolve_output    vmlinux
 1388.00  1.6% ipv4_dst_destroy        vmlinux
 1335.00  1.5% udp_sendmsg             vmlinux
 1109.00  1.3% kmem_cache_free         vmlinux
 1007.00  1.2% __alloc_skb             vmlinux
 1004.00  1.1% kfree                   vmlinux
 1002.00  1.1% ip_append_data          vmlinux
  975.00  1.1% cache_grow              vmlinux
  936.00  1.1% ____cache_alloc_node    vmlinux
  925.00  1.1% udp_push_pending_frames vmlinux


All this raw_spin_lock overhead comes from SLAB.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12  6:58       ` Eric Dumazet
@ 2010-10-12  7:24         ` Andrew Morton
  2010-10-12  7:49           ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2010-10-12  7:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter

On Tue, 12 Oct 2010 08:58:19 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 11 octobre 2010 __ 23:03 -0700, Andrew Morton a __crit :
> > On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > [PATCH net-next] net: allocate skbs on local node
> > > 
> > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> > > allocating net drivers skbs on a given memory node : The one closest to
> > > the NIC hardware. This is wrong because as soon as we try to scale
> > > network stack, we need to use many cpus to handle traffic and hit
> > > slub/slab management on cross-node allocations/frees when these cpus
> > > have to alloc/free skbs bound to a central node.
> > > 
> > > skb allocated in RX path are ephemeral, they have a very short
> > > lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> > > appeared as a nice idea four years ago is in fact a bad one.
> > > 
> > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> > > and two 10Gb NIC might deliver more than 28 million packets per second,
> > > needing all the available cpus.
> > > 
> > > Cost of cross-node handling in network and vm stacks outperforms the
> > > small benefit hardware had when doing its DMA transfert in its 'local'
> > > memory node at RX time. Even trying to differentiate the two allocations
> > > done for one skb (the sk_buff on local node, the data part on NIC
> > > hardware node) is not enough to bring good performance.
> > > 
> > 
> > This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> > 
> 
> I would say, _you_ should prove that original patch was good. It seems
> no network guy was really in the discussion ?

Two wrongs and all that.  The 2006 patch has nothing to do with it,
apart from demonstrating the importance of including performance
measurements in a performance patch.

> Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the
> difference. Thats about a 40% slowdown on high packet rates, on a dual
> socket machine (dual X5570  @2.93GHz). You can expect higher values on
> four nodes (I dont have such hardware to do the test)

Like that.  Please flesh it out and stick it in the changelog.

> 
> > The mooted effects should be tested for on both slab and slub, I
> > suggest.  They're pretty different beasts.
> 
> SLAB is so slow on NUMA these days, you can forget it for good.

I'd love to forget it, but it's faster for some things (I forget
which).  Which is why it's still around.

And the ghastly thing about this is that you're forced to care about it
too because some people are, apparently, still using it.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12  7:24         ` Andrew Morton
@ 2010-10-12  7:49           ` Eric Dumazet
  2010-10-12  7:58             ` Andrew Morton
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-12  7:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter

Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :

> I'd love to forget it, but it's faster for some things (I forget
> which).  Which is why it's still around.

Yes, two years ago it was true on pathological/obscure cases.
Every time I did the comparison, SLUB won.
You asked me, I did yet another test this morning, and 40% is pretty
serious, I believe.

> 
> And the ghastly thing about this is that you're forced to care about it
> too because some people are, apparently, still using it.
> 

Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
machines, I know...

I am not saying we should not care, but for any serious network workload
on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
might even get better.

BTW, I believe all modern distros ship SLUB, dont they ?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12  7:49           ` Eric Dumazet
@ 2010-10-12  7:58             ` Andrew Morton
  2010-10-12 11:08               ` Pekka Enberg
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2010-10-12  7:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter, Pekka Enberg

On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :
> 
> > I'd love to forget it, but it's faster for some things (I forget
> > which).  Which is why it's still around.
> 
> Yes, two years ago it was true on pathological/obscure cases.
> Every time I did the comparison, SLUB won.
> You asked me, I did yet another test this morning, and 40% is pretty
> serious, I believe.
> 
> > 
> > And the ghastly thing about this is that you're forced to care about it
> > too because some people are, apparently, still using it.
> > 
> 
> Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
> machines, I know...
> 
> I am not saying we should not care, but for any serious network workload
> on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
> might even get better.
> 
> BTW, I believe all modern distros ship SLUB, dont they ?
> 

Dunno.

Pekka, why haven't we deleted slab yet??

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-12  5:16     ` Eric Dumazet
@ 2010-10-12  9:12       ` Vladislav Zolotarov
  2010-10-14 17:39         ` David Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Vladislav Zolotarov @ 2010-10-12  9:12 UTC (permalink / raw)
  To: Eric Dumazet, Tom Herbert
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Tuesday, October 12, 2010 7:16 AM
> To: Tom Herbert
> Cc: David Miller; netdev; Michael Chan; Eilon Greenstein
> Subject: Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
> 
> Le lundi 11 octobre 2010 à 22:03 -0700, Tom Herbert a écrit :
> > On Mon, Oct 11, 2010 at 4:22 PM, Eric Dumazet
> <eric.dumazet@gmail.com> wrote:
> > > Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> > >> netdev_alloc_skb() is a very wrong interface, really.
> > >>
> > >> We should remove/deprecate it.
> > >>
> > >> For multi queue devices, it makes more sense to allocate skb on
> local
> > >> node of the cpu handling RX interrupts. This allow each cpu to
> > >> manipulate its own slub/slab queues/structures without doing
> expensive
> > >> cross-node business.
> > >>
> > >> For non multi queue devices, IRQ affinity should be set so that a
> cpu
> > >> close to the device services interrupts. Even if not set, using
> > >> dev_alloc_skb() is faster.
> > >>
> > >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > >
> > > Or maybe revert :
> > >
> > > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > > Author: Christoph Hellwig <hch@lst.de>
> > > Date:   Wed Dec 6 20:32:36 2006 -0800
> > >
> >
> > I second this revert.  Node aware allocation by device's node makes
> > little sense on a multi-queue device and leads to mediocre
> > performance.
> 
> Yes, I said this several time in the past, I believe time has come to
> get rid of it.
> 
> I posted a patch some minutes ago, so you can review it and ack it ;)
> 
> Thanks !

Eric, very nice patch, thanks. However Eilon is on vacation till tomorrow
and he is a responsible maintainer of the bnx2x, thus the final ACK should
be his.

Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
final ACK.

Thanks,
vlad

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12  7:58             ` Andrew Morton
@ 2010-10-12 11:08               ` Pekka Enberg
  2010-10-12 12:50                 ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Pekka Enberg @ 2010-10-12 11:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, Christoph Lameter,
	David Rientjes, LKML, Nick Piggin

On 10/12/10 10:58 AM, Andrew Morton wrote:
> On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet<eric.dumazet@gmail.com>  wrote:
>
>> Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :
>>
>>> I'd love to forget it, but it's faster for some things (I forget
>>> which).  Which is why it's still around.
>>
>> Yes, two years ago it was true on pathological/obscure cases.
>> Every time I did the comparison, SLUB won.
>> You asked me, I did yet another test this morning, and 40% is pretty
>> serious, I believe.
>>
>>> And the ghastly thing about this is that you're forced to care about it
>>> too because some people are, apparently, still using it.
>>
>> Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
>> machines, I know...
>>
>> I am not saying we should not care, but for any serious network workload
>> on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
>> might even get better.
>>
>> BTW, I believe all modern distros ship SLUB, dont they ?
>
> Dunno.
>
> Pekka, why haven't we deleted slab yet??

To make a long story short, we still have relevant performance 
regressions that need to be taken care of. The most interesting one is a 
regression in netperf TCP_RR that's been reported by David Rientjes a 
while back. There's bunch of SLUB cleanups queued for 2.6.37 that pave 
the way for Christoph's SLUB queueing work that should hopefully fix 
that particular issue for 2.6.38.

There's little point in discussing the removal of SLAB as long as there 
are performance regressions for real workloads from people who are 
willing to share results and test patches. I'm optimistic that we'll be 
able to try removing SLAB some time next year unless something 
interesting pops up...

			Pekka

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12 11:08               ` Pekka Enberg
@ 2010-10-12 12:50                 ` Christoph Lameter
  2010-10-12 19:43                   ` David Rientjes
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2010-10-12 12:50 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, Eric Dumazet, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, David Rientjes, LKML,
	Nick Piggin

On Tue, 12 Oct 2010, Pekka Enberg wrote:

> There's little point in discussing the removal of SLAB as long as there are
> performance regressions for real workloads from people who are willing to
> share results and test patches. I'm optimistic that we'll be able to try
> removing SLAB some time next year unless something interesting pops up...

Hmmm. Given these effects I think we should be more cautious regarding the
unification work. May be the "unified allocator" should replace SLAB
instead and SLUB can stay unchanged? The unification patches go back to
the one lock per node SLAB thing because the queue maintenance overhead is
otherwise causing large regressions in hackbench because of lots of atomic
ops. The per node lock seem to be causing problems here in the network
stack,. Take the unified as a SLAB cleanup instead? Then at least we have
a large common code base and just differentiate through the locking
mechanism?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [BUG net-next] bnx2x: all traffic comes to RX queue 0
  2010-10-11 23:03 [PATCH net-next] bnx2x: dont use netdev_alloc_skb() Eric Dumazet
  2010-10-11 23:22 ` Eric Dumazet
@ 2010-10-12 16:07 ` Eric Dumazet
  2010-10-12 16:20   ` Dmitry Kravkov
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-12 16:07 UTC (permalink / raw)
  To: David Miller, Dmitry Kravkov, Vladislav Zolotarov, Yaniv Rosner
  Cc: netdev, Michael Chan, Eilon Greenstein

Hmm, while doing tests for the netdev_alloc_skb() problem, I found
current net-next tree is not really multi queue enabled...

ethtool -S eth1|grep _ucast
     [0]: rx_ucast_packets: 3507786
     [0]: tx_ucast_packets: 416925
     [1]: rx_ucast_packets: 0
     [1]: tx_ucast_packets: 4
     [2]: rx_ucast_packets: 0
     [2]: tx_ucast_packets: 397467
     [3]: rx_ucast_packets: 0
     [3]: tx_ucast_packets: 75832
     [4]: rx_ucast_packets: 0
     [4]: tx_ucast_packets: 171025
     [5]: rx_ucast_packets: 0
     [5]: tx_ucast_packets: 233025
     [6]: rx_ucast_packets: 0
     [6]: tx_ucast_packets: 250358
     [7]: rx_ucast_packets: 0
     [7]: tx_ucast_packets: 240792
     [8]: rx_ucast_packets: 0
     [8]: tx_ucast_packets: 216366
     [9]: rx_ucast_packets: 0
     [9]: tx_ucast_packets: 1
     [10]: rx_ucast_packets: 0
     [10]: tx_ucast_packets: 350324
     [11]: rx_ucast_packets: 0
     [11]: tx_ucast_packets: 92403
     [12]: rx_ucast_packets: 0
     [12]: tx_ucast_packets: 307678
     [13]: rx_ucast_packets: 0
     [13]: tx_ucast_packets: 314315
     [14]: rx_ucast_packets: 0
     [14]: tx_ucast_packets: 256767
     [15]: rx_ucast_packets: 0
     [15]: tx_ucast_packets: 185105
     rx_ucast_packets: 3507786
     tx_ucast_packets: 3508387

# ethtool -i eth1
driver: bnx2x
version: 1.60.00-1
firmware-version: bc 4.8.0 phy baa0.105
bus-info: 0000:02:00.1

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57711E
10Gigabit PCIe
	Subsystem: Hewlett-Packard Company NC532i Dual Port 10GbE Multifunction
BL-C Adapter
	Flags: bus master, fast devsel, latency 0, IRQ 47
	Memory at fa000000 (64-bit, non-prefetchable) [size=8M]
	Memory at f9800000 (64-bit, non-prefetchable) [size=8M]
	[virtual] Expansion ROM at e7010000 [disabled] [size=64K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Device Serial Number f4-ce-46-ff-fe-bb-32-d4
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel <?>
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x


Any idea before a biscection ?

Thanks !



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
  2010-10-12 16:07 ` [BUG net-next] bnx2x: all traffic comes to RX queue 0 Eric Dumazet
@ 2010-10-12 16:20   ` Dmitry Kravkov
  2010-10-12 18:11     ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Dmitry Kravkov @ 2010-10-12 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Michael Chan, Eilon Greenstein, David Miller,
	Vladislav Zolotarov, Yaniv Rosner

Eric,

I will take a look

Thanks

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Tuesday, October 12, 2010 6:08 PM
To: David Miller; Dmitry Kravkov; Vladislav Zolotarov; Yaniv Rosner
Cc: netdev; Michael Chan; Eilon Greenstein
Subject: [BUG net-next] bnx2x: all traffic comes to RX queue 0

Hmm, while doing tests for the netdev_alloc_skb() problem, I found
current net-next tree is not really multi queue enabled...

ethtool -S eth1|grep _ucast
     [0]: rx_ucast_packets: 3507786
     [0]: tx_ucast_packets: 416925
     [1]: rx_ucast_packets: 0
     [1]: tx_ucast_packets: 4
     [2]: rx_ucast_packets: 0
     [2]: tx_ucast_packets: 397467
     [3]: rx_ucast_packets: 0
     [3]: tx_ucast_packets: 75832
     [4]: rx_ucast_packets: 0
     [4]: tx_ucast_packets: 171025
     [5]: rx_ucast_packets: 0
     [5]: tx_ucast_packets: 233025
     [6]: rx_ucast_packets: 0
     [6]: tx_ucast_packets: 250358
     [7]: rx_ucast_packets: 0
     [7]: tx_ucast_packets: 240792
     [8]: rx_ucast_packets: 0
     [8]: tx_ucast_packets: 216366
     [9]: rx_ucast_packets: 0
     [9]: tx_ucast_packets: 1
     [10]: rx_ucast_packets: 0
     [10]: tx_ucast_packets: 350324
     [11]: rx_ucast_packets: 0
     [11]: tx_ucast_packets: 92403
     [12]: rx_ucast_packets: 0
     [12]: tx_ucast_packets: 307678
     [13]: rx_ucast_packets: 0
     [13]: tx_ucast_packets: 314315
     [14]: rx_ucast_packets: 0
     [14]: tx_ucast_packets: 256767
     [15]: rx_ucast_packets: 0
     [15]: tx_ucast_packets: 185105
     rx_ucast_packets: 3507786
     tx_ucast_packets: 3508387

# ethtool -i eth1
driver: bnx2x
version: 1.60.00-1
firmware-version: bc 4.8.0 phy baa0.105
bus-info: 0000:02:00.1

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57711E
10Gigabit PCIe
	Subsystem: Hewlett-Packard Company NC532i Dual Port 10GbE Multifunction
BL-C Adapter
	Flags: bus master, fast devsel, latency 0, IRQ 47
	Memory at fa000000 (64-bit, non-prefetchable) [size=8M]
	Memory at f9800000 (64-bit, non-prefetchable) [size=8M]
	[virtual] Expansion ROM at e7010000 [disabled] [size=64K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Device Serial Number f4-ce-46-ff-fe-bb-32-d4
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel <?>
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x


Any idea before a biscection ?

Thanks !




^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
  2010-10-12 16:20   ` Dmitry Kravkov
@ 2010-10-12 18:11     ` Eric Dumazet
  2010-10-12 18:18       ` Vladislav Zolotarov
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-12 18:11 UTC (permalink / raw)
  To: Dmitry Kravkov
  Cc: netdev, Michael Chan, Eilon Greenstein, David Miller,
	Vladislav Zolotarov, Yaniv Rosner

Le mardi 12 octobre 2010 à 09:20 -0700, Dmitry Kravkov a écrit :
> Eric,
> 
> I will take a look
> 

Thanks ! Here is the bisection result

# git bisect bad
523224a3b3cd407ce4e6731a087194e13a90db18 is the first bad commit
commit 523224a3b3cd407ce4e6731a087194e13a90db18
Author: Dmitry Kravkov <dmitry@broadcom.com>
Date:   Wed Oct 6 03:23:26 2010 +0000

    bnx2x, cnic, bnx2i: use new FW/HSI
    
    This is the new FW HSI blob and the relevant definitions without logic changes.
    It also included code adaptation for new HSI. New features are not enabled.
    
    New FW/HSI includes:
    - Support for 57712 HW
    - Future support for VF (not used)
    - Improvements in FW interrupts scheme
    - FW FCoE hooks (stubs for future usage)
    
    Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
    Signed-off-by: Michael Chan <mchan@broadcom.com>
    Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 8931a748ae89487466d99e6121a3965131e3a201 fed2be49565ca291203a865554a0e071a2a9c3fb M	drivers
:040000 040000 2ff6a5c290ca7e9dc339b7213f49c341f4fc5822 82c0a7117b58f538a8b49891c5524e3537c5ae66 M	firmware


# git bisect log
git bisect start '--' 'drivers/net/bnx2x'
# bad: [b9e9b15966e35f3b4a8d3820cac460505552f72c] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
git bisect bad b9e9b15966e35f3b4a8d3820cac460505552f72c
# good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35
git bisect good 9fe6206f400646a2322096b56c59891d530e8d51
# good: [b7737c9be9d3e894d1a4375c52f5f47789475f26] bnx2x: Split PHY functions
git bisect good b7737c9be9d3e894d1a4375c52f5f47789475f26
# good: [8681dc3abd54e845a2effab441921b4c4457c241] bnx2x: Moved enabling of MSI to the bnx2x_set_num_queues()
git bisect good 8681dc3abd54e845a2effab441921b4c4457c241
# bad: [c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2] bnx2x: remove unused fields in main driver structure
git bisect bad c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2
# bad: [fb3bff178e722fe88b5ab02319c9636da0980e25] bnx2x: rename MF related fields
git bisect bad fb3bff178e722fe88b5ab02319c9636da0980e25
# good: [560131f313ea9b9439742138289b6f68d61531ec] bnx2x: create folder for bnx2x firmware files
git bisect good 560131f313ea9b9439742138289b6f68d61531ec
# bad: [523224a3b3cd407ce4e6731a087194e13a90db18] bnx2x, cnic, bnx2i: use new FW/HSI
git bisect bad 523224a3b3cd407ce4e6731a087194e13a90db18



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
  2010-10-12 18:11     ` Eric Dumazet
@ 2010-10-12 18:18       ` Vladislav Zolotarov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladislav Zolotarov @ 2010-10-12 18:18 UTC (permalink / raw)
  To: Eric Dumazet, Dmitry Kravkov
  Cc: netdev, Michael Chan, Eilon Greenstein, David Miller, Yaniv Rosner

Thanks, Eric. Dima has already found a problem and we are running 
a regression to verify nothing else is broken after the RSS is 
brought back to life.

If nothing goes wrong Dima will send a patch in an hour or so...

Thanks again for the catch.
vlad

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Tuesday, October 12, 2010 8:12 PM
> To: Dmitry Kravkov
> Cc: netdev; Michael Chan; Eilon Greenstein; David Miller; Vladislav
> Zolotarov; Yaniv Rosner
> Subject: RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
> 
> Le mardi 12 octobre 2010 à 09:20 -0700, Dmitry Kravkov a écrit :
> > Eric,
> >
> > I will take a look
> >
> 
> Thanks ! Here is the bisection result
> 
> # git bisect bad
> 523224a3b3cd407ce4e6731a087194e13a90db18 is the first bad commit
> commit 523224a3b3cd407ce4e6731a087194e13a90db18
> Author: Dmitry Kravkov <dmitry@broadcom.com>
> Date:   Wed Oct 6 03:23:26 2010 +0000
> 
>     bnx2x, cnic, bnx2i: use new FW/HSI
> 
>     This is the new FW HSI blob and the relevant definitions without
> logic changes.
>     It also included code adaptation for new HSI. New features are not
> enabled.
> 
>     New FW/HSI includes:
>     - Support for 57712 HW
>     - Future support for VF (not used)
>     - Improvements in FW interrupts scheme
>     - FW FCoE hooks (stubs for future usage)
> 
>     Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
>     Signed-off-by: Michael Chan <mchan@broadcom.com>
>     Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> :040000 040000 8931a748ae89487466d99e6121a3965131e3a201
> fed2be49565ca291203a865554a0e071a2a9c3fb M	drivers
> :040000 040000 2ff6a5c290ca7e9dc339b7213f49c341f4fc5822
> 82c0a7117b58f538a8b49891c5524e3537c5ae66 M	firmware
> 
> 
> # git bisect log
> git bisect start '--' 'drivers/net/bnx2x'
> # bad: [b9e9b15966e35f3b4a8d3820cac460505552f72c] Merge branch 'master'
> of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
> git bisect bad b9e9b15966e35f3b4a8d3820cac460505552f72c
> # good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35
> git bisect good 9fe6206f400646a2322096b56c59891d530e8d51
> # good: [b7737c9be9d3e894d1a4375c52f5f47789475f26] bnx2x: Split PHY
> functions
> git bisect good b7737c9be9d3e894d1a4375c52f5f47789475f26
> # good: [8681dc3abd54e845a2effab441921b4c4457c241] bnx2x: Moved
> enabling of MSI to the bnx2x_set_num_queues()
> git bisect good 8681dc3abd54e845a2effab441921b4c4457c241
> # bad: [c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2] bnx2x: remove unused
> fields in main driver structure
> git bisect bad c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2
> # bad: [fb3bff178e722fe88b5ab02319c9636da0980e25] bnx2x: rename MF
> related fields
> git bisect bad fb3bff178e722fe88b5ab02319c9636da0980e25
> # good: [560131f313ea9b9439742138289b6f68d61531ec] bnx2x: create folder
> for bnx2x firmware files
> git bisect good 560131f313ea9b9439742138289b6f68d61531ec
> # bad: [523224a3b3cd407ce4e6731a087194e13a90db18] bnx2x, cnic, bnx2i:
> use new FW/HSI
> git bisect bad 523224a3b3cd407ce4e6731a087194e13a90db18
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12 12:50                 ` Christoph Lameter
@ 2010-10-12 19:43                   ` David Rientjes
  2010-10-13  6:17                     ` Pekka Enberg
  2010-10-13 16:00                     ` Christoph Lameter
  0 siblings, 2 replies; 40+ messages in thread
From: David Rientjes @ 2010-10-12 19:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Tue, 12 Oct 2010, Christoph Lameter wrote:

> Hmmm. Given these effects I think we should be more cautious regarding the
> unification work. May be the "unified allocator" should replace SLAB
> instead and SLUB can stay unchanged?

Linus has said that he refuses to merge another allocator until one is 
removed or replaced, so that would force the unificiation patches to go 
into slab instead if you want to leave slub untouched.

> The unification patches go back to
> the one lock per node SLAB thing because the queue maintenance overhead is
> otherwise causing large regressions in hackbench because of lots of atomic
> ops. The per node lock seem to be causing problems here in the network
> stack,.

The TCP_RR regression on slub is because of what I described a couple 
years ago as "slab thrashing" where cpu slabs would be filled with 
allocations, then frees would occur to move those slabs from the full to 
partial list with only a few free objects, those partial slabs would 
quickly become full, etc.  Performance gets better if you change the 
per-node lock to a trylock when iterating the partial list and preallocate 
and have a substantially longer partial list than normal (and it still 
didn't rival slab's performance), so I don't think it's only a per-node 
lock that's the issue , it's all the slowpath overhead of swapping the cpu 
slab out for another slab.  The TCP_RR load would show slub stats that 
indicate certain caches, kmalloc-256 and kmalloc-2048, would have ~98% of 
allocations coming from the slowpath.

This gets better if you allocate higher order slabs (and kmalloc-2048 is 
already order-3 by default) but then allocating new slabs gets really slow 
if not impossible on smaller machines.  The overhead of even compaction 
will kill us.

> Take the unified as a SLAB cleanup instead? Then at least we have
> a large common code base and just differentiate through the locking
> mechanism?
> 

Will you be adding the extensive slub debugging to slab then?  It would be 
a shame to lose it because one allocator is chosen over another for 
performance reasons and then we need to recompile to debug issues as they 
arise.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12 19:43                   ` David Rientjes
@ 2010-10-13  6:17                     ` Pekka Enberg
  2010-10-13  6:31                       ` David Rientjes
  2010-10-13 16:00                     ` Christoph Lameter
  1 sibling, 1 reply; 40+ messages in thread
From: Pekka Enberg @ 2010-10-13  6:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Andrew Morton, Eric Dumazet, David Miller,
	netdev, Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Tue, 12 Oct 2010, Christoph Lameter wrote:
>> Hmmm. Given these effects I think we should be more cautious regarding the
>> unification work. May be the "unified allocator" should replace SLAB
>> instead and SLUB can stay unchanged?

On 10/12/10 10:43 PM, David Rientjes wrote:
> Linus has said that he refuses to merge another allocator until one is
> removed or replaced, so that would force the unificiation patches to go
> into slab instead if you want to leave slub untouched.

Yes, and quite frankly, I'm not interested in introducing a new one either.

On Tue, 12 Oct 2010, Christoph Lameter wrote:
>> The unification patches go back to
>> the one lock per node SLAB thing because the queue maintenance overhead is
>> otherwise causing large regressions in hackbench because of lots of atomic
>> ops. The per node lock seem to be causing problems here in the network
>> stack,.

On Tue, 12 Oct 2010, Christoph Lameter wrote:
>> Take the unified as a SLAB cleanup instead? Then at least we have
>> a large common code base and just differentiate through the locking
>> mechanism?

On 10/12/10 10:43 PM, David Rientjes wrote:
> Will you be adding the extensive slub debugging to slab then?  It would be
> a shame to lose it because one allocator is chosen over another for
> performance reasons and then we need to recompile to debug issues as they
> arise.

I think Christoph is saying that we'd remove SLAB and make the unified 
allocator the new SLAB while keeping SLUB in place. In any case, yes, 
the debugging support in SLUB is something that we want to keep.

			Pekka

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13  6:17                     ` Pekka Enberg
@ 2010-10-13  6:31                       ` David Rientjes
  2010-10-13  6:36                         ` Pekka Enberg
  0 siblings, 1 reply; 40+ messages in thread
From: David Rientjes @ 2010-10-13  6:31 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andrew Morton, Eric Dumazet, David Miller,
	netdev, Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Wed, 13 Oct 2010, Pekka Enberg wrote:

> I think Christoph is saying that we'd remove SLAB and make the unified
> allocator the new SLAB while keeping SLUB in place.

Right, so his unification patches would be against slab instead of slub 
which is a pretty major change from the current state of the patchset, 
although it might actually be smaller?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13  6:31                       ` David Rientjes
@ 2010-10-13  6:36                         ` Pekka Enberg
  0 siblings, 0 replies; 40+ messages in thread
From: Pekka Enberg @ 2010-10-13  6:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Andrew Morton, Eric Dumazet, David Miller,
	netdev, Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

  On Wed, 13 Oct 2010, Pekka Enberg wrote:
>> I think Christoph is saying that we'd remove SLAB and make the unified
>> allocator the new SLAB while keeping SLUB in place.

On 10/13/10 9:31 AM, David Rientjes wrote:
>> Right, so his unification patches would be against slab instead of slub
>> which is a pretty major change from the current state of the patchset,
>> although it might actually be smaller?

No, the way I understood it was:

   rm mm/slab.c
   cp mm/slub.c mm/slab.c
<apply patches on top of slab.c>

             Pekka

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-12 19:43                   ` David Rientjes
  2010-10-13  6:17                     ` Pekka Enberg
@ 2010-10-13 16:00                     ` Christoph Lameter
  2010-10-13 20:48                       ` David Rientjes
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2010-10-13 16:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Tue, 12 Oct 2010, David Rientjes wrote:

> > Take the unified as a SLAB cleanup instead? Then at least we have
> > a large common code base and just differentiate through the locking
> > mechanism?
> >
>
> Will you be adding the extensive slub debugging to slab then?  It would be
> a shame to lose it because one allocator is chosen over another for
> performance reasons and then we need to recompile to debug issues as they
> arise.

Well basically we would copy SLUB to SLAB apply unification patches to
SLAB instead of SLUBB. We first have to make sure that the unified patches
have the same performance as SLAB.

It maybe much better to isolate the debug features and general bootstrap
from the particulars of the allocation strategy of either SLUB or SLAB.
That way a common code base exists and it would be easier to add different
allocation strategies.

Basically have slab.c with the basic functions and then slab_queueing.c
and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation
strategy?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13 16:00                     ` Christoph Lameter
@ 2010-10-13 20:48                       ` David Rientjes
  2010-10-13 21:43                         ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: David Rientjes @ 2010-10-13 20:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Wed, 13 Oct 2010, Christoph Lameter wrote:

> > Will you be adding the extensive slub debugging to slab then?  It would be
> > a shame to lose it because one allocator is chosen over another for
> > performance reasons and then we need to recompile to debug issues as they
> > arise.
> 
> Well basically we would copy SLUB to SLAB apply unification patches to
> SLAB instead of SLUBB. We first have to make sure that the unified patches
> have the same performance as SLAB.
> 

I see, so all of the development will be done in Pekka's tree on mm/slub.c 
and then when we can see no performance regression compared to the slab 
baseline, merge it into Linus' tree as mm/slab.c.  I'm not exactly sure 
how that set of diffs being sent to Linus would look.

Are the changes to slub in the unification patchset so intrusive that it 
wouldn't be possible to isolate many of the features under #ifdef or 
boot-time options in a single, truly unified, allocator?  It seems like a 
shame that we'll have two allocators where the base is the same and much 
of the debugging code is the same.

> It maybe much better to isolate the debug features and general bootstrap
> from the particulars of the allocation strategy of either SLUB or SLAB.
> That way a common code base exists and it would be easier to add different
> allocation strategies.
> 
> Basically have slab.c with the basic functions and then slab_queueing.c
> and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation
> strategy?
> 

I was going to mention that as an idea, but I thought storing the metadata 
for certain debugging features might differ from the two allocators so 
substantially that it would be even more convoluted and difficult to 
maintain?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13 20:48                       ` David Rientjes
@ 2010-10-13 21:43                         ` Christoph Lameter
  2010-10-13 22:41                           ` David Rientjes
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2010-10-13 21:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Wed, 13 Oct 2010, David Rientjes wrote:

> > Basically have slab.c with the basic functions and then slab_queueing.c
> > and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation
> > strategy?
> >
>
> I was going to mention that as an idea, but I thought storing the metadata
> for certain debugging features might differ from the two allocators so
> substantially that it would be even more convoluted and difficult to
> maintain?

We could have some callbacks to store allocator specific metadata?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13 21:43                         ` Christoph Lameter
@ 2010-10-13 22:41                           ` David Rientjes
  2010-10-14  6:22                             ` Pekka Enberg
  2010-10-15 14:23                             ` Christoph Lameter
  0 siblings, 2 replies; 40+ messages in thread
From: David Rientjes @ 2010-10-13 22:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Wed, 13 Oct 2010, Christoph Lameter wrote:

> > I was going to mention that as an idea, but I thought storing the metadata
> > for certain debugging features might differ from the two allocators so
> > substantially that it would be even more convoluted and difficult to
> > maintain?
> 
> We could have some callbacks to store allocator specific metadata?
> 

It depends on whether we could share the same base for both slab (unified 
allocator) and slub, which you snipped from your reply, that would make 
this cleaner.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13 22:41                           ` David Rientjes
@ 2010-10-14  6:22                             ` Pekka Enberg
  2010-10-14  7:23                               ` David Rientjes
  2010-10-15 14:23                             ` Christoph Lameter
  1 sibling, 1 reply; 40+ messages in thread
From: Pekka Enberg @ 2010-10-14  6:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Andrew Morton, Eric Dumazet, David Miller,
	netdev, Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Wed, 13 Oct 2010, Christoph Lameter wrote:
>>> I was going to mention that as an idea, but I thought storing the metadata
>>> for certain debugging features might differ from the two allocators so
>>> substantially that it would be even more convoluted and difficult to
>>> maintain?
>>
>> We could have some callbacks to store allocator specific metadata?

On 10/14/10 1:41 AM, David Rientjes wrote:
> It depends on whether we could share the same base for both slab (unified
> allocator) and slub, which you snipped from your reply, that would make
> this cleaner.

Argh. Why would we want to introduce something that's effectively a new 
allocator based on SLUB? If there's something controversial in the 
current patch series, lets just keep it out of mainline. A "rewrite" is 
the reason we're in this mess so lets not repeat the same mistake again!

			Pekka

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-14  6:22                             ` Pekka Enberg
@ 2010-10-14  7:23                               ` David Rientjes
  0 siblings, 0 replies; 40+ messages in thread
From: David Rientjes @ 2010-10-14  7:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andrew Morton, Eric Dumazet, David Miller,
	netdev, Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin

On Thu, 14 Oct 2010, Pekka Enberg wrote:

> Argh. Why would we want to introduce something that's effectively a new
> allocator based on SLUB? If there's something controversial in the current
> patch series, lets just keep it out of mainline. A "rewrite" is the reason
> we're in this mess so lets not repeat the same mistake again!
> 

SLUB is a good base framework for developing just about any slab allocator 
you can imagine, in part because of its enhanced debugging facilities.  
Nick originally developed SLQB with much of the same SLUB framework and 
the queueing changes that Christoph is proposing in his new unified 
allocator builds upon SLUB.

Instead of the slab.c, slab_queue.c, and slab_nonqueue.c trifecta, I 
suggested building as much of the core allocator into a single file as 
possible and then extending that with a config option such as 
CONFIG_SLAB_QUEUEING, if possible.  Christoph knows his allocator better 
than anybody so he'd be the person to ask if this was indeed feasible and, 
if so, I think it's in the best interest of a long-term maintainable 
kernel.

I care about how this is organized because I think the current config 
option demanding users select between SLAB and SLUB without really 
understanding the differences (especially for users who run a very wide 
range of applications and the pros and cons of better microbenchmark 
results for one allocator over another isn't at all convincing) is 
detrimental.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-12  6:03     ` Andrew Morton
  2010-10-12  6:58       ` Eric Dumazet
@ 2010-10-14 15:31       ` Tom Herbert
  2010-10-14 16:05         ` Eric Dumazet
  2010-10-14 19:27         ` Andrew Morton
  1 sibling, 2 replies; 40+ messages in thread
From: Tom Herbert @ 2010-10-14 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, Christoph Lameter

> This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
>
> The mooted effects should be tested for on both slab and slub, I
> suggest.  They're pretty different beasts.
> --

Some results running netper TCP_RR test with 200 instances, 1 byte
request and response on 16 core AMD using bnx2x with one 16 queues,
one for each CPU.

SLAB

Without patch 553570 tps at 86% CPU
With patch 791883 tps at 93% CPU

SLUB

Without patch 704879 tps at 95% CPU
With patch 775278 tps at 92% CPU

I believe both show good benfits with patch, and it actually looks
like the impact is more pronounced for SLAB.  I would also note, that
we have actually already internally patched __netdev_alloc_skb to do
local node allocation which we have been running in production for
quite some time.

Tom

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-14 15:31       ` Tom Herbert
@ 2010-10-14 16:05         ` Eric Dumazet
  2010-10-15 16:57           ` Christoph Lameter
  2010-10-14 19:27         ` Andrew Morton
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-14 16:05 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Andrew Morton, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, Christoph Lameter

Le jeudi 14 octobre 2010 à 08:31 -0700, Tom Herbert a écrit :
> > This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> >
> > The mooted effects should be tested for on both slab and slub, I
> > suggest.  They're pretty different beasts.
> > --
> 
> Some results running netper TCP_RR test with 200 instances, 1 byte
> request and response on 16 core AMD using bnx2x with one 16 queues,
> one for each CPU.
> 
> SLAB
> 
> Without patch 553570 tps at 86% CPU
> With patch 791883 tps at 93% CPU
> 
> SLUB
> 
> Without patch 704879 tps at 95% CPU
> With patch 775278 tps at 92% CPU
> 
> I believe both show good benfits with patch, and it actually looks
> like the impact is more pronounced for SLAB.  I would also note, that
> we have actually already internally patched __netdev_alloc_skb to do
> local node allocation which we have been running in production for
> quite some time.

Excellent ! I was not sure I could do this before NFWS...

Thanks Tom !

Small note : last user of 'allocate skb on a given node, not local one'
is pktgen.

[PATCH net-next] net: allocate skbs on local node

commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.

Some numbers, courtesy from Tom Herbert :

Some results running netper TCP_RR test with 200 instances, 1 byte
request and response on 16 core AMD using bnx2x with one 16 queues,
one for each CPU.

SLAB

Without patch 553570 tps at 86% CPU
With patch 791883 tps at 93% CPU

SLUB

Without patch 704879 tps at 95% CPU
With patch 775278 tps at 92% CPU

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tom Herbert <therbert@google.com>
---
 include/linux/skbuff.h |   20 ++++++++++++++++----
 net/core/skbuff.c      |   13 +------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0b53c43..05a358f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size,
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
-	return __alloc_skb(size, priority, 0, -1);
+	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
 }
 
 extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return skb;
 }
 
-extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+/**
+ *	__netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *	@gfp_mask: alloc_pages_node mask
+ *
+ * 	Allocate a new page. dev currently unused.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
+}
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
  *	@dev: network device to receive on
  *
- * 	Allocate a new page node local to the specified device.
+ * 	Allocate a new page. dev currently unused.
  *
  * 	%NULL is returned if there is no free memory.
  */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 752c197..4e8b82e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
-struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
-{
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
-	struct page *page;
-
-	page = alloc_pages_node(node, gfp_mask, 0);
-	return page;
-}
-EXPORT_SYMBOL(__netdev_alloc_page);
-
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-12  9:12       ` Vladislav Zolotarov
@ 2010-10-14 17:39         ` David Miller
  2010-10-14 18:17           ` Eilon Greenstein
  2010-10-14 18:17           ` Tom Herbert
  0 siblings, 2 replies; 40+ messages in thread
From: David Miller @ 2010-10-14 17:39 UTC (permalink / raw)
  To: vladz; +Cc: eric.dumazet, therbert, netdev, mchan, eilong

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Tue, 12 Oct 2010 02:12:41 -0700

> Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
> final ACK.

Ok, but it is two days later now :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-14 17:39         ` David Miller
@ 2010-10-14 18:17           ` Eilon Greenstein
  2010-10-14 18:20             ` Eric Dumazet
  2010-10-14 18:17           ` Tom Herbert
  1 sibling, 1 reply; 40+ messages in thread
From: Eilon Greenstein @ 2010-10-14 18:17 UTC (permalink / raw)
  To: David Miller, eric.dumazet
  Cc: Vladislav Zolotarov, therbert, netdev, Michael Chan

On Thu, 2010-10-14 at 10:39 -0700, David Miller wrote:
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Tue, 12 Oct 2010 02:12:41 -0700
> 
> > Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
> > final ACK.
> 
> Ok, but it is two days later now :-)
> 

I was under the impression that the current path is to revert
b30973f877fea1a3fb84e05599890fcc082a88e5 instead of the original thread.
Eric - do you still want to apply the original patch from this thread?

Thanks,
Eilon



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-14 17:39         ` David Miller
  2010-10-14 18:17           ` Eilon Greenstein
@ 2010-10-14 18:17           ` Tom Herbert
  1 sibling, 0 replies; 40+ messages in thread
From: Tom Herbert @ 2010-10-14 18:17 UTC (permalink / raw)
  To: David Miller; +Cc: vladz, eric.dumazet, netdev, mchan, eilong

On Thu, Oct 14, 2010 at 10:39 AM, David Miller <davem@davemloft.net> wrote:
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Tue, 12 Oct 2010 02:12:41 -0700
>
>> Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
>> final ACK.
>
> Ok, but it is two days later now :-)

If the netdev_alloc_skb is reverted to do local node allocation as
Eric suggested, then this patch against bnx2x (and presumably similar
changes to other drivers) is no longer needed.

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-14 18:17           ` Eilon Greenstein
@ 2010-10-14 18:20             ` Eric Dumazet
  2010-10-14 18:25               ` David Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2010-10-14 18:20 UTC (permalink / raw)
  To: eilong; +Cc: David Miller, Vladislav Zolotarov, therbert, netdev, Michael Chan

Le jeudi 14 octobre 2010 à 20:17 +0200, Eilon Greenstein a écrit :
> On Thu, 2010-10-14 at 10:39 -0700, David Miller wrote:
> > From: "Vladislav Zolotarov" <vladz@broadcom.com>
> > Date: Tue, 12 Oct 2010 02:12:41 -0700
> > 
> > > Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
> > > final ACK.
> > 
> > Ok, but it is two days later now :-)
> > 
> 
> I was under the impression that the current path is to revert
> b30973f877fea1a3fb84e05599890fcc082a88e5 instead of the original thread.
> Eric - do you still want to apply the original patch from this thread?

No, this patch is not needed, if the generic one is considered.

We might change copybreak stuff, but not all skb allocations in bnx2x

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
  2010-10-14 18:20             ` Eric Dumazet
@ 2010-10-14 18:25               ` David Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David Miller @ 2010-10-14 18:25 UTC (permalink / raw)
  To: eric.dumazet; +Cc: eilong, vladz, therbert, netdev, mchan

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 14 Oct 2010 20:20:20 +0200

> Le jeudi 14 octobre 2010 à 20:17 +0200, Eilon Greenstein a écrit :
>> On Thu, 2010-10-14 at 10:39 -0700, David Miller wrote:
>> > From: "Vladislav Zolotarov" <vladz@broadcom.com>
>> > Date: Tue, 12 Oct 2010 02:12:41 -0700
>> > 
>> > > Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
>> > > final ACK.
>> > 
>> > Ok, but it is two days later now :-)
>> > 
>> 
>> I was under the impression that the current path is to revert
>> b30973f877fea1a3fb84e05599890fcc082a88e5 instead of the original thread.
>> Eric - do you still want to apply the original patch from this thread?
> 
> No, this patch is not needed, if the generic one is considered.

Ok.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-14 15:31       ` Tom Herbert
  2010-10-14 16:05         ` Eric Dumazet
@ 2010-10-14 19:27         ` Andrew Morton
  2010-10-14 19:59           ` Eric Dumazet
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2010-10-14 19:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, Christoph Lameter

On Thu, 14 Oct 2010 08:31:01 -0700
Tom Herbert <therbert@google.com> wrote:

> > This is all conspicuously hand-wavy and unquantified. __(IOW: prove it!)
> >
> > The mooted effects should be tested for on both slab and slub, I
> > suggest. __They're pretty different beasts.
> > --
> 
> Some results running netper TCP_RR test with 200 instances, 1 byte
> request and response on 16 core AMD using bnx2x with one 16 queues,
> one for each CPU.
> 
> SLAB
> 
> Without patch 553570 tps at 86% CPU
> With patch 791883 tps at 93% CPU
> 
> SLUB
> 
> Without patch 704879 tps at 95% CPU
> With patch 775278 tps at 92% CPU
> 
> I believe both show good benfits with patch, and it actually looks
> like the impact is more pronounced for SLAB.  I would also note, that
> we have actually already internally patched __netdev_alloc_skb to do
> local node allocation which we have been running in production for
> quite some time.
> 

Yes, that's a solid gain.

Can we think of any hardware configuration for which the change would
be harmful?  Something with really expensive cross-node DMA maybe?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-14 19:27         ` Andrew Morton
@ 2010-10-14 19:59           ` Eric Dumazet
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2010-10-14 19:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tom Herbert, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, Christoph Lameter

Le jeudi 14 octobre 2010 à 12:27 -0700, Andrew Morton a écrit :

> Can we think of any hardware configuration for which the change would
> be harmful?  Something with really expensive cross-node DMA maybe?
> 

If such hardware exists (yes it does, but not close my hands...), then
NIC IRQS probably should be handled by cpus close to the device, or it
might be very slow. This has nothing to do with skb allocation layer.

I believe we should not try to correct wrong IRQ affinities with NUMA
games. Network stack will wakeup threads and scheduler will run them on
same cpu, if possible.

If skb stay long enough on socket receive queue, application will need
to fetch again remote numa node when reading socket a few milliseconds
later, because cache will be cold : Total : two cross node transferts
per incoming frame.

The more node distances are big, the more speedup we can expect from
this patch.

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net:  allocate skbs on local node
  2010-10-13 22:41                           ` David Rientjes
  2010-10-14  6:22                             ` Pekka Enberg
@ 2010-10-15 14:23                             ` Christoph Lameter
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-10-15 14:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin


On Wed, 13 Oct 2010, David Rientjes wrote:

> On Wed, 13 Oct 2010, Christoph Lameter wrote:
>
> > > I was going to mention that as an idea, but I thought storing the metadata
> > > for certain debugging features might differ from the two allocators so
> > > substantially that it would be even more convoluted and difficult to
> > > maintain?
> >
> > We could have some callbacks to store allocator specific metadata?
> >
>
> It depends on whether we could share the same base for both slab (unified
> allocator) and slub, which you snipped from your reply, that would make
> this cleaner.

I already said before that we should consider having a common base so I
thought that was not necessary.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-14 16:05         ` Eric Dumazet
@ 2010-10-15 16:57           ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-10-15 16:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Andrew Morton, David Miller, netdev, Michael Chan,
	Eilon Greenstein, Christoph Hellwig, Pekka Enberg,
	David Rientjes

On Thu, 14 Oct 2010, Eric Dumazet wrote:

> > Some results running netper TCP_RR test with 200 instances, 1 byte
> > request and response on 16 core AMD using bnx2x with one 16 queues,
> > one for each CPU.
> >
> > SLAB
> >
> > Without patch 553570 tps at 86% CPU
> > With patch 791883 tps at 93% CPU
> >
> > SLUB
> >
> > Without patch 704879 tps at 95% CPU
> > With patch 775278 tps at 92% CPU

Shows one basic difference between SLAB and SLUB: SLAB always needs to go
the slow path when allocating from other nodes. SLUB allows other nodes to
use the fast path. First alloc will be slow but it will switch the "local
queue" to the new node ("queue" switching fast because the queue is
simply a linked list from the page struct of objects within the page).
Second and following access will be fast and coming from that node.

But the requirement that objects on the per page queue are only within the
page also limits the size of the queue and often requires more slow path
use on free....



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next] net: allocate skbs on local node
  2010-10-12  5:05   ` [PATCH net-next] net: allocate skbs on local node Eric Dumazet
  2010-10-12  5:35     ` Tom Herbert
  2010-10-12  6:03     ` Andrew Morton
@ 2010-10-16 18:54     ` David Miller
  2 siblings, 0 replies; 40+ messages in thread
From: David Miller @ 2010-10-16 18:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, mchan, eilong, akpm, hch, cl

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 12 Oct 2010 07:05:25 +0200

> [PATCH net-next] net: allocate skbs on local node
> 
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
> 
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
> 
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
> 
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2010-10-16 18:53 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-11 23:03 [PATCH net-next] bnx2x: dont use netdev_alloc_skb() Eric Dumazet
2010-10-11 23:22 ` Eric Dumazet
2010-10-12  5:03   ` Tom Herbert
2010-10-12  5:16     ` Eric Dumazet
2010-10-12  9:12       ` Vladislav Zolotarov
2010-10-14 17:39         ` David Miller
2010-10-14 18:17           ` Eilon Greenstein
2010-10-14 18:20             ` Eric Dumazet
2010-10-14 18:25               ` David Miller
2010-10-14 18:17           ` Tom Herbert
2010-10-12  5:05   ` [PATCH net-next] net: allocate skbs on local node Eric Dumazet
2010-10-12  5:35     ` Tom Herbert
2010-10-12  6:03     ` Andrew Morton
2010-10-12  6:58       ` Eric Dumazet
2010-10-12  7:24         ` Andrew Morton
2010-10-12  7:49           ` Eric Dumazet
2010-10-12  7:58             ` Andrew Morton
2010-10-12 11:08               ` Pekka Enberg
2010-10-12 12:50                 ` Christoph Lameter
2010-10-12 19:43                   ` David Rientjes
2010-10-13  6:17                     ` Pekka Enberg
2010-10-13  6:31                       ` David Rientjes
2010-10-13  6:36                         ` Pekka Enberg
2010-10-13 16:00                     ` Christoph Lameter
2010-10-13 20:48                       ` David Rientjes
2010-10-13 21:43                         ` Christoph Lameter
2010-10-13 22:41                           ` David Rientjes
2010-10-14  6:22                             ` Pekka Enberg
2010-10-14  7:23                               ` David Rientjes
2010-10-15 14:23                             ` Christoph Lameter
2010-10-14 15:31       ` Tom Herbert
2010-10-14 16:05         ` Eric Dumazet
2010-10-15 16:57           ` Christoph Lameter
2010-10-14 19:27         ` Andrew Morton
2010-10-14 19:59           ` Eric Dumazet
2010-10-16 18:54     ` David Miller
2010-10-12 16:07 ` [BUG net-next] bnx2x: all traffic comes to RX queue 0 Eric Dumazet
2010-10-12 16:20   ` Dmitry Kravkov
2010-10-12 18:11     ` Eric Dumazet
2010-10-12 18:18       ` Vladislav Zolotarov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.