All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next-2.6] net: __alloc_skb() speedup
@ 2010-05-04 17:10 Eric Dumazet
  2010-05-05  8:06 ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2010-05-04 17:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jamal, Tom Herbert

With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450  @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC
- 1.080.000 pps without a single drop at NIC level.

Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
first sk_buff cache line, the second to prefetch the shinfo part.

Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.

All skb_shared_info fields before 'dataref' are cleared in 
__alloc_skb().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |    7 ++++++-
 net/core/skbuff.c      |   21 +++++----------------
 2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 746a652..f32ccc9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -187,7 +187,6 @@ union skb_shared_tx {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
-	atomic_t	dataref;
 	unsigned short	nr_frags;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
@@ -197,6 +196,12 @@ struct skb_shared_info {
 	union skb_shared_tx tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
+
+	/*
+	 * Warning : all fields before dataref are cleared in __alloc_skb()
+	 */
+	atomic_t	dataref;
+	
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8b9c109..a9b0e1f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -181,12 +181,14 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
 		goto out;
+	prefetchw(skb);
 
 	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask, node);
 	if (!data)
 		goto nodata;
+	prefetchw(data + size);
 
 	/*
 	 * Only clear those fields we need to clear, not those that we will
@@ -208,15 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
-	shinfo->nr_frags  = 0;
-	shinfo->gso_size = 0;
-	shinfo->gso_segs = 0;
-	shinfo->gso_type = 0;
-	shinfo->ip6_frag_id = 0;
-	shinfo->tx_flags.flags = 0;
-	skb_frag_list_init(skb);
-	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -505,16 +500,10 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 		return 0;
 
 	skb_release_head_state(skb);
+
 	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
-	shinfo->nr_frags = 0;
-	shinfo->gso_size = 0;
-	shinfo->gso_segs = 0;
-	shinfo->gso_type = 0;
-	shinfo->ip6_frag_id = 0;
-	shinfo->tx_flags.flags = 0;
-	skb_frag_list_init(skb);
-	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->data = skb->head + NET_SKB_PAD;



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
  2010-05-04 17:10 [PATCH net-next-2.6] net: __alloc_skb() speedup Eric Dumazet
@ 2010-05-05  8:06 ` David Miller
  2010-05-05  8:22   ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2010-05-05  8:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hadi, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 04 May 2010 19:10:54 +0200

> With following patch I can reach maximum rate of my pktgen+udpsink
> simulator :
> - 'old' machine : dual quad core E5450  @3.00GHz
> - 64 UDP rx flows (only differ by destination port)
> - RPS enabled, NIC interrupts serviced on cpu0
> - rps dispatched on 7 other cores. (~130.000 IPI per second)
> - SLAB allocator (faster than SLUB in this workload)
> - tg3 NIC
> - 1.080.000 pps without a single drop at NIC level.
> 
> Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
> first sk_buff cache line, the second to prefetch the shinfo part.
> 
> Also using one memset() to initialize all skb_shared_info fields instead
> of one by one to reduce number of instructions, using long word moves.
> 
> All skb_shared_info fields before 'dataref' are cleared in 
> __alloc_skb().
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

I'll apply this, nice work Eric.

But some caveats...

On several cpu types it is possible to "prefetch invalidate"
cachelines.  PowerPC and sparc64 can both do it.  I'm pretty
sure current gen x86 have SSE bits that can do this too.

In fact, the memset() for sparc64 is going to do these cacheline
invalidates, making the prefetches on 'skb' in fact wasteful.
It will just create spurious bus traffic.

The memset() for skb_shared_info() is going to help universally
I think.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
  2010-05-05  8:06 ` David Miller
@ 2010-05-05  8:22   ` Eric Dumazet
  2010-05-05  8:26     ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2010-05-05  8:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hadi, therbert

Le mercredi 05 mai 2010 à 01:06 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 04 May 2010 19:10:54 +0200
> 
> > With following patch I can reach maximum rate of my pktgen+udpsink
> > simulator :
> > - 'old' machine : dual quad core E5450  @3.00GHz
> > - 64 UDP rx flows (only differ by destination port)
> > - RPS enabled, NIC interrupts serviced on cpu0
> > - rps dispatched on 7 other cores. (~130.000 IPI per second)
> > - SLAB allocator (faster than SLUB in this workload)
> > - tg3 NIC
> > - 1.080.000 pps without a single drop at NIC level.
> > 
> > Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
> > first sk_buff cache line, the second to prefetch the shinfo part.
> > 
> > Also using one memset() to initialize all skb_shared_info fields instead
> > of one by one to reduce number of instructions, using long word moves.
> > 
> > All skb_shared_info fields before 'dataref' are cleared in 
> > __alloc_skb().
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> I'll apply this, nice work Eric.
> 
> But some caveats...
> 
> On several cpu types it is possible to "prefetch invalidate"
> cachelines.  PowerPC and sparc64 can both do it.  I'm pretty
> sure current gen x86 have SSE bits that can do this too.
> 
> In fact, the memset() for sparc64 is going to do these cacheline
> invalidates, making the prefetches on 'skb' in fact wasteful.
> It will just create spurious bus traffic.
> 

You mean memset() wont be inlined by ompiler to plain memory writes, but
use the custom kernel memset()  ?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
  2010-05-05  8:22   ` Eric Dumazet
@ 2010-05-05  8:26     ` David Miller
  2010-05-05 12:00       ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2010-05-05  8:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hadi, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 May 2010 10:22:14 +0200

> You mean memset() wont be inlined by ompiler to plain memory writes, but
> use the custom kernel memset()  ?

I hope memset() is never inlined for a 202 byte piece of memory on
sparc64 or powerpc.  What happens and makes sense on x86 is x86's
business :-)

Especially since that elides the cache invalidate optimizations, and
for anything >= 64 bytes those are absolutely critical on Niagara.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
  2010-05-05  8:26     ` David Miller
@ 2010-05-05 12:00       ` Eric Dumazet
  2010-05-05 21:52         ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2010-05-05 12:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hadi, therbert

Le mercredi 05 mai 2010 à 01:26 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 05 May 2010 10:22:14 +0200
> 
> > You mean memset() wont be inlined by ompiler to plain memory writes, but
> > use the custom kernel memset()  ?
> 
> I hope memset() is never inlined for a 202 byte piece of memory on
> sparc64 or powerpc.  What happens and makes sense on x86 is x86's
> business :-)
> 
> Especially since that elides the cache invalidate optimizations, and
> for anything >= 64 bytes those are absolutely critical on Niagara.
> -

Sorry, I was thinking about the shinfo part :

memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));

offsetof(struct skb_shared_info, dataref) is small enough and we dont
dirty a full cache line, so maybe I can keep prefetchw(data + size) ?

If not, in which cases can we use prefetchw() in kernel, if some arches
dont handle it well ?

Note1 : Without prefetchw(skb) (I removed it in this v2 patch), some
packets are dropped again...

Note2: If NET_SKB_PAD changed to 64, cpu0 has about 2% of free cpu
cycles (as noticed by a user application cycles burner)

-----------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.8% [1000Hz cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________
             1018.00 16.8% eth_type_trans                
              960.00 15.9% __alloc_skb                   
              757.00 12.5% __netdev_alloc_skb            
              681.00 11.3% _raw_spin_lock                
              479.00  7.9% nommu_map_page                
              424.00  7.0% tg3_poll_work                 
              209.00  3.5% get_rps_cpu                   
              205.00  3.4% _raw_spin_lock_irqsave        
              188.00  3.1% __kmalloc                    
              164.00  2.7% enqueue_to_backlog            
              119.00  2.0% tg3_alloc_rx_skb              
              112.00  1.9% kmem_cache_alloc              

Thanks !

[PATCH v2 net-next-2.6] net: __alloc_skb() speedup

With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450  @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC [BCM5715S Gigabit Ethernet (rev a3)]
- 1.080.000 pps with few drops (~150 packets per second) at NIC level.
- 32bit kernel

Idea is to add one prefetchw() call in __alloc_skb() to hint cpu we are
about to clear part of skb_shared_info.

Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.

All skb_shared_info fields before 'dataref' are cleared in 
__alloc_skb().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |    7 ++++++-
 net/core/skbuff.c      |   21 +++++----------------
 2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 746a652..f32ccc9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -187,7 +187,6 @@ union skb_shared_tx {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
-	atomic_t	dataref;
 	unsigned short	nr_frags;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
@@ -197,6 +196,12 @@ struct skb_shared_info {
 	union skb_shared_tx tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
+
+	/*
+	 * Warning : all fields before dataref are cleared in __alloc_skb()
+	 */
+	atomic_t	dataref;
+	
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8b9c109..7cafe50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -187,6 +187,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 			gfp_mask, node);
 	if (!data)
 		goto nodata;
+	/* prepare shinfo initialization */
+	prefetchw(data + size);
 
 	/*
 	 * Only clear those fields we need to clear, not those that we will
@@ -208,15 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
-	shinfo->nr_frags  = 0;
-	shinfo->gso_size = 0;
-	shinfo->gso_segs = 0;
-	shinfo->gso_type = 0;
-	shinfo->ip6_frag_id = 0;
-	shinfo->tx_flags.flags = 0;
-	skb_frag_list_init(skb);
-	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -505,16 +500,10 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 		return 0;
 
 	skb_release_head_state(skb);
+
 	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
-	shinfo->nr_frags = 0;
-	shinfo->gso_size = 0;
-	shinfo->gso_segs = 0;
-	shinfo->gso_type = 0;
-	shinfo->ip6_frag_id = 0;
-	shinfo->tx_flags.flags = 0;
-	skb_frag_list_init(skb);
-	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->data = skb->head + NET_SKB_PAD;



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
  2010-05-05 12:00       ` Eric Dumazet
@ 2010-05-05 21:52         ` David Miller
  0 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2010-05-05 21:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hadi, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 May 2010 14:00:09 +0200

> Sorry, I was thinking about the shinfo part :
> 
> memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> 
> offsetof(struct skb_shared_info, dataref) is small enough and we dont
> dirty a full cache line, so maybe I can keep prefetchw(data + size) ?

You do dirty a full line on sparc64, the prefetch invalidate goes a L1
cache line at a time, so 32 bytes.  And this memset() is 40 bytes.
The call to the memset symbol is still generated by gcc for this case.

I think the cutoff for doing it inline is something like 16 bytes on
sparc64, four 64-bit loads and stores.

Unlike x86 these risc chips don't have string-op instructions, and for
sparc64 and powerpc the instructions are fixed in size (4 bytes) so
the inline cost is "(memset_size / word_size) * 4".  Whereas on x86 the
inlining cost is more-or-less fixed.

> If not, in which cases can we use prefetchw() in kernel, if some arches
> dont handle it well ?

It has to be looked at in a case-by-case basis.  There is no simple
answer here.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-05-05 21:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-04 17:10 [PATCH net-next-2.6] net: __alloc_skb() speedup Eric Dumazet
2010-05-05  8:06 ` David Miller
2010-05-05  8:22   ` Eric Dumazet
2010-05-05  8:26     ` David Miller
2010-05-05 12:00       ` Eric Dumazet
2010-05-05 21:52         ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.