All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about __alloc_skb() speedup
@ 2010-12-03 10:14 Junchang Wang
  2010-12-03 10:50 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Junchang Wang @ 2010-12-03 10:14 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

Hi Eric,

I'm reading your patch (ec7d2f2cf3a1 __alloc_skb() speedup),
in which you prefetch skb and the shinfo part. I'm very
curious why we don't prefetch skb->data. It seems that will
help tx path a lot.

I added the following code

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 104f844..c60a808 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -222,6 +222,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
+	prefetchw(data);
+
 out:
 	return skb;
 nodata:

and the pktgen in my server (A Intel SR1625 server with two E5530 
4-core processors and a single ixgbe-based NIC) goes from 7.6Mpps to
8.4Mpps (64 byte), with 10% performance gain.

For rx path, I did experiments on both ixgbe and igb with pktgen+kute,
and there is no change in system performance.

welcome any suggestions and corrections.

Thanks.

--Junchang

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Question about __alloc_skb() speedup
  2010-12-03 10:14 Question about __alloc_skb() speedup Junchang Wang
@ 2010-12-03 10:50 ` Eric Dumazet
  2010-12-04 14:18   ` Junchang Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2010-12-03 10:50 UTC (permalink / raw)
  To: Junchang Wang; +Cc: netdev

Le vendredi 03 décembre 2010 à 18:14 +0800, Junchang Wang a écrit :
> Hi Eric,
> 
> I'm reading your patch (ec7d2f2cf3a1 __alloc_skb() speedup),
> in which you prefetch skb and the shinfo part. I'm very
> curious why we don't prefetch skb->data. It seems that will
> help tx path a lot.
> 
> I added the following code
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 104f844..c60a808 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -222,6 +222,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  
>  		child->fclone = SKB_FCLONE_UNAVAILABLE;
>  	}
> +	prefetchw(data);
> +
>  out:
>  	return skb;
>  nodata:
> 
> and the pktgen in my server (A Intel SR1625 server with two E5530 
> 4-core processors and a single ixgbe-based NIC) goes from 7.6Mpps to
> 8.4Mpps (64 byte), with 10% performance gain.
> 
> For rx path, I did experiments on both ixgbe and igb with pktgen+kute,
> and there is no change in system performance.
> 
> welcome any suggestions and corrections.
> 
> Thanks.

This is because __alloc_skb() is generic :

We dont know if the skb->data is going to be used right after or not at
all.

For example, NIC drivers call __alloc_skb() to refill their RX ring
buffer. There is no gain to prefetch data in this case since the data is
going to be written by the NIC hardware. The reverse would be needed
actually : ask to local cpu to evict data from its cache, so that device
can DMA it faster (less bus transactions)

By the way, adding prefetchw() right before the "return skb;" is
probably not very useful. You can certainly try to add the prefetchw()
in pktgen itself, since you know for sure you are going to write the
data.

I dont understand your 10% speedup because pktgen actually uses
__netdev_alloc_skb(), so it calls skb_reserve(skb, NET_SKB_PAD) : your
prefetchw is bringing a cache line that wont be used at all by pktgen.

I would say 10% sounds highly suspect to me...




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about __alloc_skb() speedup
  2010-12-03 10:50 ` Eric Dumazet
@ 2010-12-04 14:18   ` Junchang Wang
  2010-12-04 14:47     ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Junchang Wang @ 2010-12-04 14:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Fri, Dec 03, 2010 at 11:50:29AM +0100, Eric Dumazet wrote:
>
>This is because __alloc_skb() is generic :
>
>We dont know if the skb->data is going to be used right after or not at
>all.
>
>For example, NIC drivers call __alloc_skb() to refill their RX ring
>buffer. There is no gain to prefetch data in this case since the data is
>going to be written by the NIC hardware. The reverse would be needed
>actually : ask to local cpu to evict data from its cache, so that device
>can DMA it faster (less bus transactions)

Thanks for your explanation. Now I understand your patch. :)

>
>By the way, adding prefetchw() right before the "return skb;" is
>probably not very useful. 

Did you mean adding prefetchw() right before the "return" instruction 
doesn't gain beneficial effect? 

My understanding is that what prefetch instructions do is to hint cpu 
to hot some cache lines, so this kind of prefetch could also benefit 
following functions.

>You can certainly try to add the prefetchw()
>in pktgen itself, since you know for sure you are going to write the
>data.
>
>I dont understand your 10% speedup because pktgen actually uses
>__netdev_alloc_skb(), so it calls skb_reserve(skb, NET_SKB_PAD) : your
>prefetchw is bringing a cache line that wont be used at all by pktgen.

Thanks for corrections. Please check the following code.
>
>I would say 10% sounds highly suspect to me...
>
I repeated the experiment a number of times (>10). The number of 10% was
careless, but I confirmed there's speedup, which fall into (%3, 8%). I
found it hard to give the exact number of speedup because I had to reboot
the system each time I added the prefetch code.

I noticed the hardware prefetch(Hardware Prefetcher and Adjacent Cache
Line Prefetch) was turn on by default. I doubt my faulty code gain benefit
from hardware prefetch. Without those two options, the performance of 
both pktgens reduced by 5%-8% and I can hardly see beneficial effect from
my previous code. 

I added the prefetchw() in pktgen as follows:

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 2953b2a..512f1ae 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2660,6 +2660,7 @@ static struct sk_buff *fill_packet_ipv4(struct net_device *odev,
 		sprintf(pkt_dev->result, "No memory");
 		return NULL;
 	}
+	prefetchw(skb->data);
 
 	skb_reserve(skb, datalen);
 
This time, I can check it without rebooting the system. The performance 
gain is 4%-5%(stable). Does 4% worth submitting it to the kernel?

Thanks.

--Junchang

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Question about __alloc_skb() speedup
  2010-12-04 14:18   ` Junchang Wang
@ 2010-12-04 14:47     ` Eric Dumazet
  2010-12-04 14:49       ` Eric Dumazet
  2010-12-05 10:56       ` Junchang Wang
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Dumazet @ 2010-12-04 14:47 UTC (permalink / raw)
  To: Junchang Wang; +Cc: netdev

Le samedi 04 décembre 2010 à 22:18 +0800, Junchang Wang a écrit :

> I added the prefetchw() in pktgen as follows:
> 
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index 2953b2a..512f1ae 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -2660,6 +2660,7 @@ static struct sk_buff *fill_packet_ipv4(struct net_device *odev,
>  		sprintf(pkt_dev->result, "No memory");
>  		return NULL;
>  	}
> +	prefetchw(skb->data);
>  
>  	skb_reserve(skb, datalen);
>  
> This time, I can check it without rebooting the system. The performance 
> gain is 4%-5%(stable). Does 4% worth submitting it to the kernel?

Yes I believe so, pktgen being very specific, but I have few questions :

Is it with SLUB or SLAB ?

How many buffers in TX ring on you nic (ethtool -g eth0) ?

What is the datalen value here ? (you prefetch, then advance skb->data)

32 or 64bit kernel ?

How many pps do you get before and after patch ?

Thanks



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about __alloc_skb() speedup
  2010-12-04 14:47     ` Eric Dumazet
@ 2010-12-04 14:49       ` Eric Dumazet
  2010-12-05 10:56       ` Junchang Wang
  1 sibling, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2010-12-04 14:49 UTC (permalink / raw)
  To: Junchang Wang; +Cc: netdev

Le samedi 04 décembre 2010 à 15:47 +0100, Eric Dumazet a écrit :
> Le samedi 04 décembre 2010 à 22:18 +0800, Junchang Wang a écrit :
> 
> > I added the prefetchw() in pktgen as follows:
> > 
> > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> > index 2953b2a..512f1ae 100644
> > --- a/net/core/pktgen.c
> > +++ b/net/core/pktgen.c
> > @@ -2660,6 +2660,7 @@ static struct sk_buff *fill_packet_ipv4(struct net_device *odev,
> >  		sprintf(pkt_dev->result, "No memory");
> >  		return NULL;
> >  	}
> > +	prefetchw(skb->data);
> >  
> >  	skb_reserve(skb, datalen);
> >  
> > This time, I can check it without rebooting the system. The performance 
> > gain is 4%-5%(stable). Does 4% worth submitting it to the kernel?
> 
> Yes I believe so, pktgen being very specific, but I have few questions :
> 
> Is it with SLUB or SLAB ?
> 
> How many buffers in TX ring on you nic (ethtool -g eth0) ?
> 
> What is the datalen value here ? (you prefetch, then advance skb->data)
> 
> 32 or 64bit kernel ?
> 
> How many pps do you get before and after patch ?
> 
> Thanks
> 

Also, dont forget to include the prefetchw() in fill_packet_ipv6() as
well when submitting your patch.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about __alloc_skb() speedup
  2010-12-04 14:47     ` Eric Dumazet
  2010-12-04 14:49       ` Eric Dumazet
@ 2010-12-05 10:56       ` Junchang Wang
  2010-12-05 16:49         ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Junchang Wang @ 2010-12-05 10:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Sat, Dec 04, 2010 at 03:47:38PM +0100, Eric Dumazet wrote:
>
>Yes I believe so, pktgen being very specific, but I have few questions :
>
>Is it with SLUB or SLAB ?
I had read your discussion about "net: allocate skbs on local node" in
the list, so SLUB was used.

BTW, what I observed is that network subsystem scales well on NUMA
systems equipped with a single processor(up to six cores), but the
performance didn't scale very well if there are two processors. 

I have noticed there are a number of discussions in the list. Are 
there any suggestions? I'm very pleasant to do test.

>
>How many buffers in TX ring on you nic (ethtool -g eth0) ?
>
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             512
RX Mini:        0
RX Jumbo:       0
TX:             512

>What is the datalen value here ? (you prefetch, then advance skb->data)
>
16. But the following skb_push will drawback 14 bytes.

>32 or 64bit kernel ?
>
This is a CentOS 5.5 - 64bit distribution with the latest net-next.

>How many pps do you get before and after patch ?
>
A Intel SR1625 server with two E5530 quad-core processors and a single
ixgbe-based NIC.
Without prefetch: 8.63 Mpps
With prefetch: 9.03 Mpps
Improvement: 4.6%


Thanks.
--Junchang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about __alloc_skb() speedup
  2010-12-05 10:56       ` Junchang Wang
@ 2010-12-05 16:49         ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2010-12-05 16:49 UTC (permalink / raw)
  To: Junchang Wang; +Cc: netdev

Le dimanche 05 décembre 2010 à 18:56 +0800, Junchang Wang a écrit :
> On Sat, Dec 04, 2010 at 03:47:38PM +0100, Eric Dumazet wrote:
> >
> >Yes I believe so, pktgen being very specific, but I have few questions :
> >
> >Is it with SLUB or SLAB ?
> I had read your discussion about "net: allocate skbs on local node" in
> the list, so SLUB was used.


> 
> BTW, what I observed is that network subsystem scales well on NUMA
> systems equipped with a single processor(up to six cores), but the
> performance didn't scale very well if there are two processors. 
> 
> I have noticed there are a number of discussions in the list. Are 
> there any suggestions? I'm very pleasant to do test.
> 
> >
> >How many buffers in TX ring on you nic (ethtool -g eth0) ?
> >
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       0
> TX:             4096
> Current hardware settings:
> RX:             512
> RX Mini:        0
> RX Jumbo:       0
> TX:             512
> 
> >What is the datalen value here ? (you prefetch, then advance skb->data)
> >
> 16. But the following skb_push will drawback 14 bytes.
> 
> >32 or 64bit kernel ?
> >
> This is a CentOS 5.5 - 64bit distribution with the latest net-next.
> 
> >How many pps do you get before and after patch ?
> >
> A Intel SR1625 server with two E5530 quad-core processors and a single
> ixgbe-based NIC.
> Without prefetch: 8.63 Mpps
> With prefetch: 9.03 Mpps
> Improvement: 4.6%
> 
> 

Thanks Junchang, please submit your pktgen patch with the two added
prefetchw(), I'll Ack it :)




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-12-05 16:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-03 10:14 Question about __alloc_skb() speedup Junchang Wang
2010-12-03 10:50 ` Eric Dumazet
2010-12-04 14:18   ` Junchang Wang
2010-12-04 14:47     ` Eric Dumazet
2010-12-04 14:49       ` Eric Dumazet
2010-12-05 10:56       ` Junchang Wang
2010-12-05 16:49         ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.