bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit
       [not found] ` <20210129170216.6a879619@carbon>
@ 2021-01-29 19:17   ` Jesper Dangaard Brouer
  2021-01-29 21:46     ` Lorenzo Bianconi
  0 siblings, 1 reply; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-29 19:17 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, davem, kuba, ast, daniel, toshiaki.makita1,
	lorenzo.bianconi, toke, Stefano Brivio, brouer

On Fri, 29 Jan 2021 17:02:16 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > +	for (i = 0; i < n_skb; i++) {
> > +		struct sk_buff *skb = skbs[i];
> > +
> > +		memset(skb, 0, offsetof(struct sk_buff, tail));  
> 
> It is very subtle, but the memset operation on Intel CPU translates
> into a "rep stos" (repeated store) operation.  This operation need to
> save CPU-flags (to support being interrupted) thus it is actually
> expensive (and in my experience cause side effects on pipeline
> efficiency).  I have a kernel module for testing memset here[1].
> 
> In CPUMAP I have moved the clearing outside this loop. But via asking
> the MM system to clear the memory via gfp_t flag __GFP_ZERO.  This
> cause us to clear more memory 256 bytes, but it is aligned.  Above
> offsetof(struct sk_buff, tail) is 188 bytes, which is unaligned making
> the rep-stos more expensive in setup time.  It is below 3-cachelines,
> which is actually interesting and an improvement since last I checked.
> I actually have to re-test with time_bench_memset[1], to know that is
> better now.

After much testing (with [1]), yes please use gfp_t flag __GFP_ZERO.

 SKB: offsetof-tail:184 bytes
  - memset_skb_tail Per elem: 37 cycles(tsc) 10.463 ns

 SKB: ROUNDUP(offsetof-tail: 192)
  - memset_skb_tail_roundup Per elem: 37 cycles(tsc) 10.468 ns

I though it would be better/faster to round up to full cachelines, but
measurements show that the cost was the same for 184 vs 192.  It does
validate the theory that it is the cacheline boundary that is important.

When doing the gfp_t flag __GFP_ZERO, the kernel cannot know the
constant size, and instead end up calling memset_erms().

The cost of memset_erms(256) is:
 - memset_variable_step(256) Per elem: 31 cycles(tsc) 8.803 ns

The const version with 256 that uses rep-stos cost more:
 - memset_256 Per elem: 41 cycles(tsc) 11.552 ns


Below not relevant for your patch, but an interesting data point is
that memset_erms(512) only cost 4 cycles more:
 - memset_variable_step(512) Per elem: 35 cycles(tsc) 9.893 ns

(but don't use rep-stos for const 512 it is 72 cycles(tsc) 20.069 ns.)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit
       [not found] ` <36298336-72f9-75a5-781e-7cb01dac1702@gmail.com>
@ 2021-01-29 21:41   ` Lorenzo Bianconi
  0 siblings, 0 replies; 5+ messages in thread
From: Lorenzo Bianconi @ 2021-01-29 21:41 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: bpf, netdev, davem, kuba, ast, daniel, lorenzo.bianconi, brouer, toke

[-- Attachment #1: Type: text/plain, Size: 2127 bytes --]

> On 2021/01/29 4:41, Lorenzo Bianconi wrote:
> > Split ndo_xdp_xmit and ndo_start_xmit use cases in veth_xdp_rcv routine
> > in order to alloc skbs in bulk for XDP_PASS verdict.
> > Introduce xdp_alloc_skb_bulk utility routine to alloc skb bulk list.
> > The proposed approach has been tested in the following scenario:
> > 
> > eth (ixgbe) --> XDP_REDIRECT --> veth0 --> (remote-ns) veth1 --> XDP_PASS
> > 
> > XDP_REDIRECT: xdp_redirect_map bpf sample
> > XDP_PASS: xdp_rxq_info bpf sample
> > 
> > traffic generator: pkt_gen sending udp traffic on a remote device
> > 
> > bpf-next master: ~3.64Mpps
> > bpf-next + skb bulking allocation: ~3.75Mpps
> > 
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> ...
> > +/* frames array contains VETH_XDP_BATCH at most */
> > +static void veth_xdp_rcv_batch(struct veth_rq *rq, void **frames,
> > +			       int n_xdpf, struct veth_xdp_tx_bq *bq,
> > +			       struct veth_stats *stats)
> > +{
> > +	void *skbs[VETH_XDP_BATCH];
> > +	int i, n_skb = 0;
> > +
> > +	for (i = 0; i < n_xdpf; i++) {
> > +		struct xdp_frame *frame = frames[i];
> > +
> > +		stats->xdp_bytes += frame->len;
> > +		frame = veth_xdp_rcv_one(rq, frame, bq, stats);
> > +		if (frame)
> > +			frames[n_skb++] = frame;
> > +	}
> 
> Maybe we can move this block to veth_xdp_rcv() and make the logic even more simple?
> 
> Something like this:
> 
> static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> 	...
> 		if (veth_is_xdp_frame(ptr)) {
> 			struct xdp_frame *frame = veth_ptr_to_xdp(ptr);
> 
> 			stats->xdp_bytes += frame->len;
> 			frame = veth_xdp_rcv_one(rq, frame, bq, stats);
> 			if (frame) {
> 				xdpf[n_xdpf++] = frame;
> 				if (n_xdpf == VETH_XDP_BATCH) {
> 					veth_xdp_rcv_batch(rq, xdpf, n_xdpf, bq,
> 							   stats);
> 					n_xdpf = 0;
> 				}
> 			}
> 
> Then we can fully make use of skb bulk allocation as xdpf[] does not include
> frames which may be XDP_TX'ed or XDP_REDIRECT'ed.
> 
> WDYT?

I gues the code is more readable, I will fix in v3. Thanks.

Regards,
Lorenzo

> 
> Toshiaki Makita

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit
  2021-01-29 19:17   ` [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit Jesper Dangaard Brouer
@ 2021-01-29 21:46     ` Lorenzo Bianconi
  2021-01-29 21:49       ` Lorenzo Bianconi
  0 siblings, 1 reply; 5+ messages in thread
From: Lorenzo Bianconi @ 2021-01-29 21:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, davem, kuba, ast, daniel, toshiaki.makita1,
	lorenzo.bianconi, toke, Stefano Brivio

[-- Attachment #1: Type: text/plain, Size: 3016 bytes --]

On Jan 29, Jesper Dangaard Brouer wrote:
> On Fri, 29 Jan 2021 17:02:16 +0100
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > +	for (i = 0; i < n_skb; i++) {
> > > +		struct sk_buff *skb = skbs[i];
> > > +
> > > +		memset(skb, 0, offsetof(struct sk_buff, tail));  
> > 
> > It is very subtle, but the memset operation on Intel CPU translates
> > into a "rep stos" (repeated store) operation.  This operation need to
> > save CPU-flags (to support being interrupted) thus it is actually
> > expensive (and in my experience cause side effects on pipeline
> > efficiency).  I have a kernel module for testing memset here[1].
> > 
> > In CPUMAP I have moved the clearing outside this loop. But via asking
> > the MM system to clear the memory via gfp_t flag __GFP_ZERO.  This
> > cause us to clear more memory 256 bytes, but it is aligned.  Above
> > offsetof(struct sk_buff, tail) is 188 bytes, which is unaligned making
> > the rep-stos more expensive in setup time.  It is below 3-cachelines,
> > which is actually interesting and an improvement since last I checked.
> > I actually have to re-test with time_bench_memset[1], to know that is
> > better now.
> 
> After much testing (with [1]), yes please use gfp_t flag __GFP_ZERO.

I run some comparison tests using memset and __GFP_ZERO and with VETH_XDP_BATCH
set to 8 and 16. Results are pretty close so not completely sure the delta is
just a noise:

- VETH_XDP_BATCH= 8 + __GFP_ZERO: ~3.737Mpps
- VETH_XDP_BATCH= 16 + __GFP_ZERO: ~3.79Mpps
- VETH_XDP_BATCH= 8 + memset: ~3.766Mpps
- VETH_XDP_BATCH= 16 + __GFP_ZERO: ~3.765Mpps

Regards,
Lorenzo

> 
>  SKB: offsetof-tail:184 bytes
>   - memset_skb_tail Per elem: 37 cycles(tsc) 10.463 ns
> 
>  SKB: ROUNDUP(offsetof-tail: 192)
>   - memset_skb_tail_roundup Per elem: 37 cycles(tsc) 10.468 ns
> 
> I though it would be better/faster to round up to full cachelines, but
> measurements show that the cost was the same for 184 vs 192.  It does
> validate the theory that it is the cacheline boundary that is important.
> 
> When doing the gfp_t flag __GFP_ZERO, the kernel cannot know the
> constant size, and instead end up calling memset_erms().
> 
> The cost of memset_erms(256) is:
>  - memset_variable_step(256) Per elem: 31 cycles(tsc) 8.803 ns
> 
> The const version with 256 that uses rep-stos cost more:
>  - memset_256 Per elem: 41 cycles(tsc) 11.552 ns
> 
> 
> Below not relevant for your patch, but an interesting data point is
> that memset_erms(512) only cost 4 cycles more:
>  - memset_variable_step(512) Per elem: 35 cycles(tsc) 9.893 ns
> 
> (but don't use rep-stos for const 512 it is 72 cycles(tsc) 20.069 ns.)
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
> CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit
  2021-01-29 21:46     ` Lorenzo Bianconi
@ 2021-01-29 21:49       ` Lorenzo Bianconi
  2021-02-01 10:06         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 5+ messages in thread
From: Lorenzo Bianconi @ 2021-01-29 21:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, davem, kuba, ast, daniel, toshiaki.makita1,
	lorenzo.bianconi, toke, Stefano Brivio

[-- Attachment #1: Type: text/plain, Size: 3293 bytes --]

On Jan 29, Lorenzo Bianconi wrote:
> On Jan 29, Jesper Dangaard Brouer wrote:
> > On Fri, 29 Jan 2021 17:02:16 +0100
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > 
> > > > +	for (i = 0; i < n_skb; i++) {
> > > > +		struct sk_buff *skb = skbs[i];
> > > > +
> > > > +		memset(skb, 0, offsetof(struct sk_buff, tail));  
> > > 
> > > It is very subtle, but the memset operation on Intel CPU translates
> > > into a "rep stos" (repeated store) operation.  This operation need to
> > > save CPU-flags (to support being interrupted) thus it is actually
> > > expensive (and in my experience cause side effects on pipeline
> > > efficiency).  I have a kernel module for testing memset here[1].
> > > 
> > > In CPUMAP I have moved the clearing outside this loop. But via asking
> > > the MM system to clear the memory via gfp_t flag __GFP_ZERO.  This
> > > cause us to clear more memory 256 bytes, but it is aligned.  Above
> > > offsetof(struct sk_buff, tail) is 188 bytes, which is unaligned making
> > > the rep-stos more expensive in setup time.  It is below 3-cachelines,
> > > which is actually interesting and an improvement since last I checked.
> > > I actually have to re-test with time_bench_memset[1], to know that is
> > > better now.
> > 
> > After much testing (with [1]), yes please use gfp_t flag __GFP_ZERO.
> 
> I run some comparison tests using memset and __GFP_ZERO and with VETH_XDP_BATCH
> set to 8 and 16. Results are pretty close so not completely sure the delta is
> just a noise:
> 
> - VETH_XDP_BATCH= 8 + __GFP_ZERO: ~3.737Mpps
> - VETH_XDP_BATCH= 16 + __GFP_ZERO: ~3.79Mpps
> - VETH_XDP_BATCH= 8 + memset: ~3.766Mpps
> - VETH_XDP_BATCH= 16 + __GFP_ZERO: ~3.765Mpps

Sorry last line is:
  - VETH_XDP_BATCH= 16 + memset: ~3.765Mpps

Regards,
Lorenzo

> 
> Regards,
> Lorenzo
> 
> > 
> >  SKB: offsetof-tail:184 bytes
> >   - memset_skb_tail Per elem: 37 cycles(tsc) 10.463 ns
> > 
> >  SKB: ROUNDUP(offsetof-tail: 192)
> >   - memset_skb_tail_roundup Per elem: 37 cycles(tsc) 10.468 ns
> > 
> > I though it would be better/faster to round up to full cachelines, but
> > measurements show that the cost was the same for 184 vs 192.  It does
> > validate the theory that it is the cacheline boundary that is important.
> > 
> > When doing the gfp_t flag __GFP_ZERO, the kernel cannot know the
> > constant size, and instead end up calling memset_erms().
> > 
> > The cost of memset_erms(256) is:
> >  - memset_variable_step(256) Per elem: 31 cycles(tsc) 8.803 ns
> > 
> > The const version with 256 that uses rep-stos cost more:
> >  - memset_256 Per elem: 41 cycles(tsc) 11.552 ns
> > 
> > 
> > Below not relevant for your patch, but an interesting data point is
> > that memset_erms(512) only cost 4 cycles more:
> >  - memset_variable_step(512) Per elem: 35 cycles(tsc) 9.893 ns
> > 
> > (but don't use rep-stos for const 512 it is 72 cycles(tsc) 20.069 ns.)
> > 
> > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
> > CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
> > -- 
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> > 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit
  2021-01-29 21:49       ` Lorenzo Bianconi
@ 2021-02-01 10:06         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-01 10:06 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, davem, kuba, ast, daniel, toshiaki.makita1,
	lorenzo.bianconi, toke, Stefano Brivio, brouer

On Fri, 29 Jan 2021 22:49:27 +0100
Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> On Jan 29, Lorenzo Bianconi wrote:
> > On Jan 29, Jesper Dangaard Brouer wrote:  
> > > On Fri, 29 Jan 2021 17:02:16 +0100
> > > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >   
> > > > > +	for (i = 0; i < n_skb; i++) {
> > > > > +		struct sk_buff *skb = skbs[i];
> > > > > +
> > > > > +		memset(skb, 0, offsetof(struct sk_buff, tail));    
> > > > 
> > > > It is very subtle, but the memset operation on Intel CPU translates
> > > > into a "rep stos" (repeated store) operation.  This operation need to
> > > > save CPU-flags (to support being interrupted) thus it is actually
> > > > expensive (and in my experience cause side effects on pipeline
> > > > efficiency).  I have a kernel module for testing memset here[1].
> > > > 
> > > > In CPUMAP I have moved the clearing outside this loop. But via asking
> > > > the MM system to clear the memory via gfp_t flag __GFP_ZERO.  This
> > > > cause us to clear more memory 256 bytes, but it is aligned.  Above
> > > > offsetof(struct sk_buff, tail) is 188 bytes, which is unaligned making
> > > > the rep-stos more expensive in setup time.  It is below 3-cachelines,
> > > > which is actually interesting and an improvement since last I checked.
> > > > I actually have to re-test with time_bench_memset[1], to know that is
> > > > better now.  
> > > 
> > > After much testing (with [1]), yes please use gfp_t flag __GFP_ZERO.  
> > 
> > I run some comparison tests using memset and __GFP_ZERO and with VETH_XDP_BATCH
> > set to 8 and 16. Results are pretty close so not completely sure the delta is
> > just a noise:
> > 
> > - VETH_XDP_BATCH= 8 + __GFP_ZERO: ~3.737Mpps
> > - VETH_XDP_BATCH= 16 + __GFP_ZERO: ~3.79Mpps
> > - VETH_XDP_BATCH= 8 + memset: ~3.766Mpps
> > - VETH_XDP_BATCH= 16 + __GFP_ZERO: ~3.765Mpps  
> 
> Sorry last line is:
>   - VETH_XDP_BATCH= 16 + memset: ~3.765Mpps

Thanks for doing these benchmarks.

From my memset benchmarks we are looking for a 1.66 ns difference(10.463-8.803),
which is VERY hard to measure accurately (anything below 2 ns is
extremely hard due to OS noise).

VETH_XDP_BATCH=8 __GFP_ZERO (3.737Mpps) -> memset (3.766Mpps)
 - __GFP_ZERO loosing 0.029Mpps and 2.06 ns slower

VETH_XDP_BATCH=16 __GFP_ZERO (3.79Mpps) -> memset (3.765Mpps)
 - __GFP_ZERO gaining 0.025Mpps and 1.75 ns faster

I would say this is noise in the measurements.  Even-though batch=16
match the expected improvement, batch=8 goes in the other direction.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-02-01 10:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <415937741661ac331be09c0e59b4ff1eacfee782.1611861943.git.lorenzo@kernel.org>
     [not found] ` <20210129170216.6a879619@carbon>
2021-01-29 19:17   ` [PATCH v2 bpf-next] net: veth: alloc skb in bulk for ndo_xdp_xmit Jesper Dangaard Brouer
2021-01-29 21:46     ` Lorenzo Bianconi
2021-01-29 21:49       ` Lorenzo Bianconi
2021-02-01 10:06         ` Jesper Dangaard Brouer
     [not found] ` <36298336-72f9-75a5-781e-7cb01dac1702@gmail.com>
2021-01-29 21:41   ` Lorenzo Bianconi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).