All of lore.kernel.org
 help / color / mirror / Atom feed
* Profiling XDP programs for performance issues
@ 2021-04-08  2:19 Neal Shukla
  2021-04-08 10:32 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 8+ messages in thread
From: Neal Shukla @ 2021-04-08  2:19 UTC (permalink / raw)
  To: xdp-newbies

We’ve been introducing bpf_tail_call’s into our XDP programs and have run into
packet loss and latency increases when performing load tests. After profiling
our code we’ve come to the conclusion that this is the problem area in our code:
`int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`

This is the first time we read from the packet in the first XDP program. We have
yet to make a tail call at this point. However, we do write into the metadata
section prior to this line.

How We Profiled Our Code:
To profile our code, we used https://github.com/iovisor/bpftrace. We ran this
command while sending traffic to our machine:
`sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' >
/tmp/stack_samples.out`

From there we got a kernel stack trace with the most frequently counted spots at
the bottom of the output file. The most commonly hit spot asides from the CPU
idle look like:
```
@[
    bpf_prog_986b0b3beb6f0873_some_program+290
    i40e_napi_poll+1897
    net_rx_action+309
    __softirqentry_text_start+202
    run_ksoftirqd+38
    smpboot_thread_fn+197
    kthread+283
    ret_from_fork+34
]: 8748
```

We then took the program id and ran this command to retrieve the jited code:
`sudo bpftool prog dump jited tag 986b0b3beb6f0873`

By converting the decimal offset (290) from the stack trace to hex format (122)
we found the line which it’s referring to in the jited code:
```
11d:   movzbq 0xc(%r15),%rsi
122:   movzbq 0xd(%r15),%rdi
127:   shl         $0x8,%rdi
12b:   or          %rsi,%rdi
12e:   ror         $0x8,%di
132:   movzwl %di,%edi
```
We've mapped this portion to refer to the line mentioned earlier:
`int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`

1) Are we correctly profiling our XDP programs?

2) Is there a reason why our first read into the packet would cause this issue?
And what would be the best way to solve the issue?
We've theorized it may have to do with cache or TLB misses as we've added a lot
more instructions to our programs.

Thanks for your time!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-08  2:19 Profiling XDP programs for performance issues Neal Shukla
@ 2021-04-08 10:32 ` Toke Høiland-Jørgensen
  2021-04-08 18:20   ` Neal Shukla
  0 siblings, 1 reply; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-04-08 10:32 UTC (permalink / raw)
  To: Neal Shukla, xdp-newbies
  Cc: Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer

Neal Shukla <nshukla@riotgames.com> writes:

> We’ve been introducing bpf_tail_call’s into our XDP programs and have run into
> packet loss and latency increases when performing load tests. After profiling
> our code we’ve come to the conclusion that this is the problem area in our code:
> `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`
>
> This is the first time we read from the packet in the first XDP program. We have
> yet to make a tail call at this point. However, we do write into the metadata
> section prior to this line.
>
> How We Profiled Our Code:
> To profile our code, we used https://github.com/iovisor/bpftrace. We ran this
> command while sending traffic to our machine:
> `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' >
> /tmp/stack_samples.out`
>
> From there we got a kernel stack trace with the most frequently counted spots at
> the bottom of the output file. The most commonly hit spot asides from the CPU
> idle look like:
> ```
> @[
>     bpf_prog_986b0b3beb6f0873_some_program+290
>     i40e_napi_poll+1897
>     net_rx_action+309
>     __softirqentry_text_start+202
>     run_ksoftirqd+38
>     smpboot_thread_fn+197
>     kthread+283
>     ret_from_fork+34
> ]: 8748
> ```
>
> We then took the program id and ran this command to retrieve the jited code:
> `sudo bpftool prog dump jited tag 986b0b3beb6f0873`
>
> By converting the decimal offset (290) from the stack trace to hex format (122)
> we found the line which it’s referring to in the jited code:
> ```
> 11d:   movzbq 0xc(%r15),%rsi
> 122:   movzbq 0xd(%r15),%rdi
> 127:   shl         $0x8,%rdi
> 12b:   or          %rsi,%rdi
> 12e:   ror         $0x8,%di
> 132:   movzwl %di,%edi
> ```
> We've mapped this portion to refer to the line mentioned earlier:
> `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`
>
> 1) Are we correctly profiling our XDP programs?
>
> 2) Is there a reason why our first read into the packet would cause this issue?
> And what would be the best way to solve the issue?
> We've theorized it may have to do with cache or TLB misses as we've added a lot
> more instructions to our programs.

Yeah, this sounds like a caching issue. What system are you running this
on? Intel's DDIO feature that DMAs packets directly to L3 cache tends to
help with these sorts of things, but maybe your system doesn't have
that, or it's not being used for some reason?

Adding a few other people who have a better grasp of these details than
me, in the hope that they can be more helpful :)

-Toke


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-08 10:32 ` Toke Høiland-Jørgensen
@ 2021-04-08 18:20   ` Neal Shukla
  2021-04-08 21:37     ` Zvi Effron
  0 siblings, 1 reply; 8+ messages in thread
From: Neal Shukla @ 2021-04-08 18:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: xdp-newbies, Björn Töpel, Karlsson, Magnus,
	Jesper Dangaard Brouer

System Info:
CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Network Adapter/NIC: Intel X710
Driver: i40e
Kernel version: 5.8.15
OS: Fedora 33


It’s worth noting that we tried expanding the DDIO to full (0x7ff) and
a little more than half (0x7f0) with no material effect.

- Neal

On Thu, Apr 8, 2021 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Neal Shukla <nshukla@riotgames.com> writes:
>
> > We’ve been introducing bpf_tail_call’s into our XDP programs and have run into
> > packet loss and latency increases when performing load tests. After profiling
> > our code we’ve come to the conclusion that this is the problem area in our code:
> > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`
> >
> > This is the first time we read from the packet in the first XDP program. We have
> > yet to make a tail call at this point. However, we do write into the metadata
> > section prior to this line.
> >
> > How We Profiled Our Code:
> > To profile our code, we used https://github.com/iovisor/bpftrace. We ran this
> > command while sending traffic to our machine:
> > `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' >
> > /tmp/stack_samples.out`
> >
> > From there we got a kernel stack trace with the most frequently counted spots at
> > the bottom of the output file. The most commonly hit spot asides from the CPU
> > idle look like:
> > ```
> > @[
> >     bpf_prog_986b0b3beb6f0873_some_program+290
> >     i40e_napi_poll+1897
> >     net_rx_action+309
> >     __softirqentry_text_start+202
> >     run_ksoftirqd+38
> >     smpboot_thread_fn+197
> >     kthread+283
> >     ret_from_fork+34
> > ]: 8748
> > ```
> >
> > We then took the program id and ran this command to retrieve the jited code:
> > `sudo bpftool prog dump jited tag 986b0b3beb6f0873`
> >
> > By converting the decimal offset (290) from the stack trace to hex format (122)
> > we found the line which it’s referring to in the jited code:
> > ```
> > 11d:   movzbq 0xc(%r15),%rsi
> > 122:   movzbq 0xd(%r15),%rdi
> > 127:   shl         $0x8,%rdi
> > 12b:   or          %rsi,%rdi
> > 12e:   ror         $0x8,%di
> > 132:   movzwl %di,%edi
> > ```
> > We've mapped this portion to refer to the line mentioned earlier:
> > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`
> >
> > 1) Are we correctly profiling our XDP programs?
> >
> > 2) Is there a reason why our first read into the packet would cause this issue?
> > And what would be the best way to solve the issue?
> > We've theorized it may have to do with cache or TLB misses as we've added a lot
> > more instructions to our programs.
>
> Yeah, this sounds like a caching issue. What system are you running this
> on? Intel's DDIO feature that DMAs packets directly to L3 cache tends to
> help with these sorts of things, but maybe your system doesn't have
> that, or it's not being used for some reason?
>
> Adding a few other people who have a better grasp of these details than
> me, in the hope that they can be more helpful :)
>
> -Toke
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-08 18:20   ` Neal Shukla
@ 2021-04-08 21:37     ` Zvi Effron
  2021-04-08 23:05       ` Neal Shukla
  0 siblings, 1 reply; 8+ messages in thread
From: Zvi Effron @ 2021-04-08 21:37 UTC (permalink / raw)
  To: Neal Shukla
  Cc: Toke Høiland-Jørgensen, Xdp, Björn Töpel,
	Karlsson, Magnus, Jesper Dangaard Brouer

Apologies for the spam to anyone who received my first response, but
it was accidentally sent as HTML and rejected by the mailing list.

On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote:
>
> System Info:
> CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> Network Adapter/NIC: Intel X710
> Driver: i40e
> Kernel version: 5.8.15
> OS: Fedora 33
>

Slight correction, we're actually on the 5.10.10 kernel.

--Zvi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-08 21:37     ` Zvi Effron
@ 2021-04-08 23:05       ` Neal Shukla
  2021-04-09  6:40         ` Magnus Karlsson
  0 siblings, 1 reply; 8+ messages in thread
From: Neal Shukla @ 2021-04-08 23:05 UTC (permalink / raw)
  To: Zvi Effron
  Cc: Toke Høiland-Jørgensen, Xdp, Björn Töpel,
	Karlsson, Magnus, Jesper Dangaard Brouer

Using perf, we've confirmed that the line mentioned has a 25.58% cache miss
rate.

On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@riotgames.com> wrote:
>
> Apologies for the spam to anyone who received my first response, but
> it was accidentally sent as HTML and rejected by the mailing list.
>
> On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote:
> >
> > System Info:
> > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> > Network Adapter/NIC: Intel X710
> > Driver: i40e
> > Kernel version: 5.8.15
> > OS: Fedora 33
> >
>
> Slight correction, we're actually on the 5.10.10 kernel.
>
> --Zvi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-08 23:05       ` Neal Shukla
@ 2021-04-09  6:40         ` Magnus Karlsson
  2021-04-09 10:10           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 8+ messages in thread
From: Magnus Karlsson @ 2021-04-09  6:40 UTC (permalink / raw)
  To: Neal Shukla
  Cc: Zvi Effron, Toke Høiland-Jørgensen, Xdp,
	Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer

On Fri, Apr 9, 2021 at 1:06 AM Neal Shukla <nshukla@riotgames.com> wrote:
>
> Using perf, we've confirmed that the line mentioned has a 25.58% cache miss
> rate.

Do these hit in the LLC or in DRAM? In any case, your best bet is
likely to prefetch this into your L1/L2. In my experience, the best
way to do this is not to use an explicit prefetch instruction but to
touch/fetch the cache lines you need in the beginning of your
computation and let the fetch latency and the usage of the first cache
line hide the latencies of fetching the others. In your case, touch
both metadata and packet at the same time. Work with the metadata and
other things then come back to the packet data and hopefully the
relevant part will reside in the cache or registers by now. If that
does not work, touch packet number N+1 just before starting with
packet N.

Very general recommendations but hope it helps anyway. How exactly to
do this efficiently is very application dependent.

/Magnus

> On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@riotgames.com> wrote:
> >
> > Apologies for the spam to anyone who received my first response, but
> > it was accidentally sent as HTML and rejected by the mailing list.
> >
> > On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote:
> > >
> > > System Info:
> > > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> > > Network Adapter/NIC: Intel X710
> > > Driver: i40e
> > > Kernel version: 5.8.15
> > > OS: Fedora 33
> > >
> >
> > Slight correction, we're actually on the 5.10.10 kernel.
> >
> > --Zvi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-09  6:40         ` Magnus Karlsson
@ 2021-04-09 10:10           ` Jesper Dangaard Brouer
  2021-04-15  1:14             ` Neal Shukla
  0 siblings, 1 reply; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2021-04-09 10:10 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Neal Shukla, Zvi Effron, Toke Høiland-Jørgensen, Xdp,
	Björn Töpel, Karlsson, Magnus, brouer

On Fri, 9 Apr 2021 08:40:51 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Fri, Apr 9, 2021 at 1:06 AM Neal Shukla <nshukla@riotgames.com> wrote:
> >
> > Using perf, we've confirmed that the line mentioned has a 25.58% cache miss
> > rate.  
> 
> Do these hit in the LLC or in DRAM? In any case, your best bet is
> likely to prefetch this into your L1/L2. In my experience, the best
> way to do this is not to use an explicit prefetch instruction but to
> touch/fetch the cache lines you need in the beginning of your
> computation and let the fetch latency and the usage of the first cache
> line hide the latencies of fetching the others. In your case, touch
> both metadata and packet at the same time. Work with the metadata and
> other things then come back to the packet data and hopefully the
> relevant part will reside in the cache or registers by now. If that
> does not work, touch packet number N+1 just before starting with
> packet N.
> 
> Very general recommendations but hope it helps anyway. How exactly to
> do this efficiently is very application dependent.

I see you use driver i40e and that driver does a net_prefetch(xdp->data)
*AFTER* the XDP hook.  Thus, that could explain why you are seeing this.

Can you try the patch below, and see if it solves your observed issue?

> > On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@riotgames.com> wrote:  
> > >
> > > Apologies for the spam to anyone who received my first response, but
> > > it was accidentally sent as HTML and rejected by the mailing list.
> > >
> > > On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote:  
> > > >
> > > > System Info:
> > > > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> > > > Network Adapter/NIC: Intel X710
> > > > Driver: i40e
> > > > Kernel version: 5.8.15
> > > > OS: Fedora 33
> > > >  
> > >
> > > Slight correction, we're actually on the 5.10.10 kernel.

[PATCH] i40e: Move net_prefetch to benefit XDP

From: Jesper Dangaard Brouer <brouer@redhat.com>

DEBUG PATCH WITH XXX comments

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e398b8ac2a85..c09b8a5e6a2a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2121,7 +2121,7 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
 	struct sk_buff *skb;
 
 	/* prefetch first cache line of first page */
-	net_prefetch(xdp->data);
+	net_prefetch(xdp->data); // XXX: Too late for XDP
 
 	/* Note, we get here by enabling legacy-rx via:
 	 *
@@ -2205,7 +2205,7 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
 	 * likely have a consumer accessing first few bytes of meta
 	 * data, and then actual data.
 	 */
-	net_prefetch(xdp->data_meta);
+//	net_prefetch(xdp->data_meta); //XXX: too late for XDP
 
 	/* build an skb around the page buffer */
 	skb = build_skb(xdp->data_hard_start, truesize);
@@ -2513,6 +2513,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 			/* At larger PAGE_SIZE, frame_sz depend on len size */
 			xdp.frame_sz = i40e_rx_frame_truesize(rx_ring, size);
 #endif
+			net_prefetch(xdp->data);
 			skb = i40e_run_xdp(rx_ring, &xdp);
 		}
 
@@ -2530,6 +2531,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		} else if (skb) {
 			i40e_add_rx_frag(rx_ring, rx_buffer, skb, size);
 		} else if (ring_uses_build_skb(rx_ring)) {
+			// XXX: net_prefetch called after i40e_run_xdp()
 			skb = i40e_build_skb(rx_ring, rx_buffer, &xdp);
 		} else {
 			skb = i40e_construct_skb(rx_ring, rx_buffer, &xdp);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Profiling XDP programs for performance issues
  2021-04-09 10:10           ` Jesper Dangaard Brouer
@ 2021-04-15  1:14             ` Neal Shukla
  0 siblings, 0 replies; 8+ messages in thread
From: Neal Shukla @ 2021-04-15  1:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Magnus Karlsson, Zvi Effron, Toke Høiland-Jørgensen,
	Xdp, Björn Töpel, Karlsson, Magnus

Sorry for the late update.

We used the patch and unfortunately saw no improvement.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-04-15  1:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08  2:19 Profiling XDP programs for performance issues Neal Shukla
2021-04-08 10:32 ` Toke Høiland-Jørgensen
2021-04-08 18:20   ` Neal Shukla
2021-04-08 21:37     ` Zvi Effron
2021-04-08 23:05       ` Neal Shukla
2021-04-09  6:40         ` Magnus Karlsson
2021-04-09 10:10           ` Jesper Dangaard Brouer
2021-04-15  1:14             ` Neal Shukla

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.