* Profiling XDP programs for performance issues @ 2021-04-08 2:19 Neal Shukla 2021-04-08 10:32 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 8+ messages in thread From: Neal Shukla @ 2021-04-08 2:19 UTC (permalink / raw) To: xdp-newbies We’ve been introducing bpf_tail_call’s into our XDP programs and have run into packet loss and latency increases when performing load tests. After profiling our code we’ve come to the conclusion that this is the problem area in our code: `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` This is the first time we read from the packet in the first XDP program. We have yet to make a tail call at this point. However, we do write into the metadata section prior to this line. How We Profiled Our Code: To profile our code, we used https://github.com/iovisor/bpftrace. We ran this command while sending traffic to our machine: `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > /tmp/stack_samples.out` From there we got a kernel stack trace with the most frequently counted spots at the bottom of the output file. The most commonly hit spot asides from the CPU idle look like: ``` @[ bpf_prog_986b0b3beb6f0873_some_program+290 i40e_napi_poll+1897 net_rx_action+309 __softirqentry_text_start+202 run_ksoftirqd+38 smpboot_thread_fn+197 kthread+283 ret_from_fork+34 ]: 8748 ``` We then took the program id and ran this command to retrieve the jited code: `sudo bpftool prog dump jited tag 986b0b3beb6f0873` By converting the decimal offset (290) from the stack trace to hex format (122) we found the line which it’s referring to in the jited code: ``` 11d: movzbq 0xc(%r15),%rsi 122: movzbq 0xd(%r15),%rdi 127: shl $0x8,%rdi 12b: or %rsi,%rdi 12e: ror $0x8,%di 132: movzwl %di,%edi ``` We've mapped this portion to refer to the line mentioned earlier: `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` 1) Are we correctly profiling our XDP programs? 2) Is there a reason why our first read into the packet would cause this issue? And what would be the best way to solve the issue? We've theorized it may have to do with cache or TLB misses as we've added a lot more instructions to our programs. Thanks for your time! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-08 2:19 Profiling XDP programs for performance issues Neal Shukla @ 2021-04-08 10:32 ` Toke Høiland-Jørgensen 2021-04-08 18:20 ` Neal Shukla 0 siblings, 1 reply; 8+ messages in thread From: Toke Høiland-Jørgensen @ 2021-04-08 10:32 UTC (permalink / raw) To: Neal Shukla, xdp-newbies Cc: Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer Neal Shukla <nshukla@riotgames.com> writes: > We’ve been introducing bpf_tail_call’s into our XDP programs and have run into > packet loss and latency increases when performing load tests. After profiling > our code we’ve come to the conclusion that this is the problem area in our code: > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` > > This is the first time we read from the packet in the first XDP program. We have > yet to make a tail call at this point. However, we do write into the metadata > section prior to this line. > > How We Profiled Our Code: > To profile our code, we used https://github.com/iovisor/bpftrace. We ran this > command while sending traffic to our machine: > `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > > /tmp/stack_samples.out` > > From there we got a kernel stack trace with the most frequently counted spots at > the bottom of the output file. The most commonly hit spot asides from the CPU > idle look like: > ``` > @[ > bpf_prog_986b0b3beb6f0873_some_program+290 > i40e_napi_poll+1897 > net_rx_action+309 > __softirqentry_text_start+202 > run_ksoftirqd+38 > smpboot_thread_fn+197 > kthread+283 > ret_from_fork+34 > ]: 8748 > ``` > > We then took the program id and ran this command to retrieve the jited code: > `sudo bpftool prog dump jited tag 986b0b3beb6f0873` > > By converting the decimal offset (290) from the stack trace to hex format (122) > we found the line which it’s referring to in the jited code: > ``` > 11d: movzbq 0xc(%r15),%rsi > 122: movzbq 0xd(%r15),%rdi > 127: shl $0x8,%rdi > 12b: or %rsi,%rdi > 12e: ror $0x8,%di > 132: movzwl %di,%edi > ``` > We've mapped this portion to refer to the line mentioned earlier: > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` > > 1) Are we correctly profiling our XDP programs? > > 2) Is there a reason why our first read into the packet would cause this issue? > And what would be the best way to solve the issue? > We've theorized it may have to do with cache or TLB misses as we've added a lot > more instructions to our programs. Yeah, this sounds like a caching issue. What system are you running this on? Intel's DDIO feature that DMAs packets directly to L3 cache tends to help with these sorts of things, but maybe your system doesn't have that, or it's not being used for some reason? Adding a few other people who have a better grasp of these details than me, in the hope that they can be more helpful :) -Toke ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-08 10:32 ` Toke Høiland-Jørgensen @ 2021-04-08 18:20 ` Neal Shukla 2021-04-08 21:37 ` Zvi Effron 0 siblings, 1 reply; 8+ messages in thread From: Neal Shukla @ 2021-04-08 18:20 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: xdp-newbies, Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer System Info: CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz Network Adapter/NIC: Intel X710 Driver: i40e Kernel version: 5.8.15 OS: Fedora 33 It’s worth noting that we tried expanding the DDIO to full (0x7ff) and a little more than half (0x7f0) with no material effect. - Neal On Thu, Apr 8, 2021 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Neal Shukla <nshukla@riotgames.com> writes: > > > We’ve been introducing bpf_tail_call’s into our XDP programs and have run into > > packet loss and latency increases when performing load tests. After profiling > > our code we’ve come to the conclusion that this is the problem area in our code: > > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` > > > > This is the first time we read from the packet in the first XDP program. We have > > yet to make a tail call at this point. However, we do write into the metadata > > section prior to this line. > > > > How We Profiled Our Code: > > To profile our code, we used https://github.com/iovisor/bpftrace. We ran this > > command while sending traffic to our machine: > > `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > > > /tmp/stack_samples.out` > > > > From there we got a kernel stack trace with the most frequently counted spots at > > the bottom of the output file. The most commonly hit spot asides from the CPU > > idle look like: > > ``` > > @[ > > bpf_prog_986b0b3beb6f0873_some_program+290 > > i40e_napi_poll+1897 > > net_rx_action+309 > > __softirqentry_text_start+202 > > run_ksoftirqd+38 > > smpboot_thread_fn+197 > > kthread+283 > > ret_from_fork+34 > > ]: 8748 > > ``` > > > > We then took the program id and ran this command to retrieve the jited code: > > `sudo bpftool prog dump jited tag 986b0b3beb6f0873` > > > > By converting the decimal offset (290) from the stack trace to hex format (122) > > we found the line which it’s referring to in the jited code: > > ``` > > 11d: movzbq 0xc(%r15),%rsi > > 122: movzbq 0xd(%r15),%rdi > > 127: shl $0x8,%rdi > > 12b: or %rsi,%rdi > > 12e: ror $0x8,%di > > 132: movzwl %di,%edi > > ``` > > We've mapped this portion to refer to the line mentioned earlier: > > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` > > > > 1) Are we correctly profiling our XDP programs? > > > > 2) Is there a reason why our first read into the packet would cause this issue? > > And what would be the best way to solve the issue? > > We've theorized it may have to do with cache or TLB misses as we've added a lot > > more instructions to our programs. > > Yeah, this sounds like a caching issue. What system are you running this > on? Intel's DDIO feature that DMAs packets directly to L3 cache tends to > help with these sorts of things, but maybe your system doesn't have > that, or it's not being used for some reason? > > Adding a few other people who have a better grasp of these details than > me, in the hope that they can be more helpful :) > > -Toke > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-08 18:20 ` Neal Shukla @ 2021-04-08 21:37 ` Zvi Effron 2021-04-08 23:05 ` Neal Shukla 0 siblings, 1 reply; 8+ messages in thread From: Zvi Effron @ 2021-04-08 21:37 UTC (permalink / raw) To: Neal Shukla Cc: Toke Høiland-Jørgensen, Xdp, Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer Apologies for the spam to anyone who received my first response, but it was accidentally sent as HTML and rejected by the mailing list. On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote: > > System Info: > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz > Network Adapter/NIC: Intel X710 > Driver: i40e > Kernel version: 5.8.15 > OS: Fedora 33 > Slight correction, we're actually on the 5.10.10 kernel. --Zvi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-08 21:37 ` Zvi Effron @ 2021-04-08 23:05 ` Neal Shukla 2021-04-09 6:40 ` Magnus Karlsson 0 siblings, 1 reply; 8+ messages in thread From: Neal Shukla @ 2021-04-08 23:05 UTC (permalink / raw) To: Zvi Effron Cc: Toke Høiland-Jørgensen, Xdp, Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer Using perf, we've confirmed that the line mentioned has a 25.58% cache miss rate. On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@riotgames.com> wrote: > > Apologies for the spam to anyone who received my first response, but > it was accidentally sent as HTML and rejected by the mailing list. > > On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote: > > > > System Info: > > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz > > Network Adapter/NIC: Intel X710 > > Driver: i40e > > Kernel version: 5.8.15 > > OS: Fedora 33 > > > > Slight correction, we're actually on the 5.10.10 kernel. > > --Zvi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-08 23:05 ` Neal Shukla @ 2021-04-09 6:40 ` Magnus Karlsson 2021-04-09 10:10 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 8+ messages in thread From: Magnus Karlsson @ 2021-04-09 6:40 UTC (permalink / raw) To: Neal Shukla Cc: Zvi Effron, Toke Høiland-Jørgensen, Xdp, Björn Töpel, Karlsson, Magnus, Jesper Dangaard Brouer On Fri, Apr 9, 2021 at 1:06 AM Neal Shukla <nshukla@riotgames.com> wrote: > > Using perf, we've confirmed that the line mentioned has a 25.58% cache miss > rate. Do these hit in the LLC or in DRAM? In any case, your best bet is likely to prefetch this into your L1/L2. In my experience, the best way to do this is not to use an explicit prefetch instruction but to touch/fetch the cache lines you need in the beginning of your computation and let the fetch latency and the usage of the first cache line hide the latencies of fetching the others. In your case, touch both metadata and packet at the same time. Work with the metadata and other things then come back to the packet data and hopefully the relevant part will reside in the cache or registers by now. If that does not work, touch packet number N+1 just before starting with packet N. Very general recommendations but hope it helps anyway. How exactly to do this efficiently is very application dependent. /Magnus > On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@riotgames.com> wrote: > > > > Apologies for the spam to anyone who received my first response, but > > it was accidentally sent as HTML and rejected by the mailing list. > > > > On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote: > > > > > > System Info: > > > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz > > > Network Adapter/NIC: Intel X710 > > > Driver: i40e > > > Kernel version: 5.8.15 > > > OS: Fedora 33 > > > > > > > Slight correction, we're actually on the 5.10.10 kernel. > > > > --Zvi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-09 6:40 ` Magnus Karlsson @ 2021-04-09 10:10 ` Jesper Dangaard Brouer 2021-04-15 1:14 ` Neal Shukla 0 siblings, 1 reply; 8+ messages in thread From: Jesper Dangaard Brouer @ 2021-04-09 10:10 UTC (permalink / raw) To: Magnus Karlsson Cc: Neal Shukla, Zvi Effron, Toke Høiland-Jørgensen, Xdp, Björn Töpel, Karlsson, Magnus, brouer On Fri, 9 Apr 2021 08:40:51 +0200 Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > On Fri, Apr 9, 2021 at 1:06 AM Neal Shukla <nshukla@riotgames.com> wrote: > > > > Using perf, we've confirmed that the line mentioned has a 25.58% cache miss > > rate. > > Do these hit in the LLC or in DRAM? In any case, your best bet is > likely to prefetch this into your L1/L2. In my experience, the best > way to do this is not to use an explicit prefetch instruction but to > touch/fetch the cache lines you need in the beginning of your > computation and let the fetch latency and the usage of the first cache > line hide the latencies of fetching the others. In your case, touch > both metadata and packet at the same time. Work with the metadata and > other things then come back to the packet data and hopefully the > relevant part will reside in the cache or registers by now. If that > does not work, touch packet number N+1 just before starting with > packet N. > > Very general recommendations but hope it helps anyway. How exactly to > do this efficiently is very application dependent. I see you use driver i40e and that driver does a net_prefetch(xdp->data) *AFTER* the XDP hook. Thus, that could explain why you are seeing this. Can you try the patch below, and see if it solves your observed issue? > > On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@riotgames.com> wrote: > > > > > > Apologies for the spam to anyone who received my first response, but > > > it was accidentally sent as HTML and rejected by the mailing list. > > > > > > On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@riotgames.com> wrote: > > > > > > > > System Info: > > > > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz > > > > Network Adapter/NIC: Intel X710 > > > > Driver: i40e > > > > Kernel version: 5.8.15 > > > > OS: Fedora 33 > > > > > > > > > > Slight correction, we're actually on the 5.10.10 kernel. [PATCH] i40e: Move net_prefetch to benefit XDP From: Jesper Dangaard Brouer <brouer@redhat.com> DEBUG PATCH WITH XXX comments Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index e398b8ac2a85..c09b8a5e6a2a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2121,7 +2121,7 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring, struct sk_buff *skb; /* prefetch first cache line of first page */ - net_prefetch(xdp->data); + net_prefetch(xdp->data); // XXX: Too late for XDP /* Note, we get here by enabling legacy-rx via: * @@ -2205,7 +2205,7 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring, * likely have a consumer accessing first few bytes of meta * data, and then actual data. */ - net_prefetch(xdp->data_meta); +// net_prefetch(xdp->data_meta); //XXX: too late for XDP /* build an skb around the page buffer */ skb = build_skb(xdp->data_hard_start, truesize); @@ -2513,6 +2513,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget) /* At larger PAGE_SIZE, frame_sz depend on len size */ xdp.frame_sz = i40e_rx_frame_truesize(rx_ring, size); #endif + net_prefetch(xdp->data); skb = i40e_run_xdp(rx_ring, &xdp); } @@ -2530,6 +2531,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget) } else if (skb) { i40e_add_rx_frag(rx_ring, rx_buffer, skb, size); } else if (ring_uses_build_skb(rx_ring)) { + // XXX: net_prefetch called after i40e_run_xdp() skb = i40e_build_skb(rx_ring, rx_buffer, &xdp); } else { skb = i40e_construct_skb(rx_ring, rx_buffer, &xdp); ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Profiling XDP programs for performance issues 2021-04-09 10:10 ` Jesper Dangaard Brouer @ 2021-04-15 1:14 ` Neal Shukla 0 siblings, 0 replies; 8+ messages in thread From: Neal Shukla @ 2021-04-15 1:14 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Magnus Karlsson, Zvi Effron, Toke Høiland-Jørgensen, Xdp, Björn Töpel, Karlsson, Magnus Sorry for the late update. We used the patch and unfortunately saw no improvement. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-04-15 1:14 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-08 2:19 Profiling XDP programs for performance issues Neal Shukla 2021-04-08 10:32 ` Toke Høiland-Jørgensen 2021-04-08 18:20 ` Neal Shukla 2021-04-08 21:37 ` Zvi Effron 2021-04-08 23:05 ` Neal Shukla 2021-04-09 6:40 ` Magnus Karlsson 2021-04-09 10:10 ` Jesper Dangaard Brouer 2021-04-15 1:14 ` Neal Shukla
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.