* zero-copy between interfaces @ 2020-01-13 0:18 Ryan Goodfellow 2020-01-13 9:16 ` Magnus Karlsson ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-13 0:18 UTC (permalink / raw) To: xdp-newbies [-- Attachment #1: Type: text/plain, Size: 3081 bytes --] Greetings XDP folks. I've been working on a zero-copy XDP bridge implementation similar to what's described in the following thread. https://www.spinics.net/lists/xdp-newbies/msg01333.html I now have an implementation that is working reasonably well under certain conditions for various hardware. The implementation is primarily based on the xdpsock_user program in the kernel under samples/bpf. You can find my program and corresponding BPF program here. - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c I have small testbed to run this code on that looks like the following. Packet forwarding machine: CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread) Memory: 32 GB NICs: - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G) - Intel X722 10G SFP+ Sender/receiver machines CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread) Memory: 32 GB NICs: - Mellanox ConnectX 4 40G MCX4131A-BCAT - Intel X722 10G SFP+ I could not get zero-copy to work with the i40e driver as it would crash. I've attached the corresponding traces from dmesg. The results below are with the i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the server and test with instead of the X722 if that is of interest. In all cases I used a single hardware queue via the following. ethtool -L <dev> combined 1 The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I used ntuple rules to push things through queue 1 (shadows 0) as follows ethtool -N <dev> flow-type ether src <mac> action 1 The numbers that I have been able to achive with this code are the following. MTU is 1500 in all cases. mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) Are these numbers in the ballpark of what's expected? One thing I have noticed is that I cannot create large memory maps for the packet buffers. For example a frame size of 2048 with 524288 frames (around 1G of packets) is fine. However, increasing size by an order of magnitude, which is well within the memory capacity of the host machine results in an error when creating the UMEM and the kernel shows the attached call trace. I'm going to begin investigating this in more detail soon, but if anyone has advice on large XDP memory maps that would be much appreciated. The reason for wanting large memory maps is that our use case for XDP is network emulation - and sometimes that means introducing delay factors that can require a rather large in-memory packet buffers. If there is interest in including this program in the official BPF samples I'm happy to submit a patch. Any comments on the program are also much appreciated. Thanks ~ ry [-- Attachment #2: large-umem-trace.txt --] [-- Type: text/plain, Size: 3188 bytes --] [ 9879.806272] ------------[ cut here ]------------ [ 9879.811489] WARNING: CPU: 14 PID: 18118 at mm/page_alloc.c:4738 __alloc_pages_nodemask+0x1de/0x2b0 [ 9879.816743] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core [ 9879.822023] CPU: 14 PID: 18118 Comm: xdpsock_multide Not tainted 5.5.0-rc3-moa+ #1 [ 9879.827399] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019 [ 9879.832838] RIP: 0010:__alloc_pages_nodemask+0x1de/0x2b0 [ 9879.838257] Code: 0f 85 1a ff ff ff 65 48 8b 04 25 00 7d 01 00 48 05 28 08 00 00 41 bd 01 00 00 00 48 89 44 24 08 e9 fb fe ff ff 80 e7 20 75 02 <0f> 0b 45 31 ed eb 98 44 8b 64 24 18 65 8b 05 17 ff c0 5a 89 c0 48 [ 9879.849435] RSP: 0018:ffffa8fe00777db8 EFLAGS: 00010246 [ 9879.855031] RAX: 0000000000000000 RBX: 00000000000400c0 RCX: 0000000000000000 [ 9879.860665] RDX: 0000000000000000 RSI: 000000000000000b RDI: 0000000000040dc0 [ 9879.866260] RBP: 0000000000800000 R08: ffffffffa6e27560 R09: ffff93f01c1a0000 [ 9879.871862] R10: ffffcb5c5e48a9c0 R11: 0000000822898000 R12: ffff93f01828c600 [ 9879.877478] R13: 000000000000000b R14: 000000000000000b R15: ffff93f01c1a0000 [ 9879.883079] FS: 00007fd124b8c740(0000) GS:ffff93f020180000(0000) knlGS:0000000000000000 [ 9879.888724] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9879.894361] CR2: 00007f04f8629ecc CR3: 000000080480a005 CR4: 00000000007606e0 [ 9879.900038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9879.905692] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9879.911207] PKRU: 55555554 [ 9879.916581] Call Trace: [ 9879.921841] kmalloc_order+0x1b/0x80 [ 9879.926998] kmalloc_order_trace+0x1d/0xa0 [ 9879.932083] xdp_umem_create+0x380/0x450 [ 9879.937114] xsk_setsockopt+0x1e9/0x270 [ 9879.942151] __sys_setsockopt+0xd6/0x190 [ 9879.947071] __x64_sys_setsockopt+0x21/0x30 [ 9879.951872] do_syscall_64+0x50/0x140 [ 9879.956564] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 9879.961189] RIP: 0033:0x7fd124ca787a [ 9879.965746] Code: ff ff ff c3 48 8b 15 15 d6 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e6 d5 0b 00 f7 d8 64 89 01 48 [ 9879.975218] RSP: 002b:00007fff8463f498 EFLAGS: 00000206 ORIG_RAX: 0000000000000036 [ 9879.979993] RAX: ffffffffffffffda RBX: 00007fd0a4b8c000 RCX: 00007fd124ca787a [ 9879.984779] RDX: 0000000000000004 RSI: 000000000000011b RDI: 0000000000000009 [ 9879.989489] RBP: 00007fff8463f580 R08: 0000000000000020 R09: 00005605752b5f70 [ 9879.994115] R10: 00007fff8463f4b0 R11: 0000000000000206 R12: 00007fff8463f590 [ 9879.998684] R13: 0000000080000000 R14: 00005605752b5a40 R15: 00005605752b5ac0 [ 9880.003156] ---[ end trace d2ad1cbfd1388dbe ]--- [ 9880.007555] xdp_umem: kalloc failed [ 9880.032714] xdpsock_multide[18118]: segfault at 98 ip 0000560573d2aba1 sp 00007fff8463f520 error 4 in xdpsock_multidev[560573d2a000+16000] [ 9880.045593] Code: ec 10 41 8b 3c 9c 8b 15 4d 07 02 00 c7 45 ec 00 00 00 00 e8 e1 0f 01 00 85 c0 75 51 48 8d 15 46 09 02 00 8b 45 ec 48 8b 14 da <39> 82 98 00 00 00 74 27 85 c0 74 15 48 8d 3d fc 55 01 00 e8 37 f5 [-- Attachment #3: i40e-zerocopy-trace.txt --] [-- Type: text/plain, Size: 8627 bytes --] [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 queues for VSI 0 err -12 [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left [ 328.579467] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 328.579573] #PF: supervisor read access in kernel mode [ 328.579655] #PF: error_code(0x0000) - not-present page [ 328.579737] PGD 0 P4D 0 [ 328.579780] Oops: 0000 [#1] SMP PTI [ 328.579835] CPU: 0 PID: 2157 Comm: xdpsock_multide Not tainted 5.5.0-rc3-moa+ #1 [ 328.579951] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019 [ 328.580093] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e] [ 328.580162] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f [ 328.580455] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246 [ 328.580538] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000 [ 328.580650] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810 [ 328.580762] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5 [ 328.580875] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000 [ 328.580987] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300 [ 328.581099] FS: 00007f079f960740(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000 [ 328.581227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 328.581316] CR2: 0000000000000000 CR3: 000000085578c005 CR4: 00000000007606f0 [ 328.581428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 328.581541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 328.585600] PKRU: 55555554 [ 328.589632] Call Trace: [ 328.593637] ? i40e_reconfig_rss_queues+0x100/0x100 [i40e] [ 328.597641] dev_xdp_install+0x4e/0x70 [ 328.601530] dev_change_xdp_fd+0x11a/0x260 [ 328.605335] do_setlink+0xdd0/0xe80 [ 328.609020] ? get_page_from_freelist+0x738/0x10d0 [ 328.612678] ? on_each_cpu+0x54/0x60 [ 328.616233] ? kmem_cache_alloc_node+0x43/0x1f0 [ 328.619856] ? optimize_nops.isra.0+0x90/0x90 [ 328.623457] rtnl_setlink+0xe5/0x160 [ 328.627015] ? security_capable+0x40/0x60 [ 328.630541] rtnetlink_rcv_msg+0x2b0/0x360 [ 328.634046] ? alloc_file+0x76/0xf0 [ 328.637513] ? alloc_file_pseudo+0xa3/0x110 [ 328.640949] ? rtnl_calcit.isra.0+0x110/0x110 [ 328.644369] netlink_rcv_skb+0x49/0x110 [ 328.647757] netlink_unicast+0x191/0x230 [ 328.651140] netlink_sendmsg+0x225/0x460 [ 328.654481] sock_sendmsg+0x5e/0x60 [ 328.657784] __sys_sendto+0xee/0x160 [ 328.661050] ? __sys_getsockname+0x7e/0xc0 [ 328.664306] ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe [ 328.667564] ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe [ 328.670715] ? trace_hardirqs_off_caller+0x2d/0xd0 [ 328.673841] ? do_syscall_64+0x12/0x140 [ 328.676955] __x64_sys_sendto+0x25/0x30 [ 328.680025] do_syscall_64+0x50/0x140 [ 328.683046] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 328.686097] RIP: 0033:0x7f079fa7b66d [ 328.689126] Code: ff ff ff ff eb bc 0f 1f 80 00 00 00 00 48 8d 05 79 3d 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 83 [ 328.695639] RSP: 002b:00007fffe2a66e88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 328.698886] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f079fa7b66d [ 328.702098] RDX: 0000000000000034 RSI: 00007fffe2a66ea0 RDI: 0000000000000007 [ 328.705240] RBP: 00007fffe2a66f20 R08: 0000000000000000 R09: 0000000000000000 [ 328.708278] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007 [ 328.711244] R13: 0000000000000006 R14: 0000000000000007 R15: 0000000000000020 [ 328.714150] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core [ 328.717149] CR2: 0000000000000000 [ 328.720135] ---[ end trace f2aa62f9801a8115 ]--- [ 328.723157] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e] [ 328.726171] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f [ 328.732497] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246 [ 328.735655] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000 [ 328.738890] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810 [ 328.742151] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5 [ 328.745377] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000 [ 328.748616] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300 [ 328.751848] FS: 00007f079f960740(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000 [ 328.755130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 328.758421] CR2: 0000000000000000 CR3: 000000085578c005 CR4: 00000000007606f0 [ 328.761777] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 328.765113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 328.768421] PKRU: 55555554 [ 328.780850] BUG: kernel NULL pointer dereference, address: 0000000000000f40 [ 328.784237] #PF: supervisor read access in kernel mode [ 328.787635] #PF: error_code(0x0000) - not-present page [ 328.791042] PGD 0 P4D 0 [ 328.794431] Oops: 0000 [#2] SMP PTI [ 328.797827] CPU: 0 PID: 185 Comm: kworker/0:2 Tainted: G D 5.5.0-rc3-moa+ #1 [ 328.801358] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019 [ 328.804971] Workqueue: i40e i40e_service_task [i40e] [ 328.808596] RIP: 0010:i40e_notify_client_of_netdev_close+0x6/0x60 [i40e] [ 328.812301] Code: 40 10 e8 1d 9c ab ca 48 8b 44 24 28 65 48 33 04 25 28 00 00 00 75 06 48 83 c4 30 5b c3 e8 e2 3c d2 c9 66 90 0f 1f 44 00 00 53 <48> 8b 87 40 0f 00 00 48 8b 98 38 08 00 00 48 85 db 74 44 4c 8b 83 [ 328.820156] RSP: 0018:ffffb265405b3de0 EFLAGS: 00010247 [ 328.824151] RAX: ffff91570ef42c00 RBX: ffff91570ef58810 RCX: ffff915713eb6818 [ 328.828215] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 [ 328.832300] RBP: ffff91570ef58000 R08: 0000000065303469 R09: 8080808080808080 [ 328.836435] R10: ffff915717db05ac R11: 0000000000000018 R12: ffff91570ef58008 [ 328.840564] R13: 0000000000000053 R14: 0000000000000000 R15: 0ffffd2653fa0980 [ 328.844691] FS: 0000000000000000(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000 [ 328.848883] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 328.853088] CR2: 0000000000000f40 CR3: 000000001c80a006 CR4: 00000000007606f0 [ 328.857340] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 328.861571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 328.865681] PKRU: 55555554 [ 328.869666] Call Trace: [ 328.873572] i40e_service_task+0x511/0xa40 [i40e] [ 328.877484] ? process_one_work+0x1d3/0x390 [ 328.881365] process_one_work+0x1e5/0x390 [ 328.885245] worker_thread+0x4a/0x3d0 [ 328.889121] kthread+0xfb/0x130 [ 328.893004] ? process_one_work+0x390/0x390 [ 328.896831] ? kthread_park+0x90/0x90 [ 328.900580] ret_from_fork+0x3a/0x50 [ 328.904312] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core [ 328.908185] CR2: 0000000000000f40 [ 328.911944] ---[ end trace f2aa62f9801a8116 ]--- [ 328.915635] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e] [ 328.919262] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f [ 328.926889] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246 [ 328.930743] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000 [ 328.934672] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810 [ 328.938595] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5 [ 328.942481] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000 [ 328.946304] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300 [ 328.950102] FS: 0000000000000000(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000 [ 328.953961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 328.957847] CR2: 0000000000000f40 CR3: 000000001c80a006 CR4: 00000000007606f0 [ 328.961743] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 328.965589] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 328.969372] PKRU: 55555554 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 0:18 zero-copy between interfaces Ryan Goodfellow @ 2020-01-13 9:16 ` Magnus Karlsson 2020-01-13 10:43 ` Toke Høiland-Jørgensen 2020-01-13 15:11 ` Ryan Goodfellow 2020-01-13 11:41 ` Jesper Dangaard Brouer 2020-01-17 12:32 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 2 siblings, 2 replies; 49+ messages in thread From: Magnus Karlsson @ 2020-01-13 9:16 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > Greetings XDP folks. I've been working on a zero-copy XDP bridge > implementation similar to what's described in the following thread. > > https://www.spinics.net/lists/xdp-newbies/msg01333.html > > I now have an implementation that is working reasonably well under certain > conditions for various hardware. The implementation is primarily based on the > xdpsock_user program in the kernel under samples/bpf. You can find my program > and corresponding BPF program here. > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c > > I have small testbed to run this code on that looks like the following. > > Packet forwarding machine: > CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread) > Memory: 32 GB > NICs: > - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G) > - Intel X722 10G SFP+ > > Sender/receiver machines > CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread) > Memory: 32 GB > NICs: > - Mellanox ConnectX 4 40G MCX4131A-BCAT > - Intel X722 10G SFP+ > > I could not get zero-copy to work with the i40e driver as it would crash. I've > attached the corresponding traces from dmesg. The results below are with the > i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the > server and test with instead of the X722 if that is of interest. In all cases I > used a single hardware queue via the following. > > ethtool -L <dev> combined 1 > > The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I > used ntuple rules to push things through queue 1 (shadows 0) as follows > > ethtool -N <dev> flow-type ether src <mac> action 1 > > The numbers that I have been able to achive with this code are the following. MTU > is 1500 in all cases. > > mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) > i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) > virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) > > Are these numbers in the ballpark of what's expected? > > One thing I have noticed is that I cannot create large memory maps for the > packet buffers. For example a frame size of 2048 with 524288 frames (around > 1G of packets) is fine. However, increasing size by an order of magnitude, which > is well within the memory capacity of the host machine results in an error when > creating the UMEM and the kernel shows the attached call trace. I'm going to > begin investigating this in more detail soon, but if anyone has advice on large > XDP memory maps that would be much appreciated. Hi Ryan, Thanks for taking XDP and AF_XDP for a sping. I will start by fixing this out-of-memory issue. With your umem size, we are hitting the size limit of kmalloc. I will fix this by using kvmalloc that tries to allocate with vmalloc if kmalloc fails. Should hopefully make it possible for you to allocate larger umems. > The reason for wanting large memory maps is that our use case for XDP is network > emulation - and sometimes that means introducing delay factors that can require > a rather large in-memory packet buffers. > > If there is interest in including this program in the official BPF samples I'm happy to > submit a patch. Any comments on the program are also much appreciated. More examples are always useful, but the question is if it should reside in samples or outside the kernel in some other repo? Is there some good place in xdp-project github that could be used for this purpose? /Magnus > Thanks > > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 9:16 ` Magnus Karlsson @ 2020-01-13 10:43 ` Toke Høiland-Jørgensen 2020-01-13 15:25 ` Ryan Goodfellow 2020-01-13 15:11 ` Ryan Goodfellow 1 sibling, 1 reply; 49+ messages in thread From: Toke Høiland-Jørgensen @ 2020-01-13 10:43 UTC (permalink / raw) To: Magnus Karlsson, Ryan Goodfellow; +Cc: xdp-newbies Magnus Karlsson <magnus.karlsson@gmail.com> writes: > On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: >> >> Greetings XDP folks. I've been working on a zero-copy XDP bridge >> implementation similar to what's described in the following thread. >> >> https://www.spinics.net/lists/xdp-newbies/msg01333.html >> >> I now have an implementation that is working reasonably well under certain >> conditions for various hardware. The implementation is primarily based on the >> xdpsock_user program in the kernel under samples/bpf. You can find my program >> and corresponding BPF program here. >> >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c >> >> I have small testbed to run this code on that looks like the following. >> >> Packet forwarding machine: >> CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread) >> Memory: 32 GB >> NICs: >> - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G) >> - Intel X722 10G SFP+ >> >> Sender/receiver machines >> CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread) >> Memory: 32 GB >> NICs: >> - Mellanox ConnectX 4 40G MCX4131A-BCAT >> - Intel X722 10G SFP+ >> >> I could not get zero-copy to work with the i40e driver as it would crash. I've >> attached the corresponding traces from dmesg. The results below are with the >> i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the >> server and test with instead of the X722 if that is of interest. In all cases I >> used a single hardware queue via the following. >> >> ethtool -L <dev> combined 1 >> >> The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I >> used ntuple rules to push things through queue 1 (shadows 0) as follows >> >> ethtool -N <dev> flow-type ether src <mac> action 1 >> >> The numbers that I have been able to achive with this code are the following. MTU >> is 1500 in all cases. >> >> mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) >> i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) >> virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) >> >> Are these numbers in the ballpark of what's expected? >> >> One thing I have noticed is that I cannot create large memory maps for the >> packet buffers. For example a frame size of 2048 with 524288 frames (around >> 1G of packets) is fine. However, increasing size by an order of magnitude, which >> is well within the memory capacity of the host machine results in an error when >> creating the UMEM and the kernel shows the attached call trace. I'm going to >> begin investigating this in more detail soon, but if anyone has advice on large >> XDP memory maps that would be much appreciated. > > Hi Ryan, > > Thanks for taking XDP and AF_XDP for a sping. I will start by fixing > this out-of-memory issue. With your umem size, we are hitting the size > limit of kmalloc. I will fix this by using kvmalloc that tries to > allocate with vmalloc if kmalloc fails. Should hopefully make it > possible for you to allocate larger umems. > >> The reason for wanting large memory maps is that our use case for XDP is network >> emulation - and sometimes that means introducing delay factors that can require >> a rather large in-memory packet buffers. >> >> If there is interest in including this program in the official BPF samples I'm happy to >> submit a patch. Any comments on the program are also much appreciated. > > More examples are always useful, but the question is if it should > reside in samples or outside the kernel in some other repo? Is there > some good place in xdp-project github that could be used for this > purpose? We could certainly create something; either a new xdp-samples repository, or an example-programs/ subdir of the xdp-tutorial? Which of those makes the most sense depends on the size of the program I think... -Toke ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 10:43 ` Toke Høiland-Jørgensen @ 2020-01-13 15:25 ` Ryan Goodfellow 2020-01-13 17:09 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-13 15:25 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: Magnus Karlsson, xdp-newbies On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote: >> Magnus Karlsson <magnus.karlsson@gmail.com> writes: >>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: >>> The reason for wanting large memory maps is that our use case for XDP is network >>> emulation - and sometimes that means introducing delay factors that can require >>> a rather large in-memory packet buffers. >>> >>> If there is interest in including this program in the official BPF samples I'm happy to >>> submit a patch. Any comments on the program are also much appreciated. >> >> More examples are always useful, but the question is if it should >> reside in samples or outside the kernel in some other repo? Is there >> some good place in xdp-project github that could be used for this >> purpose? > > We could certainly create something; either a new xdp-samples > repository, or an example-programs/ subdir of the xdp-tutorial? Which of > those makes the most sense depends on the size of the program I think... > > -Toke > I'm happy to provide patches or pull-requests in either case. The userspace program is 1 file with 555 lines and the BPF program is 28 lines. I've tested the userspace program with the 5.5 kernel. The BPF program requires clang-9 to work properly (due to BTF features IIRC). - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c The primary usefulness of this program relative to what's out there is that it pushes packets between interfaces using a common memory map. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 15:25 ` Ryan Goodfellow @ 2020-01-13 17:09 ` Toke Høiland-Jørgensen 2020-01-14 7:47 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Toke Høiland-Jørgensen @ 2020-01-13 17:09 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: Magnus Karlsson, xdp-newbies Ryan Goodfellow <rgoodfel@isi.edu> writes: > On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote: >>> Magnus Karlsson <magnus.karlsson@gmail.com> writes: >>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: >>>> The reason for wanting large memory maps is that our use case for XDP is network >>>> emulation - and sometimes that means introducing delay factors that can require >>>> a rather large in-memory packet buffers. >>>> >>>> If there is interest in including this program in the official BPF samples I'm happy to >>>> submit a patch. Any comments on the program are also much appreciated. >>> >>> More examples are always useful, but the question is if it should >>> reside in samples or outside the kernel in some other repo? Is there >>> some good place in xdp-project github that could be used for this >>> purpose? >> >> We could certainly create something; either a new xdp-samples >> repository, or an example-programs/ subdir of the xdp-tutorial? Which of >> those makes the most sense depends on the size of the program I think... >> >> -Toke >> > > I'm happy to provide patches or pull-requests in either case. The userspace > program is 1 file with 555 lines and the BPF program is 28 lines. I've > tested the userspace program with the 5.5 kernel. The BPF program requires > clang-9 to work properly (due to BTF features IIRC). > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c > > The primary usefulness of this program relative to what's out there is > that it pushes packets between interfaces using a common memory map. Hmm, yeah, this could live in either samples or as standalone. IDK, Magnus what do you think? -Toke ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 17:09 ` Toke Høiland-Jørgensen @ 2020-01-14 7:47 ` Magnus Karlsson 2020-01-14 8:11 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-14 7:47 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: Ryan Goodfellow, xdp-newbies On Mon, Jan 13, 2020 at 6:09 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Ryan Goodfellow <rgoodfel@isi.edu> writes: > > > On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote: > >>> Magnus Karlsson <magnus.karlsson@gmail.com> writes: > >>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > >>>> The reason for wanting large memory maps is that our use case for XDP is network > >>>> emulation - and sometimes that means introducing delay factors that can require > >>>> a rather large in-memory packet buffers. > >>>> > >>>> If there is interest in including this program in the official BPF samples I'm happy to > >>>> submit a patch. Any comments on the program are also much appreciated. > >>> > >>> More examples are always useful, but the question is if it should > >>> reside in samples or outside the kernel in some other repo? Is there > >>> some good place in xdp-project github that could be used for this > >>> purpose? > >> > >> We could certainly create something; either a new xdp-samples > >> repository, or an example-programs/ subdir of the xdp-tutorial? Which of > >> those makes the most sense depends on the size of the program I think... > >> > >> -Toke > >> > > > > I'm happy to provide patches or pull-requests in either case. The userspace > > program is 1 file with 555 lines and the BPF program is 28 lines. I've > > tested the userspace program with the 5.5 kernel. The BPF program requires > > clang-9 to work properly (due to BTF features IIRC). > > > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c > > > > The primary usefulness of this program relative to what's out there is > > that it pushes packets between interfaces using a common memory map. > > Hmm, yeah, this could live in either samples or as standalone. IDK, > Magnus what do you think? Let us try the kernel samples first. Then and if the folks on the mailing list do not want more samples in there, we fall back on the xdp-tutorial repo. /Magnus > -Toke > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-14 7:47 ` Magnus Karlsson @ 2020-01-14 8:11 ` Toke Høiland-Jørgensen 0 siblings, 0 replies; 49+ messages in thread From: Toke Høiland-Jørgensen @ 2020-01-14 8:11 UTC (permalink / raw) To: Magnus Karlsson; +Cc: Ryan Goodfellow, xdp-newbies Magnus Karlsson <magnus.karlsson@gmail.com> writes: > On Mon, Jan 13, 2020 at 6:09 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Ryan Goodfellow <rgoodfel@isi.edu> writes: >> >> > On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote: >> >>> Magnus Karlsson <magnus.karlsson@gmail.com> writes: >> >>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: >> >>>> The reason for wanting large memory maps is that our use case for XDP is network >> >>>> emulation - and sometimes that means introducing delay factors that can require >> >>>> a rather large in-memory packet buffers. >> >>>> >> >>>> If there is interest in including this program in the official BPF samples I'm happy to >> >>>> submit a patch. Any comments on the program are also much appreciated. >> >>> >> >>> More examples are always useful, but the question is if it should >> >>> reside in samples or outside the kernel in some other repo? Is there >> >>> some good place in xdp-project github that could be used for this >> >>> purpose? >> >> >> >> We could certainly create something; either a new xdp-samples >> >> repository, or an example-programs/ subdir of the xdp-tutorial? Which of >> >> those makes the most sense depends on the size of the program I think... >> >> >> >> -Toke >> >> >> > >> > I'm happy to provide patches or pull-requests in either case. The userspace >> > program is 1 file with 555 lines and the BPF program is 28 lines. I've >> > tested the userspace program with the 5.5 kernel. The BPF program requires >> > clang-9 to work properly (due to BTF features IIRC). >> > >> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c >> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c >> > >> > The primary usefulness of this program relative to what's out there is >> > that it pushes packets between interfaces using a common memory map. >> >> Hmm, yeah, this could live in either samples or as standalone. IDK, >> Magnus what do you think? > > Let us try the kernel samples first. Then and if the folks on the > mailing list do not want more samples in there, we fall back on the > xdp-tutorial repo. ACK, SGTM :) -Toke ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 9:16 ` Magnus Karlsson 2020-01-13 10:43 ` Toke Høiland-Jørgensen @ 2020-01-13 15:11 ` Ryan Goodfellow 2020-01-14 9:59 ` Magnus Karlsson 1 sibling, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-13 15:11 UTC (permalink / raw) To: Magnus Karlsson; +Cc: xdp-newbies On Mon, Jan 13, 2020 at 10:16:47AM +0100, Magnus Karlsson wrote: > Thanks for taking XDP and AF_XDP for a sping. I will start by fixing > this out-of-memory issue. With your umem size, we are hitting the size > limit of kmalloc. I will fix this by using kvmalloc that tries to > allocate with vmalloc if kmalloc fails. Should hopefully make it > possible for you to allocate larger umems. > Thanks! -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 15:11 ` Ryan Goodfellow @ 2020-01-14 9:59 ` Magnus Karlsson 2020-01-14 20:52 ` Ryan Goodfellow 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-14 9:59 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Mon, Jan 13, 2020 at 4:12 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Mon, Jan 13, 2020 at 10:16:47AM +0100, Magnus Karlsson wrote: > > Thanks for taking XDP and AF_XDP for a sping. I will start by fixing > > this out-of-memory issue. With your umem size, we are hitting the size > > limit of kmalloc. I will fix this by using kvmalloc that tries to > > allocate with vmalloc if kmalloc fails. Should hopefully make it > > possible for you to allocate larger umems. > > > > Thanks! Just sent out a patch on the mailing list. Would be great if you could try it out. /Magnus > -- > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-14 9:59 ` Magnus Karlsson @ 2020-01-14 20:52 ` Ryan Goodfellow 2020-01-15 1:41 ` Ryan Goodfellow 2020-01-17 17:40 ` William Tu 0 siblings, 2 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-14 20:52 UTC (permalink / raw) To: Magnus Karlsson; +Cc: xdp-newbies On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > Just sent out a patch on the mailing list. Would be great if you could > try it out. Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next tree and manually applied to the 5.5.0-rc3 branch I've been working with up to this point. It does allow for allocating more memory, however packet forwarding no longer works. I did not see any complaints from dmesg, but here is an example iperf3 session from a client that worked before. ry@xd2:~$ iperf3 -c 10.1.0.2 Connecting to host 10.1.0.2, port 5201 [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver iperf3: interrupt - the client has terminated I'll continue to investigate and report back with anything that I find. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-14 20:52 ` Ryan Goodfellow @ 2020-01-15 1:41 ` Ryan Goodfellow 2020-01-15 7:40 ` Magnus Karlsson 2020-01-17 17:40 ` William Tu 1 sibling, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-15 1:41 UTC (permalink / raw) To: Magnus Karlsson; +Cc: xdp-newbies On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > Just sent out a patch on the mailing list. Would be great if you could > > try it out. > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > this point. It does allow for allocating more memory, however packet > forwarding no longer works. I did not see any complaints from dmesg, but here > is an example iperf3 session from a client that worked before. > > ry@xd2:~$ iperf3 -c 10.1.0.2 > Connecting to host 10.1.0.2, port 5201 > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > iperf3: interrupt - the client has terminated > > I'll continue to investigate and report back with anything that I find. Interestingly I found this behavior to exist in the bpf-next tree independent of the patch being present. I also gave the 5.5.0-rc6 branch a try (without the patch) and packets forward OK. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-15 1:41 ` Ryan Goodfellow @ 2020-01-15 7:40 ` Magnus Karlsson 2020-01-15 8:20 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-15 7:40 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > > > Just sent out a patch on the mailing list. Would be great if you could > > > try it out. > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > > this point. It does allow for allocating more memory, however packet > > forwarding no longer works. I did not see any complaints from dmesg, but here > > is an example iperf3 session from a client that worked before. > > > > ry@xd2:~$ iperf3 -c 10.1.0.2 > > Connecting to host 10.1.0.2, port 5201 > > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > > [ ID] Interval Transfer Bitrate Retr Cwnd > > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate Retr > > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > > iperf3: interrupt - the client has terminated > > > > I'll continue to investigate and report back with anything that I find. > > Interestingly I found this behavior to exist in the bpf-next tree independent > of the patch being present. Ryan, Could you please do a bisect on it? In the 12 commits after the merge commit below there are number of sensitive rewrites of the ring access functions. Maybe one of them breaks your code. When you say "packet forwarding no longer works", do you mean it works for a second or so, then no packets come through? What HW are you using? commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 Merge: 99cacdc 1d9cb1f Author: Alexei Starovoitov <ast@kernel.org> Date: Fri Dec 20 16:00:10 2019 -0800 Merge branch 'xsk-cleanup' Magnus Karlsson says: ==================== This patch set cleans up the ring access functions of AF_XDP in hope that it will now be easier to understand and maintain. I used to get a headache every time I looked at this code in order to really understand it, but now I do think it is a lot less painful. <snip> /Magnus > I also gave the 5.5.0-rc6 branch a try (without the patch) and packets forward > OK. > > -- > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-15 7:40 ` Magnus Karlsson @ 2020-01-15 8:20 ` Magnus Karlsson 2020-01-16 2:04 ` Ryan Goodfellow 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-15 8:20 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > > > > > Just sent out a patch on the mailing list. Would be great if you could > > > > try it out. > > > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > > > this point. It does allow for allocating more memory, however packet > > > forwarding no longer works. I did not see any complaints from dmesg, but here > > > is an example iperf3 session from a client that worked before. > > > > > > ry@xd2:~$ iperf3 -c 10.1.0.2 > > > Connecting to host 10.1.0.2, port 5201 > > > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > [ ID] Interval Transfer Bitrate Retr > > > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > > > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > > > iperf3: interrupt - the client has terminated > > > > > > I'll continue to investigate and report back with anything that I find. > > > > Interestingly I found this behavior to exist in the bpf-next tree independent > > of the patch being present. > > Ryan, > > Could you please do a bisect on it? In the 12 commits after the merge > commit below there are number of sensitive rewrites of the ring access > functions. Maybe one of them breaks your code. When you say "packet > forwarding no longer works", do you mean it works for a second or so, > then no packets come through? What HW are you using? > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 > Merge: 99cacdc 1d9cb1f > Author: Alexei Starovoitov <ast@kernel.org> > Date: Fri Dec 20 16:00:10 2019 -0800 > > Merge branch 'xsk-cleanup' > > Magnus Karlsson says: > > ==================== > This patch set cleans up the ring access functions of AF_XDP in hope > that it will now be easier to understand and maintain. I used to get a > headache every time I looked at this code in order to really understand it, > but now I do think it is a lot less painful. > <snip> > > /Magnus I see that you have debug messages in your application. Could you please run with those on and send me the output so I can see where it stops. A bisect that pin-points what commit that breaks your program plus the debug output should hopefully send us on the right path for a fix. Thanks: Magnus > > I also gave the 5.5.0-rc6 branch a try (without the patch) and packets forward > > OK. > > > > -- > > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-15 8:20 ` Magnus Karlsson @ 2020-01-16 2:04 ` Ryan Goodfellow 2020-01-16 14:32 ` Magnus Karlsson 2020-01-21 7:34 ` Magnus Karlsson 0 siblings, 2 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-16 2:04 UTC (permalink / raw) To: Magnus Karlsson; +Cc: xdp-newbies On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote: > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson > <magnus.karlsson@gmail.com> wrote: > > > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > > > > > > > Just sent out a patch on the mailing list. Would be great if you could > > > > > try it out. > > > > > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > > > > this point. It does allow for allocating more memory, however packet > > > > forwarding no longer works. I did not see any complaints from dmesg, but here > > > > is an example iperf3 session from a client that worked before. > > > > > > > > ry@xd2:~$ iperf3 -c 10.1.0.2 > > > > Connecting to host 10.1.0.2, port 5201 > > > > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > > > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > [ ID] Interval Transfer Bitrate Retr > > > > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > > > > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > > > > iperf3: interrupt - the client has terminated > > > > > > > > I'll continue to investigate and report back with anything that I find. > > > > > > Interestingly I found this behavior to exist in the bpf-next tree independent > > > of the patch being present. > > > > Ryan, > > > > Could you please do a bisect on it? In the 12 commits after the merge > > commit below there are number of sensitive rewrites of the ring access > > functions. Maybe one of them breaks your code. When you say "packet > > forwarding no longer works", do you mean it works for a second or so, > > then no packets come through? What HW are you using? > > > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 > > Merge: 99cacdc 1d9cb1f > > Author: Alexei Starovoitov <ast@kernel.org> > > Date: Fri Dec 20 16:00:10 2019 -0800 > > > > Merge branch 'xsk-cleanup' > > > > Magnus Karlsson says: > > > > ==================== > > This patch set cleans up the ring access functions of AF_XDP in hope > > that it will now be easier to understand and maintain. I used to get a > > headache every time I looked at this code in order to really understand it, > > but now I do think it is a lot less painful. > > <snip> > > > > /Magnus > > I see that you have debug messages in your application. Could you > please run with those on and send me the output so I can see where it > stops. A bisect that pin-points what commit that breaks your program > plus the debug output should hopefully send us on the right path for a > fix. > > Thanks: Magnus > Hi Magnus, I did a bisect starting from the head of the bpf-next tree (990bca1) down to the first commit before the patch series you identified (df034c9). The result was identifying df0ae6f as the commit that causes the issue I am seeing. I've posted output from the program in debugging mode here - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 Yes, you are correct in that forwarding works for a brief period and then stops. I've noticed that the number of packets that are forwarded is equal to the size of the producer/consumer descriptor rings. I've posted two ping traces from a client ping that shows this. - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 I've also noticed that when the forwarding stops, the CPU usage for the proc running the program is pegged, which is not the norm for this program as it uses a poll call with a timeout on the xsk fd. The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) running the mlx5 driver. The program is running in zero copy mode. I also tested this code out in a virtual machine with virtio NICs in SKB mode which uses xdpgeneric - there were no issues in that setting. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-16 2:04 ` Ryan Goodfellow @ 2020-01-16 14:32 ` Magnus Karlsson 2020-01-17 9:45 ` Magnus Karlsson 2020-01-21 7:34 ` Magnus Karlsson 1 sibling, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-16 14:32 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote: > > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson > > <magnus.karlsson@gmail.com> wrote: > > > > > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > > > > > > > > > Just sent out a patch on the mailing list. Would be great if you could > > > > > > try it out. > > > > > > > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > > > > > this point. It does allow for allocating more memory, however packet > > > > > forwarding no longer works. I did not see any complaints from dmesg, but here > > > > > is an example iperf3 session from a client that worked before. > > > > > > > > > > ry@xd2:~$ iperf3 -c 10.1.0.2 > > > > > Connecting to host 10.1.0.2, port 5201 > > > > > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > > > > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > [ ID] Interval Transfer Bitrate Retr > > > > > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > > > > > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > > > > > iperf3: interrupt - the client has terminated > > > > > > > > > > I'll continue to investigate and report back with anything that I find. > > > > > > > > Interestingly I found this behavior to exist in the bpf-next tree independent > > > > of the patch being present. > > > > > > Ryan, > > > > > > Could you please do a bisect on it? In the 12 commits after the merge > > > commit below there are number of sensitive rewrites of the ring access > > > functions. Maybe one of them breaks your code. When you say "packet > > > forwarding no longer works", do you mean it works for a second or so, > > > then no packets come through? What HW are you using? > > > > > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 > > > Merge: 99cacdc 1d9cb1f > > > Author: Alexei Starovoitov <ast@kernel.org> > > > Date: Fri Dec 20 16:00:10 2019 -0800 > > > > > > Merge branch 'xsk-cleanup' > > > > > > Magnus Karlsson says: > > > > > > ==================== > > > This patch set cleans up the ring access functions of AF_XDP in hope > > > that it will now be easier to understand and maintain. I used to get a > > > headache every time I looked at this code in order to really understand it, > > > but now I do think it is a lot less painful. > > > <snip> > > > > > > /Magnus > > > > I see that you have debug messages in your application. Could you > > please run with those on and send me the output so I can see where it > > stops. A bisect that pin-points what commit that breaks your program > > plus the debug output should hopefully send us on the right path for a > > fix. > > > > Thanks: Magnus > > > > Hi Magnus, > > I did a bisect starting from the head of the bpf-next tree (990bca1) down to > the first commit before the patch series you identified (df034c9). The result > was identifying df0ae6f as the commit that causes the issue I am seeing. > > I've posted output from the program in debugging mode here > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 Perfect. Thanks. > Yes, you are correct in that forwarding works for a brief period and then stops. > I've noticed that the number of packets that are forwarded is equal to the size > of the producer/consumer descriptor rings. I've posted two ping traces from a > client ping that shows this. > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 > > I've also noticed that when the forwarding stops, the CPU usage for the proc > running the program is pegged, which is not the norm for this program as it uses > a poll call with a timeout on the xsk fd. I will replicate your setup and try to reproduce it. Only have one port connected to my load generator now, but when I get into the office, I will connect two ports. In what loop does the execution get stuck when it hangs at 100% load? /Magnus > The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) > running the mlx5 driver. The program is running in zero copy mode. I also tested > this code out in a virtual machine with virtio NICs in SKB mode which uses > xdpgeneric - there were no issues in that setting. > > -- > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-16 14:32 ` Magnus Karlsson @ 2020-01-17 9:45 ` Magnus Karlsson 2020-01-17 17:05 ` Ryan Goodfellow 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-17 9:45 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Thu, Jan 16, 2020 at 3:32 PM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote: > > > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson > > > <magnus.karlsson@gmail.com> wrote: > > > > > > > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > > > > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > > > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > > > > > > > > > > > Just sent out a patch on the mailing list. Would be great if you could > > > > > > > try it out. > > > > > > > > > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > > > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > > > > > > this point. It does allow for allocating more memory, however packet > > > > > > forwarding no longer works. I did not see any complaints from dmesg, but here > > > > > > is an example iperf3 session from a client that worked before. > > > > > > > > > > > > ry@xd2:~$ iperf3 -c 10.1.0.2 > > > > > > Connecting to host 10.1.0.2, port 5201 > > > > > > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > > > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > > > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > > > > > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > > [ ID] Interval Transfer Bitrate Retr > > > > > > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > > > > > > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > > > > > > iperf3: interrupt - the client has terminated > > > > > > > > > > > > I'll continue to investigate and report back with anything that I find. > > > > > > > > > > Interestingly I found this behavior to exist in the bpf-next tree independent > > > > > of the patch being present. > > > > > > > > Ryan, > > > > > > > > Could you please do a bisect on it? In the 12 commits after the merge > > > > commit below there are number of sensitive rewrites of the ring access > > > > functions. Maybe one of them breaks your code. When you say "packet > > > > forwarding no longer works", do you mean it works for a second or so, > > > > then no packets come through? What HW are you using? > > > > > > > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 > > > > Merge: 99cacdc 1d9cb1f > > > > Author: Alexei Starovoitov <ast@kernel.org> > > > > Date: Fri Dec 20 16:00:10 2019 -0800 > > > > > > > > Merge branch 'xsk-cleanup' > > > > > > > > Magnus Karlsson says: > > > > > > > > ==================== > > > > This patch set cleans up the ring access functions of AF_XDP in hope > > > > that it will now be easier to understand and maintain. I used to get a > > > > headache every time I looked at this code in order to really understand it, > > > > but now I do think it is a lot less painful. > > > > <snip> > > > > > > > > /Magnus > > > > > > I see that you have debug messages in your application. Could you > > > please run with those on and send me the output so I can see where it > > > stops. A bisect that pin-points what commit that breaks your program > > > plus the debug output should hopefully send us on the right path for a > > > fix. > > > > > > Thanks: Magnus > > > > > > > Hi Magnus, > > > > I did a bisect starting from the head of the bpf-next tree (990bca1) down to > > the first commit before the patch series you identified (df034c9). The result > > was identifying df0ae6f as the commit that causes the issue I am seeing. > > > > I've posted output from the program in debugging mode here > > > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 > > Perfect. Thanks. > > > Yes, you are correct in that forwarding works for a brief period and then stops. > > I've noticed that the number of packets that are forwarded is equal to the size > > of the producer/consumer descriptor rings. I've posted two ping traces from a > > client ping that shows this. > > > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 > > > > I've also noticed that when the forwarding stops, the CPU usage for the proc > > running the program is pegged, which is not the norm for this program as it uses > > a poll call with a timeout on the xsk fd. > > I will replicate your setup and try to reproduce it. Only have one > port connected to my load generator now, but when I get into the > office, I will connect two ports. If have now run your application, but unfortunately I cannot recreate your problem. It works and runs for several minutes until I get bored and terminate it. Note that I use an i40e card that you get a crash with. So two problems I cannot reproduce, sigh. Here is my system info. Can you please dump yours? Please do the ethtool dump on your i40e card. mkarlsso@kurt:~/src/dna-linux$ sudo ethtool -i ens803f0 [sudo] password for mkarlsso: driver: i40e version: 2.8.20-k firmware-version: 5.05 0x800028a6 1.1568.0 expansion-rom-version: bus-info: 0000:86:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes mkarlsso@kurt:~/src/dna-linux$ uname -a Linux kurt 5.5.0-rc4+ #72 SMP PREEMPT Thu Jan 16 10:03:20 CET 2020 x86_64 x86_64 x86_64 GNU/Linux mkarlsso@kurt:~/src/dna-linux$ git log -1 commit b65053cd94f46619b4aae746b98f2d8d9274540e (HEAD, bpf-next/master) Author: Andrii Nakryiko <andriin@fb.com> Date: Wed Jan 15 16:55:49 2020 -0800 selftests/bpf: Add whitelist/blacklist of test names to test_progs gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) I also noted that you use MAX_SOCKS in your XDP program. The size of the xsks_map is not dependent on the number of sockets in your case. It is dependent on the queue id you use. So I would introduce a MAX_QUEUE_ID and set it to e.g. 128 and use that instead. MAX_SOCKS is 4, so quite restrictive. /Magnus > In what loop does the execution get stuck when it hangs at 100% load? > > /Magnus > > > The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) > > running the mlx5 driver. The program is running in zero copy mode. I also tested > > this code out in a virtual machine with virtio NICs in SKB mode which uses > > xdpgeneric - there were no issues in that setting. > > > > -- > > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-17 9:45 ` Magnus Karlsson @ 2020-01-17 17:05 ` Ryan Goodfellow 0 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-17 17:05 UTC (permalink / raw) To: Magnus Karlsson; +Cc: xdp-newbies On Fri, Jan 17, 2020 at 10:45:33AM +0100, Magnus Karlsson wrote: > If have now run your application, but unfortunately I cannot recreate > your problem. It works and runs for several minutes until I get bored > and terminate it. Note that I use an i40e card that you get a crash > with. So two problems I cannot reproduce, sigh. Here is my system > info. Can you please dump yours? Please do the ethtool dump on your > i40e card. > > mkarlsso@kurt:~/src/dna-linux$ sudo ethtool -i ens803f0 > [sudo] password for mkarlsso: > driver: i40e > version: 2.8.20-k > firmware-version: 5.05 0x800028a6 1.1568.0 > expansion-rom-version: > bus-info: 0000:86:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: yes > > mkarlsso@kurt:~/src/dna-linux$ uname -a > Linux kurt 5.5.0-rc4+ #72 SMP PREEMPT Thu Jan 16 10:03:20 CET 2020 > x86_64 x86_64 x86_64 GNU/Linux > > mkarlsso@kurt:~/src/dna-linux$ git log -1 > commit b65053cd94f46619b4aae746b98f2d8d9274540e (HEAD, bpf-next/master) > Author: Andrii Nakryiko <andriin@fb.com> > Date: Wed Jan 15 16:55:49 2020 -0800 > > selftests/bpf: Add whitelist/blacklist of test names to test_progs > > gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) > > I also noted that you use MAX_SOCKS in your XDP program. The size of > the xsks_map is not dependent on the number of sockets in your case. > It is dependent on the queue id you use. So I would introduce a > MAX_QUEUE_ID and set it to e.g. 128 and use that instead. MAX_SOCKS is > 4, so quite restrictive. > > /Magnus So I plugged in my X710-DA4 card, and this one actually works fine. Here is the requested output for that card. ry@turbine:~$ sudo ethtool -i enp101s0f0 driver: i40e version: 2.8.20-k firmware-version: 7.10 0x80006471 1.2527.0 expansion-rom-version: bus-info: 0000:65:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ry@turbine:~$ sudo lspci -vvv | grep 710 65:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-4 Product Name: XL710 40GbE Controller 65:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710 Product Name: XL710 40GbE Controller 65:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710 Product Name: XL710 40GbE Controller 65:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710 Product Name: XL710 40GbE Controller Here is the output for the X722 card. As of a few weeks ago firmware version 3.33 was the latest I could find. ry@turbine:~$ sudo ethtool -i eno7 driver: i40e version: 2.8.20-k firmware-version: 3.33 0x80001006 1.1747.0 expansion-rom-version: bus-info: 0000:b7:00.2 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ry@turbine:~$ sudo lspci -vvv | grep 722 b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) DeviceName: Intel LAN X722 #1 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) DeviceName: Intel LAN X722 #2 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) DeviceName: Intel LAN X722 #3 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) DeviceName: Intel LAN X722 #4 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ I verified that the driver still crashes with the current kernel/program I am runing that works on the X710-DA4. Other output as requested ry@turbine:~$ uname -a Linux turbine 5.5.0-rc4-moa+ #16 SMP Fri Jan 17 10:52:42 EST 2020 x86_64 GNU/Linux ry@turbine:~/kmoa/bpf-next$ git log -2 commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master) Author: Ryan Goodfellow <rgoodfel@isi.edu> Date: Wed Jan 15 16:54:39 2020 -0500 add xdpsock_multidev sample program This is a simple program that uses AF_XDP sockets to forward packets between two interfaces using a common memory region and no copying of packets. Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu> commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD) Author: Andrii Nakryiko <andriin@fb.com> Date: Wed Jan 15 11:08:56 2020 -0800 libbpf: Support .text sub-calls relocations The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call relocations within the same section. This includes a default .text section, which contains any BPF sub-programs. This wasn't the case before and so libbpf was able to get a way with slightly simpler handling of subprogram call relocations. This patch adds support for .text section relocations. It needs to ensure correct order of relocations, so does two passes: - first, relocate .text instructions, if there are any relocations in it; - then process all the other programs and copy over patched .text instructions for all sub-program calls. v1->v2: - break early once .text program is processed. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin@fb.com Here is the info for the Mellanox card that is not working after the df0ae6f commit. ry@turbine:~$ sudo ethtool -i enp23s0f0 driver: mlx5_core version: 5.0-0 firmware-version: 12.23.1020 (MT_2150110033) expansion-rom-version: bus-info: 0000:17:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ry@turbine:~$ sudo lspci -vvv | grep Mellanox 17:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] Subsystem: Mellanox Technologies ConnectX-4 Stand-up dual-port 100GbE MCX416A-CCAT 17:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] Subsystem: Mellanox Technologies ConnectX-4 Stand-up dual-port 100GbE MCX416A-CCAT -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-16 2:04 ` Ryan Goodfellow 2020-01-16 14:32 ` Magnus Karlsson @ 2020-01-21 7:34 ` Magnus Karlsson 2020-01-21 13:40 ` Maxim Mikityanskiy 1 sibling, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-21 7:34 UTC (permalink / raw) To: Ryan Goodfellow, Maxim Mikityanskiy; +Cc: xdp-newbies On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote: > > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson > > <magnus.karlsson@gmail.com> wrote: > > > > > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: > > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > > > > > > > > > Just sent out a patch on the mailing list. Would be great if you could > > > > > > try it out. > > > > > > > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > > > > > this point. It does allow for allocating more memory, however packet > > > > > forwarding no longer works. I did not see any complaints from dmesg, but here > > > > > is an example iperf3 session from a client that worked before. > > > > > > > > > > ry@xd2:~$ iperf3 -c 10.1.0.2 > > > > > Connecting to host 10.1.0.2, port 5201 > > > > > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > > > > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > > > > > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > [ ID] Interval Transfer Bitrate Retr > > > > > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > > > > > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > > > > > iperf3: interrupt - the client has terminated > > > > > > > > > > I'll continue to investigate and report back with anything that I find. > > > > > > > > Interestingly I found this behavior to exist in the bpf-next tree independent > > > > of the patch being present. > > > > > > Ryan, > > > > > > Could you please do a bisect on it? In the 12 commits after the merge > > > commit below there are number of sensitive rewrites of the ring access > > > functions. Maybe one of them breaks your code. When you say "packet > > > forwarding no longer works", do you mean it works for a second or so, > > > then no packets come through? What HW are you using? > > > > > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 > > > Merge: 99cacdc 1d9cb1f > > > Author: Alexei Starovoitov <ast@kernel.org> > > > Date: Fri Dec 20 16:00:10 2019 -0800 > > > > > > Merge branch 'xsk-cleanup' > > > > > > Magnus Karlsson says: > > > > > > ==================== > > > This patch set cleans up the ring access functions of AF_XDP in hope > > > that it will now be easier to understand and maintain. I used to get a > > > headache every time I looked at this code in order to really understand it, > > > but now I do think it is a lot less painful. > > > <snip> > > > > > > /Magnus > > > > I see that you have debug messages in your application. Could you > > please run with those on and send me the output so I can see where it > > stops. A bisect that pin-points what commit that breaks your program > > plus the debug output should hopefully send us on the right path for a > > fix. > > > > Thanks: Magnus > > > > Hi Magnus, > > I did a bisect starting from the head of the bpf-next tree (990bca1) down to > the first commit before the patch series you identified (df034c9). The result > was identifying df0ae6f as the commit that causes the issue I am seeing. > > I've posted output from the program in debugging mode here > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 > > Yes, you are correct in that forwarding works for a brief period and then stops. > I've noticed that the number of packets that are forwarded is equal to the size > of the producer/consumer descriptor rings. I've posted two ping traces from a > client ping that shows this. > > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 > > I've also noticed that when the forwarding stops, the CPU usage for the proc > running the program is pegged, which is not the norm for this program as it uses > a poll call with a timeout on the xsk fd. > > The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) > running the mlx5 driver. The program is running in zero copy mode. I also tested > this code out in a virtual machine with virtio NICs in SKB mode which uses > xdpgeneric - there were no issues in that setting. > > -- > ~ ry Maxim, Do you think you could help me debug this issue that Ryan is having? I can unfortunately not reproduce the stalling issue with my Intel i40e cards. Ryan, Maxim is Mellanox's responsible for AF_XDP support and he has also contributed to the core AF_XDP code. So you are in good hands :-). Thanks: Magnus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-21 7:34 ` Magnus Karlsson @ 2020-01-21 13:40 ` Maxim Mikityanskiy 2020-01-22 21:43 ` Ryan Goodfellow 0 siblings, 1 reply; 49+ messages in thread From: Maxim Mikityanskiy @ 2020-01-21 13:40 UTC (permalink / raw) To: Ryan Goodfellow Cc: Magnus Karlsson, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On 2020-01-21 09:34, Magnus Karlsson wrote: > On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: >> >> On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote: >>> On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson >>> <magnus.karlsson@gmail.com> wrote: >>>> >>>> On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote: >>>>> >>>>> On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote: >>>>>> On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: >>>>>>> >>>>>>> Just sent out a patch on the mailing list. Would be great if you could >>>>>>> try it out. >>>>>> >>>>>> Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next >>>>>> tree and manually applied to the 5.5.0-rc3 branch I've been working with up to >>>>>> this point. It does allow for allocating more memory, however packet >>>>>> forwarding no longer works. I did not see any complaints from dmesg, but here >>>>>> is an example iperf3 session from a client that worked before. >>>>>> >>>>>> ry@xd2:~$ iperf3 -c 10.1.0.2 >>>>>> Connecting to host 10.1.0.2, port 5201 >>>>>> [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 >>>>>> [ ID] Interval Transfer Bitrate Retr Cwnd >>>>>> [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes >>>>>> [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes >>>>>> [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes >>>>>> [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes >>>>>> [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes >>>>>> [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes >>>>>> [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes >>>>>> [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes >>>>>> [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes >>>>>> ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes >>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - >>>>>> [ ID] Interval Transfer Bitrate Retr >>>>>> [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender >>>>>> [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver >>>>>> iperf3: interrupt - the client has terminated >>>>>> >>>>>> I'll continue to investigate and report back with anything that I find. >>>>> >>>>> Interestingly I found this behavior to exist in the bpf-next tree independent >>>>> of the patch being present. >>>> >>>> Ryan, >>>> >>>> Could you please do a bisect on it? In the 12 commits after the merge >>>> commit below there are number of sensitive rewrites of the ring access >>>> functions. Maybe one of them breaks your code. When you say "packet >>>> forwarding no longer works", do you mean it works for a second or so, >>>> then no packets come through? What HW are you using? >>>> >>>> commit ce3cec27933c069d2015a81e59b93eb656fe7ee4 >>>> Merge: 99cacdc 1d9cb1f >>>> Author: Alexei Starovoitov <ast@kernel.org> >>>> Date: Fri Dec 20 16:00:10 2019 -0800 >>>> >>>> Merge branch 'xsk-cleanup' >>>> >>>> Magnus Karlsson says: >>>> >>>> ==================== >>>> This patch set cleans up the ring access functions of AF_XDP in hope >>>> that it will now be easier to understand and maintain. I used to get a >>>> headache every time I looked at this code in order to really understand it, >>>> but now I do think it is a lot less painful. >>>> <snip> >>>> >>>> /Magnus >>> >>> I see that you have debug messages in your application. Could you >>> please run with those on and send me the output so I can see where it >>> stops. A bisect that pin-points what commit that breaks your program >>> plus the debug output should hopefully send us on the right path for a >>> fix. >>> >>> Thanks: Magnus >>> Hi Ryan, >> Hi Magnus, >> >> I did a bisect starting from the head of the bpf-next tree (990bca1) down to >> the first commit before the patch series you identified (df034c9). The result >> was identifying df0ae6f as the commit that causes the issue I am seeing. This commit and the commit before it remove batching in xskq_nb_avail. Before these two commits xskq_nb_avail may have returned values 0..16, and now it's not limited. It looks like a change too minor to cause this issue... >> I've posted output from the program in debugging mode here >> >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 >> >> Yes, you are correct in that forwarding works for a brief period and then stops. >> I've noticed that the number of packets that are forwarded is equal to the size >> of the producer/consumer descriptor rings. I've posted two ping traces from a >> client ping that shows this. >> >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 These snippets are not available. >> >> I've also noticed that when the forwarding stops, the CPU usage for the proc >> running the program is pegged, which is not the norm for this program as it uses >> a poll call with a timeout on the xsk fd. This information led me to a guess what may be happening. On the RX side, mlx5e allocates pages in bulks for performance reasons and to leverage hardware features targeted to performance. In AF_XDP mode, bulking of frames is also used (on x86, the bulk size is 64 with striding RQ enabled, and 8 otherwise, however, it's implementation details that might change later). If you don't put enough frames to XSK Fill Ring, the driver will be demanding more frames and return from poll() immediately. Basically, in the application, you should put as many frames to the Fill Ring as you can. Please check if that could be the root cause of your issue. I tracked this issue in our internal bug tracker in case we need to perform actual debugging of mlx5e. I'm looking forward to your feedback on my assumption above. >> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) >> running the mlx5 driver. This one should run without striding RQ, please verify it with ethtool --show-priv-flags (the flag name is rx_striding_rq). > The program is running in zero copy mode. I also tested >> this code out in a virtual machine with virtio NICs in SKB mode which uses >> xdpgeneric - there were no issues in that setting. >> >> -- >> ~ ry > > Maxim, > > Do you think you could help me debug this issue that Ryan is having? I > can unfortunately not reproduce the stalling issue with my Intel i40e > cards. > > Ryan, Maxim is Mellanox's responsible for AF_XDP support and he has > also contributed to the core AF_XDP code. So you are in good hands > :-). > > Thanks: Magnus > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-21 13:40 ` Maxim Mikityanskiy @ 2020-01-22 21:43 ` Ryan Goodfellow 2020-01-27 14:01 ` Maxim Mikityanskiy 0 siblings, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-22 21:43 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Magnus Karlsson, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > >> I've posted output from the program in debugging mode here > >> > >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 > >> > >> Yes, you are correct in that forwarding works for a brief period and then stops. > >> I've noticed that the number of packets that are forwarded is equal to the size > >> of the producer/consumer descriptor rings. I've posted two ping traces from a > >> client ping that shows this. > >> > >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 > >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 > > These snippets are not available. Apologies, I had the wrong permissions set. They should be available now. > > >> > >> I've also noticed that when the forwarding stops, the CPU usage for the proc > >> running the program is pegged, which is not the norm for this program as it uses > >> a poll call with a timeout on the xsk fd. > > This information led me to a guess what may be happening. On the RX > side, mlx5e allocates pages in bulks for performance reasons and to > leverage hardware features targeted to performance. In AF_XDP mode, > bulking of frames is also used (on x86, the bulk size is 64 with > striding RQ enabled, and 8 otherwise, however, it's implementation > details that might change later). If you don't put enough frames to XSK > Fill Ring, the driver will be demanding more frames and return from > poll() immediately. Basically, in the application, you should put as > many frames to the Fill Ring as you can. Please check if that could be > the root cause of your issue. The code in this application makes an effort to relenish the fill ring as fast as possible. The basic loop of the application is to first check if there are any descriptors to be consumed from the completion queue or any descriptors that can be added to the fill queue, and only then to move on to moving packets through the rx and tx rings. https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474 > > I tracked this issue in our internal bug tracker in case we need to > perform actual debugging of mlx5e. I'm looking forward to your feedback > on my assumption above. > > >> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) > >> running the mlx5 driver. > > This one should run without striding RQ, please verify it with ethtool > --show-priv-flags (the flag name is rx_striding_rq). I do not remember changing this option, so whatever the default is, is what it was running with. I am traveling this week and do not have access to these systems, but will ensure that this flag is set properly when I get back. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-22 21:43 ` Ryan Goodfellow @ 2020-01-27 14:01 ` Maxim Mikityanskiy 2020-01-27 15:54 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Maxim Mikityanskiy @ 2020-01-27 14:01 UTC (permalink / raw) To: Ryan Goodfellow, Magnus Karlsson Cc: xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On 2020-01-22 23:43, Ryan Goodfellow wrote: > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: >>>> I've posted output from the program in debugging mode here >>>> >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 >>>> >>>> Yes, you are correct in that forwarding works for a brief period and then stops. >>>> I've noticed that the number of packets that are forwarded is equal to the size >>>> of the producer/consumer descriptor rings. I've posted two ping traces from a >>>> client ping that shows this. >>>> >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 >> >> These snippets are not available. > > Apologies, I had the wrong permissions set. They should be available now. > >> >>>> >>>> I've also noticed that when the forwarding stops, the CPU usage for the proc >>>> running the program is pegged, which is not the norm for this program as it uses >>>> a poll call with a timeout on the xsk fd. >> >> This information led me to a guess what may be happening. On the RX >> side, mlx5e allocates pages in bulks for performance reasons and to >> leverage hardware features targeted to performance. In AF_XDP mode, >> bulking of frames is also used (on x86, the bulk size is 64 with >> striding RQ enabled, and 8 otherwise, however, it's implementation >> details that might change later). If you don't put enough frames to XSK >> Fill Ring, the driver will be demanding more frames and return from >> poll() immediately. Basically, in the application, you should put as >> many frames to the Fill Ring as you can. Please check if that could be >> the root cause of your issue. > > The code in this application makes an effort to relenish the fill ring as fast > as possible. The basic loop of the application is to first check if there are > any descriptors to be consumed from the completion queue or any descriptors that > can be added to the fill queue, and only then to move on to moving packets > through the rx and tx rings. > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474 I reproduced your issue and did my investigation, and here is what I found: 1. Commit df0ae6f78a45 (that you found during bisect) introduces an important behavioral change (which I thought was not that important). xskq_nb_avail used to return min(entries, dcnt), but after the change it just returns entries, which may be as big as the ring size. 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail catches up with q->cons_head. So, before that patch and one previous patch, cons_head - cons_tail was not more than 16, so the consumer index was updated periodically. Now consumer is updated only when the whole ring is exhausted. 3. The application can't replenish the fill ring if the consumer index doesn't move. As a consequence, refilling the descriptors by the application can't happen in parallel with using them by the driver. It should have some performance penalty and possibly even lead to packet drops, because the driver uses all the descriptors and only then advances the consumer index, and then it has to wait until the application refills the ring, busy-looping and losing packets. 4. As mlx5e allocates frames in batches, the consequences are even more severe: it's a deadlock where the driver waits for the application, and vice versa. The driver never reaches the point where cons_tail gets equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch size requested by the driver is 8, the driver won't peek anything from the fill ring waiting for difference between cons_tail and cons_head to increase to be at least 8. On the other hand, the application can't put anything to the ring, because it still thinks that the consumer index is 0. As cons_tail never reaches cons_head, the consumer index doesn't get updated, hence the deadlock. So, in my vision, the decision to remove RX_BATCH_SIZE and periodic updates of the consumer index was wrong. It totally breaks mlx5e, that does batching, and it will affect the performance of any driver, because the application can't refill the ring until it gets completely empty and the driver starts waiting for frames. I suggest that periodic updates of the consumer index should be readded to xskq_cons_peek_addr. Magnus, what do you think of the suggestion above? Thanks, Max >> >> I tracked this issue in our internal bug tracker in case we need to >> perform actual debugging of mlx5e. I'm looking forward to your feedback >> on my assumption above. >> >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) >>>> running the mlx5 driver. >> >> This one should run without striding RQ, please verify it with ethtool >> --show-priv-flags (the flag name is rx_striding_rq). > > I do not remember changing this option, so whatever the default is, is what it > was running with. I am traveling this week and do not have access to these > systems, but will ensure that this flag is set properly when I get back. > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-27 14:01 ` Maxim Mikityanskiy @ 2020-01-27 15:54 ` Magnus Karlsson 2020-01-30 9:37 ` Maxim Mikityanskiy 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-27 15:54 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > >>>> I've posted output from the program in debugging mode here > >>>> > >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375 > >>>> > >>>> Yes, you are correct in that forwarding works for a brief period and then stops. > >>>> I've noticed that the number of packets that are forwarded is equal to the size > >>>> of the producer/consumer descriptor rings. I've posted two ping traces from a > >>>> client ping that shows this. > >>>> > >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376 > >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377 > >> > >> These snippets are not available. > > > > Apologies, I had the wrong permissions set. They should be available now. > > > >> > >>>> > >>>> I've also noticed that when the forwarding stops, the CPU usage for the proc > >>>> running the program is pegged, which is not the norm for this program as it uses > >>>> a poll call with a timeout on the xsk fd. > >> > >> This information led me to a guess what may be happening. On the RX > >> side, mlx5e allocates pages in bulks for performance reasons and to > >> leverage hardware features targeted to performance. In AF_XDP mode, > >> bulking of frames is also used (on x86, the bulk size is 64 with > >> striding RQ enabled, and 8 otherwise, however, it's implementation > >> details that might change later). If you don't put enough frames to XSK > >> Fill Ring, the driver will be demanding more frames and return from > >> poll() immediately. Basically, in the application, you should put as > >> many frames to the Fill Ring as you can. Please check if that could be > >> the root cause of your issue. > > > > The code in this application makes an effort to relenish the fill ring as fast > > as possible. The basic loop of the application is to first check if there are > > any descriptors to be consumed from the completion queue or any descriptors that > > can be added to the fill queue, and only then to move on to moving packets > > through the rx and tx rings. > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474 > > I reproduced your issue and did my investigation, and here is what I found: > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > important behavioral change (which I thought was not that important). > xskq_nb_avail used to return min(entries, dcnt), but after the change it > just returns entries, which may be as big as the ring size. > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > catches up with q->cons_head. So, before that patch and one previous > patch, cons_head - cons_tail was not more than 16, so the consumer index > was updated periodically. Now consumer is updated only when the whole > ring is exhausted. > > 3. The application can't replenish the fill ring if the consumer index > doesn't move. As a consequence, refilling the descriptors by the > application can't happen in parallel with using them by the driver. It > should have some performance penalty and possibly even lead to packet > drops, because the driver uses all the descriptors and only then > advances the consumer index, and then it has to wait until the > application refills the ring, busy-looping and losing packets. This will happen if user space always fills up the whole ring, which might or might not happen all depending on the app. With that said, it might provide better performance to update it once in a while. It might also be the case that we will get better performance with the new scheme if we only fill half the fill ring. I will look into this and see what I get. > 4. As mlx5e allocates frames in batches, the consequences are even more > severe: it's a deadlock where the driver waits for the application, and > vice versa. The driver never reaches the point where cons_tail gets > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > size requested by the driver is 8, the driver won't peek anything from > the fill ring waiting for difference between cons_tail and cons_head to > increase to be at least 8. On the other hand, the application can't put > anything to the ring, because it still thinks that the consumer index is > 0. As cons_tail never reaches cons_head, the consumer index doesn't get > updated, hence the deadlock. Good thing that you detected this. Maybe I should get a Mellanox card :-). This is different from how we implemented Intel's drivers that just work on any batch size. If it gets 3 packets back, it will use those. How do you deal with the case when the application just puts a single buffer in the fill ring and wants to receive a single packet? Why does the Mellanox driver need a specific batch size? This is just for my understanding so we can find a good solution. > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > updates of the consumer index was wrong. It totally breaks mlx5e, that > does batching, and it will affect the performance of any driver, because > the application can't refill the ring until it gets completely empty and > the driver starts waiting for frames. I suggest that periodic updates of > the consumer index should be readded to xskq_cons_peek_addr. The reason I wanted to remove RX_BATCH_SIZE is that application developers complained about it giving rise to counter intuitive behavior in user space. I will try to dig out the complaints and the explanations Björn and I had to send which it seemed that users really should not have to care about. It should just work. I also do not like to have arbitrary constants like this. Why 16? Would much prefer not having to deal with this, unless of course it horribly breaks something or gives rise to worse performance. Might still be the case here, but if not, I would like to remove it. Thanks: Magnus > Magnus, what do you think of the suggestion above? > > Thanks, > Max > > >> > >> I tracked this issue in our internal bug tracker in case we need to > >> perform actual debugging of mlx5e. I'm looking forward to your feedback > >> on my assumption above. > >> > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT) > >>>> running the mlx5 driver. > >> > >> This one should run without striding RQ, please verify it with ethtool > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > I do not remember changing this option, so whatever the default is, is what it > > was running with. I am traveling this week and do not have access to these > > systems, but will ensure that this flag is set properly when I get back. > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: zero-copy between interfaces 2020-01-27 15:54 ` Magnus Karlsson @ 2020-01-30 9:37 ` Maxim Mikityanskiy 2020-01-30 9:59 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Maxim Mikityanskiy @ 2020-01-30 9:37 UTC (permalink / raw) To: Magnus Karlsson Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh > -----Original Message----- > From: Magnus Karlsson <magnus.karlsson@gmail.com> > Sent: 27 January, 2020 17:55 > To: Maxim Mikityanskiy <maximmi@mellanox.com> > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe > Shemesh <moshe@mellanox.com> > Subject: Re: zero-copy between interfaces > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> > wrote: > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > > >>>> I've posted output from the program in debugging mode here > > >>>> > > >>>> - https://gitlab.com/mergetb/tech/network- > emulation/kernel/snippets/1930375 > > >>>> > > >>>> Yes, you are correct in that forwarding works for a brief period and > then stops. > > >>>> I've noticed that the number of packets that are forwarded is equal > to the size > > >>>> of the producer/consumer descriptor rings. I've posted two ping > traces from a > > >>>> client ping that shows this. > > >>>> > > >>>> - https://gitlab.com/mergetb/tech/network- > emulation/kernel/snippets/1930376 > > >>>> - https://gitlab.com/mergetb/tech/network- > emulation/kernel/snippets/1930377 > > >> > > >> These snippets are not available. > > > > > > Apologies, I had the wrong permissions set. They should be available > now. > > > > > >> > > >>>> > > >>>> I've also noticed that when the forwarding stops, the CPU usage for > the proc > > >>>> running the program is pegged, which is not the norm for this program > as it uses > > >>>> a poll call with a timeout on the xsk fd. > > >> > > >> This information led me to a guess what may be happening. On the RX > > >> side, mlx5e allocates pages in bulks for performance reasons and to > > >> leverage hardware features targeted to performance. In AF_XDP mode, > > >> bulking of frames is also used (on x86, the bulk size is 64 with > > >> striding RQ enabled, and 8 otherwise, however, it's implementation > > >> details that might change later). If you don't put enough frames to XSK > > >> Fill Ring, the driver will be demanding more frames and return from > > >> poll() immediately. Basically, in the application, you should put as > > >> many frames to the Fill Ring as you can. Please check if that could be > > >> the root cause of your issue. > > > > > > The code in this application makes an effort to relenish the fill ring > as fast > > > as possible. The basic loop of the application is to first check if > there are > > > any descriptors to be consumed from the completion queue or any > descriptors that > > > can be added to the fill queue, and only then to move on to moving > packets > > > through the rx and tx rings. > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- > moa/samples/bpf/xdpsock_multidev.c#L452-474 > > > > I reproduced your issue and did my investigation, and here is what I > found: > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > > important behavioral change (which I thought was not that important). > > xskq_nb_avail used to return min(entries, dcnt), but after the change it > > just returns entries, which may be as big as the ring size. > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > > catches up with q->cons_head. So, before that patch and one previous > > patch, cons_head - cons_tail was not more than 16, so the consumer index > > was updated periodically. Now consumer is updated only when the whole > > ring is exhausted. > > > > 3. The application can't replenish the fill ring if the consumer index > > doesn't move. As a consequence, refilling the descriptors by the > > application can't happen in parallel with using them by the driver. It > > should have some performance penalty and possibly even lead to packet > > drops, because the driver uses all the descriptors and only then > > advances the consumer index, and then it has to wait until the > > application refills the ring, busy-looping and losing packets. > > This will happen if user space always fills up the whole ring, which > might or might not happen all depending on the app. Yes, that's right, and as far as I know, it's common to fill as many frames as the application can (there was no reason to do it other way till now). > With that said, it > might provide better performance to update it once in a while. It > might also be the case that we will get better performance with the > new scheme if we only fill half the fill ring. Yes, it may improve performance. However, I don't think it's correct to set such a limitation on the app. > I will look into this > and see what I get. > > > 4. As mlx5e allocates frames in batches, the consequences are even more > > severe: it's a deadlock where the driver waits for the application, and > > vice versa. The driver never reaches the point where cons_tail gets > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > > size requested by the driver is 8, the driver won't peek anything from > > the fill ring waiting for difference between cons_tail and cons_head to > > increase to be at least 8. On the other hand, the application can't put > > anything to the ring, because it still thinks that the consumer index is > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get > > updated, hence the deadlock. > > Good thing that you detected this. Maybe I should get a Mellanox card > :-). This is different from how we implemented Intel's drivers that > just work on any batch size. If it gets 3 packets back, it will use > those. How do you deal with the case when the application just puts a > single buffer in the fill ring and wants to receive a single packet? mlx5e will wait until the full batch is available. As AF_XDP is intended for high-performance apps, this scenario is less expected. We prefer to leverage our performance features. > Why does the Mellanox driver need a specific batch size? This is just > for my understanding so we can find a good solution. The main reason is our performance feature called striding RQ. Skipping all irrelevant details, a batch of 64 pages is posted to the NIC with a single request, and the NIC fills them progressively. This feature is turned on by default on modern NICs, and it's really good for performance. It might be possible to post a smaller batch though, I'm not sure about it, it needs to be checked, but anyway it's not something that we will likely do, because it's a complication of the data path, and if we know more frames are coming, it's much better for the performance to wait for them, rather than to post several incomplete batches. > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > > updates of the consumer index was wrong. It totally breaks mlx5e, that > > does batching, and it will affect the performance of any driver, because > > the application can't refill the ring until it gets completely empty and > > the driver starts waiting for frames. I suggest that periodic updates of > > the consumer index should be readded to xskq_cons_peek_addr. > > The reason I wanted to remove RX_BATCH_SIZE is that application > developers complained about it giving rise to counter intuitive > behavior in user space. I will try to dig out the complaints and the > explanations Björn and I had to send which it seemed that users really > should not have to care about. It should just work. I think the counter that doesn't update till the very last moment and then advances by the ring size will also be something to complain about (and I am the first one to complain :D). Such bursts are not good in any case. > I also do not like > to have arbitrary constants like this. Why 16? I believe any batching mechanism has a constant that look arbitrary. This constant should be the golden mean: if it's too small, there is little effect from batching; if it's too big, it gets too bursty. Basically, after your patch it just changed from 16 to the ring size. Maybe we can tie that constant to ring size? Make it ring_size / another_arbitrary_constant? :) > Would much prefer not > having to deal with this, unless of course it horribly breaks > something or gives rise to worse performance. Might still be the case > here, but if not, I would like to remove it. > > Thanks: Magnus > > > Magnus, what do you think of the suggestion above? > > > > Thanks, > > Max > > > > >> > > >> I tracked this issue in our internal bug tracker in case we need to > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback > > >> on my assumption above. > > >> > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- > CCAT) > > >>>> running the mlx5 driver. > > >> > > >> This one should run without striding RQ, please verify it with ethtool > > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > > > I do not remember changing this option, so whatever the default is, is > what it > > > was running with. I am traveling this week and do not have access to > these > > > systems, but will ensure that this flag is set properly when I get back. > > > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-30 9:37 ` Maxim Mikityanskiy @ 2020-01-30 9:59 ` Magnus Karlsson 2020-01-30 11:40 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-30 9:59 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > > > -----Original Message----- > > From: Magnus Karlsson <magnus.karlsson@gmail.com> > > Sent: 27 January, 2020 17:55 > > To: Maxim Mikityanskiy <maximmi@mellanox.com> > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe > > Shemesh <moshe@mellanox.com> > > Subject: Re: zero-copy between interfaces > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> > > wrote: > > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > > > >>>> I've posted output from the program in debugging mode here > > > >>>> > > > >>>> - https://gitlab.com/mergetb/tech/network- > > emulation/kernel/snippets/1930375 > > > >>>> > > > >>>> Yes, you are correct in that forwarding works for a brief period and > > then stops. > > > >>>> I've noticed that the number of packets that are forwarded is equal > > to the size > > > >>>> of the producer/consumer descriptor rings. I've posted two ping > > traces from a > > > >>>> client ping that shows this. > > > >>>> > > > >>>> - https://gitlab.com/mergetb/tech/network- > > emulation/kernel/snippets/1930376 > > > >>>> - https://gitlab.com/mergetb/tech/network- > > emulation/kernel/snippets/1930377 > > > >> > > > >> These snippets are not available. > > > > > > > > Apologies, I had the wrong permissions set. They should be available > > now. > > > > > > > >> > > > >>>> > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for > > the proc > > > >>>> running the program is pegged, which is not the norm for this program > > as it uses > > > >>>> a poll call with a timeout on the xsk fd. > > > >> > > > >> This information led me to a guess what may be happening. On the RX > > > >> side, mlx5e allocates pages in bulks for performance reasons and to > > > >> leverage hardware features targeted to performance. In AF_XDP mode, > > > >> bulking of frames is also used (on x86, the bulk size is 64 with > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation > > > >> details that might change later). If you don't put enough frames to XSK > > > >> Fill Ring, the driver will be demanding more frames and return from > > > >> poll() immediately. Basically, in the application, you should put as > > > >> many frames to the Fill Ring as you can. Please check if that could be > > > >> the root cause of your issue. > > > > > > > > The code in this application makes an effort to relenish the fill ring > > as fast > > > > as possible. The basic loop of the application is to first check if > > there are > > > > any descriptors to be consumed from the completion queue or any > > descriptors that > > > > can be added to the fill queue, and only then to move on to moving > > packets > > > > through the rx and tx rings. > > > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- > > moa/samples/bpf/xdpsock_multidev.c#L452-474 > > > > > > I reproduced your issue and did my investigation, and here is what I > > found: > > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > > > important behavioral change (which I thought was not that important). > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it > > > just returns entries, which may be as big as the ring size. > > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > > > catches up with q->cons_head. So, before that patch and one previous > > > patch, cons_head - cons_tail was not more than 16, so the consumer index > > > was updated periodically. Now consumer is updated only when the whole > > > ring is exhausted. > > > > > > 3. The application can't replenish the fill ring if the consumer index > > > doesn't move. As a consequence, refilling the descriptors by the > > > application can't happen in parallel with using them by the driver. It > > > should have some performance penalty and possibly even lead to packet > > > drops, because the driver uses all the descriptors and only then > > > advances the consumer index, and then it has to wait until the > > > application refills the ring, busy-looping and losing packets. > > > > This will happen if user space always fills up the whole ring, which > > might or might not happen all depending on the app. > > Yes, that's right, and as far as I know, it's common to fill as many > frames as the application can (there was no reason to do it other way > till now). > > > With that said, it > > might provide better performance to update it once in a while. It > > might also be the case that we will get better performance with the > > new scheme if we only fill half the fill ring. > > Yes, it may improve performance. However, I don't think it's correct to > set such a limitation on the app. I will examine what provides the best performance. On one hand it is the number of updates to shared ring state (minimized by current scheme) and the ability for the user app to but buffers on the fill ring. Stating that putting e.g. half the packets on the fill ring provides better performance (for some application) is not a limitation. It is performance tuning advise :-). > > I will look into this > > and see what I get. > > > > > 4. As mlx5e allocates frames in batches, the consequences are even more > > > severe: it's a deadlock where the driver waits for the application, and > > > vice versa. The driver never reaches the point where cons_tail gets > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > > > size requested by the driver is 8, the driver won't peek anything from > > > the fill ring waiting for difference between cons_tail and cons_head to > > > increase to be at least 8. On the other hand, the application can't put > > > anything to the ring, because it still thinks that the consumer index is > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get > > > updated, hence the deadlock. > > > > Good thing that you detected this. Maybe I should get a Mellanox card > > :-). This is different from how we implemented Intel's drivers that > > just work on any batch size. If it gets 3 packets back, it will use > > those. How do you deal with the case when the application just puts a > > single buffer in the fill ring and wants to receive a single packet? > > mlx5e will wait until the full batch is available. As AF_XDP is intended > for high-performance apps, this scenario is less expected. We prefer to > leverage our performance features. That you cannot support all applications on top of AF_XDP with your zero-copy support seems broken to me. But I give you that you might support all the ones you care about with your current batching support. Like you say, most apps will put plenty of buffers on the fill ring, so this should not be a problem. Can you not implement some slow path for these cases? You must have one for the skb case. > > Why does the Mellanox driver need a specific batch size? This is just > > for my understanding so we can find a good solution. > > The main reason is our performance feature called striding RQ. Skipping > all irrelevant details, a batch of 64 pages is posted to the NIC with a > single request, and the NIC fills them progressively. This feature is > turned on by default on modern NICs, and it's really good for > performance. > > It might be possible to post a smaller batch though, I'm not sure about > it, it needs to be checked, but anyway it's not something that we will > likely do, because it's a complication of the data path, and if we know > more frames are coming, it's much better for the performance to wait for > them, rather than to post several incomplete batches. > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > > > updates of the consumer index was wrong. It totally breaks mlx5e, that > > > does batching, and it will affect the performance of any driver, because > > > the application can't refill the ring until it gets completely empty and > > > the driver starts waiting for frames. I suggest that periodic updates of > > > the consumer index should be readded to xskq_cons_peek_addr. > > > > The reason I wanted to remove RX_BATCH_SIZE is that application > > developers complained about it giving rise to counter intuitive > > behavior in user space. I will try to dig out the complaints and the > > explanations Björn and I had to send which it seemed that users really > > should not have to care about. It should just work. > > I think the counter that doesn't update till the very last moment and > then advances by the ring size will also be something to complain about > (and I am the first one to complain :D). Such bursts are not good in any > case. Do you have any performance data that shows this scheme is bad for performance? The good thing about this scheme is that global state is updated less frequently. And the bad thing is what you mentioned. But which one has the greatest effect, is the question. > > I also do not like > > to have arbitrary constants like this. Why 16? > > I believe any batching mechanism has a constant that look arbitrary. > This constant should be the golden mean: if it's too small, there is > little effect from batching; if it's too big, it gets too bursty. > > Basically, after your patch it just changed from 16 to the ring size. > Maybe we can tie that constant to ring size? Make it ring_size / > another_arbitrary_constant? :) Agreed, I thought about this too. Something tied to ring_size might be a good compromise. Will examine this. But I want to base this on performance data not idle speculation, so need to run some experiments first. /Magnus > > Would much prefer not > > having to deal with this, unless of course it horribly breaks > > something or gives rise to worse performance. Might still be the case > > here, but if not, I would like to remove it. > > > > Thanks: Magnus > > > > > Magnus, what do you think of the suggestion above? > > > > > > Thanks, > > > Max > > > > > > >> > > > >> I tracked this issue in our internal bug tracker in case we need to > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback > > > >> on my assumption above. > > > >> > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- > > CCAT) > > > >>>> running the mlx5 driver. > > > >> > > > >> This one should run without striding RQ, please verify it with ethtool > > > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > > > > > I do not remember changing this option, so whatever the default is, is > > what it > > > > was running with. I am traveling this week and do not have access to > > these > > > > systems, but will ensure that this flag is set properly when I get back. > > > > > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-30 9:59 ` Magnus Karlsson @ 2020-01-30 11:40 ` Magnus Karlsson 2020-02-04 16:10 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-01-30 11:40 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy > <maximmi@mellanox.com> wrote: > > > > > -----Original Message----- > > > From: Magnus Karlsson <magnus.karlsson@gmail.com> > > > Sent: 27 January, 2020 17:55 > > > To: Maxim Mikityanskiy <maximmi@mellanox.com> > > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq > > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe > > > Shemesh <moshe@mellanox.com> > > > Subject: Re: zero-copy between interfaces > > > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> > > > wrote: > > > > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > > > > >>>> I've posted output from the program in debugging mode here > > > > >>>> > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > emulation/kernel/snippets/1930375 > > > > >>>> > > > > >>>> Yes, you are correct in that forwarding works for a brief period and > > > then stops. > > > > >>>> I've noticed that the number of packets that are forwarded is equal > > > to the size > > > > >>>> of the producer/consumer descriptor rings. I've posted two ping > > > traces from a > > > > >>>> client ping that shows this. > > > > >>>> > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > emulation/kernel/snippets/1930376 > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > emulation/kernel/snippets/1930377 > > > > >> > > > > >> These snippets are not available. > > > > > > > > > > Apologies, I had the wrong permissions set. They should be available > > > now. > > > > > > > > > >> > > > > >>>> > > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for > > > the proc > > > > >>>> running the program is pegged, which is not the norm for this program > > > as it uses > > > > >>>> a poll call with a timeout on the xsk fd. > > > > >> > > > > >> This information led me to a guess what may be happening. On the RX > > > > >> side, mlx5e allocates pages in bulks for performance reasons and to > > > > >> leverage hardware features targeted to performance. In AF_XDP mode, > > > > >> bulking of frames is also used (on x86, the bulk size is 64 with > > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation > > > > >> details that might change later). If you don't put enough frames to XSK > > > > >> Fill Ring, the driver will be demanding more frames and return from > > > > >> poll() immediately. Basically, in the application, you should put as > > > > >> many frames to the Fill Ring as you can. Please check if that could be > > > > >> the root cause of your issue. > > > > > > > > > > The code in this application makes an effort to relenish the fill ring > > > as fast > > > > > as possible. The basic loop of the application is to first check if > > > there are > > > > > any descriptors to be consumed from the completion queue or any > > > descriptors that > > > > > can be added to the fill queue, and only then to move on to moving > > > packets > > > > > through the rx and tx rings. > > > > > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- > > > moa/samples/bpf/xdpsock_multidev.c#L452-474 > > > > > > > > I reproduced your issue and did my investigation, and here is what I > > > found: > > > > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > > > > important behavioral change (which I thought was not that important). > > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it > > > > just returns entries, which may be as big as the ring size. > > > > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > > > > catches up with q->cons_head. So, before that patch and one previous > > > > patch, cons_head - cons_tail was not more than 16, so the consumer index > > > > was updated periodically. Now consumer is updated only when the whole > > > > ring is exhausted. > > > > > > > > 3. The application can't replenish the fill ring if the consumer index > > > > doesn't move. As a consequence, refilling the descriptors by the > > > > application can't happen in parallel with using them by the driver. It > > > > should have some performance penalty and possibly even lead to packet > > > > drops, because the driver uses all the descriptors and only then > > > > advances the consumer index, and then it has to wait until the > > > > application refills the ring, busy-looping and losing packets. > > > > > > This will happen if user space always fills up the whole ring, which > > > might or might not happen all depending on the app. > > > > Yes, that's right, and as far as I know, it's common to fill as many > > frames as the application can (there was no reason to do it other way > > till now). > > > > > With that said, it > > > might provide better performance to update it once in a while. It > > > might also be the case that we will get better performance with the > > > new scheme if we only fill half the fill ring. > > > > Yes, it may improve performance. However, I don't think it's correct to > > set such a limitation on the app. Actually, a much worse limitation to put on an application is to say that you have to have a certain amount of buffers on some ring for the zero-copy feature to work. For example that we need at least 64 buffers on the fill ring for all the NIC cards to work in zero-copy mode. That would be a bad thing to have to put in the documentation. An OS is supposed to abstract away HW differences, and with this current limitation in your driver, it shines through for sure. What we would like to put in the documentation is a statement along the lines of: "for best performance, make sure you have plenty of buffers on the fill ring so that the NIC can work as efficiently as possible". Not a statement that it does not work on Mellanox unless you put enough buffers on the fill ring. So my advice (and wish) is that you fix this in your driver. With that said, I will still look into what is the best way to get at least the sample to work for you. But there is no way to make sure every single app works for you in zero-copy mode, unless you support arbitrary amount of buffers on the fill ring. I guess that sooner or later, a customer of yours will get into this situation one way or the other, so why not fix it now. /Magnus > I will examine what provides the best performance. On one hand it is > the number of updates to shared ring state (minimized by current > scheme) and the ability for the user app to but buffers on the fill > ring. Stating that putting e.g. half the packets on the fill ring > provides better performance (for some application) is not a > limitation. It is performance tuning advise :-). > > > > I will look into this > > > and see what I get. > > > > > > > 4. As mlx5e allocates frames in batches, the consequences are even more > > > > severe: it's a deadlock where the driver waits for the application, and > > > > vice versa. The driver never reaches the point where cons_tail gets > > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > > > > size requested by the driver is 8, the driver won't peek anything from > > > > the fill ring waiting for difference between cons_tail and cons_head to > > > > increase to be at least 8. On the other hand, the application can't put > > > > anything to the ring, because it still thinks that the consumer index is > > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get > > > > updated, hence the deadlock. > > > > > > Good thing that you detected this. Maybe I should get a Mellanox card > > > :-). This is different from how we implemented Intel's drivers that > > > just work on any batch size. If it gets 3 packets back, it will use > > > those. How do you deal with the case when the application just puts a > > > single buffer in the fill ring and wants to receive a single packet? > > > > mlx5e will wait until the full batch is available. As AF_XDP is intended > > for high-performance apps, this scenario is less expected. We prefer to > > leverage our performance features. > > That you cannot support all applications on top of AF_XDP with your > zero-copy support seems broken to me. But I give you that you might > support all the ones you care about with your current batching > support. Like you say, most apps will put plenty of buffers on the > fill ring, so this should not be a problem. Can you not implement some > slow path for these cases? You must have one for the skb case. > > > > Why does the Mellanox driver need a specific batch size? This is just > > > for my understanding so we can find a good solution. > > > > The main reason is our performance feature called striding RQ. Skipping > > all irrelevant details, a batch of 64 pages is posted to the NIC with a > > single request, and the NIC fills them progressively. This feature is > > turned on by default on modern NICs, and it's really good for > > performance. > > > > It might be possible to post a smaller batch though, I'm not sure about > > it, it needs to be checked, but anyway it's not something that we will > > likely do, because it's a complication of the data path, and if we know > > more frames are coming, it's much better for the performance to wait for > > them, rather than to post several incomplete batches. > > > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > > > > updates of the consumer index was wrong. It totally breaks mlx5e, that > > > > does batching, and it will affect the performance of any driver, because > > > > the application can't refill the ring until it gets completely empty and > > > > the driver starts waiting for frames. I suggest that periodic updates of > > > > the consumer index should be readded to xskq_cons_peek_addr. > > > > > > The reason I wanted to remove RX_BATCH_SIZE is that application > > > developers complained about it giving rise to counter intuitive > > > behavior in user space. I will try to dig out the complaints and the > > > explanations Björn and I had to send which it seemed that users really > > > should not have to care about. It should just work. > > > > I think the counter that doesn't update till the very last moment and > > then advances by the ring size will also be something to complain about > > (and I am the first one to complain :D). Such bursts are not good in any > > case. > > Do you have any performance data that shows this scheme is bad for > performance? The good thing about this scheme is that global state is > updated less frequently. And the bad thing is what you mentioned. But > which one has the greatest effect, is the question. > > > > I also do not like > > > to have arbitrary constants like this. Why 16? > > > > I believe any batching mechanism has a constant that look arbitrary. > > This constant should be the golden mean: if it's too small, there is > > little effect from batching; if it's too big, it gets too bursty. > > > > Basically, after your patch it just changed from 16 to the ring size. > > Maybe we can tie that constant to ring size? Make it ring_size / > > another_arbitrary_constant? :) > > Agreed, I thought about this too. Something tied to ring_size might be > a good compromise. Will examine this. But I want to base this on > performance data not idle speculation, so need to run some experiments > first. > > /Magnus > > > > Would much prefer not > > > having to deal with this, unless of course it horribly breaks > > > something or gives rise to worse performance. Might still be the case > > > here, but if not, I would like to remove it. > > > > > > Thanks: Magnus > > > > > > > Magnus, what do you think of the suggestion above? > > > > > > > > Thanks, > > > > Max > > > > > > > > >> > > > > >> I tracked this issue in our internal bug tracker in case we need to > > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback > > > > >> on my assumption above. > > > > >> > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- > > > CCAT) > > > > >>>> running the mlx5 driver. > > > > >> > > > > >> This one should run without striding RQ, please verify it with ethtool > > > > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > > > > > > > I do not remember changing this option, so whatever the default is, is > > > what it > > > > > was running with. I am traveling this week and do not have access to > > > these > > > > > systems, but will ensure that this flag is set properly when I get back. > > > > > > > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-30 11:40 ` Magnus Karlsson @ 2020-02-04 16:10 ` Magnus Karlsson 2020-02-05 13:31 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-02-04 16:10 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson > <magnus.karlsson@gmail.com> wrote: > > > > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy > > <maximmi@mellanox.com> wrote: > > > > > > > -----Original Message----- > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com> > > > > Sent: 27 January, 2020 17:55 > > > > To: Maxim Mikityanskiy <maximmi@mellanox.com> > > > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq > > > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe > > > > Shemesh <moshe@mellanox.com> > > > > Subject: Re: zero-copy between interfaces > > > > > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> > > > > wrote: > > > > > > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > > > > > >>>> I've posted output from the program in debugging mode here > > > > > >>>> > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > emulation/kernel/snippets/1930375 > > > > > >>>> > > > > > >>>> Yes, you are correct in that forwarding works for a brief period and > > > > then stops. > > > > > >>>> I've noticed that the number of packets that are forwarded is equal > > > > to the size > > > > > >>>> of the producer/consumer descriptor rings. I've posted two ping > > > > traces from a > > > > > >>>> client ping that shows this. > > > > > >>>> > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > emulation/kernel/snippets/1930376 > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > emulation/kernel/snippets/1930377 > > > > > >> > > > > > >> These snippets are not available. > > > > > > > > > > > > Apologies, I had the wrong permissions set. They should be available > > > > now. > > > > > > > > > > > >> > > > > > >>>> > > > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for > > > > the proc > > > > > >>>> running the program is pegged, which is not the norm for this program > > > > as it uses > > > > > >>>> a poll call with a timeout on the xsk fd. > > > > > >> > > > > > >> This information led me to a guess what may be happening. On the RX > > > > > >> side, mlx5e allocates pages in bulks for performance reasons and to > > > > > >> leverage hardware features targeted to performance. In AF_XDP mode, > > > > > >> bulking of frames is also used (on x86, the bulk size is 64 with > > > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation > > > > > >> details that might change later). If you don't put enough frames to XSK > > > > > >> Fill Ring, the driver will be demanding more frames and return from > > > > > >> poll() immediately. Basically, in the application, you should put as > > > > > >> many frames to the Fill Ring as you can. Please check if that could be > > > > > >> the root cause of your issue. > > > > > > > > > > > > The code in this application makes an effort to relenish the fill ring > > > > as fast > > > > > > as possible. The basic loop of the application is to first check if > > > > there are > > > > > > any descriptors to be consumed from the completion queue or any > > > > descriptors that > > > > > > can be added to the fill queue, and only then to move on to moving > > > > packets > > > > > > through the rx and tx rings. > > > > > > > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- > > > > moa/samples/bpf/xdpsock_multidev.c#L452-474 > > > > > > > > > > I reproduced your issue and did my investigation, and here is what I > > > > found: > > > > > > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > > > > > important behavioral change (which I thought was not that important). > > > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it > > > > > just returns entries, which may be as big as the ring size. > > > > > > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > > > > > catches up with q->cons_head. So, before that patch and one previous > > > > > patch, cons_head - cons_tail was not more than 16, so the consumer index > > > > > was updated periodically. Now consumer is updated only when the whole > > > > > ring is exhausted. > > > > > > > > > > 3. The application can't replenish the fill ring if the consumer index > > > > > doesn't move. As a consequence, refilling the descriptors by the > > > > > application can't happen in parallel with using them by the driver. It > > > > > should have some performance penalty and possibly even lead to packet > > > > > drops, because the driver uses all the descriptors and only then > > > > > advances the consumer index, and then it has to wait until the > > > > > application refills the ring, busy-looping and losing packets. > > > > > > > > This will happen if user space always fills up the whole ring, which > > > > might or might not happen all depending on the app. > > > > > > Yes, that's right, and as far as I know, it's common to fill as many > > > frames as the application can (there was no reason to do it other way > > > till now). > > > > > > > With that said, it > > > > might provide better performance to update it once in a while. It > > > > might also be the case that we will get better performance with the > > > > new scheme if we only fill half the fill ring. > > > > > > Yes, it may improve performance. However, I don't think it's correct to > > > set such a limitation on the app. > > Actually, a much worse limitation to put on an application is to say > that you have to have a certain amount of buffers on some ring for the > zero-copy feature to work. For example that we need at least 64 > buffers on the fill ring for all the NIC cards to work in zero-copy > mode. That would be a bad thing to have to put in the documentation. > An OS is supposed to abstract away HW differences, and with this > current limitation in your driver, it shines through for sure. What we > would like to put in the documentation is a statement along the lines > of: "for best performance, make sure you have plenty of buffers on the > fill ring so that the NIC can work as efficiently as possible". Not a > statement that it does not work on Mellanox unless you put enough > buffers on the fill ring. So my advice (and wish) is that you fix this > in your driver. With that said, I will still look into what is the > best way to get at least the sample to work for you. But there is no > way to make sure every single app works for you in zero-copy mode, > unless you support arbitrary amount of buffers on the fill ring. I > guess that sooner or later, a customer of yours will get into this > situation one way or the other, so why not fix it now. > > /Magnus > > > I will examine what provides the best performance. On one hand it is > > the number of updates to shared ring state (minimized by current > > scheme) and the ability for the user app to but buffers on the fill > > ring. Stating that putting e.g. half the packets on the fill ring > > provides better performance (for some application) is not a > > limitation. It is performance tuning advise :-). I have now made a set of measurements. First I just made a variation study using the xdpsock app, varying the amount of packets the kernel can withdraw from a consumer ring (fill and Tx) before updating global state. For the 1 core case (app and driver on the same core) the more frequent you do this update, the better. The reason for this is that it costs very little to update the state since the application is not running. And it is beneficial for the app to have a freshly updated state when it starts to execute as it can operate on more packets. For the 2 core case (app on one core, driver on another) it is the complete opposite: the fewer updates to global state, the better. The reason for this is that it costs a lot to update global state as it triggers cache coherency actions between the two cores. What I did then was to compare the current scheme, update only when grabbing new packets, to a new scheme were we also update the global consumer pointer when we are exiting Rx or Tx processing in the NAPI context. On two cores the current scheme gets 0.5 to 1 Mpps more in throughput than also updating the pointer at the end of NAPI. But for 1 core case, the new scheme is the best and generates between 0.2 and 0.3 Mpps more in throughput than the current one. But all in all, the current scheme is more beneficial than the proposed one if we say that both the 1 core and the 2 core case is equally important. Something to note is that the xdpsock application only puts batch size (64) of packets in the fill ring in every iteration, and this might lead to some good pacing for the current scheme and the 2 core case. I.e., we do not get into the case of the fill ring only being full or empty. But I will run this on some real apps to get some more results, and I know that Björn has an optimized xdpsock application that puts many more packets into the fill queue than 64. This optimization might actually make the new proposal (also updating at the end of NAPI) be better and make the current scheme suffer. We will examine this further and get back. /Magnus > > > > I will look into this > > > > and see what I get. > > > > > > > > > 4. As mlx5e allocates frames in batches, the consequences are even more > > > > > severe: it's a deadlock where the driver waits for the application, and > > > > > vice versa. The driver never reaches the point where cons_tail gets > > > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > > > > > size requested by the driver is 8, the driver won't peek anything from > > > > > the fill ring waiting for difference between cons_tail and cons_head to > > > > > increase to be at least 8. On the other hand, the application can't put > > > > > anything to the ring, because it still thinks that the consumer index is > > > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get > > > > > updated, hence the deadlock. > > > > > > > > Good thing that you detected this. Maybe I should get a Mellanox card > > > > :-). This is different from how we implemented Intel's drivers that > > > > just work on any batch size. If it gets 3 packets back, it will use > > > > those. How do you deal with the case when the application just puts a > > > > single buffer in the fill ring and wants to receive a single packet? > > > > > > mlx5e will wait until the full batch is available. As AF_XDP is intended > > > for high-performance apps, this scenario is less expected. We prefer to > > > leverage our performance features. > > > > That you cannot support all applications on top of AF_XDP with your > > zero-copy support seems broken to me. But I give you that you might > > support all the ones you care about with your current batching > > support. Like you say, most apps will put plenty of buffers on the > > fill ring, so this should not be a problem. Can you not implement some > > slow path for these cases? You must have one for the skb case. > > > > > > Why does the Mellanox driver need a specific batch size? This is just > > > > for my understanding so we can find a good solution. > > > > > > The main reason is our performance feature called striding RQ. Skipping > > > all irrelevant details, a batch of 64 pages is posted to the NIC with a > > > single request, and the NIC fills them progressively. This feature is > > > turned on by default on modern NICs, and it's really good for > > > performance. > > > > > > It might be possible to post a smaller batch though, I'm not sure about > > > it, it needs to be checked, but anyway it's not something that we will > > > likely do, because it's a complication of the data path, and if we know > > > more frames are coming, it's much better for the performance to wait for > > > them, rather than to post several incomplete batches. > > > > > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > > > > > updates of the consumer index was wrong. It totally breaks mlx5e, that > > > > > does batching, and it will affect the performance of any driver, because > > > > > the application can't refill the ring until it gets completely empty and > > > > > the driver starts waiting for frames. I suggest that periodic updates of > > > > > the consumer index should be readded to xskq_cons_peek_addr. > > > > > > > > The reason I wanted to remove RX_BATCH_SIZE is that application > > > > developers complained about it giving rise to counter intuitive > > > > behavior in user space. I will try to dig out the complaints and the > > > > explanations Björn and I had to send which it seemed that users really > > > > should not have to care about. It should just work. > > > > > > I think the counter that doesn't update till the very last moment and > > > then advances by the ring size will also be something to complain about > > > (and I am the first one to complain :D). Such bursts are not good in any > > > case. > > > > Do you have any performance data that shows this scheme is bad for > > performance? The good thing about this scheme is that global state is > > updated less frequently. And the bad thing is what you mentioned. But > > which one has the greatest effect, is the question. > > > > > > I also do not like > > > > to have arbitrary constants like this. Why 16? > > > > > > I believe any batching mechanism has a constant that look arbitrary. > > > This constant should be the golden mean: if it's too small, there is > > > little effect from batching; if it's too big, it gets too bursty. > > > > > > Basically, after your patch it just changed from 16 to the ring size. > > > Maybe we can tie that constant to ring size? Make it ring_size / > > > another_arbitrary_constant? :) > > > > Agreed, I thought about this too. Something tied to ring_size might be > > a good compromise. Will examine this. But I want to base this on > > performance data not idle speculation, so need to run some experiments > > first. > > > > /Magnus > > > > > > Would much prefer not > > > > having to deal with this, unless of course it horribly breaks > > > > something or gives rise to worse performance. Might still be the case > > > > here, but if not, I would like to remove it. > > > > > > > > Thanks: Magnus > > > > > > > > > Magnus, what do you think of the suggestion above? > > > > > > > > > > Thanks, > > > > > Max > > > > > > > > > > >> > > > > > >> I tracked this issue in our internal bug tracker in case we need to > > > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback > > > > > >> on my assumption above. > > > > > >> > > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- > > > > CCAT) > > > > > >>>> running the mlx5 driver. > > > > > >> > > > > > >> This one should run without striding RQ, please verify it with ethtool > > > > > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > > > > > > > > > I do not remember changing this option, so whatever the default is, is > > > > what it > > > > > > was running with. I am traveling this week and do not have access to > > > > these > > > > > > systems, but will ensure that this flag is set properly when I get back. > > > > > > > > > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-02-04 16:10 ` Magnus Karlsson @ 2020-02-05 13:31 ` Magnus Karlsson 2020-02-06 14:56 ` Maxim Mikityanskiy 0 siblings, 1 reply; 49+ messages in thread From: Magnus Karlsson @ 2020-02-05 13:31 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson > <magnus.karlsson@gmail.com> wrote: > > > > On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson > > <magnus.karlsson@gmail.com> wrote: > > > > > > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy > > > <maximmi@mellanox.com> wrote: > > > > > > > > > -----Original Message----- > > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com> > > > > > Sent: 27 January, 2020 17:55 > > > > > To: Maxim Mikityanskiy <maximmi@mellanox.com> > > > > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq > > > > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe > > > > > Shemesh <moshe@mellanox.com> > > > > > Subject: Re: zero-copy between interfaces > > > > > > > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> > > > > > wrote: > > > > > > > > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote: > > > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > > > > > > >>>> I've posted output from the program in debugging mode here > > > > > > >>>> > > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > > emulation/kernel/snippets/1930375 > > > > > > >>>> > > > > > > >>>> Yes, you are correct in that forwarding works for a brief period and > > > > > then stops. > > > > > > >>>> I've noticed that the number of packets that are forwarded is equal > > > > > to the size > > > > > > >>>> of the producer/consumer descriptor rings. I've posted two ping > > > > > traces from a > > > > > > >>>> client ping that shows this. > > > > > > >>>> > > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > > emulation/kernel/snippets/1930376 > > > > > > >>>> - https://gitlab.com/mergetb/tech/network- > > > > > emulation/kernel/snippets/1930377 > > > > > > >> > > > > > > >> These snippets are not available. > > > > > > > > > > > > > > Apologies, I had the wrong permissions set. They should be available > > > > > now. > > > > > > > > > > > > > >> > > > > > > >>>> > > > > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for > > > > > the proc > > > > > > >>>> running the program is pegged, which is not the norm for this program > > > > > as it uses > > > > > > >>>> a poll call with a timeout on the xsk fd. > > > > > > >> > > > > > > >> This information led me to a guess what may be happening. On the RX > > > > > > >> side, mlx5e allocates pages in bulks for performance reasons and to > > > > > > >> leverage hardware features targeted to performance. In AF_XDP mode, > > > > > > >> bulking of frames is also used (on x86, the bulk size is 64 with > > > > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation > > > > > > >> details that might change later). If you don't put enough frames to XSK > > > > > > >> Fill Ring, the driver will be demanding more frames and return from > > > > > > >> poll() immediately. Basically, in the application, you should put as > > > > > > >> many frames to the Fill Ring as you can. Please check if that could be > > > > > > >> the root cause of your issue. > > > > > > > > > > > > > > The code in this application makes an effort to relenish the fill ring > > > > > as fast > > > > > > > as possible. The basic loop of the application is to first check if > > > > > there are > > > > > > > any descriptors to be consumed from the completion queue or any > > > > > descriptors that > > > > > > > can be added to the fill queue, and only then to move on to moving > > > > > packets > > > > > > > through the rx and tx rings. > > > > > > > > > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- > > > > > moa/samples/bpf/xdpsock_multidev.c#L452-474 > > > > > > > > > > > > I reproduced your issue and did my investigation, and here is what I > > > > > found: > > > > > > > > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > > > > > > important behavioral change (which I thought was not that important). > > > > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it > > > > > > just returns entries, which may be as big as the ring size. > > > > > > > > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > > > > > > catches up with q->cons_head. So, before that patch and one previous > > > > > > patch, cons_head - cons_tail was not more than 16, so the consumer index > > > > > > was updated periodically. Now consumer is updated only when the whole > > > > > > ring is exhausted. > > > > > > > > > > > > 3. The application can't replenish the fill ring if the consumer index > > > > > > doesn't move. As a consequence, refilling the descriptors by the > > > > > > application can't happen in parallel with using them by the driver. It > > > > > > should have some performance penalty and possibly even lead to packet > > > > > > drops, because the driver uses all the descriptors and only then > > > > > > advances the consumer index, and then it has to wait until the > > > > > > application refills the ring, busy-looping and losing packets. > > > > > > > > > > This will happen if user space always fills up the whole ring, which > > > > > might or might not happen all depending on the app. > > > > > > > > Yes, that's right, and as far as I know, it's common to fill as many > > > > frames as the application can (there was no reason to do it other way > > > > till now). > > > > > > > > > With that said, it > > > > > might provide better performance to update it once in a while. It > > > > > might also be the case that we will get better performance with the > > > > > new scheme if we only fill half the fill ring. > > > > > > > > Yes, it may improve performance. However, I don't think it's correct to > > > > set such a limitation on the app. > > > > Actually, a much worse limitation to put on an application is to say > > that you have to have a certain amount of buffers on some ring for the > > zero-copy feature to work. For example that we need at least 64 > > buffers on the fill ring for all the NIC cards to work in zero-copy > > mode. That would be a bad thing to have to put in the documentation. > > An OS is supposed to abstract away HW differences, and with this > > current limitation in your driver, it shines through for sure. What we > > would like to put in the documentation is a statement along the lines > > of: "for best performance, make sure you have plenty of buffers on the > > fill ring so that the NIC can work as efficiently as possible". Not a > > statement that it does not work on Mellanox unless you put enough > > buffers on the fill ring. So my advice (and wish) is that you fix this > > in your driver. With that said, I will still look into what is the > > best way to get at least the sample to work for you. But there is no > > way to make sure every single app works for you in zero-copy mode, > > unless you support arbitrary amount of buffers on the fill ring. I > > guess that sooner or later, a customer of yours will get into this > > situation one way or the other, so why not fix it now. > > > > /Magnus > > > > > I will examine what provides the best performance. On one hand it is > > > the number of updates to shared ring state (minimized by current > > > scheme) and the ability for the user app to but buffers on the fill > > > ring. Stating that putting e.g. half the packets on the fill ring > > > provides better performance (for some application) is not a > > > limitation. It is performance tuning advise :-). > > I have now made a set of measurements. First I just made a variation > study using the xdpsock app, varying the amount of packets the kernel > can withdraw from a consumer ring (fill and Tx) before updating global > state. For the 1 core case (app and driver on the same core) the more > frequent you do this update, the better. The reason for this is that > it costs very little to update the state since the application is not > running. And it is beneficial for the app to have a freshly updated > state when it starts to execute as it can operate on more packets. For > the 2 core case (app on one core, driver on another) it is the > complete opposite: the fewer updates to global state, the better. The > reason for this is that it costs a lot to update global state as it > triggers cache coherency actions between the two cores. > > What I did then was to compare the current scheme, update only when > grabbing new packets, to a new scheme were we also update the global > consumer pointer when we are exiting Rx or Tx processing in the NAPI > context. On two cores the current scheme gets 0.5 to 1 Mpps more in > throughput than also updating the pointer at the end of NAPI. But for > 1 core case, the new scheme is the best and generates between 0.2 and > 0.3 Mpps more in throughput than the current one. But all in all, the > current scheme is more beneficial than the proposed one if we say that > both the 1 core and the 2 core case is equally important. > > Something to note is that the xdpsock application only puts batch size > (64) of packets in the fill ring in every iteration, and this might > lead to some good pacing for the current scheme and the 2 core case. > I.e., we do not get into the case of the fill ring only being full or > empty. But I will run this on some real apps to get some more results, > and I know that Björn has an optimized xdpsock application that puts > many more packets into the fill queue than 64. This optimization might > actually make the new proposal (also updating at the end of NAPI) be > better and make the current scheme suffer. We will examine this > further and get back. Actually, after some more thought and discussions I think we should optimize for the 1 core case, since that is what gives the whole system the best performance, provided that you can scale your application with instantiation that is. For a 4 core system, 4 x the 1 core performance > 2 x 2 core performance by a lot. I think that the 1 core case is the one that is going to be used out there. At least that is what I hear and see. So, when the merge window opens I am going to submit a patch that updates the consumer pointer when we exit NAPI too. Will increase the performance of the 1 core case. /Magnus > /Magnus > > > > > > I will look into this > > > > > and see what I get. > > > > > > > > > > > 4. As mlx5e allocates frames in batches, the consequences are even more > > > > > > severe: it's a deadlock where the driver waits for the application, and > > > > > > vice versa. The driver never reaches the point where cons_tail gets > > > > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > > > > > > size requested by the driver is 8, the driver won't peek anything from > > > > > > the fill ring waiting for difference between cons_tail and cons_head to > > > > > > increase to be at least 8. On the other hand, the application can't put > > > > > > anything to the ring, because it still thinks that the consumer index is > > > > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get > > > > > > updated, hence the deadlock. > > > > > > > > > > Good thing that you detected this. Maybe I should get a Mellanox card > > > > > :-). This is different from how we implemented Intel's drivers that > > > > > just work on any batch size. If it gets 3 packets back, it will use > > > > > those. How do you deal with the case when the application just puts a > > > > > single buffer in the fill ring and wants to receive a single packet? > > > > > > > > mlx5e will wait until the full batch is available. As AF_XDP is intended > > > > for high-performance apps, this scenario is less expected. We prefer to > > > > leverage our performance features. > > > > > > That you cannot support all applications on top of AF_XDP with your > > > zero-copy support seems broken to me. But I give you that you might > > > support all the ones you care about with your current batching > > > support. Like you say, most apps will put plenty of buffers on the > > > fill ring, so this should not be a problem. Can you not implement some > > > slow path for these cases? You must have one for the skb case. > > > > > > > > Why does the Mellanox driver need a specific batch size? This is just > > > > > for my understanding so we can find a good solution. > > > > > > > > The main reason is our performance feature called striding RQ. Skipping > > > > all irrelevant details, a batch of 64 pages is posted to the NIC with a > > > > single request, and the NIC fills them progressively. This feature is > > > > turned on by default on modern NICs, and it's really good for > > > > performance. > > > > > > > > It might be possible to post a smaller batch though, I'm not sure about > > > > it, it needs to be checked, but anyway it's not something that we will > > > > likely do, because it's a complication of the data path, and if we know > > > > more frames are coming, it's much better for the performance to wait for > > > > them, rather than to post several incomplete batches. > > > > > > > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > > > > > > updates of the consumer index was wrong. It totally breaks mlx5e, that > > > > > > does batching, and it will affect the performance of any driver, because > > > > > > the application can't refill the ring until it gets completely empty and > > > > > > the driver starts waiting for frames. I suggest that periodic updates of > > > > > > the consumer index should be readded to xskq_cons_peek_addr. > > > > > > > > > > The reason I wanted to remove RX_BATCH_SIZE is that application > > > > > developers complained about it giving rise to counter intuitive > > > > > behavior in user space. I will try to dig out the complaints and the > > > > > explanations Björn and I had to send which it seemed that users really > > > > > should not have to care about. It should just work. > > > > > > > > I think the counter that doesn't update till the very last moment and > > > > then advances by the ring size will also be something to complain about > > > > (and I am the first one to complain :D). Such bursts are not good in any > > > > case. > > > > > > Do you have any performance data that shows this scheme is bad for > > > performance? The good thing about this scheme is that global state is > > > updated less frequently. And the bad thing is what you mentioned. But > > > which one has the greatest effect, is the question. > > > > > > > > I also do not like > > > > > to have arbitrary constants like this. Why 16? > > > > > > > > I believe any batching mechanism has a constant that look arbitrary. > > > > This constant should be the golden mean: if it's too small, there is > > > > little effect from batching; if it's too big, it gets too bursty. > > > > > > > > Basically, after your patch it just changed from 16 to the ring size. > > > > Maybe we can tie that constant to ring size? Make it ring_size / > > > > another_arbitrary_constant? :) > > > > > > Agreed, I thought about this too. Something tied to ring_size might be > > > a good compromise. Will examine this. But I want to base this on > > > performance data not idle speculation, so need to run some experiments > > > first. > > > > > > /Magnus > > > > > > > > Would much prefer not > > > > > having to deal with this, unless of course it horribly breaks > > > > > something or gives rise to worse performance. Might still be the case > > > > > here, but if not, I would like to remove it. > > > > > > > > > > Thanks: Magnus > > > > > > > > > > > Magnus, what do you think of the suggestion above? > > > > > > > > > > > > Thanks, > > > > > > Max > > > > > > > > > > > > >> > > > > > > >> I tracked this issue in our internal bug tracker in case we need to > > > > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback > > > > > > >> on my assumption above. > > > > > > >> > > > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- > > > > > CCAT) > > > > > > >>>> running the mlx5 driver. > > > > > > >> > > > > > > >> This one should run without striding RQ, please verify it with ethtool > > > > > > >> --show-priv-flags (the flag name is rx_striding_rq). > > > > > > > > > > > > > > I do not remember changing this option, so whatever the default is, is > > > > > what it > > > > > > > was running with. I am traveling this week and do not have access to > > > > > these > > > > > > > systems, but will ensure that this flag is set properly when I get back. > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-02-05 13:31 ` Magnus Karlsson @ 2020-02-06 14:56 ` Maxim Mikityanskiy 2020-02-07 9:01 ` Magnus Karlsson 0 siblings, 1 reply; 49+ messages in thread From: Maxim Mikityanskiy @ 2020-02-06 14:56 UTC (permalink / raw) To: Magnus Karlsson Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On 2020-02-05 15:31, Magnus Karlsson wrote: > On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson > <magnus.karlsson@gmail.com> wrote: >> >> On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson >> <magnus.karlsson@gmail.com> wrote: >>> >>> On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson >>> <magnus.karlsson@gmail.com> wrote: >>>> >>>> On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy >>>> <maximmi@mellanox.com> wrote: >>>>> >>>>>> -----Original Message----- >>>>>> From: Magnus Karlsson <magnus.karlsson@gmail.com> >>>>>> Sent: 27 January, 2020 17:55 >>>>>> To: Maxim Mikityanskiy <maximmi@mellanox.com> >>>>>> Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq >>>>>> Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe >>>>>> Shemesh <moshe@mellanox.com> >>>>>> Subject: Re: zero-copy between interfaces >>>>>> >>>>>> On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> >>>>>> wrote: >>>>>>> >>>>>>> On 2020-01-22 23:43, Ryan Goodfellow wrote: >>>>>>>> On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: >>>>>>>>>>> I've posted output from the program in debugging mode here >>>>>>>>>>> >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network- >>>>>> emulation/kernel/snippets/1930375 >>>>>>>>>>> >>>>>>>>>>> Yes, you are correct in that forwarding works for a brief period and >>>>>> then stops. >>>>>>>>>>> I've noticed that the number of packets that are forwarded is equal >>>>>> to the size >>>>>>>>>>> of the producer/consumer descriptor rings. I've posted two ping >>>>>> traces from a >>>>>>>>>>> client ping that shows this. >>>>>>>>>>> >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network- >>>>>> emulation/kernel/snippets/1930376 >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network- >>>>>> emulation/kernel/snippets/1930377 >>>>>>>>> >>>>>>>>> These snippets are not available. >>>>>>>> >>>>>>>> Apologies, I had the wrong permissions set. They should be available >>>>>> now. >>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I've also noticed that when the forwarding stops, the CPU usage for >>>>>> the proc >>>>>>>>>>> running the program is pegged, which is not the norm for this program >>>>>> as it uses >>>>>>>>>>> a poll call with a timeout on the xsk fd. >>>>>>>>> >>>>>>>>> This information led me to a guess what may be happening. On the RX >>>>>>>>> side, mlx5e allocates pages in bulks for performance reasons and to >>>>>>>>> leverage hardware features targeted to performance. In AF_XDP mode, >>>>>>>>> bulking of frames is also used (on x86, the bulk size is 64 with >>>>>>>>> striding RQ enabled, and 8 otherwise, however, it's implementation >>>>>>>>> details that might change later). If you don't put enough frames to XSK >>>>>>>>> Fill Ring, the driver will be demanding more frames and return from >>>>>>>>> poll() immediately. Basically, in the application, you should put as >>>>>>>>> many frames to the Fill Ring as you can. Please check if that could be >>>>>>>>> the root cause of your issue. >>>>>>>> >>>>>>>> The code in this application makes an effort to relenish the fill ring >>>>>> as fast >>>>>>>> as possible. The basic loop of the application is to first check if >>>>>> there are >>>>>>>> any descriptors to be consumed from the completion queue or any >>>>>> descriptors that >>>>>>>> can be added to the fill queue, and only then to move on to moving >>>>>> packets >>>>>>>> through the rx and tx rings. >>>>>>>> >>>>>>>> https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- >>>>>> moa/samples/bpf/xdpsock_multidev.c#L452-474 >>>>>>> >>>>>>> I reproduced your issue and did my investigation, and here is what I >>>>>> found: >>>>>>> >>>>>>> 1. Commit df0ae6f78a45 (that you found during bisect) introduces an >>>>>>> important behavioral change (which I thought was not that important). >>>>>>> xskq_nb_avail used to return min(entries, dcnt), but after the change it >>>>>>> just returns entries, which may be as big as the ring size. >>>>>>> >>>>>>> 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail >>>>>>> catches up with q->cons_head. So, before that patch and one previous >>>>>>> patch, cons_head - cons_tail was not more than 16, so the consumer index >>>>>>> was updated periodically. Now consumer is updated only when the whole >>>>>>> ring is exhausted. >>>>>>> >>>>>>> 3. The application can't replenish the fill ring if the consumer index >>>>>>> doesn't move. As a consequence, refilling the descriptors by the >>>>>>> application can't happen in parallel with using them by the driver. It >>>>>>> should have some performance penalty and possibly even lead to packet >>>>>>> drops, because the driver uses all the descriptors and only then >>>>>>> advances the consumer index, and then it has to wait until the >>>>>>> application refills the ring, busy-looping and losing packets. >>>>>> >>>>>> This will happen if user space always fills up the whole ring, which >>>>>> might or might not happen all depending on the app. >>>>> >>>>> Yes, that's right, and as far as I know, it's common to fill as many >>>>> frames as the application can (there was no reason to do it other way >>>>> till now). >>>>> >>>>>> With that said, it >>>>>> might provide better performance to update it once in a while. It >>>>>> might also be the case that we will get better performance with the >>>>>> new scheme if we only fill half the fill ring. >>>>> >>>>> Yes, it may improve performance. However, I don't think it's correct to >>>>> set such a limitation on the app. >>> >>> Actually, a much worse limitation to put on an application is to say >>> that you have to have a certain amount of buffers on some ring for the >>> zero-copy feature to work. For example that we need at least 64 >>> buffers on the fill ring for all the NIC cards to work in zero-copy >>> mode. That would be a bad thing to have to put in the documentation. >>> An OS is supposed to abstract away HW differences, and with this >>> current limitation in your driver, it shines through for sure. What we >>> would like to put in the documentation is a statement along the lines >>> of: "for best performance, make sure you have plenty of buffers on the >>> fill ring so that the NIC can work as efficiently as possible". Not a >>> statement that it does not work on Mellanox unless you put enough >>> buffers on the fill ring. So my advice (and wish) is that you fix this >>> in your driver. With that said, I will still look into what is the >>> best way to get at least the sample to work for you. But there is no >>> way to make sure every single app works for you in zero-copy mode, >>> unless you support arbitrary amount of buffers on the fill ring. I >>> guess that sooner or later, a customer of yours will get into this >>> situation one way or the other, so why not fix it now. Hi Magnus, We made an internal discussion about batching and AF_XDP in mlx5e. There are two types of RX queues supported by mlx5e: striding RQ and legacy RQ. Which type of RQ is used, depends on the configuration, hardware support and defaults for different NICs, but basically in cases when striding RQ is enabled by default, it's faster than legacy RQ, and this is the case for modern NICs. All RX queues created in the driver are of the same type. Striding RQ has a requirement of allocating in batches, and the batch size is specified on queue creation, so there is no fallback possible for this case. Legacy RQ, on the other hand, does not require batching in XDP use cases, but now we do it anyway for performance reasons and for code unification with non-XDP queues. I understand your concern that the current API doesn't provide a completely opaque abstraction over the driver. However, we can't just throw away an important performance feature (striding RQ) to support some exotic case of a fill ring with a single frame only. AF_XDP is a framework for high-performance applications, so it's extremely unlikely that an AF_XDP application will only need to receive a single packet. Such applications just won't need AF_XDP. So, given that the issue can't be fixed without disabling striding RQ, and disabling striding RQ will just reduce the performance of all real-world applications, we decided to keep things as is for now, and if a customer complains about it, we will suggest them to disable striding RQ in their configuration, and we'll consider an option of disabling batching in legacy RQ for AF_XDP. BTW, if the current API can't provide a good enough abstraction over advanced features of mlx5e, maybe we should extend the API somehow? E.g., when need_wakeup for RX goes to "yes", also tell how many frames need to be refilled? >>> /Magnus >>> >>>> I will examine what provides the best performance. On one hand it is >>>> the number of updates to shared ring state (minimized by current >>>> scheme) and the ability for the user app to but buffers on the fill >>>> ring. Stating that putting e.g. half the packets on the fill ring >>>> provides better performance (for some application) is not a >>>> limitation. It is performance tuning advise :-). >> >> I have now made a set of measurements. First I just made a variation >> study using the xdpsock app, varying the amount of packets the kernel >> can withdraw from a consumer ring (fill and Tx) before updating global >> state. For the 1 core case (app and driver on the same core) the more >> frequent you do this update, the better. The reason for this is that >> it costs very little to update the state since the application is not >> running. And it is beneficial for the app to have a freshly updated >> state when it starts to execute as it can operate on more packets. For >> the 2 core case (app on one core, driver on another) it is the >> complete opposite: the fewer updates to global state, the better. The >> reason for this is that it costs a lot to update global state as it >> triggers cache coherency actions between the two cores. >> >> What I did then was to compare the current scheme, update only when >> grabbing new packets, to a new scheme were we also update the global >> consumer pointer when we are exiting Rx or Tx processing in the NAPI >> context. On two cores the current scheme gets 0.5 to 1 Mpps more in >> throughput than also updating the pointer at the end of NAPI. But for >> 1 core case, the new scheme is the best and generates between 0.2 and >> 0.3 Mpps more in throughput than the current one. But all in all, the >> current scheme is more beneficial than the proposed one if we say that >> both the 1 core and the 2 core case is equally important. >> >> Something to note is that the xdpsock application only puts batch size >> (64) of packets in the fill ring in every iteration, and this might >> lead to some good pacing for the current scheme and the 2 core case. >> I.e., we do not get into the case of the fill ring only being full or >> empty. But I will run this on some real apps to get some more results, >> and I know that Björn has an optimized xdpsock application that puts >> many more packets into the fill queue than 64. This optimization might >> actually make the new proposal (also updating at the end of NAPI) be >> better and make the current scheme suffer. We will examine this >> further and get back. > > Actually, after some more thought and discussions I think we should > optimize for the 1 core case, since that is what gives the whole > system the best performance, provided that you can scale your > application with instantiation that is. For a 4 core system, 4 x the 1 > core performance > 2 x 2 core performance by a lot. I think that the 1 > core case is the one that is going to be used out there. At least that > is what I hear and see. > > So, when the merge window opens I am going to submit a patch that > updates the consumer pointer when we exit NAPI too. Will increase the > performance of the 1 core case. That sounds good to me. It doesn't make sense to update it multiple times per NAPI (in the single core case the application won't run at that time anyway), so once per NAPI is the most frequent, and according to your experiments it should be the most efficient. It should make mlx5e work again. One concern though: you say you are going to submit it to -next, but a kernel with your patches has been released, and it has broken AF_XDP support in mlx5e. I can send a small fix to net that will revert the behavior back to updating the consumer index once every 16 frames (so it makes mlx5e work again), and your patch will go on top of my bugfix. Does it sound right to you? Thanks for taking time to do the tests! > /Magnus > >> /Magnus >> >>>>>> I will look into this >>>>>> and see what I get. >>>>>> >>>>>>> 4. As mlx5e allocates frames in batches, the consequences are even more >>>>>>> severe: it's a deadlock where the driver waits for the application, and >>>>>>> vice versa. The driver never reaches the point where cons_tail gets >>>>>>> equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch >>>>>>> size requested by the driver is 8, the driver won't peek anything from >>>>>>> the fill ring waiting for difference between cons_tail and cons_head to >>>>>>> increase to be at least 8. On the other hand, the application can't put >>>>>>> anything to the ring, because it still thinks that the consumer index is >>>>>>> 0. As cons_tail never reaches cons_head, the consumer index doesn't get >>>>>>> updated, hence the deadlock. >>>>>> >>>>>> Good thing that you detected this. Maybe I should get a Mellanox card >>>>>> :-). This is different from how we implemented Intel's drivers that >>>>>> just work on any batch size. If it gets 3 packets back, it will use >>>>>> those. How do you deal with the case when the application just puts a >>>>>> single buffer in the fill ring and wants to receive a single packet? >>>>> >>>>> mlx5e will wait until the full batch is available. As AF_XDP is intended >>>>> for high-performance apps, this scenario is less expected. We prefer to >>>>> leverage our performance features. >>>> >>>> That you cannot support all applications on top of AF_XDP with your >>>> zero-copy support seems broken to me. But I give you that you might >>>> support all the ones you care about with your current batching >>>> support. Like you say, most apps will put plenty of buffers on the >>>> fill ring, so this should not be a problem. Can you not implement some >>>> slow path for these cases? You must have one for the skb case. >>>> >>>>>> Why does the Mellanox driver need a specific batch size? This is just >>>>>> for my understanding so we can find a good solution. >>>>> >>>>> The main reason is our performance feature called striding RQ. Skipping >>>>> all irrelevant details, a batch of 64 pages is posted to the NIC with a >>>>> single request, and the NIC fills them progressively. This feature is >>>>> turned on by default on modern NICs, and it's really good for >>>>> performance. >>>>> >>>>> It might be possible to post a smaller batch though, I'm not sure about >>>>> it, it needs to be checked, but anyway it's not something that we will >>>>> likely do, because it's a complication of the data path, and if we know >>>>> more frames are coming, it's much better for the performance to wait for >>>>> them, rather than to post several incomplete batches. >>>>> >>>>>>> So, in my vision, the decision to remove RX_BATCH_SIZE and periodic >>>>>>> updates of the consumer index was wrong. It totally breaks mlx5e, that >>>>>>> does batching, and it will affect the performance of any driver, because >>>>>>> the application can't refill the ring until it gets completely empty and >>>>>>> the driver starts waiting for frames. I suggest that periodic updates of >>>>>>> the consumer index should be readded to xskq_cons_peek_addr. >>>>>> >>>>>> The reason I wanted to remove RX_BATCH_SIZE is that application >>>>>> developers complained about it giving rise to counter intuitive >>>>>> behavior in user space. I will try to dig out the complaints and the >>>>>> explanations Björn and I had to send which it seemed that users really >>>>>> should not have to care about. It should just work. >>>>> >>>>> I think the counter that doesn't update till the very last moment and >>>>> then advances by the ring size will also be something to complain about >>>>> (and I am the first one to complain :D). Such bursts are not good in any >>>>> case. >>>> >>>> Do you have any performance data that shows this scheme is bad for >>>> performance? The good thing about this scheme is that global state is >>>> updated less frequently. And the bad thing is what you mentioned. But >>>> which one has the greatest effect, is the question. >>>> >>>>>> I also do not like >>>>>> to have arbitrary constants like this. Why 16? >>>>> >>>>> I believe any batching mechanism has a constant that look arbitrary. >>>>> This constant should be the golden mean: if it's too small, there is >>>>> little effect from batching; if it's too big, it gets too bursty. >>>>> >>>>> Basically, after your patch it just changed from 16 to the ring size. >>>>> Maybe we can tie that constant to ring size? Make it ring_size / >>>>> another_arbitrary_constant? :) >>>> >>>> Agreed, I thought about this too. Something tied to ring_size might be >>>> a good compromise. Will examine this. But I want to base this on >>>> performance data not idle speculation, so need to run some experiments >>>> first. >>>> >>>> /Magnus >>>> >>>>>> Would much prefer not >>>>>> having to deal with this, unless of course it horribly breaks >>>>>> something or gives rise to worse performance. Might still be the case >>>>>> here, but if not, I would like to remove it. >>>>>> >>>>>> Thanks: Magnus >>>>>> >>>>>>> Magnus, what do you think of the suggestion above? >>>>>>> >>>>>>> Thanks, >>>>>>> Max >>>>>>> >>>>>>>>> >>>>>>>>> I tracked this issue in our internal bug tracker in case we need to >>>>>>>>> perform actual debugging of mlx5e. I'm looking forward to your feedback >>>>>>>>> on my assumption above. >>>>>>>>> >>>>>>>>>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- >>>>>> CCAT) >>>>>>>>>>> running the mlx5 driver. >>>>>>>>> >>>>>>>>> This one should run without striding RQ, please verify it with ethtool >>>>>>>>> --show-priv-flags (the flag name is rx_striding_rq). >>>>>>>> >>>>>>>> I do not remember changing this option, so whatever the default is, is >>>>>> what it >>>>>>>> was running with. I am traveling this week and do not have access to >>>>>> these >>>>>>>> systems, but will ensure that this flag is set properly when I get back. >>>>>>>> >>>>>>> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-02-06 14:56 ` Maxim Mikityanskiy @ 2020-02-07 9:01 ` Magnus Karlsson 0 siblings, 0 replies; 49+ messages in thread From: Magnus Karlsson @ 2020-02-07 9:01 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh On Thu, Feb 6, 2020 at 3:56 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > > On 2020-02-05 15:31, Magnus Karlsson wrote: > > On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson > > <magnus.karlsson@gmail.com> wrote: > >> > >> On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson > >> <magnus.karlsson@gmail.com> wrote: > >>> > >>> On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson > >>> <magnus.karlsson@gmail.com> wrote: > >>>> > >>>> On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy > >>>> <maximmi@mellanox.com> wrote: > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Magnus Karlsson <magnus.karlsson@gmail.com> > >>>>>> Sent: 27 January, 2020 17:55 > >>>>>> To: Maxim Mikityanskiy <maximmi@mellanox.com> > >>>>>> Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq > >>>>>> Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe > >>>>>> Shemesh <moshe@mellanox.com> > >>>>>> Subject: Re: zero-copy between interfaces > >>>>>> > >>>>>> On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> On 2020-01-22 23:43, Ryan Goodfellow wrote: > >>>>>>>> On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote: > >>>>>>>>>>> I've posted output from the program in debugging mode here > >>>>>>>>>>> > >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network- > >>>>>> emulation/kernel/snippets/1930375 > >>>>>>>>>>> > >>>>>>>>>>> Yes, you are correct in that forwarding works for a brief period and > >>>>>> then stops. > >>>>>>>>>>> I've noticed that the number of packets that are forwarded is equal > >>>>>> to the size > >>>>>>>>>>> of the producer/consumer descriptor rings. I've posted two ping > >>>>>> traces from a > >>>>>>>>>>> client ping that shows this. > >>>>>>>>>>> > >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network- > >>>>>> emulation/kernel/snippets/1930376 > >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network- > >>>>>> emulation/kernel/snippets/1930377 > >>>>>>>>> > >>>>>>>>> These snippets are not available. > >>>>>>>> > >>>>>>>> Apologies, I had the wrong permissions set. They should be available > >>>>>> now. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I've also noticed that when the forwarding stops, the CPU usage for > >>>>>> the proc > >>>>>>>>>>> running the program is pegged, which is not the norm for this program > >>>>>> as it uses > >>>>>>>>>>> a poll call with a timeout on the xsk fd. > >>>>>>>>> > >>>>>>>>> This information led me to a guess what may be happening. On the RX > >>>>>>>>> side, mlx5e allocates pages in bulks for performance reasons and to > >>>>>>>>> leverage hardware features targeted to performance. In AF_XDP mode, > >>>>>>>>> bulking of frames is also used (on x86, the bulk size is 64 with > >>>>>>>>> striding RQ enabled, and 8 otherwise, however, it's implementation > >>>>>>>>> details that might change later). If you don't put enough frames to XSK > >>>>>>>>> Fill Ring, the driver will be demanding more frames and return from > >>>>>>>>> poll() immediately. Basically, in the application, you should put as > >>>>>>>>> many frames to the Fill Ring as you can. Please check if that could be > >>>>>>>>> the root cause of your issue. > >>>>>>>> > >>>>>>>> The code in this application makes an effort to relenish the fill ring > >>>>>> as fast > >>>>>>>> as possible. The basic loop of the application is to first check if > >>>>>> there are > >>>>>>>> any descriptors to be consumed from the completion queue or any > >>>>>> descriptors that > >>>>>>>> can be added to the fill queue, and only then to move on to moving > >>>>>> packets > >>>>>>>> through the rx and tx rings. > >>>>>>>> > >>>>>>>> https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5- > >>>>>> moa/samples/bpf/xdpsock_multidev.c#L452-474 > >>>>>>> > >>>>>>> I reproduced your issue and did my investigation, and here is what I > >>>>>> found: > >>>>>>> > >>>>>>> 1. Commit df0ae6f78a45 (that you found during bisect) introduces an > >>>>>>> important behavioral change (which I thought was not that important). > >>>>>>> xskq_nb_avail used to return min(entries, dcnt), but after the change it > >>>>>>> just returns entries, which may be as big as the ring size. > >>>>>>> > >>>>>>> 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail > >>>>>>> catches up with q->cons_head. So, before that patch and one previous > >>>>>>> patch, cons_head - cons_tail was not more than 16, so the consumer index > >>>>>>> was updated periodically. Now consumer is updated only when the whole > >>>>>>> ring is exhausted. > >>>>>>> > >>>>>>> 3. The application can't replenish the fill ring if the consumer index > >>>>>>> doesn't move. As a consequence, refilling the descriptors by the > >>>>>>> application can't happen in parallel with using them by the driver. It > >>>>>>> should have some performance penalty and possibly even lead to packet > >>>>>>> drops, because the driver uses all the descriptors and only then > >>>>>>> advances the consumer index, and then it has to wait until the > >>>>>>> application refills the ring, busy-looping and losing packets. > >>>>>> > >>>>>> This will happen if user space always fills up the whole ring, which > >>>>>> might or might not happen all depending on the app. > >>>>> > >>>>> Yes, that's right, and as far as I know, it's common to fill as many > >>>>> frames as the application can (there was no reason to do it other way > >>>>> till now). > >>>>> > >>>>>> With that said, it > >>>>>> might provide better performance to update it once in a while. It > >>>>>> might also be the case that we will get better performance with the > >>>>>> new scheme if we only fill half the fill ring. > >>>>> > >>>>> Yes, it may improve performance. However, I don't think it's correct to > >>>>> set such a limitation on the app. > >>> > >>> Actually, a much worse limitation to put on an application is to say > >>> that you have to have a certain amount of buffers on some ring for the > >>> zero-copy feature to work. For example that we need at least 64 > >>> buffers on the fill ring for all the NIC cards to work in zero-copy > >>> mode. That would be a bad thing to have to put in the documentation. > >>> An OS is supposed to abstract away HW differences, and with this > >>> current limitation in your driver, it shines through for sure. What we > >>> would like to put in the documentation is a statement along the lines > >>> of: "for best performance, make sure you have plenty of buffers on the > >>> fill ring so that the NIC can work as efficiently as possible". Not a > >>> statement that it does not work on Mellanox unless you put enough > >>> buffers on the fill ring. So my advice (and wish) is that you fix this > >>> in your driver. With that said, I will still look into what is the > >>> best way to get at least the sample to work for you. But there is no > >>> way to make sure every single app works for you in zero-copy mode, > >>> unless you support arbitrary amount of buffers on the fill ring. I > >>> guess that sooner or later, a customer of yours will get into this > >>> situation one way or the other, so why not fix it now. > > Hi Magnus, > > We made an internal discussion about batching and AF_XDP in mlx5e. > > There are two types of RX queues supported by mlx5e: striding RQ and > legacy RQ. Which type of RQ is used, depends on the configuration, > hardware support and defaults for different NICs, but basically in cases > when striding RQ is enabled by default, it's faster than legacy RQ, and > this is the case for modern NICs. All RX queues created in the driver > are of the same type. Striding RQ has a requirement of allocating in > batches, and the batch size is specified on queue creation, so there is > no fallback possible for this case. Legacy RQ, on the other hand, does > not require batching in XDP use cases, but now we do it anyway for > performance reasons and for code unification with non-XDP queues. Thanks. I understand why this is a problem now. Let see how we can work around it in the best way. > I understand your concern that the current API doesn't provide a > completely opaque abstraction over the driver. However, we can't just > throw away an important performance feature (striding RQ) to support > some exotic case of a fill ring with a single frame only. AF_XDP is a > framework for high-performance applications, so it's extremely unlikely > that an AF_XDP application will only need to receive a single packet. > Such applications just won't need AF_XDP. So, given that the issue can't > be fixed without disabling striding RQ, and disabling striding RQ will > just reduce the performance of all real-world applications, we decided > to keep things as is for now, and if a customer complains about it, we > will suggest them to disable striding RQ in their configuration, and > we'll consider an option of disabling batching in legacy RQ for AF_XDP. > > BTW, if the current API can't provide a good enough abstraction over > advanced features of mlx5e, maybe we should extend the API somehow? > E.g., when need_wakeup for RX goes to "yes", also tell how many frames > need to be refilled? Yes, let us set need_wakeup for Rx, unless you are not already doing that. The Intel drivers do this when there are no entries at all on the fill ring and none on the HW Rx ring. You can set it when there are less than 64 on the fill ring (or whatever striding size that you have) and no entries on the HW Rx ring. But I prefer not to return a number. Why not just document that if the user gets a need_wakeup on the fill ring it needs to put entries on the fill ring and wake up the kernel. We do recommend putting as many entries in the fill ring as possible since this is good for performance (for all NICs). Putting too few might trigger another need_wakeup event on NICs that prefer to work on batches of packets. Or something like that. > >>> /Magnus > >>> > >>>> I will examine what provides the best performance. On one hand it is > >>>> the number of updates to shared ring state (minimized by current > >>>> scheme) and the ability for the user app to but buffers on the fill > >>>> ring. Stating that putting e.g. half the packets on the fill ring > >>>> provides better performance (for some application) is not a > >>>> limitation. It is performance tuning advise :-). > >> > >> I have now made a set of measurements. First I just made a variation > >> study using the xdpsock app, varying the amount of packets the kernel > >> can withdraw from a consumer ring (fill and Tx) before updating global > >> state. For the 1 core case (app and driver on the same core) the more > >> frequent you do this update, the better. The reason for this is that > >> it costs very little to update the state since the application is not > >> running. And it is beneficial for the app to have a freshly updated > >> state when it starts to execute as it can operate on more packets. For > >> the 2 core case (app on one core, driver on another) it is the > >> complete opposite: the fewer updates to global state, the better. The > >> reason for this is that it costs a lot to update global state as it > >> triggers cache coherency actions between the two cores. > >> > >> What I did then was to compare the current scheme, update only when > >> grabbing new packets, to a new scheme were we also update the global > >> consumer pointer when we are exiting Rx or Tx processing in the NAPI > >> context. On two cores the current scheme gets 0.5 to 1 Mpps more in > >> throughput than also updating the pointer at the end of NAPI. But for > >> 1 core case, the new scheme is the best and generates between 0.2 and > >> 0.3 Mpps more in throughput than the current one. But all in all, the > >> current scheme is more beneficial than the proposed one if we say that > >> both the 1 core and the 2 core case is equally important. > >> > >> Something to note is that the xdpsock application only puts batch size > >> (64) of packets in the fill ring in every iteration, and this might > >> lead to some good pacing for the current scheme and the 2 core case. > >> I.e., we do not get into the case of the fill ring only being full or > >> empty. But I will run this on some real apps to get some more results, > >> and I know that Björn has an optimized xdpsock application that puts > >> many more packets into the fill queue than 64. This optimization might > >> actually make the new proposal (also updating at the end of NAPI) be > >> better and make the current scheme suffer. We will examine this > >> further and get back. > > > > Actually, after some more thought and discussions I think we should > > optimize for the 1 core case, since that is what gives the whole > > system the best performance, provided that you can scale your > > application with instantiation that is. For a 4 core system, 4 x the 1 > > core performance > 2 x 2 core performance by a lot. I think that the 1 > > core case is the one that is going to be used out there. At least that > > is what I hear and see. > > > > So, when the merge window opens I am going to submit a patch that > > updates the consumer pointer when we exit NAPI too. Will increase the > > performance of the 1 core case. > > That sounds good to me. It doesn't make sense to update it multiple > times per NAPI (in the single core case the application won't run at > that time anyway), so once per NAPI is the most frequent, and according > to your experiments it should be the most efficient. It should make > mlx5e work again. One concern though: you say you are going to submit it > to -next, but a kernel with your patches has been released, and it has > broken AF_XDP support in mlx5e. I can send a small fix to net that will > revert the behavior back to updating the consumer index once every 16 > frames (so it makes mlx5e work again), and your patch will go on top of > my bugfix. Does it sound right to you? How about a patch set of two patches. In the cover letter you explain the situation. 1: The application might deadlock if the kernel requires multiple buffers to be put on the fill ring to proceed but the consumer pointer has not been updated by the kernel so it is not possible to put them there. 2: When the kernel needs more buffers, the driver should signal this through the fill ring's need_wakeup flag. Document this. There are two solutions to this: a) fix the publishing of the consumer pointer and document that the user application has to put entries in the fill ring when the need_wakeup flag is set for the fill ring (and implement this in your driver if you have not already done so). b) Fix the Mellanox driver so it can work on any amount of buffers put in the fill ring. We chose a) because it will result in the most performant system and optimizing for the "few packets in the fill ring case" is optimizing for something that will rarely happen. I can send you the first patch. I do not want to reintroduce the batch size, if I do not have to. As I have said before, there are already enough tunables in an AF_XDP application. We do not need another one. My scheme of publishing the consumer pointers at the end of the NAPI loop seems to work really well and requires no parameters. You can then add a patch for 2 and a cover letter making our case that we would like this fixed. I will send you the patch in a separate mail on this list. What do you think? Thanks: Magnus > Thanks for taking time to do the tests! > > > /Magnus > > > >> /Magnus > >> > >>>>>> I will look into this > >>>>>> and see what I get. > >>>>>> > >>>>>>> 4. As mlx5e allocates frames in batches, the consequences are even more > >>>>>>> severe: it's a deadlock where the driver waits for the application, and > >>>>>>> vice versa. The driver never reaches the point where cons_tail gets > >>>>>>> equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch > >>>>>>> size requested by the driver is 8, the driver won't peek anything from > >>>>>>> the fill ring waiting for difference between cons_tail and cons_head to > >>>>>>> increase to be at least 8. On the other hand, the application can't put > >>>>>>> anything to the ring, because it still thinks that the consumer index is > >>>>>>> 0. As cons_tail never reaches cons_head, the consumer index doesn't get > >>>>>>> updated, hence the deadlock. > >>>>>> > >>>>>> Good thing that you detected this. Maybe I should get a Mellanox card > >>>>>> :-). This is different from how we implemented Intel's drivers that > >>>>>> just work on any batch size. If it gets 3 packets back, it will use > >>>>>> those. How do you deal with the case when the application just puts a > >>>>>> single buffer in the fill ring and wants to receive a single packet? > >>>>> > >>>>> mlx5e will wait until the full batch is available. As AF_XDP is intended > >>>>> for high-performance apps, this scenario is less expected. We prefer to > >>>>> leverage our performance features. > >>>> > >>>> That you cannot support all applications on top of AF_XDP with your > >>>> zero-copy support seems broken to me. But I give you that you might > >>>> support all the ones you care about with your current batching > >>>> support. Like you say, most apps will put plenty of buffers on the > >>>> fill ring, so this should not be a problem. Can you not implement some > >>>> slow path for these cases? You must have one for the skb case. > >>>> > >>>>>> Why does the Mellanox driver need a specific batch size? This is just > >>>>>> for my understanding so we can find a good solution. > >>>>> > >>>>> The main reason is our performance feature called striding RQ. Skipping > >>>>> all irrelevant details, a batch of 64 pages is posted to the NIC with a > >>>>> single request, and the NIC fills them progressively. This feature is > >>>>> turned on by default on modern NICs, and it's really good for > >>>>> performance. > >>>>> > >>>>> It might be possible to post a smaller batch though, I'm not sure about > >>>>> it, it needs to be checked, but anyway it's not something that we will > >>>>> likely do, because it's a complication of the data path, and if we know > >>>>> more frames are coming, it's much better for the performance to wait for > >>>>> them, rather than to post several incomplete batches. > >>>>> > >>>>>>> So, in my vision, the decision to remove RX_BATCH_SIZE and periodic > >>>>>>> updates of the consumer index was wrong. It totally breaks mlx5e, that > >>>>>>> does batching, and it will affect the performance of any driver, because > >>>>>>> the application can't refill the ring until it gets completely empty and > >>>>>>> the driver starts waiting for frames. I suggest that periodic updates of > >>>>>>> the consumer index should be readded to xskq_cons_peek_addr. > >>>>>> > >>>>>> The reason I wanted to remove RX_BATCH_SIZE is that application > >>>>>> developers complained about it giving rise to counter intuitive > >>>>>> behavior in user space. I will try to dig out the complaints and the > >>>>>> explanations Björn and I had to send which it seemed that users really > >>>>>> should not have to care about. It should just work. > >>>>> > >>>>> I think the counter that doesn't update till the very last moment and > >>>>> then advances by the ring size will also be something to complain about > >>>>> (and I am the first one to complain :D). Such bursts are not good in any > >>>>> case. > >>>> > >>>> Do you have any performance data that shows this scheme is bad for > >>>> performance? The good thing about this scheme is that global state is > >>>> updated less frequently. And the bad thing is what you mentioned. But > >>>> which one has the greatest effect, is the question. > >>>> > >>>>>> I also do not like > >>>>>> to have arbitrary constants like this. Why 16? > >>>>> > >>>>> I believe any batching mechanism has a constant that look arbitrary. > >>>>> This constant should be the golden mean: if it's too small, there is > >>>>> little effect from batching; if it's too big, it gets too bursty. > >>>>> > >>>>> Basically, after your patch it just changed from 16 to the ring size. > >>>>> Maybe we can tie that constant to ring size? Make it ring_size / > >>>>> another_arbitrary_constant? :) > >>>> > >>>> Agreed, I thought about this too. Something tied to ring_size might be > >>>> a good compromise. Will examine this. But I want to base this on > >>>> performance data not idle speculation, so need to run some experiments > >>>> first. > >>>> > >>>> /Magnus > >>>> > >>>>>> Would much prefer not > >>>>>> having to deal with this, unless of course it horribly breaks > >>>>>> something or gives rise to worse performance. Might still be the case > >>>>>> here, but if not, I would like to remove it. > >>>>>> > >>>>>> Thanks: Magnus > >>>>>> > >>>>>>> Magnus, what do you think of the suggestion above? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Max > >>>>>>> > >>>>>>>>> > >>>>>>>>> I tracked this issue in our internal bug tracker in case we need to > >>>>>>>>> perform actual debugging of mlx5e. I'm looking forward to your feedback > >>>>>>>>> on my assumption above. > >>>>>>>>> > >>>>>>>>>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A- > >>>>>> CCAT) > >>>>>>>>>>> running the mlx5 driver. > >>>>>>>>> > >>>>>>>>> This one should run without striding RQ, please verify it with ethtool > >>>>>>>>> --show-priv-flags (the flag name is rx_striding_rq). > >>>>>>>> > >>>>>>>> I do not remember changing this option, so whatever the default is, is > >>>>>> what it > >>>>>>>> was running with. I am traveling this week and do not have access to > >>>>>> these > >>>>>>>> systems, but will ensure that this flag is set properly when I get back. > >>>>>>>> > >>>>>>> > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-14 20:52 ` Ryan Goodfellow 2020-01-15 1:41 ` Ryan Goodfellow @ 2020-01-17 17:40 ` William Tu 1 sibling, 0 replies; 49+ messages in thread From: William Tu @ 2020-01-17 17:40 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: Magnus Karlsson, xdp-newbies On Tue, Jan 14, 2020 at 12:53 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote: > > > > Just sent out a patch on the mailing list. Would be great if you could > > try it out. > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to > this point. It does allow for allocating more memory, however packet > forwarding no longer works. I did not see any complaints from dmesg, but here > is an example iperf3 session from a client that worked before. > > ry@xd2:~$ iperf3 -c 10.1.0.2 > Connecting to host 10.1.0.2, port 5201 > [ 5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 5.91 MBytes 49.5 Mbits/sec 2 1.41 KBytes > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes > ^C[ 5] 10.00-139.77 sec 0.00 Bytes 0.00 bits/sec 4 1.41 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-139.77 sec 5.91 MBytes 355 Kbits/sec 9 sender > [ 5] 0.00-139.77 sec 0.00 Bytes 0.00 bits/sec receiver > iperf3: interrupt - the client has terminated > > I'll continue to investigate and report back with anything that I find. > Hi Ryan, Not sure if this is the same, but we hit something similar in OVS AF_XDP's implementation. In our case, it happens when using native-mode, not the driver zero-copy mode, and iperf works a couple seconds then down to zero. FYI: https://mail.openvswitch.org/pipermail/ovs-dev/2019-November/365076.html and fixes https://github.com/openvswitch/ovs/commit/161773c72a33a86a23d4892e3c7448cee9946317 Regards, William ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 0:18 zero-copy between interfaces Ryan Goodfellow 2020-01-13 9:16 ` Magnus Karlsson @ 2020-01-13 11:41 ` Jesper Dangaard Brouer 2020-01-13 15:28 ` Ryan Goodfellow 2020-01-17 12:32 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 2 siblings, 1 reply; 49+ messages in thread From: Jesper Dangaard Brouer @ 2020-01-13 11:41 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies On Mon, 13 Jan 2020 00:18:36 +0000 Ryan Goodfellow <rgoodfel@isi.edu> wrote: > The numbers that I have been able to achive with this code are the following. MTU > is 1500 in all cases. > > mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) > i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) > virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) > > Are these numbers in the ballpark of what's expected? I would say they are too slow / low. Have you remembered to do bulking? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 11:41 ` Jesper Dangaard Brouer @ 2020-01-13 15:28 ` Ryan Goodfellow 2020-01-13 17:04 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-13 15:28 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: xdp-newbies On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote: > On Mon, 13 Jan 2020 00:18:36 +0000 > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > The numbers that I have been able to achive with this code are the following. MTU > > is 1500 in all cases. > > > > mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) > > i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) > > virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) > > > > Are these numbers in the ballpark of what's expected? > > I would say they are too slow / low. > > Have you remembered to do bulking? > I am using a batch size of 256. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 15:28 ` Ryan Goodfellow @ 2020-01-13 17:04 ` Jesper Dangaard Brouer 2020-01-17 17:54 ` Ryan Goodfellow 0 siblings, 1 reply; 49+ messages in thread From: Jesper Dangaard Brouer @ 2020-01-13 17:04 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: brouer, xdp-newbies On Mon, 13 Jan 2020 10:28:00 -0500 Ryan Goodfellow <rgoodfel@isi.edu> wrote: > On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote: > > On Mon, 13 Jan 2020 00:18:36 +0000 > > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > The numbers that I have been able to achive with this code are the following. MTU > > > is 1500 in all cases. > > > > > > mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) > > > i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) > > > virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) > > > > > > Are these numbers in the ballpark of what's expected? > > > > I would say they are too slow / low. > > > > Have you remembered to do bulking? > > > > I am using a batch size of 256. Hmm... Maybe you can test with xdp_redirect_map program in samples/bpf/ and compare the performance on this hardware? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 17:04 ` Jesper Dangaard Brouer @ 2020-01-17 17:54 ` Ryan Goodfellow 2020-01-18 10:14 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-17 17:54 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: xdp-newbies On Mon, Jan 13, 2020 at 06:04:11PM +0100, Jesper Dangaard Brouer wrote: > On Mon, 13 Jan 2020 10:28:00 -0500 > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote: > > > On Mon, 13 Jan 2020 00:18:36 +0000 > > > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > > > The numbers that I have been able to achive with this code are the following. MTU > > > > is 1500 in all cases. > > > > > > > > mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) > > > > i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) > > > > virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) > > > > > > > > Are these numbers in the ballpark of what's expected? > > > > > > I would say they are too slow / low. > > > > > > Have you remembered to do bulking? > > > > > > > I am using a batch size of 256. > > Hmm... > > Maybe you can test with xdp_redirect_map program in samples/bpf/ and > compare the performance on this hardware? Hi Jesper, I tried to use this program, however it does not seem to work for bidirectional traffic across the two interfaces? Now that I have an i40e card that is working here is an update to the numbers. At 1500 MTU the i40e runs at 9.7 GBPS (using the X710-DA4 10G interface), with approx 800 Kpps. Dropping the MSS size down to 700 in iperf3 yields similar near-line rate performance at 1.6 Mpps. This appears to be the limit as dropping the MSS further results in degraded performance at similar packet rates (could be an iperf3 artifact). -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-17 17:54 ` Ryan Goodfellow @ 2020-01-18 10:14 ` Jesper Dangaard Brouer 2020-01-18 14:08 ` Ryan Goodfellow 0 siblings, 1 reply; 49+ messages in thread From: Jesper Dangaard Brouer @ 2020-01-18 10:14 UTC (permalink / raw) To: Ryan Goodfellow; +Cc: xdp-newbies, brouer On Fri, 17 Jan 2020 12:54:09 -0500 Ryan Goodfellow <rgoodfel@isi.edu> wrote: > On Mon, Jan 13, 2020 at 06:04:11PM +0100, Jesper Dangaard Brouer wrote: > > On Mon, 13 Jan 2020 10:28:00 -0500 > > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote: > > > > On Mon, 13 Jan 2020 00:18:36 +0000 > > > > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > > > > > The numbers that I have been able to achive with this code are the following. MTU > > > > > is 1500 in all cases. > > > > > > > > > > mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy) > > > > > i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy) > > > > > virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs) > > > > > > > > > > Are these numbers in the ballpark of what's expected? > > > > > > > > I would say they are too slow / low. > > > > > > > > Have you remembered to do bulking? > > > > > > > > > > I am using a batch size of 256. > > > > Hmm... > > > > Maybe you can test with xdp_redirect_map program in samples/bpf/ and > > compare the performance on this hardware? > > Hi Jesper, > > I tried to use this program, however it does not seem to work for bidirectional > traffic across the two interfaces? It does work bidirectional if you start more of these xdp_redirect_map programs. Do notice this is an example program. Look at xdp_fwd_*.c if you want a program that is functional and uses the existing IP route table for XDP acceleration. My point is that there are alternatives for doing zero-copy between interfaces... A xdp_redirect_map inside the kernel out another interface is already zero-copy. I'm wondering why did you choose/need AF_XDP technology for doing forwarding? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-18 10:14 ` Jesper Dangaard Brouer @ 2020-01-18 14:08 ` Ryan Goodfellow 2020-01-26 4:53 ` Dan Siemon 0 siblings, 1 reply; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-18 14:08 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: xdp-newbies On Sat, Jan 18, 2020 at 11:14:05AM +0100, Jesper Dangaard Brouer wrote: > On Fri, 17 Jan 2020 12:54:09 -0500 > Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > I tried to use this program, however it does not seem to work for bidirectional > > traffic across the two interfaces? > > It does work bidirectional if you start more of these xdp_redirect_map > programs. Do notice this is an example program. Look at xdp_fwd_*.c > if you want a program that is functional and uses the existing IP route > table for XDP acceleration. > > My point is that there are alternatives for doing zero-copy between > interfaces... A xdp_redirect_map inside the kernel out another > interface is already zero-copy. > > I'm wondering why did you choose/need AF_XDP technology for doing forwarding? This is just a sample program used to demonstrate moving packets between different interfaces efficiently using AF_XDP. Our actual use case is performing network emulation in userspace. For example, representing impaired links or entire networks with link-by-link shaping specifications. We are using AF_XDP to get packets to/from our network emulation software as quickly as possible without having to go through the entire network stack, as the emulation host's network configuration does not influence the networks it's emulating. Traditionally we've used DPDK for this, but are porting to AF_XDP for the relative simplicity and flexibility it provides. Some specific benefits for us are: - Can attach to VTEPs which allows us to hook into some EVPN/VXLAN based networks we have easily. Alternatively with the BPF program flexibility, we also have the option to split out BGP control plane traffic from overlay traffic when attaching to the physical interface and pass it through to the kernel. Both of these approaches let the kernel manage the FDB for VTEPs as well as taking care of encap/decap (potentially offloaded to the NIC itself) and let our software focus on emulation. - Using XDP in virtual machines in our testing environment is straightforward, while this is possible with DPDK and virtio, the setup was rather convoluted. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-18 14:08 ` Ryan Goodfellow @ 2020-01-26 4:53 ` Dan Siemon 0 siblings, 0 replies; 49+ messages in thread From: Dan Siemon @ 2020-01-26 4:53 UTC (permalink / raw) To: Ryan Goodfellow, Jesper Dangaard Brouer; +Cc: xdp-newbies On Sat, 2020-01-18 at 09:08 -0500, Ryan Goodfellow wrote: > On Sat, Jan 18, 2020 at 11:14:05AM +0100, Jesper Dangaard > Brouerwrote: > > > > I'm wondering why did you choose/need AF_XDP technology for doing > > forwarding? > > This is just a sample program used to demonstrate moving packets > between > different interfaces efficiently using AF_XDP. > > Our actual use case is performing network emulation in userspace. For > example, > representing impaired links or entire networks with link-by-link > shaping > specifications. We are using AF_XDP to get packets to/from our > network emulation > software as quickly as possible without having to go through the > entire network > stack, as the emulation host's network configuration does not > influence the > networks it's emulating. > > Traditionally we've used DPDK for this, but are porting to AF_XDP for > the > relative simplicity and flexibility it provides. Some specific > benefits for us > are: > > - Can attach to VTEPs which allows us to hook into some EVPN/VXLAN > based > networks we have easily. Alternatively with the BPF program > flexibility, we > also have the option to split out BGP control plane traffic from > overlay > traffic when attaching to the physical interface and pass it > through to the > kernel. Both of these approaches let the kernel manage the FDB for > VTEPs as > well as taking care of encap/decap (potentially offloaded to the > NIC itself) > and let our software focus on emulation. > > - Using XDP in virtual machines in our testing environment is > straightforward, > while this is possible with DPDK and virtio, the setup was rather > convoluted. I'm in a very similar situation. The per-packet work we do is complicated. Doing it all in BPF would be much more painful if even possible. For us, AF_XDP is a nice, simple DPDK. We have no need for BPF in the AF_XDP path. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-13 0:18 zero-copy between interfaces Ryan Goodfellow @ 2020-01-17 12:32 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 2020-01-13 11:41 ` Jesper Dangaard Brouer 2020-01-17 12:32 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 2 siblings, 0 replies; 49+ messages in thread From: Björn Töpel @ 2020-01-17 12:32 UTC (permalink / raw) To: Ryan Goodfellow Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson, Björn Töpel On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote: > [...] > > I could not get zero-copy to work with the i40e driver as it would crash. I've > attached the corresponding traces from dmesg. Thanks Ryan! I had a look at the crash, and it's in the XDP setup: i40e_xdp_setup: ... for (i = 0; i < vsi->num_queue_pairs; i++) WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); and the vsi->rx_ring[0] is NULL. This is clearly broken. It would help with more lines from your dmesg: the cut i40e log hints that something is really broken: [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 queues for VSI 0 err -12 [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left Is it possible to dig out the complete log? Thanks! Björn ^ permalink raw reply [flat|nested] 49+ messages in thread
* [Intel-wired-lan] zero-copy between interfaces @ 2020-01-17 12:32 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 0 siblings, 0 replies; 49+ messages in thread From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2020-01-17 12:32 UTC (permalink / raw) To: intel-wired-lan On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote: > [...] > > I could not get zero-copy to work with the i40e driver as it would crash. I've > attached the corresponding traces from dmesg. Thanks Ryan! I had a look at the crash, and it's in the XDP setup: i40e_xdp_setup: ... for (i = 0; i < vsi->num_queue_pairs; i++) WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); and the vsi->rx_ring[0] is NULL. This is clearly broken. It would help with more lines from your dmesg: the cut i40e log hints that something is really broken: [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 queues for VSI 0 err -12 [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left Is it possible to dig out the complete log? Thanks! Bj?rn ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-17 12:32 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2020-01-17 17:16 ` Ryan Goodfellow -1 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-17 17:16 UTC (permalink / raw) To: Björn Töpel Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson, Björn Töpel On Fri, Jan 17, 2020 at 01:32:07PM +0100, Björn Töpel wrote: > On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > [...] > > > > I could not get zero-copy to work with the i40e driver as it would crash. I've > > attached the corresponding traces from dmesg. > > Thanks Ryan! I had a look at the crash, and it's in the XDP setup: > > i40e_xdp_setup: > ... > for (i = 0; i < vsi->num_queue_pairs; i++) > WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); > > and the vsi->rx_ring[0] is NULL. This is clearly broken. > > It would help with more lines from your dmesg: the cut i40e log hints > that something is really broken: > > [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 > queues for VSI 0 err -12 > [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed > [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left > > Is it possible to dig out the complete log? Hi Björn, I've linked a full dmesg log from an XDP setup crash. Note that there are two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the one that is crashing on XDP setup. https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080 Some info that may be useful: ry@turbine:~$ sudo ethtool -i eno7 driver: i40e version: 2.8.20-k firmware-version: 3.33 0x80001006 1.1747.0 expansion-rom-version: bus-info: 0000:b7:00.2 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes The firmware version 3.33 was the latest I could find as of a few weeks ago. ry@turbine:~$ sudo lspci -vvv | grep 722 b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) DeviceName: Intel LAN X722 #1 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) DeviceName: Intel LAN X722 #2 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) DeviceName: Intel LAN X722 #3 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) DeviceName: Intel LAN X722 #4 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ ry@ryzen:~$ uname -a Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux ry@turbine:~/kmoa/bpf-next$ git log -2 commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master) Author: Ryan Goodfellow <rgoodfel@isi.edu> Date: Wed Jan 15 16:54:39 2020 -0500 add xdpsock_multidev sample program This is a simple program that uses AF_XDP sockets to forward packets between two interfaces using a common memory region and no copying of packets. Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu> commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD) Author: Andrii Nakryiko <andriin@fb.com> Date: Wed Jan 15 11:08:56 2020 -0800 libbpf: Support .text sub-calls relocations The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call relocations within the same section. This includes a default .text section, which contains any BPF sub-programs. This wasn't the case before and so libbpf was able to get a way with slightly simpler handling of subprogram call relocations. This patch adds support for .text section relocations. It needs to ensure correct order of relocations, so does two passes: - first, relocate .text instructions, if there are any relocations in it; - then process all the other programs and copy over patched .text instructions for all sub-program calls. v1->v2: - break early once .text program is processed. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin@fb.com -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* [Intel-wired-lan] zero-copy between interfaces @ 2020-01-17 17:16 ` Ryan Goodfellow 0 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-17 17:16 UTC (permalink / raw) To: intel-wired-lan On Fri, Jan 17, 2020 at 01:32:07PM +0100, Bj?rn T?pel wrote: > On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > [...] > > > > I could not get zero-copy to work with the i40e driver as it would crash. I've > > attached the corresponding traces from dmesg. > > Thanks Ryan! I had a look at the crash, and it's in the XDP setup: > > i40e_xdp_setup: > ... > for (i = 0; i < vsi->num_queue_pairs; i++) > WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); > > and the vsi->rx_ring[0] is NULL. This is clearly broken. > > It would help with more lines from your dmesg: the cut i40e log hints > that something is really broken: > > [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 > queues for VSI 0 err -12 > [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed > [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left > > Is it possible to dig out the complete log? Hi Bj?rn, I've linked a full dmesg log from an XDP setup crash. Note that there are two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the one that is crashing on XDP setup. https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080 Some info that may be useful: ry at turbine:~$ sudo ethtool -i eno7 driver: i40e version: 2.8.20-k firmware-version: 3.33 0x80001006 1.1747.0 expansion-rom-version: bus-info: 0000:b7:00.2 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes The firmware version 3.33 was the latest I could find as of a few weeks ago. ry at turbine:~$ sudo lspci -vvv | grep 722 b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) DeviceName: Intel LAN X722 #1 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) DeviceName: Intel LAN X722 #2 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) DeviceName: Intel LAN X722 #3 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) DeviceName: Intel LAN X722 #4 Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ ry at ryzen:~$ uname -a Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux ry at turbine:~/kmoa/bpf-next$ git log -2 commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master) Author: Ryan Goodfellow <rgoodfel@isi.edu> Date: Wed Jan 15 16:54:39 2020 -0500 add xdpsock_multidev sample program This is a simple program that uses AF_XDP sockets to forward packets between two interfaces using a common memory region and no copying of packets. Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu> commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD) Author: Andrii Nakryiko <andriin@fb.com> Date: Wed Jan 15 11:08:56 2020 -0800 libbpf: Support .text sub-calls relocations The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call relocations within the same section. This includes a default .text section, which contains any BPF sub-programs. This wasn't the case before and so libbpf was able to get a way with slightly simpler handling of subprogram call relocations. This patch adds support for .text section relocations. It needs to ensure correct order of relocations, so does two passes: - first, relocate .text instructions, if there are any relocations in it; - then process all the other programs and copy over patched .text instructions for all sub-program calls. v1->v2: - break early once .text program is processed. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin at fb.com -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-17 17:16 ` [Intel-wired-lan] " Ryan Goodfellow @ 2020-01-17 18:10 ` Ryan Goodfellow -1 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-17 18:10 UTC (permalink / raw) To: Björn Töpel Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson, Björn Töpel On Fri, Jan 17, 2020 at 12:16:37PM -0500, Ryan Goodfellow wrote: > ry@ryzen:~$ uname -a > Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux correction, that was a terminal on the wrong machine ry@turbine:~$ uname -a Linux turbine 5.5.0-rc4-moa+ #16 SMP Fri Jan 17 10:52:42 EST 2020 x86_64 GNU/Linux -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* [Intel-wired-lan] zero-copy between interfaces @ 2020-01-17 18:10 ` Ryan Goodfellow 0 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-17 18:10 UTC (permalink / raw) To: intel-wired-lan On Fri, Jan 17, 2020 at 12:16:37PM -0500, Ryan Goodfellow wrote: > ry at ryzen:~$ uname -a > Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux correction, that was a terminal on the wrong machine ry at turbine:~$ uname -a Linux turbine 5.5.0-rc4-moa+ #16 SMP Fri Jan 17 10:52:42 EST 2020 x86_64 GNU/Linux -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-17 17:16 ` [Intel-wired-lan] " Ryan Goodfellow @ 2020-01-20 8:24 ` Magnus Karlsson -1 siblings, 0 replies; 49+ messages in thread From: Magnus Karlsson @ 2020-01-20 8:24 UTC (permalink / raw) To: Ryan Goodfellow Cc: Björn Töpel, xdp-newbies, Karlsson, Magnus, intel-wired-lan, Björn Töpel On Fri, Jan 17, 2020 at 6:16 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Fri, Jan 17, 2020 at 01:32:07PM +0100, Björn Töpel wrote: > > On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > [...] > > > > > > I could not get zero-copy to work with the i40e driver as it would crash. I've > > > attached the corresponding traces from dmesg. > > > > Thanks Ryan! I had a look at the crash, and it's in the XDP setup: > > > > i40e_xdp_setup: > > ... > > for (i = 0; i < vsi->num_queue_pairs; i++) > > WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); > > > > and the vsi->rx_ring[0] is NULL. This is clearly broken. > > > > It would help with more lines from your dmesg: the cut i40e log hints > > that something is really broken: > > > > [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 > > queues for VSI 0 err -12 > > [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed > > [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left > > > > Is it possible to dig out the complete log? > > Hi Björn, > > I've linked a full dmesg log from an XDP setup crash. Note that there are > two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works > fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the > one that is crashing on XDP setup. Ryan, I was wondering if you could run two small experiments since I cannot reproduce this? 1: Run your program using the two ports on your X710 card. Does it work? This is my setup and works for me. 2: On your Mellanox setup, insert a kick_tx() call for each of your two sockets before the poll() call in your forward() function. Just to see if it works when we explicitly wake up the driver. Thanks: Magnus > https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080 > > Some info that may be useful: > > ry@turbine:~$ sudo ethtool -i eno7 > driver: i40e > version: 2.8.20-k > firmware-version: 3.33 0x80001006 1.1747.0 > expansion-rom-version: > bus-info: 0000:b7:00.2 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: yes > > The firmware version 3.33 was the latest I could find as of a few weeks ago. > > ry@turbine:~$ sudo lspci -vvv | grep 722 > b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) > DeviceName: Intel LAN X722 #1 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T > b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) > DeviceName: Intel LAN X722 #2 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T > b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) > DeviceName: Intel LAN X722 #3 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ > b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) > DeviceName: Intel LAN X722 #4 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ > > ry@ryzen:~$ uname -a > Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux > > ry@turbine:~/kmoa/bpf-next$ git log -2 > commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master) > Author: Ryan Goodfellow <rgoodfel@isi.edu> > Date: Wed Jan 15 16:54:39 2020 -0500 > > add xdpsock_multidev sample program > > This is a simple program that uses AF_XDP sockets to forward packets > between two interfaces using a common memory region and no copying of > packets. > > Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu> > > commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD) > Author: Andrii Nakryiko <andriin@fb.com> > Date: Wed Jan 15 11:08:56 2020 -0800 > > libbpf: Support .text sub-calls relocations > > The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call > relocations within the same section. This includes a default .text section, > which contains any BPF sub-programs. This wasn't the case before and so libbpf > was able to get a way with slightly simpler handling of subprogram call > relocations. > > This patch adds support for .text section relocations. It needs to ensure > correct order of relocations, so does two passes: > - first, relocate .text instructions, if there are any relocations in it; > - then process all the other programs and copy over patched .text instructions > for all sub-program calls. > > v1->v2: > - break early once .text program is processed. > > Signed-off-by: Andrii Nakryiko <andriin@fb.com> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > Acked-by: Yonghong Song <yhs@fb.com> > Cc: Alexei Starovoitov <ast@kernel.org> > Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin@fb.com > > -- > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* [Intel-wired-lan] zero-copy between interfaces @ 2020-01-20 8:24 ` Magnus Karlsson 0 siblings, 0 replies; 49+ messages in thread From: Magnus Karlsson @ 2020-01-20 8:24 UTC (permalink / raw) To: intel-wired-lan On Fri, Jan 17, 2020 at 6:16 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > On Fri, Jan 17, 2020 at 01:32:07PM +0100, Bj?rn T?pel wrote: > > On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote: > > > > > [...] > > > > > > I could not get zero-copy to work with the i40e driver as it would crash. I've > > > attached the corresponding traces from dmesg. > > > > Thanks Ryan! I had a look at the crash, and it's in the XDP setup: > > > > i40e_xdp_setup: > > ... > > for (i = 0; i < vsi->num_queue_pairs; i++) > > WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); > > > > and the vsi->rx_ring[0] is NULL. This is clearly broken. > > > > It would help with more lines from your dmesg: the cut i40e log hints > > that something is really broken: > > > > [ 328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 > > queues for VSI 0 err -12 > > [ 328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed > > [ 328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left > > > > Is it possible to dig out the complete log? > > Hi Bj?rn, > > I've linked a full dmesg log from an XDP setup crash. Note that there are > two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works > fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the > one that is crashing on XDP setup. Ryan, I was wondering if you could run two small experiments since I cannot reproduce this? 1: Run your program using the two ports on your X710 card. Does it work? This is my setup and works for me. 2: On your Mellanox setup, insert a kick_tx() call for each of your two sockets before the poll() call in your forward() function. Just to see if it works when we explicitly wake up the driver. Thanks: Magnus > https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080 > > Some info that may be useful: > > ry at turbine:~$ sudo ethtool -i eno7 > driver: i40e > version: 2.8.20-k > firmware-version: 3.33 0x80001006 1.1747.0 > expansion-rom-version: > bus-info: 0000:b7:00.2 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: yes > > The firmware version 3.33 was the latest I could find as of a few weeks ago. > > ry at turbine:~$ sudo lspci -vvv | grep 722 > b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) > DeviceName: Intel LAN X722 #1 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T > b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04) > DeviceName: Intel LAN X722 #2 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T > b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) > DeviceName: Intel LAN X722 #3 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ > b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04) > DeviceName: Intel LAN X722 #4 > Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+ > > ry at ryzen:~$ uname -a > Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux > > ry at turbine:~/kmoa/bpf-next$ git log -2 > commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master) > Author: Ryan Goodfellow <rgoodfel@isi.edu> > Date: Wed Jan 15 16:54:39 2020 -0500 > > add xdpsock_multidev sample program > > This is a simple program that uses AF_XDP sockets to forward packets > between two interfaces using a common memory region and no copying of > packets. > > Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu> > > commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD) > Author: Andrii Nakryiko <andriin@fb.com> > Date: Wed Jan 15 11:08:56 2020 -0800 > > libbpf: Support .text sub-calls relocations > > The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call > relocations within the same section. This includes a default .text section, > which contains any BPF sub-programs. This wasn't the case before and so libbpf > was able to get a way with slightly simpler handling of subprogram call > relocations. > > This patch adds support for .text section relocations. It needs to ensure > correct order of relocations, so does two passes: > - first, relocate .text instructions, if there are any relocations in it; > - then process all the other programs and copy over patched .text instructions > for all sub-program calls. > > v1->v2: > - break early once .text program is processed. > > Signed-off-by: Andrii Nakryiko <andriin@fb.com> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > Acked-by: Yonghong Song <yhs@fb.com> > Cc: Alexei Starovoitov <ast@kernel.org> > Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin at fb.com > > -- > ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-20 8:24 ` [Intel-wired-lan] " Magnus Karlsson @ 2020-01-20 18:33 ` Ryan Goodfellow -1 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-20 18:33 UTC (permalink / raw) To: Magnus Karlsson Cc: Björn Töpel, xdp-newbies, Karlsson, Magnus, intel-wired-lan, Björn Töpel On Mon, Jan 20, 2020 at 09:24:05AM +0100, Magnus Karlsson wrote: > > I was wondering if you could run two small experiments since I cannot > reproduce this? > > 1: Run your program using the two ports on your X710 card. Does it > work? This is my setup and works for me. The X710 card works without issue. > 2: On your Mellanox setup, insert a kick_tx() call for each of your > two sockets before the poll() call in your forward() function. Just to > see if it works when we explicitly wake up the driver. This did not have an effect on the observed behavior. Exactly N packets go through the interface where N is equal to the size of the FQ/CQ rings and then forwarding halts. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* [Intel-wired-lan] zero-copy between interfaces @ 2020-01-20 18:33 ` Ryan Goodfellow 0 siblings, 0 replies; 49+ messages in thread From: Ryan Goodfellow @ 2020-01-20 18:33 UTC (permalink / raw) To: intel-wired-lan On Mon, Jan 20, 2020 at 09:24:05AM +0100, Magnus Karlsson wrote: > > I was wondering if you could run two small experiments since I cannot > reproduce this? > > 1: Run your program using the two ports on your X710 card. Does it > work? This is my setup and works for me. The X710 card works without issue. > 2: On your Mellanox setup, insert a kick_tx() call for each of your > two sockets before the poll() call in your forward() function. Just to > see if it works when we explicitly wake up the driver. This did not have an effect on the observed behavior. Exactly N packets go through the interface where N is equal to the size of the FQ/CQ rings and then forwarding halts. -- ~ ry ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: zero-copy between interfaces 2020-01-17 17:16 ` [Intel-wired-lan] " Ryan Goodfellow @ 2020-01-20 17:04 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= -1 siblings, 0 replies; 49+ messages in thread From: Björn Töpel @ 2020-01-20 17:04 UTC (permalink / raw) To: Ryan Goodfellow, Björn Töpel Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson On 2020-01-17 18:16, Ryan Goodfellow wrote: [...] > > https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080 > Ryan, thanks a lot for the detailed report! Much appreciated! Long story short, the i40e crash is that the drivers tries to allocate 256 queues, but the HW is short on queues. The drivers enters a broken state, which triggers the crash. I'll make sure we'll get a patch for this. Björn ^ permalink raw reply [flat|nested] 49+ messages in thread
* [Intel-wired-lan] zero-copy between interfaces @ 2020-01-20 17:04 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 0 siblings, 0 replies; 49+ messages in thread From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2020-01-20 17:04 UTC (permalink / raw) To: intel-wired-lan On 2020-01-17 18:16, Ryan Goodfellow wrote: [...] > > https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080 > Ryan, thanks a lot for the detailed report! Much appreciated! Long story short, the i40e crash is that the drivers tries to allocate 256 queues, but the HW is short on queues. The drivers enters a broken state, which triggers the crash. I'll make sure we'll get a patch for this. Bj?rn ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2020-02-07 9:01 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-13 0:18 zero-copy between interfaces Ryan Goodfellow 2020-01-13 9:16 ` Magnus Karlsson 2020-01-13 10:43 ` Toke Høiland-Jørgensen 2020-01-13 15:25 ` Ryan Goodfellow 2020-01-13 17:09 ` Toke Høiland-Jørgensen 2020-01-14 7:47 ` Magnus Karlsson 2020-01-14 8:11 ` Toke Høiland-Jørgensen 2020-01-13 15:11 ` Ryan Goodfellow 2020-01-14 9:59 ` Magnus Karlsson 2020-01-14 20:52 ` Ryan Goodfellow 2020-01-15 1:41 ` Ryan Goodfellow 2020-01-15 7:40 ` Magnus Karlsson 2020-01-15 8:20 ` Magnus Karlsson 2020-01-16 2:04 ` Ryan Goodfellow 2020-01-16 14:32 ` Magnus Karlsson 2020-01-17 9:45 ` Magnus Karlsson 2020-01-17 17:05 ` Ryan Goodfellow 2020-01-21 7:34 ` Magnus Karlsson 2020-01-21 13:40 ` Maxim Mikityanskiy 2020-01-22 21:43 ` Ryan Goodfellow 2020-01-27 14:01 ` Maxim Mikityanskiy 2020-01-27 15:54 ` Magnus Karlsson 2020-01-30 9:37 ` Maxim Mikityanskiy 2020-01-30 9:59 ` Magnus Karlsson 2020-01-30 11:40 ` Magnus Karlsson 2020-02-04 16:10 ` Magnus Karlsson 2020-02-05 13:31 ` Magnus Karlsson 2020-02-06 14:56 ` Maxim Mikityanskiy 2020-02-07 9:01 ` Magnus Karlsson 2020-01-17 17:40 ` William Tu 2020-01-13 11:41 ` Jesper Dangaard Brouer 2020-01-13 15:28 ` Ryan Goodfellow 2020-01-13 17:04 ` Jesper Dangaard Brouer 2020-01-17 17:54 ` Ryan Goodfellow 2020-01-18 10:14 ` Jesper Dangaard Brouer 2020-01-18 14:08 ` Ryan Goodfellow 2020-01-26 4:53 ` Dan Siemon 2020-01-17 12:32 ` Björn Töpel 2020-01-17 12:32 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= 2020-01-17 17:16 ` Ryan Goodfellow 2020-01-17 17:16 ` [Intel-wired-lan] " Ryan Goodfellow 2020-01-17 18:10 ` Ryan Goodfellow 2020-01-17 18:10 ` [Intel-wired-lan] " Ryan Goodfellow 2020-01-20 8:24 ` Magnus Karlsson 2020-01-20 8:24 ` [Intel-wired-lan] " Magnus Karlsson 2020-01-20 18:33 ` Ryan Goodfellow 2020-01-20 18:33 ` [Intel-wired-lan] " Ryan Goodfellow 2020-01-20 17:04 ` Björn Töpel 2020-01-20 17:04 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.