All of lore.kernel.org
 help / color / mirror / Atom feed
* zero-copy between interfaces
@ 2020-01-13  0:18 Ryan Goodfellow
  2020-01-13  9:16 ` Magnus Karlsson
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-13  0:18 UTC (permalink / raw)
  To: xdp-newbies

[-- Attachment #1: Type: text/plain, Size: 3081 bytes --]

Greetings XDP folks. I've been working on a zero-copy XDP bridge 
implementation similar to what's described in the following thread.

  https://www.spinics.net/lists/xdp-newbies/msg01333.html

I now have an implementation that is working reasonably well under certain
conditions for various hardware. The implementation is primarily based on the
xdpsock_user program in the kernel under samples/bpf. You can find my program
and corresponding BPF program here.

- https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
- https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c

I have small testbed to run this code on that looks like the following.

Packet forwarding machine:
    CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
    Memory: 32 GB
    NICs: 
    - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G)
    - Intel X722 10G SFP+

Sender/receiver machines
    CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
    Memory: 32 GB
    NICs: 
    - Mellanox ConnectX 4 40G MCX4131A-BCAT
    - Intel X722 10G SFP+

I could not get zero-copy to work with the i40e driver as it would crash. I've
attached the corresponding traces from dmesg. The results below are with the 
i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the 
server and test with instead of the X722 if that is of interest. In all cases I 
used a single hardware queue via the following.

    ethtool -L <dev> combined 1

The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I 
used ntuple rules to push things through queue 1 (shadows 0) as follows

    ethtool -N <dev> flow-type ether src <mac> action 1

The numbers that I have been able to achive with this code are the following. MTU
is 1500 in all cases.

    mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
    i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
    virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)

Are these numbers in the ballpark of what's expected?

One thing I have noticed is that I cannot create large memory maps for the
packet buffers. For example a frame size of 2048 with 524288 frames (around
1G of packets) is fine. However, increasing size by an order of magnitude, which
is well within the memory capacity of the host machine results in an error when 
creating the UMEM and the kernel shows the attached call trace. I'm going to 
begin investigating this in more detail soon, but if anyone has advice on large 
XDP memory maps that would be much appreciated.

The reason for wanting large memory maps is that our use case for XDP is network
emulation - and sometimes that means introducing delay factors that can require
a rather large in-memory packet buffers.

If there is interest in including this program in the official BPF samples I'm happy to
submit a patch. Any comments on the program are also much appreciated.

Thanks

~ ry

[-- Attachment #2: large-umem-trace.txt --]
[-- Type: text/plain, Size: 3188 bytes --]

[ 9879.806272] ------------[ cut here ]------------
[ 9879.811489] WARNING: CPU: 14 PID: 18118 at mm/page_alloc.c:4738 __alloc_pages_nodemask+0x1de/0x2b0
[ 9879.816743] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core
[ 9879.822023] CPU: 14 PID: 18118 Comm: xdpsock_multide Not tainted 5.5.0-rc3-moa+ #1
[ 9879.827399] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019
[ 9879.832838] RIP: 0010:__alloc_pages_nodemask+0x1de/0x2b0
[ 9879.838257] Code: 0f 85 1a ff ff ff 65 48 8b 04 25 00 7d 01 00 48 05 28 08 00 00 41 bd 01 00 00 00 48 89 44 24 08 e9 fb fe ff ff 80 e7 20 75 02 <0f> 0b 45 31 ed eb 98 44 8b 64 24 18 65 8b 05 17 ff c0 5a 89 c0 48
[ 9879.849435] RSP: 0018:ffffa8fe00777db8 EFLAGS: 00010246
[ 9879.855031] RAX: 0000000000000000 RBX: 00000000000400c0 RCX: 0000000000000000
[ 9879.860665] RDX: 0000000000000000 RSI: 000000000000000b RDI: 0000000000040dc0
[ 9879.866260] RBP: 0000000000800000 R08: ffffffffa6e27560 R09: ffff93f01c1a0000
[ 9879.871862] R10: ffffcb5c5e48a9c0 R11: 0000000822898000 R12: ffff93f01828c600
[ 9879.877478] R13: 000000000000000b R14: 000000000000000b R15: ffff93f01c1a0000
[ 9879.883079] FS:  00007fd124b8c740(0000) GS:ffff93f020180000(0000) knlGS:0000000000000000
[ 9879.888724] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9879.894361] CR2: 00007f04f8629ecc CR3: 000000080480a005 CR4: 00000000007606e0
[ 9879.900038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9879.905692] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9879.911207] PKRU: 55555554
[ 9879.916581] Call Trace:
[ 9879.921841]  kmalloc_order+0x1b/0x80
[ 9879.926998]  kmalloc_order_trace+0x1d/0xa0
[ 9879.932083]  xdp_umem_create+0x380/0x450
[ 9879.937114]  xsk_setsockopt+0x1e9/0x270
[ 9879.942151]  __sys_setsockopt+0xd6/0x190
[ 9879.947071]  __x64_sys_setsockopt+0x21/0x30
[ 9879.951872]  do_syscall_64+0x50/0x140
[ 9879.956564]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9879.961189] RIP: 0033:0x7fd124ca787a
[ 9879.965746] Code: ff ff ff c3 48 8b 15 15 d6 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e6 d5 0b 00 f7 d8 64 89 01 48
[ 9879.975218] RSP: 002b:00007fff8463f498 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
[ 9879.979993] RAX: ffffffffffffffda RBX: 00007fd0a4b8c000 RCX: 00007fd124ca787a
[ 9879.984779] RDX: 0000000000000004 RSI: 000000000000011b RDI: 0000000000000009
[ 9879.989489] RBP: 00007fff8463f580 R08: 0000000000000020 R09: 00005605752b5f70
[ 9879.994115] R10: 00007fff8463f4b0 R11: 0000000000000206 R12: 00007fff8463f590
[ 9879.998684] R13: 0000000080000000 R14: 00005605752b5a40 R15: 00005605752b5ac0
[ 9880.003156] ---[ end trace d2ad1cbfd1388dbe ]---
[ 9880.007555] xdp_umem: kalloc failed
[ 9880.032714] xdpsock_multide[18118]: segfault at 98 ip 0000560573d2aba1 sp 00007fff8463f520 error 4 in xdpsock_multidev[560573d2a000+16000]
[ 9880.045593] Code: ec 10 41 8b 3c 9c 8b 15 4d 07 02 00 c7 45 ec 00 00 00 00 e8 e1 0f 01 00 85 c0 75 51 48 8d 15 46 09 02 00 8b 45 ec 48 8b 14 da <39> 82 98 00 00 00 74 27 85 c0 74 15 48 8d 3d fc 55 01 00 e8 37 f5


[-- Attachment #3: i40e-zerocopy-trace.txt --]
[-- Type: text/plain, Size: 8627 bytes --]

[  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 queues for VSI 0 err -12
[  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
[  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left
[  328.579467] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  328.579573] #PF: supervisor read access in kernel mode
[  328.579655] #PF: error_code(0x0000) - not-present page
[  328.579737] PGD 0 P4D 0 
[  328.579780] Oops: 0000 [#1] SMP PTI
[  328.579835] CPU: 0 PID: 2157 Comm: xdpsock_multide Not tainted 5.5.0-rc3-moa+ #1
[  328.579951] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019
[  328.580093] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e]
[  328.580162] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f
[  328.580455] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246
[  328.580538] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000
[  328.580650] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810
[  328.580762] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5
[  328.580875] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000
[  328.580987] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300
[  328.581099] FS:  00007f079f960740(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.581227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.581316] CR2: 0000000000000000 CR3: 000000085578c005 CR4: 00000000007606f0
[  328.581428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.581541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.585600] PKRU: 55555554
[  328.589632] Call Trace:
[  328.593637]  ? i40e_reconfig_rss_queues+0x100/0x100 [i40e]
[  328.597641]  dev_xdp_install+0x4e/0x70
[  328.601530]  dev_change_xdp_fd+0x11a/0x260
[  328.605335]  do_setlink+0xdd0/0xe80
[  328.609020]  ? get_page_from_freelist+0x738/0x10d0
[  328.612678]  ? on_each_cpu+0x54/0x60
[  328.616233]  ? kmem_cache_alloc_node+0x43/0x1f0
[  328.619856]  ? optimize_nops.isra.0+0x90/0x90
[  328.623457]  rtnl_setlink+0xe5/0x160
[  328.627015]  ? security_capable+0x40/0x60
[  328.630541]  rtnetlink_rcv_msg+0x2b0/0x360
[  328.634046]  ? alloc_file+0x76/0xf0
[  328.637513]  ? alloc_file_pseudo+0xa3/0x110
[  328.640949]  ? rtnl_calcit.isra.0+0x110/0x110
[  328.644369]  netlink_rcv_skb+0x49/0x110
[  328.647757]  netlink_unicast+0x191/0x230
[  328.651140]  netlink_sendmsg+0x225/0x460
[  328.654481]  sock_sendmsg+0x5e/0x60
[  328.657784]  __sys_sendto+0xee/0x160
[  328.661050]  ? __sys_getsockname+0x7e/0xc0
[  328.664306]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  328.667564]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  328.670715]  ? trace_hardirqs_off_caller+0x2d/0xd0
[  328.673841]  ? do_syscall_64+0x12/0x140
[  328.676955]  __x64_sys_sendto+0x25/0x30
[  328.680025]  do_syscall_64+0x50/0x140
[  328.683046]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  328.686097] RIP: 0033:0x7f079fa7b66d
[  328.689126] Code: ff ff ff ff eb bc 0f 1f 80 00 00 00 00 48 8d 05 79 3d 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 83
[  328.695639] RSP: 002b:00007fffe2a66e88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  328.698886] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f079fa7b66d
[  328.702098] RDX: 0000000000000034 RSI: 00007fffe2a66ea0 RDI: 0000000000000007
[  328.705240] RBP: 00007fffe2a66f20 R08: 0000000000000000 R09: 0000000000000000
[  328.708278] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
[  328.711244] R13: 0000000000000006 R14: 0000000000000007 R15: 0000000000000020
[  328.714150] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core
[  328.717149] CR2: 0000000000000000
[  328.720135] ---[ end trace f2aa62f9801a8115 ]---
[  328.723157] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e]
[  328.726171] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f
[  328.732497] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246
[  328.735655] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000
[  328.738890] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810
[  328.742151] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5
[  328.745377] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000
[  328.748616] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300
[  328.751848] FS:  00007f079f960740(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.755130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.758421] CR2: 0000000000000000 CR3: 000000085578c005 CR4: 00000000007606f0
[  328.761777] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.765113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.768421] PKRU: 55555554
[  328.780850] BUG: kernel NULL pointer dereference, address: 0000000000000f40
[  328.784237] #PF: supervisor read access in kernel mode
[  328.787635] #PF: error_code(0x0000) - not-present page
[  328.791042] PGD 0 P4D 0 
[  328.794431] Oops: 0000 [#2] SMP PTI
[  328.797827] CPU: 0 PID: 185 Comm: kworker/0:2 Tainted: G      D           5.5.0-rc3-moa+ #1
[  328.801358] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019
[  328.804971] Workqueue: i40e i40e_service_task [i40e]
[  328.808596] RIP: 0010:i40e_notify_client_of_netdev_close+0x6/0x60 [i40e]
[  328.812301] Code: 40 10 e8 1d 9c ab ca 48 8b 44 24 28 65 48 33 04 25 28 00 00 00 75 06 48 83 c4 30 5b c3 e8 e2 3c d2 c9 66 90 0f 1f 44 00 00 53 <48> 8b 87 40 0f 00 00 48 8b 98 38 08 00 00 48 85 db 74 44 4c 8b 83
[  328.820156] RSP: 0018:ffffb265405b3de0 EFLAGS: 00010247
[  328.824151] RAX: ffff91570ef42c00 RBX: ffff91570ef58810 RCX: ffff915713eb6818
[  328.828215] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[  328.832300] RBP: ffff91570ef58000 R08: 0000000065303469 R09: 8080808080808080
[  328.836435] R10: ffff915717db05ac R11: 0000000000000018 R12: ffff91570ef58008
[  328.840564] R13: 0000000000000053 R14: 0000000000000000 R15: 0ffffd2653fa0980
[  328.844691] FS:  0000000000000000(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.848883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.853088] CR2: 0000000000000f40 CR3: 000000001c80a006 CR4: 00000000007606f0
[  328.857340] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.861571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.865681] PKRU: 55555554
[  328.869666] Call Trace:
[  328.873572]  i40e_service_task+0x511/0xa40 [i40e]
[  328.877484]  ? process_one_work+0x1d3/0x390
[  328.881365]  process_one_work+0x1e5/0x390
[  328.885245]  worker_thread+0x4a/0x3d0
[  328.889121]  kthread+0xfb/0x130
[  328.893004]  ? process_one_work+0x390/0x390
[  328.896831]  ? kthread_park+0x90/0x90
[  328.900580]  ret_from_fork+0x3a/0x50
[  328.904312] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core
[  328.908185] CR2: 0000000000000f40
[  328.911944] ---[ end trace f2aa62f9801a8116 ]---
[  328.915635] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e]
[  328.919262] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f
[  328.926889] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246
[  328.930743] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000
[  328.934672] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810
[  328.938595] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5
[  328.942481] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000
[  328.946304] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300
[  328.950102] FS:  0000000000000000(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.953961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.957847] CR2: 0000000000000f40 CR3: 000000001c80a006 CR4: 00000000007606f0
[  328.961743] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.965589] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.969372] PKRU: 55555554


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13  0:18 zero-copy between interfaces Ryan Goodfellow
@ 2020-01-13  9:16 ` Magnus Karlsson
  2020-01-13 10:43   ` Toke Høiland-Jørgensen
  2020-01-13 15:11   ` Ryan Goodfellow
  2020-01-13 11:41 ` Jesper Dangaard Brouer
  2020-01-17 12:32   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2 siblings, 2 replies; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-13  9:16 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> Greetings XDP folks. I've been working on a zero-copy XDP bridge
> implementation similar to what's described in the following thread.
>
>   https://www.spinics.net/lists/xdp-newbies/msg01333.html
>
> I now have an implementation that is working reasonably well under certain
> conditions for various hardware. The implementation is primarily based on the
> xdpsock_user program in the kernel under samples/bpf. You can find my program
> and corresponding BPF program here.
>
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c
>
> I have small testbed to run this code on that looks like the following.
>
> Packet forwarding machine:
>     CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
>     Memory: 32 GB
>     NICs:
>     - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G)
>     - Intel X722 10G SFP+
>
> Sender/receiver machines
>     CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
>     Memory: 32 GB
>     NICs:
>     - Mellanox ConnectX 4 40G MCX4131A-BCAT
>     - Intel X722 10G SFP+
>
> I could not get zero-copy to work with the i40e driver as it would crash. I've
> attached the corresponding traces from dmesg. The results below are with the
> i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the
> server and test with instead of the X722 if that is of interest. In all cases I
> used a single hardware queue via the following.
>
>     ethtool -L <dev> combined 1
>
> The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I
> used ntuple rules to push things through queue 1 (shadows 0) as follows
>
>     ethtool -N <dev> flow-type ether src <mac> action 1
>
> The numbers that I have been able to achive with this code are the following. MTU
> is 1500 in all cases.
>
>     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
>     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
>     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
>
> Are these numbers in the ballpark of what's expected?
>
> One thing I have noticed is that I cannot create large memory maps for the
> packet buffers. For example a frame size of 2048 with 524288 frames (around
> 1G of packets) is fine. However, increasing size by an order of magnitude, which
> is well within the memory capacity of the host machine results in an error when
> creating the UMEM and the kernel shows the attached call trace. I'm going to
> begin investigating this in more detail soon, but if anyone has advice on large
> XDP memory maps that would be much appreciated.

Hi Ryan,

Thanks for taking XDP and AF_XDP for a sping. I will start by fixing
this out-of-memory issue. With your umem size, we are hitting the size
limit of kmalloc. I will fix this by using kvmalloc that tries to
allocate with vmalloc if kmalloc fails. Should hopefully make it
possible for you to allocate larger umems.

> The reason for wanting large memory maps is that our use case for XDP is network
> emulation - and sometimes that means introducing delay factors that can require
> a rather large in-memory packet buffers.
>
> If there is interest in including this program in the official BPF samples I'm happy to
> submit a patch. Any comments on the program are also much appreciated.

More examples are always useful, but the question is if it should
reside in samples or outside the kernel in some other repo? Is there
some good place in xdp-project github that could be used for this
purpose?

/Magnus

> Thanks
>
> ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13  9:16 ` Magnus Karlsson
@ 2020-01-13 10:43   ` Toke Høiland-Jørgensen
  2020-01-13 15:25     ` Ryan Goodfellow
  2020-01-13 15:11   ` Ryan Goodfellow
  1 sibling, 1 reply; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-13 10:43 UTC (permalink / raw)
  To: Magnus Karlsson, Ryan Goodfellow; +Cc: xdp-newbies

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>>
>> Greetings XDP folks. I've been working on a zero-copy XDP bridge
>> implementation similar to what's described in the following thread.
>>
>>   https://www.spinics.net/lists/xdp-newbies/msg01333.html
>>
>> I now have an implementation that is working reasonably well under certain
>> conditions for various hardware. The implementation is primarily based on the
>> xdpsock_user program in the kernel under samples/bpf. You can find my program
>> and corresponding BPF program here.
>>
>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c
>>
>> I have small testbed to run this code on that looks like the following.
>>
>> Packet forwarding machine:
>>     CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
>>     Memory: 32 GB
>>     NICs:
>>     - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G)
>>     - Intel X722 10G SFP+
>>
>> Sender/receiver machines
>>     CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
>>     Memory: 32 GB
>>     NICs:
>>     - Mellanox ConnectX 4 40G MCX4131A-BCAT
>>     - Intel X722 10G SFP+
>>
>> I could not get zero-copy to work with the i40e driver as it would crash. I've
>> attached the corresponding traces from dmesg. The results below are with the
>> i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the
>> server and test with instead of the X722 if that is of interest. In all cases I
>> used a single hardware queue via the following.
>>
>>     ethtool -L <dev> combined 1
>>
>> The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I
>> used ntuple rules to push things through queue 1 (shadows 0) as follows
>>
>>     ethtool -N <dev> flow-type ether src <mac> action 1
>>
>> The numbers that I have been able to achive with this code are the following. MTU
>> is 1500 in all cases.
>>
>>     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
>>     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
>>     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
>>
>> Are these numbers in the ballpark of what's expected?
>>
>> One thing I have noticed is that I cannot create large memory maps for the
>> packet buffers. For example a frame size of 2048 with 524288 frames (around
>> 1G of packets) is fine. However, increasing size by an order of magnitude, which
>> is well within the memory capacity of the host machine results in an error when
>> creating the UMEM and the kernel shows the attached call trace. I'm going to
>> begin investigating this in more detail soon, but if anyone has advice on large
>> XDP memory maps that would be much appreciated.
>
> Hi Ryan,
>
> Thanks for taking XDP and AF_XDP for a sping. I will start by fixing
> this out-of-memory issue. With your umem size, we are hitting the size
> limit of kmalloc. I will fix this by using kvmalloc that tries to
> allocate with vmalloc if kmalloc fails. Should hopefully make it
> possible for you to allocate larger umems.
>
>> The reason for wanting large memory maps is that our use case for XDP is network
>> emulation - and sometimes that means introducing delay factors that can require
>> a rather large in-memory packet buffers.
>>
>> If there is interest in including this program in the official BPF samples I'm happy to
>> submit a patch. Any comments on the program are also much appreciated.
>
> More examples are always useful, but the question is if it should
> reside in samples or outside the kernel in some other repo? Is there
> some good place in xdp-project github that could be used for this
> purpose?

We could certainly create something; either a new xdp-samples
repository, or an example-programs/ subdir of the xdp-tutorial? Which of
those makes the most sense depends on the size of the program I think...

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13  0:18 zero-copy between interfaces Ryan Goodfellow
  2020-01-13  9:16 ` Magnus Karlsson
@ 2020-01-13 11:41 ` Jesper Dangaard Brouer
  2020-01-13 15:28   ` Ryan Goodfellow
  2020-01-17 12:32   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2 siblings, 1 reply; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-13 11:41 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Mon, 13 Jan 2020 00:18:36 +0000
Ryan Goodfellow <rgoodfel@isi.edu> wrote:

> The numbers that I have been able to achive with this code are the following. MTU
> is 1500 in all cases.
> 
>     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
>     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
>     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
> 
> Are these numbers in the ballpark of what's expected?

I would say they are too slow / low.

Have you remembered to do bulking?


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13  9:16 ` Magnus Karlsson
  2020-01-13 10:43   ` Toke Høiland-Jørgensen
@ 2020-01-13 15:11   ` Ryan Goodfellow
  2020-01-14  9:59     ` Magnus Karlsson
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-13 15:11 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: xdp-newbies

On Mon, Jan 13, 2020 at 10:16:47AM +0100, Magnus Karlsson wrote:
> Thanks for taking XDP and AF_XDP for a sping. I will start by fixing
> this out-of-memory issue. With your umem size, we are hitting the size
> limit of kmalloc. I will fix this by using kvmalloc that tries to
> allocate with vmalloc if kmalloc fails. Should hopefully make it
> possible for you to allocate larger umems.
> 

Thanks!

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 10:43   ` Toke Høiland-Jørgensen
@ 2020-01-13 15:25     ` Ryan Goodfellow
  2020-01-13 17:09       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-13 15:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Magnus Karlsson, xdp-newbies

On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote:
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>>>  The reason for wanting large memory maps is that our use case for XDP is network
>>>  emulation - and sometimes that means introducing delay factors that can require
>>>  a rather large in-memory packet buffers.
>>> 
>>>  If there is interest in including this program in the official BPF samples I'm happy to
>>>  submit a patch. Any comments on the program are also much appreciated.
>> 
>>  More examples are always useful, but the question is if it should
>>  reside in samples or outside the kernel in some other repo? Is there
>>  some good place in xdp-project github that could be used for this
>>  purpose?
>
> We could certainly create something; either a new xdp-samples
> repository, or an example-programs/ subdir of the xdp-tutorial? Which of
> those makes the most sense depends on the size of the program I think...
> 
> -Toke
> 

I'm happy to provide patches or pull-requests in either case. The userspace
program is 1 file with 555 lines and the BPF program is 28 lines. I've 
tested the userspace program with the 5.5 kernel. The BPF program requires 
clang-9 to work properly (due to BTF features IIRC).

- https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
- https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c

The primary usefulness of this program relative to what's out there is that it
pushes packets between interfaces using a common memory map.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 11:41 ` Jesper Dangaard Brouer
@ 2020-01-13 15:28   ` Ryan Goodfellow
  2020-01-13 17:04     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-13 15:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies

On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 13 Jan 2020 00:18:36 +0000
> Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> 
> > The numbers that I have been able to achive with this code are the following. MTU
> > is 1500 in all cases.
> > 
> >     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
> >     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
> >     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
> > 
> > Are these numbers in the ballpark of what's expected?
> 
> I would say they are too slow / low.
> 
> Have you remembered to do bulking?
> 

I am using a batch size of 256.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 15:28   ` Ryan Goodfellow
@ 2020-01-13 17:04     ` Jesper Dangaard Brouer
  2020-01-17 17:54       ` Ryan Goodfellow
  0 siblings, 1 reply; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-13 17:04 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: brouer, xdp-newbies

On Mon, 13 Jan 2020 10:28:00 -0500
Ryan Goodfellow <rgoodfel@isi.edu> wrote:

> On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote:
> > On Mon, 13 Jan 2020 00:18:36 +0000
> > Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >   
> > > The numbers that I have been able to achive with this code are the following. MTU
> > > is 1500 in all cases.
> > > 
> > >     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
> > >     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
> > >     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
> > > 
> > > Are these numbers in the ballpark of what's expected?  
> > 
> > I would say they are too slow / low.
> > 
> > Have you remembered to do bulking?
> >   
> 
> I am using a batch size of 256.

Hmm...

Maybe you can test with xdp_redirect_map program in samples/bpf/ and
compare the performance on this hardware?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 15:25     ` Ryan Goodfellow
@ 2020-01-13 17:09       ` Toke Høiland-Jørgensen
  2020-01-14  7:47         ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-13 17:09 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: Magnus Karlsson, xdp-newbies

Ryan Goodfellow <rgoodfel@isi.edu> writes:

> On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote:
>>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>>>>  The reason for wanting large memory maps is that our use case for XDP is network
>>>>  emulation - and sometimes that means introducing delay factors that can require
>>>>  a rather large in-memory packet buffers.
>>>> 
>>>>  If there is interest in including this program in the official BPF samples I'm happy to
>>>>  submit a patch. Any comments on the program are also much appreciated.
>>> 
>>>  More examples are always useful, but the question is if it should
>>>  reside in samples or outside the kernel in some other repo? Is there
>>>  some good place in xdp-project github that could be used for this
>>>  purpose?
>>
>> We could certainly create something; either a new xdp-samples
>> repository, or an example-programs/ subdir of the xdp-tutorial? Which of
>> those makes the most sense depends on the size of the program I think...
>> 
>> -Toke
>> 
>
> I'm happy to provide patches or pull-requests in either case. The userspace
> program is 1 file with 555 lines and the BPF program is 28 lines. I've 
> tested the userspace program with the 5.5 kernel. The BPF program requires 
> clang-9 to work properly (due to BTF features IIRC).
>
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c
>
> The primary usefulness of this program relative to what's out there is
> that it pushes packets between interfaces using a common memory map.

Hmm, yeah, this could live in either samples or as standalone. IDK,
Magnus what do you think?

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 17:09       ` Toke Høiland-Jørgensen
@ 2020-01-14  7:47         ` Magnus Karlsson
  2020-01-14  8:11           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-14  7:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Ryan Goodfellow, xdp-newbies

On Mon, Jan 13, 2020 at 6:09 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Ryan Goodfellow <rgoodfel@isi.edu> writes:
>
> > On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote:
> >>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
> >>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >>>>  The reason for wanting large memory maps is that our use case for XDP is network
> >>>>  emulation - and sometimes that means introducing delay factors that can require
> >>>>  a rather large in-memory packet buffers.
> >>>>
> >>>>  If there is interest in including this program in the official BPF samples I'm happy to
> >>>>  submit a patch. Any comments on the program are also much appreciated.
> >>>
> >>>  More examples are always useful, but the question is if it should
> >>>  reside in samples or outside the kernel in some other repo? Is there
> >>>  some good place in xdp-project github that could be used for this
> >>>  purpose?
> >>
> >> We could certainly create something; either a new xdp-samples
> >> repository, or an example-programs/ subdir of the xdp-tutorial? Which of
> >> those makes the most sense depends on the size of the program I think...
> >>
> >> -Toke
> >>
> >
> > I'm happy to provide patches or pull-requests in either case. The userspace
> > program is 1 file with 555 lines and the BPF program is 28 lines. I've
> > tested the userspace program with the 5.5 kernel. The BPF program requires
> > clang-9 to work properly (due to BTF features IIRC).
> >
> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c
> >
> > The primary usefulness of this program relative to what's out there is
> > that it pushes packets between interfaces using a common memory map.
>
> Hmm, yeah, this could live in either samples or as standalone. IDK,
> Magnus what do you think?

Let us try the kernel samples first. Then and if the folks on the
mailing list do not want more samples in there, we fall back on the
xdp-tutorial repo.

/Magnus

> -Toke
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-14  7:47         ` Magnus Karlsson
@ 2020-01-14  8:11           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-14  8:11 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Ryan Goodfellow, xdp-newbies

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Mon, Jan 13, 2020 at 6:09 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Ryan Goodfellow <rgoodfel@isi.edu> writes:
>>
>> > On Mon, Jan 13, 2020 at 11:43:02AM +0100, Toke Høiland-Jørgensen wrote:
>> >>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>> >>>> On Mon, Jan 13, 2020 at 1:28 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>> >>>>  The reason for wanting large memory maps is that our use case for XDP is network
>> >>>>  emulation - and sometimes that means introducing delay factors that can require
>> >>>>  a rather large in-memory packet buffers.
>> >>>>
>> >>>>  If there is interest in including this program in the official BPF samples I'm happy to
>> >>>>  submit a patch. Any comments on the program are also much appreciated.
>> >>>
>> >>>  More examples are always useful, but the question is if it should
>> >>>  reside in samples or outside the kernel in some other repo? Is there
>> >>>  some good place in xdp-project github that could be used for this
>> >>>  purpose?
>> >>
>> >> We could certainly create something; either a new xdp-samples
>> >> repository, or an example-programs/ subdir of the xdp-tutorial? Which of
>> >> those makes the most sense depends on the size of the program I think...
>> >>
>> >> -Toke
>> >>
>> >
>> > I'm happy to provide patches or pull-requests in either case. The userspace
>> > program is 1 file with 555 lines and the BPF program is 28 lines. I've
>> > tested the userspace program with the 5.5 kernel. The BPF program requires
>> > clang-9 to work properly (due to BTF features IIRC).
>> >
>> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
>> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c
>> >
>> > The primary usefulness of this program relative to what's out there is
>> > that it pushes packets between interfaces using a common memory map.
>>
>> Hmm, yeah, this could live in either samples or as standalone. IDK,
>> Magnus what do you think?
>
> Let us try the kernel samples first. Then and if the folks on the
> mailing list do not want more samples in there, we fall back on the
> xdp-tutorial repo.

ACK, SGTM :)

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 15:11   ` Ryan Goodfellow
@ 2020-01-14  9:59     ` Magnus Karlsson
  2020-01-14 20:52       ` Ryan Goodfellow
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-14  9:59 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Mon, Jan 13, 2020 at 4:12 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Mon, Jan 13, 2020 at 10:16:47AM +0100, Magnus Karlsson wrote:
> > Thanks for taking XDP and AF_XDP for a sping. I will start by fixing
> > this out-of-memory issue. With your umem size, we are hitting the size
> > limit of kmalloc. I will fix this by using kvmalloc that tries to
> > allocate with vmalloc if kmalloc fails. Should hopefully make it
> > possible for you to allocate larger umems.
> >
>
> Thanks!

Just sent out a patch on the mailing list. Would be great if you could
try it out.

/Magnus

> --
> ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-14  9:59     ` Magnus Karlsson
@ 2020-01-14 20:52       ` Ryan Goodfellow
  2020-01-15  1:41         ` Ryan Goodfellow
  2020-01-17 17:40         ` William Tu
  0 siblings, 2 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-14 20:52 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: xdp-newbies

On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> 
> Just sent out a patch on the mailing list. Would be great if you could
> try it out.

Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
tree and manually applied to the 5.5.0-rc3 branch I've been working with up to 
this point. It does allow for allocating more memory, however packet 
forwarding no longer works. I did not see any complaints from dmesg, but here 
is an example iperf3 session from a client that worked before.

ry@xd2:~$ iperf3 -c 10.1.0.2
Connecting to host 10.1.0.2, port 5201
[  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
[  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated

I'll continue to investigate and report back with anything that I find.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-14 20:52       ` Ryan Goodfellow
@ 2020-01-15  1:41         ` Ryan Goodfellow
  2020-01-15  7:40           ` Magnus Karlsson
  2020-01-17 17:40         ` William Tu
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-15  1:41 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: xdp-newbies

On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > 
> > Just sent out a patch on the mailing list. Would be great if you could
> > try it out.
> 
> Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> tree and manually applied to the 5.5.0-rc3 branch I've been working with up to 
> this point. It does allow for allocating more memory, however packet 
> forwarding no longer works. I did not see any complaints from dmesg, but here 
> is an example iperf3 session from a client that worked before.
> 
> ry@xd2:~$ iperf3 -c 10.1.0.2
> Connecting to host 10.1.0.2, port 5201
> [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> iperf3: interrupt - the client has terminated
> 
> I'll continue to investigate and report back with anything that I find.

Interestingly I found this behavior to exist in the bpf-next tree independent
of the patch being present.

I also gave the 5.5.0-rc6 branch a try (without the patch) and packets forward
OK.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-15  1:41         ` Ryan Goodfellow
@ 2020-01-15  7:40           ` Magnus Karlsson
  2020-01-15  8:20             ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-15  7:40 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > >
> > > Just sent out a patch on the mailing list. Would be great if you could
> > > try it out.
> >
> > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> > this point. It does allow for allocating more memory, however packet
> > forwarding no longer works. I did not see any complaints from dmesg, but here
> > is an example iperf3 session from a client that worked before.
> >
> > ry@xd2:~$ iperf3 -c 10.1.0.2
> > Connecting to host 10.1.0.2, port 5201
> > [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> > [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> > iperf3: interrupt - the client has terminated
> >
> > I'll continue to investigate and report back with anything that I find.
>
> Interestingly I found this behavior to exist in the bpf-next tree independent
> of the patch being present.

Ryan,

Could you please do a bisect on it? In the 12 commits after the merge
commit below there are number of sensitive rewrites of the ring access
functions. Maybe one of them breaks your code. When you say "packet
forwarding no longer works", do you mean it works for a second or so,
then no packets come through? What HW are you using?

commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
Merge: 99cacdc 1d9cb1f
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Fri Dec 20 16:00:10 2019 -0800

    Merge branch 'xsk-cleanup'

    Magnus Karlsson says:

    ====================
    This patch set cleans up the ring access functions of AF_XDP in hope
    that it will now be easier to understand and maintain. I used to get a
    headache every time I looked at this code in order to really understand it,
    but now I do think it is a lot less painful.
    <snip>

/Magnus

> I also gave the 5.5.0-rc6 branch a try (without the patch) and packets forward
> OK.
>
> --
> ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-15  7:40           ` Magnus Karlsson
@ 2020-01-15  8:20             ` Magnus Karlsson
  2020-01-16  2:04               ` Ryan Goodfellow
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-15  8:20 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >
> > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > > >
> > > > Just sent out a patch on the mailing list. Would be great if you could
> > > > try it out.
> > >
> > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> > > this point. It does allow for allocating more memory, however packet
> > > forwarding no longer works. I did not see any complaints from dmesg, but here
> > > is an example iperf3 session from a client that worked before.
> > >
> > > ry@xd2:~$ iperf3 -c 10.1.0.2
> > > Connecting to host 10.1.0.2, port 5201
> > > [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> > > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> > > [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> > > iperf3: interrupt - the client has terminated
> > >
> > > I'll continue to investigate and report back with anything that I find.
> >
> > Interestingly I found this behavior to exist in the bpf-next tree independent
> > of the patch being present.
>
> Ryan,
>
> Could you please do a bisect on it? In the 12 commits after the merge
> commit below there are number of sensitive rewrites of the ring access
> functions. Maybe one of them breaks your code. When you say "packet
> forwarding no longer works", do you mean it works for a second or so,
> then no packets come through? What HW are you using?
>
> commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
> Merge: 99cacdc 1d9cb1f
> Author: Alexei Starovoitov <ast@kernel.org>
> Date:   Fri Dec 20 16:00:10 2019 -0800
>
>     Merge branch 'xsk-cleanup'
>
>     Magnus Karlsson says:
>
>     ====================
>     This patch set cleans up the ring access functions of AF_XDP in hope
>     that it will now be easier to understand and maintain. I used to get a
>     headache every time I looked at this code in order to really understand it,
>     but now I do think it is a lot less painful.
>     <snip>
>
> /Magnus

I see that you have debug messages in your application. Could you
please run with those on and send me the output so I can see where it
stops. A bisect that pin-points what commit that breaks your program
plus the debug output should hopefully send us on the right path for a
fix.

Thanks: Magnus

> > I also gave the 5.5.0-rc6 branch a try (without the patch) and packets forward
> > OK.
> >
> > --
> > ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-15  8:20             ` Magnus Karlsson
@ 2020-01-16  2:04               ` Ryan Goodfellow
  2020-01-16 14:32                 ` Magnus Karlsson
  2020-01-21  7:34                 ` Magnus Karlsson
  0 siblings, 2 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-16  2:04 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: xdp-newbies

On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote:
> On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > >
> > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > > > >
> > > > > Just sent out a patch on the mailing list. Would be great if you could
> > > > > try it out.
> > > >
> > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> > > > this point. It does allow for allocating more memory, however packet
> > > > forwarding no longer works. I did not see any complaints from dmesg, but here
> > > > is an example iperf3 session from a client that worked before.
> > > >
> > > > ry@xd2:~$ iperf3 -c 10.1.0.2
> > > > Connecting to host 10.1.0.2, port 5201
> > > > [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> > > > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> > > > [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> > > > iperf3: interrupt - the client has terminated
> > > >
> > > > I'll continue to investigate and report back with anything that I find.
> > >
> > > Interestingly I found this behavior to exist in the bpf-next tree independent
> > > of the patch being present.
> >
> > Ryan,
> >
> > Could you please do a bisect on it? In the 12 commits after the merge
> > commit below there are number of sensitive rewrites of the ring access
> > functions. Maybe one of them breaks your code. When you say "packet
> > forwarding no longer works", do you mean it works for a second or so,
> > then no packets come through? What HW are you using?
> >
> > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
> > Merge: 99cacdc 1d9cb1f
> > Author: Alexei Starovoitov <ast@kernel.org>
> > Date:   Fri Dec 20 16:00:10 2019 -0800
> >
> >     Merge branch 'xsk-cleanup'
> >
> >     Magnus Karlsson says:
> >
> >     ====================
> >     This patch set cleans up the ring access functions of AF_XDP in hope
> >     that it will now be easier to understand and maintain. I used to get a
> >     headache every time I looked at this code in order to really understand it,
> >     but now I do think it is a lot less painful.
> >     <snip>
> >
> > /Magnus
> 
> I see that you have debug messages in your application. Could you
> please run with those on and send me the output so I can see where it
> stops. A bisect that pin-points what commit that breaks your program
> plus the debug output should hopefully send us on the right path for a
> fix.
> 
> Thanks: Magnus
> 

Hi Magnus,

I did a bisect starting from the head of the bpf-next tree (990bca1) down to 
the first commit before the patch series you identified (df034c9). The result
was identifying df0ae6f as the commit that causes the issue I am seeing.

I've posted output from the program in debugging mode here

- https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375

Yes, you are correct in that forwarding works for a brief period and then stops.
I've noticed that the number of packets that are forwarded is equal to the size
of the producer/consumer descriptor rings. I've posted two ping traces from a
client ping that shows this.

- https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
- https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377

I've also noticed that when the forwarding stops, the CPU usage for the proc
running the program is pegged, which is not the norm for this program as it uses
a poll call with a timeout on the xsk fd.

The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
running the mlx5 driver. The program is running in zero copy mode. I also tested
this code out in a virtual machine with virtio NICs in SKB mode which uses
xdpgeneric - there were no issues in that setting.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-16  2:04               ` Ryan Goodfellow
@ 2020-01-16 14:32                 ` Magnus Karlsson
  2020-01-17  9:45                   ` Magnus Karlsson
  2020-01-21  7:34                 ` Magnus Karlsson
  1 sibling, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-16 14:32 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote:
> > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > > > > >
> > > > > > Just sent out a patch on the mailing list. Would be great if you could
> > > > > > try it out.
> > > > >
> > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> > > > > this point. It does allow for allocating more memory, however packet
> > > > > forwarding no longer works. I did not see any complaints from dmesg, but here
> > > > > is an example iperf3 session from a client that worked before.
> > > > >
> > > > > ry@xd2:~$ iperf3 -c 10.1.0.2
> > > > > Connecting to host 10.1.0.2, port 5201
> > > > > [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> > > > > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> > > > > [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> > > > > iperf3: interrupt - the client has terminated
> > > > >
> > > > > I'll continue to investigate and report back with anything that I find.
> > > >
> > > > Interestingly I found this behavior to exist in the bpf-next tree independent
> > > > of the patch being present.
> > >
> > > Ryan,
> > >
> > > Could you please do a bisect on it? In the 12 commits after the merge
> > > commit below there are number of sensitive rewrites of the ring access
> > > functions. Maybe one of them breaks your code. When you say "packet
> > > forwarding no longer works", do you mean it works for a second or so,
> > > then no packets come through? What HW are you using?
> > >
> > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
> > > Merge: 99cacdc 1d9cb1f
> > > Author: Alexei Starovoitov <ast@kernel.org>
> > > Date:   Fri Dec 20 16:00:10 2019 -0800
> > >
> > >     Merge branch 'xsk-cleanup'
> > >
> > >     Magnus Karlsson says:
> > >
> > >     ====================
> > >     This patch set cleans up the ring access functions of AF_XDP in hope
> > >     that it will now be easier to understand and maintain. I used to get a
> > >     headache every time I looked at this code in order to really understand it,
> > >     but now I do think it is a lot less painful.
> > >     <snip>
> > >
> > > /Magnus
> >
> > I see that you have debug messages in your application. Could you
> > please run with those on and send me the output so I can see where it
> > stops. A bisect that pin-points what commit that breaks your program
> > plus the debug output should hopefully send us on the right path for a
> > fix.
> >
> > Thanks: Magnus
> >
>
> Hi Magnus,
>
> I did a bisect starting from the head of the bpf-next tree (990bca1) down to
> the first commit before the patch series you identified (df034c9). The result
> was identifying df0ae6f as the commit that causes the issue I am seeing.
>
> I've posted output from the program in debugging mode here
>
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375

Perfect. Thanks.

> Yes, you are correct in that forwarding works for a brief period and then stops.
> I've noticed that the number of packets that are forwarded is equal to the size
> of the producer/consumer descriptor rings. I've posted two ping traces from a
> client ping that shows this.
>
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
>
> I've also noticed that when the forwarding stops, the CPU usage for the proc
> running the program is pegged, which is not the norm for this program as it uses
> a poll call with a timeout on the xsk fd.

I will replicate your setup and try to reproduce it. Only have one
port connected to my load generator now, but when I get into the
office, I will connect two ports.

In what loop does the execution get stuck when it hangs at 100% load?

/Magnus

> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
> running the mlx5 driver. The program is running in zero copy mode. I also tested
> this code out in a virtual machine with virtio NICs in SKB mode which uses
> xdpgeneric - there were no issues in that setting.
>
> --
> ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-16 14:32                 ` Magnus Karlsson
@ 2020-01-17  9:45                   ` Magnus Karlsson
  2020-01-17 17:05                     ` Ryan Goodfellow
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-17  9:45 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies

On Thu, Jan 16, 2020 at 3:32 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >
> > On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote:
> > > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson
> > > <magnus.karlsson@gmail.com> wrote:
> > > >
> > > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > > > >
> > > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> > > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > > > > > >
> > > > > > > Just sent out a patch on the mailing list. Would be great if you could
> > > > > > > try it out.
> > > > > >
> > > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> > > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> > > > > > this point. It does allow for allocating more memory, however packet
> > > > > > forwarding no longer works. I did not see any complaints from dmesg, but here
> > > > > > is an example iperf3 session from a client that worked before.
> > > > > >
> > > > > > ry@xd2:~$ iperf3 -c 10.1.0.2
> > > > > > Connecting to host 10.1.0.2, port 5201
> > > > > > [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> > > > > > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > > ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> > > > > > [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> > > > > > iperf3: interrupt - the client has terminated
> > > > > >
> > > > > > I'll continue to investigate and report back with anything that I find.
> > > > >
> > > > > Interestingly I found this behavior to exist in the bpf-next tree independent
> > > > > of the patch being present.
> > > >
> > > > Ryan,
> > > >
> > > > Could you please do a bisect on it? In the 12 commits after the merge
> > > > commit below there are number of sensitive rewrites of the ring access
> > > > functions. Maybe one of them breaks your code. When you say "packet
> > > > forwarding no longer works", do you mean it works for a second or so,
> > > > then no packets come through? What HW are you using?
> > > >
> > > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
> > > > Merge: 99cacdc 1d9cb1f
> > > > Author: Alexei Starovoitov <ast@kernel.org>
> > > > Date:   Fri Dec 20 16:00:10 2019 -0800
> > > >
> > > >     Merge branch 'xsk-cleanup'
> > > >
> > > >     Magnus Karlsson says:
> > > >
> > > >     ====================
> > > >     This patch set cleans up the ring access functions of AF_XDP in hope
> > > >     that it will now be easier to understand and maintain. I used to get a
> > > >     headache every time I looked at this code in order to really understand it,
> > > >     but now I do think it is a lot less painful.
> > > >     <snip>
> > > >
> > > > /Magnus
> > >
> > > I see that you have debug messages in your application. Could you
> > > please run with those on and send me the output so I can see where it
> > > stops. A bisect that pin-points what commit that breaks your program
> > > plus the debug output should hopefully send us on the right path for a
> > > fix.
> > >
> > > Thanks: Magnus
> > >
> >
> > Hi Magnus,
> >
> > I did a bisect starting from the head of the bpf-next tree (990bca1) down to
> > the first commit before the patch series you identified (df034c9). The result
> > was identifying df0ae6f as the commit that causes the issue I am seeing.
> >
> > I've posted output from the program in debugging mode here
> >
> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
>
> Perfect. Thanks.
>
> > Yes, you are correct in that forwarding works for a brief period and then stops.
> > I've noticed that the number of packets that are forwarded is equal to the size
> > of the producer/consumer descriptor rings. I've posted two ping traces from a
> > client ping that shows this.
> >
> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
> > - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
> >
> > I've also noticed that when the forwarding stops, the CPU usage for the proc
> > running the program is pegged, which is not the norm for this program as it uses
> > a poll call with a timeout on the xsk fd.
>
> I will replicate your setup and try to reproduce it. Only have one
> port connected to my load generator now, but when I get into the
> office, I will connect two ports.

If have now run your application, but unfortunately I cannot recreate
your problem. It works and runs for several minutes until I get bored
and terminate it. Note that I use an i40e card that you get a crash
with. So two problems I cannot reproduce, sigh. Here is my system
info. Can you please dump yours? Please do the ethtool dump on your
i40e card.

mkarlsso@kurt:~/src/dna-linux$ sudo ethtool -i ens803f0
[sudo] password for mkarlsso:
driver: i40e
version: 2.8.20-k
firmware-version: 5.05 0x800028a6 1.1568.0
expansion-rom-version:
bus-info: 0000:86:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

mkarlsso@kurt:~/src/dna-linux$ uname -a
Linux kurt 5.5.0-rc4+ #72 SMP PREEMPT Thu Jan 16 10:03:20 CET 2020
x86_64 x86_64 x86_64 GNU/Linux

mkarlsso@kurt:~/src/dna-linux$ git log -1
commit b65053cd94f46619b4aae746b98f2d8d9274540e (HEAD, bpf-next/master)
Author: Andrii Nakryiko <andriin@fb.com>
Date:   Wed Jan 15 16:55:49 2020 -0800

    selftests/bpf: Add whitelist/blacklist of test names to test_progs

gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

I also noted that you use MAX_SOCKS in your XDP program. The size of
the xsks_map is not dependent on the number of sockets in your case.
It is dependent on the queue id you use. So I would introduce a
MAX_QUEUE_ID and set it to e.g. 128 and use that instead. MAX_SOCKS is
4, so quite restrictive.

/Magnus

> In what loop does the execution get stuck when it hangs at 100% load?
>
> /Magnus
>
> > The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
> > running the mlx5 driver. The program is running in zero copy mode. I also tested
> > this code out in a virtual machine with virtio NICs in SKB mode which uses
> > xdpgeneric - there were no issues in that setting.
> >
> > --
> > ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13  0:18 zero-copy between interfaces Ryan Goodfellow
@ 2020-01-17 12:32   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2020-01-13 11:41 ` Jesper Dangaard Brouer
  2020-01-17 12:32   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2020-01-17 12:32 UTC (permalink / raw)
  To: Ryan Goodfellow
  Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson,
	Björn Töpel

On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
[...]
>
> I could not get zero-copy to work with the i40e driver as it would crash. I've
> attached the corresponding traces from dmesg.

Thanks Ryan! I had a look at the crash, and it's in the XDP setup:

i40e_xdp_setup:
...
 for (i = 0; i < vsi->num_queue_pairs; i++)
     WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);

and the vsi->rx_ring[0] is NULL. This is clearly broken.

It would help with more lines from your dmesg: the cut i40e log hints
that something is really broken:

[  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256
queues for VSI 0 err -12
[  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
[  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left

Is it possible to dig out the complete log?

Thanks!
Björn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Intel-wired-lan] zero-copy between interfaces
@ 2020-01-17 12:32   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 49+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2020-01-17 12:32 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
[...]
>
> I could not get zero-copy to work with the i40e driver as it would crash. I've
> attached the corresponding traces from dmesg.

Thanks Ryan! I had a look at the crash, and it's in the XDP setup:

i40e_xdp_setup:
...
 for (i = 0; i < vsi->num_queue_pairs; i++)
     WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);

and the vsi->rx_ring[0] is NULL. This is clearly broken.

It would help with more lines from your dmesg: the cut i40e log hints
that something is really broken:

[  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256
queues for VSI 0 err -12
[  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
[  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left

Is it possible to dig out the complete log?

Thanks!
Bj?rn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-17  9:45                   ` Magnus Karlsson
@ 2020-01-17 17:05                     ` Ryan Goodfellow
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-17 17:05 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: xdp-newbies

On Fri, Jan 17, 2020 at 10:45:33AM +0100, Magnus Karlsson wrote:
> If have now run your application, but unfortunately I cannot recreate
> your problem. It works and runs for several minutes until I get bored
> and terminate it. Note that I use an i40e card that you get a crash
> with. So two problems I cannot reproduce, sigh. Here is my system
> info. Can you please dump yours? Please do the ethtool dump on your
> i40e card.
> 
> mkarlsso@kurt:~/src/dna-linux$ sudo ethtool -i ens803f0
> [sudo] password for mkarlsso:
> driver: i40e
> version: 2.8.20-k
> firmware-version: 5.05 0x800028a6 1.1568.0
> expansion-rom-version:
> bus-info: 0000:86:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
> 
> mkarlsso@kurt:~/src/dna-linux$ uname -a
> Linux kurt 5.5.0-rc4+ #72 SMP PREEMPT Thu Jan 16 10:03:20 CET 2020
> x86_64 x86_64 x86_64 GNU/Linux
> 
> mkarlsso@kurt:~/src/dna-linux$ git log -1
> commit b65053cd94f46619b4aae746b98f2d8d9274540e (HEAD, bpf-next/master)
> Author: Andrii Nakryiko <andriin@fb.com>
> Date:   Wed Jan 15 16:55:49 2020 -0800
> 
>     selftests/bpf: Add whitelist/blacklist of test names to test_progs
> 
> gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)
> 
> I also noted that you use MAX_SOCKS in your XDP program. The size of
> the xsks_map is not dependent on the number of sockets in your case.
> It is dependent on the queue id you use. So I would introduce a
> MAX_QUEUE_ID and set it to e.g. 128 and use that instead. MAX_SOCKS is
> 4, so quite restrictive.
> 
> /Magnus

So I plugged in my X710-DA4 card, and this one actually works fine. Here is the
requested output for that card.

ry@turbine:~$ sudo ethtool -i enp101s0f0
driver: i40e
version: 2.8.20-k
firmware-version: 7.10 0x80006471 1.2527.0
expansion-rom-version:
bus-info: 0000:65:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ry@turbine:~$ sudo lspci -vvv | grep 710
65:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
	Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-4
		Product Name: XL710 40GbE Controller
65:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
	Subsystem: Intel Corporation Ethernet Converged Network Adapter X710
		Product Name: XL710 40GbE Controller
65:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
	Subsystem: Intel Corporation Ethernet Converged Network Adapter X710
		Product Name: XL710 40GbE Controller
65:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
	Subsystem: Intel Corporation Ethernet Converged Network Adapter X710
		Product Name: XL710 40GbE Controller

Here is the output for the X722 card. As of a few weeks ago firmware version
3.33 was the latest I could find.

ry@turbine:~$ sudo ethtool -i eno7
driver: i40e
version: 2.8.20-k
firmware-version: 3.33 0x80001006 1.1747.0
expansion-rom-version:
bus-info: 0000:b7:00.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ry@turbine:~$ sudo lspci -vvv | grep 722
b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
	DeviceName: Intel LAN X722 #1
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
	DeviceName: Intel LAN X722 #2
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
	DeviceName: Intel LAN X722 #3
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
	DeviceName: Intel LAN X722 #4
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+

I verified that the driver still crashes with the current kernel/program I am
runing that works on the X710-DA4.

Other output as requested

ry@turbine:~$ uname -a
Linux turbine 5.5.0-rc4-moa+ #16 SMP Fri Jan 17 10:52:42 EST 2020 x86_64 GNU/Linux

ry@turbine:~/kmoa/bpf-next$ git log -2
commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master)
Author: Ryan Goodfellow <rgoodfel@isi.edu>
Date:   Wed Jan 15 16:54:39 2020 -0500

    add xdpsock_multidev sample program

    This is a simple program that uses AF_XDP sockets to forward packets
    between two interfaces using a common memory region and no copying of
    packets.

    Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu>

commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD)
Author: Andrii Nakryiko <andriin@fb.com>
Date:   Wed Jan 15 11:08:56 2020 -0800

    libbpf: Support .text sub-calls relocations

    The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call
    relocations within the same section. This includes a default .text section,
    which contains any BPF sub-programs. This wasn't the case before and so libbpf
    was able to get a way with slightly simpler handling of subprogram call
    relocations.

    This patch adds support for .text section relocations. It needs to ensure
    correct order of relocations, so does two passes:
    - first, relocate .text instructions, if there are any relocations in it;
    - then process all the other programs and copy over patched .text instructions
    for all sub-program calls.

    v1->v2:
    - break early once .text program is processed.

    Signed-off-by: Andrii Nakryiko <andriin@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin@fb.com

Here is the info for the Mellanox card that is not working after the df0ae6f
commit.

ry@turbine:~$ sudo ethtool -i enp23s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 12.23.1020 (MT_2150110033)
expansion-rom-version: 
bus-info: 0000:17:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

ry@turbine:~$ sudo lspci -vvv | grep Mellanox
17:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
        Subsystem: Mellanox Technologies ConnectX-4 Stand-up dual-port 100GbE MCX416A-CCAT
17:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
        Subsystem: Mellanox Technologies ConnectX-4 Stand-up dual-port 100GbE MCX416A-CCAT

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-17 12:32   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2020-01-17 17:16     ` Ryan Goodfellow
  -1 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-17 17:16 UTC (permalink / raw)
  To: Björn Töpel
  Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson,
	Björn Töpel

On Fri, Jan 17, 2020 at 01:32:07PM +0100, Björn Töpel wrote:
> On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >
> [...]
> >
> > I could not get zero-copy to work with the i40e driver as it would crash. I've
> > attached the corresponding traces from dmesg.
> 
> Thanks Ryan! I had a look at the crash, and it's in the XDP setup:
> 
> i40e_xdp_setup:
> ...
>  for (i = 0; i < vsi->num_queue_pairs; i++)
>      WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> 
> and the vsi->rx_ring[0] is NULL. This is clearly broken.
> 
> It would help with more lines from your dmesg: the cut i40e log hints
> that something is really broken:
> 
> [  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256
> queues for VSI 0 err -12
> [  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
> [  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left
> 
> Is it possible to dig out the complete log?

Hi Björn,

I've linked a full dmesg log from an XDP setup crash. Note that there are 
two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works 
fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the
one that is crashing on XDP setup.

https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080

Some info that may be useful:

ry@turbine:~$ sudo ethtool -i eno7
driver: i40e
version: 2.8.20-k
firmware-version: 3.33 0x80001006 1.1747.0
expansion-rom-version:
bus-info: 0000:b7:00.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

The firmware version 3.33 was the latest I could find as of a few weeks ago.

ry@turbine:~$ sudo lspci -vvv | grep 722
b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
	DeviceName: Intel LAN X722 #1
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
	DeviceName: Intel LAN X722 #2
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
	DeviceName: Intel LAN X722 #3
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
	DeviceName: Intel LAN X722 #4
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+

ry@ryzen:~$ uname -a
Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

ry@turbine:~/kmoa/bpf-next$ git log -2
commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master)
Author: Ryan Goodfellow <rgoodfel@isi.edu>
Date:   Wed Jan 15 16:54:39 2020 -0500

    add xdpsock_multidev sample program

    This is a simple program that uses AF_XDP sockets to forward packets
    between two interfaces using a common memory region and no copying of
    packets.

    Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu>

commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD)
Author: Andrii Nakryiko <andriin@fb.com>
Date:   Wed Jan 15 11:08:56 2020 -0800

    libbpf: Support .text sub-calls relocations

    The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call
    relocations within the same section. This includes a default .text section,
    which contains any BPF sub-programs. This wasn't the case before and so libbpf
    was able to get a way with slightly simpler handling of subprogram call
    relocations.

    This patch adds support for .text section relocations. It needs to ensure
    correct order of relocations, so does two passes:
    - first, relocate .text instructions, if there are any relocations in it;
    - then process all the other programs and copy over patched .text instructions
    for all sub-program calls.

    v1->v2:
    - break early once .text program is processed.

    Signed-off-by: Andrii Nakryiko <andriin@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin@fb.com

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Intel-wired-lan] zero-copy between interfaces
@ 2020-01-17 17:16     ` Ryan Goodfellow
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-17 17:16 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jan 17, 2020 at 01:32:07PM +0100, Bj?rn T?pel wrote:
> On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >
> [...]
> >
> > I could not get zero-copy to work with the i40e driver as it would crash. I've
> > attached the corresponding traces from dmesg.
> 
> Thanks Ryan! I had a look at the crash, and it's in the XDP setup:
> 
> i40e_xdp_setup:
> ...
>  for (i = 0; i < vsi->num_queue_pairs; i++)
>      WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> 
> and the vsi->rx_ring[0] is NULL. This is clearly broken.
> 
> It would help with more lines from your dmesg: the cut i40e log hints
> that something is really broken:
> 
> [  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256
> queues for VSI 0 err -12
> [  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
> [  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left
> 
> Is it possible to dig out the complete log?

Hi Bj?rn,

I've linked a full dmesg log from an XDP setup crash. Note that there are 
two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works 
fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the
one that is crashing on XDP setup.

https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080

Some info that may be useful:

ry at turbine:~$ sudo ethtool -i eno7
driver: i40e
version: 2.8.20-k
firmware-version: 3.33 0x80001006 1.1747.0
expansion-rom-version:
bus-info: 0000:b7:00.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

The firmware version 3.33 was the latest I could find as of a few weeks ago.

ry at turbine:~$ sudo lspci -vvv | grep 722
b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
	DeviceName: Intel LAN X722 #1
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
	DeviceName: Intel LAN X722 #2
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
	DeviceName: Intel LAN X722 #3
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
	DeviceName: Intel LAN X722 #4
	Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+

ry at ryzen:~$ uname -a
Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

ry at turbine:~/kmoa/bpf-next$ git log -2
commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master)
Author: Ryan Goodfellow <rgoodfel@isi.edu>
Date:   Wed Jan 15 16:54:39 2020 -0500

    add xdpsock_multidev sample program

    This is a simple program that uses AF_XDP sockets to forward packets
    between two interfaces using a common memory region and no copying of
    packets.

    Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu>

commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD)
Author: Andrii Nakryiko <andriin@fb.com>
Date:   Wed Jan 15 11:08:56 2020 -0800

    libbpf: Support .text sub-calls relocations

    The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call
    relocations within the same section. This includes a default .text section,
    which contains any BPF sub-programs. This wasn't the case before and so libbpf
    was able to get a way with slightly simpler handling of subprogram call
    relocations.

    This patch adds support for .text section relocations. It needs to ensure
    correct order of relocations, so does two passes:
    - first, relocate .text instructions, if there are any relocations in it;
    - then process all the other programs and copy over patched .text instructions
    for all sub-program calls.

    v1->v2:
    - break early once .text program is processed.

    Signed-off-by: Andrii Nakryiko <andriin@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin at fb.com

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-14 20:52       ` Ryan Goodfellow
  2020-01-15  1:41         ` Ryan Goodfellow
@ 2020-01-17 17:40         ` William Tu
  1 sibling, 0 replies; 49+ messages in thread
From: William Tu @ 2020-01-17 17:40 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: Magnus Karlsson, xdp-newbies

On Tue, Jan 14, 2020 at 12:53 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> >
> > Just sent out a patch on the mailing list. Would be great if you could
> > try it out.
>
> Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> this point. It does allow for allocating more memory, however packet
> forwarding no longer works. I did not see any complaints from dmesg, but here
> is an example iperf3 session from a client that worked before.
>
> ry@xd2:~$ iperf3 -c 10.1.0.2
> Connecting to host 10.1.0.2, port 5201
> [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> iperf3: interrupt - the client has terminated
>
> I'll continue to investigate and report back with anything that I find.
>
Hi Ryan,

Not sure if this is the same, but we hit something similar in OVS
AF_XDP's implementation.
In our case, it happens when using native-mode, not the driver
zero-copy mode, and
iperf works a couple seconds then down to zero. FYI:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-November/365076.html
and fixes
https://github.com/openvswitch/ovs/commit/161773c72a33a86a23d4892e3c7448cee9946317

Regards,
William

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-13 17:04     ` Jesper Dangaard Brouer
@ 2020-01-17 17:54       ` Ryan Goodfellow
  2020-01-18 10:14         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-17 17:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies

On Mon, Jan 13, 2020 at 06:04:11PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 13 Jan 2020 10:28:00 -0500
> Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> 
> > On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote:
> > > On Mon, 13 Jan 2020 00:18:36 +0000
> > > Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > >   
> > > > The numbers that I have been able to achive with this code are the following. MTU
> > > > is 1500 in all cases.
> > > > 
> > > >     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
> > > >     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
> > > >     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
> > > > 
> > > > Are these numbers in the ballpark of what's expected?  
> > > 
> > > I would say they are too slow / low.
> > > 
> > > Have you remembered to do bulking?
> > >   
> > 
> > I am using a batch size of 256.
> 
> Hmm...
> 
> Maybe you can test with xdp_redirect_map program in samples/bpf/ and
> compare the performance on this hardware?

Hi Jesper,

I tried to use this program, however it does not seem to work for bidirectional
traffic across the two interfaces?

Now that I have an i40e card that is working here is an update to the numbers.

At 1500 MTU the i40e runs at 9.7 GBPS (using the X710-DA4 10G interface), with
approx 800 Kpps. Dropping the MSS size down to 700 in iperf3 yields similar
near-line rate performance at 1.6 Mpps. This appears to be the limit as dropping
the MSS further results in degraded performance at similar packet rates (could
be an iperf3 artifact).

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-17 17:16     ` [Intel-wired-lan] " Ryan Goodfellow
@ 2020-01-17 18:10       ` Ryan Goodfellow
  -1 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-17 18:10 UTC (permalink / raw)
  To: Björn Töpel
  Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson,
	Björn Töpel

On Fri, Jan 17, 2020 at 12:16:37PM -0500, Ryan Goodfellow wrote:
> ry@ryzen:~$ uname -a
> Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

correction, that was a terminal on the wrong machine

ry@turbine:~$ uname -a
Linux turbine 5.5.0-rc4-moa+ #16 SMP Fri Jan 17 10:52:42 EST 2020 x86_64 GNU/Linux


-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Intel-wired-lan] zero-copy between interfaces
@ 2020-01-17 18:10       ` Ryan Goodfellow
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-17 18:10 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jan 17, 2020 at 12:16:37PM -0500, Ryan Goodfellow wrote:
> ry at ryzen:~$ uname -a
> Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

correction, that was a terminal on the wrong machine

ry at turbine:~$ uname -a
Linux turbine 5.5.0-rc4-moa+ #16 SMP Fri Jan 17 10:52:42 EST 2020 x86_64 GNU/Linux


-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-17 17:54       ` Ryan Goodfellow
@ 2020-01-18 10:14         ` Jesper Dangaard Brouer
  2020-01-18 14:08           ` Ryan Goodfellow
  0 siblings, 1 reply; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-18 10:14 UTC (permalink / raw)
  To: Ryan Goodfellow; +Cc: xdp-newbies, brouer

On Fri, 17 Jan 2020 12:54:09 -0500
Ryan Goodfellow <rgoodfel@isi.edu> wrote:

> On Mon, Jan 13, 2020 at 06:04:11PM +0100, Jesper Dangaard Brouer wrote:
> > On Mon, 13 Jan 2020 10:28:00 -0500
> > Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> >   
> > > On Mon, Jan 13, 2020 at 12:41:34PM +0100, Jesper Dangaard Brouer wrote:  
> > > > On Mon, 13 Jan 2020 00:18:36 +0000
> > > > Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > > >     
> > > > > The numbers that I have been able to achive with this code are the following. MTU
> > > > > is 1500 in all cases.
> > > > > 
> > > > >     mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
> > > > >     i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
> > > > >     virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)
> > > > > 
> > > > > Are these numbers in the ballpark of what's expected?    
> > > > 
> > > > I would say they are too slow / low.
> > > > 
> > > > Have you remembered to do bulking?
> > > >     
> > > 
> > > I am using a batch size of 256.  
> > 
> > Hmm...
> > 
> > Maybe you can test with xdp_redirect_map program in samples/bpf/ and
> > compare the performance on this hardware?  
> 
> Hi Jesper,
> 
> I tried to use this program, however it does not seem to work for bidirectional
> traffic across the two interfaces?

It does work bidirectional if you start more of these xdp_redirect_map
programs.  Do notice this is an example program.  Look at xdp_fwd_*.c
if you want a program that is functional and uses the existing IP route
table for XDP acceleration.

My point is that there are alternatives for doing zero-copy between
interfaces... A xdp_redirect_map inside the kernel out another
interface is already zero-copy.

I'm wondering why did you choose/need AF_XDP technology for doing forwarding?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-18 10:14         ` Jesper Dangaard Brouer
@ 2020-01-18 14:08           ` Ryan Goodfellow
  2020-01-26  4:53             ` Dan Siemon
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-18 14:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies

On Sat, Jan 18, 2020 at 11:14:05AM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 17 Jan 2020 12:54:09 -0500
> Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > I tried to use this program, however it does not seem to work for bidirectional
> > traffic across the two interfaces?
> 
> It does work bidirectional if you start more of these xdp_redirect_map
> programs.  Do notice this is an example program.  Look at xdp_fwd_*.c
> if you want a program that is functional and uses the existing IP route
> table for XDP acceleration.
> 
> My point is that there are alternatives for doing zero-copy between
> interfaces... A xdp_redirect_map inside the kernel out another
> interface is already zero-copy.
> 
> I'm wondering why did you choose/need AF_XDP technology for doing forwarding?

This is just a sample program used to demonstrate moving packets between
different interfaces efficiently using AF_XDP.

Our actual use case is performing network emulation in userspace. For example,
representing impaired links or entire networks with link-by-link shaping
specifications. We are using AF_XDP to get packets to/from our network emulation
software as quickly as possible without having to go through the entire network
stack, as the emulation host's network configuration does not influence the
networks it's emulating.

Traditionally we've used DPDK for this, but are porting to AF_XDP for the 
relative simplicity and flexibility it provides. Some specific benefits for us
are:

- Can attach to VTEPs which allows us to hook into some EVPN/VXLAN based
  networks we have easily. Alternatively with the BPF program flexibility, we
  also have the option to split out BGP control plane traffic from overlay
  traffic when attaching to the physical interface and pass it through to the 
  kernel. Both of these approaches let the kernel manage the FDB for VTEPs as 
  well as taking care of encap/decap (potentially offloaded to the NIC itself) 
  and let our software focus on emulation.

- Using XDP in virtual machines in our testing environment is straightforward,
  while this is possible with DPDK and virtio, the setup was rather convoluted.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-17 17:16     ` [Intel-wired-lan] " Ryan Goodfellow
@ 2020-01-20  8:24       ` Magnus Karlsson
  -1 siblings, 0 replies; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-20  8:24 UTC (permalink / raw)
  To: Ryan Goodfellow
  Cc: Björn Töpel, xdp-newbies, Karlsson, Magnus,
	intel-wired-lan, Björn Töpel

On Fri, Jan 17, 2020 at 6:16 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Fri, Jan 17, 2020 at 01:32:07PM +0100, Björn Töpel wrote:
> > On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > >
> > [...]
> > >
> > > I could not get zero-copy to work with the i40e driver as it would crash. I've
> > > attached the corresponding traces from dmesg.
> >
> > Thanks Ryan! I had a look at the crash, and it's in the XDP setup:
> >
> > i40e_xdp_setup:
> > ...
> >  for (i = 0; i < vsi->num_queue_pairs; i++)
> >      WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> >
> > and the vsi->rx_ring[0] is NULL. This is clearly broken.
> >
> > It would help with more lines from your dmesg: the cut i40e log hints
> > that something is really broken:
> >
> > [  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256
> > queues for VSI 0 err -12
> > [  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
> > [  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left
> >
> > Is it possible to dig out the complete log?
>
> Hi Björn,
>
> I've linked a full dmesg log from an XDP setup crash. Note that there are
> two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works
> fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the
> one that is crashing on XDP setup.

Ryan,

I was wondering if you could run two small experiments since I cannot
reproduce this?

1: Run your program using the two ports on your X710 card. Does it
work? This is my setup and works for me.
2: On your Mellanox setup, insert a kick_tx() call for each of your
two sockets before the poll() call in your forward() function. Just to
see if it works when we explicitly wake up the driver.

Thanks: Magnus

> https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080
>
> Some info that may be useful:
>
> ry@turbine:~$ sudo ethtool -i eno7
> driver: i40e
> version: 2.8.20-k
> firmware-version: 3.33 0x80001006 1.1747.0
> expansion-rom-version:
> bus-info: 0000:b7:00.2
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> The firmware version 3.33 was the latest I could find as of a few weeks ago.
>
> ry@turbine:~$ sudo lspci -vvv | grep 722
> b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
>         DeviceName: Intel LAN X722 #1
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
> b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
>         DeviceName: Intel LAN X722 #2
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
> b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
>         DeviceName: Intel LAN X722 #3
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
> b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
>         DeviceName: Intel LAN X722 #4
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
>
> ry@ryzen:~$ uname -a
> Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux
>
> ry@turbine:~/kmoa/bpf-next$ git log -2
> commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master)
> Author: Ryan Goodfellow <rgoodfel@isi.edu>
> Date:   Wed Jan 15 16:54:39 2020 -0500
>
>     add xdpsock_multidev sample program
>
>     This is a simple program that uses AF_XDP sockets to forward packets
>     between two interfaces using a common memory region and no copying of
>     packets.
>
>     Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu>
>
> commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD)
> Author: Andrii Nakryiko <andriin@fb.com>
> Date:   Wed Jan 15 11:08:56 2020 -0800
>
>     libbpf: Support .text sub-calls relocations
>
>     The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call
>     relocations within the same section. This includes a default .text section,
>     which contains any BPF sub-programs. This wasn't the case before and so libbpf
>     was able to get a way with slightly simpler handling of subprogram call
>     relocations.
>
>     This patch adds support for .text section relocations. It needs to ensure
>     correct order of relocations, so does two passes:
>     - first, relocate .text instructions, if there are any relocations in it;
>     - then process all the other programs and copy over patched .text instructions
>     for all sub-program calls.
>
>     v1->v2:
>     - break early once .text program is processed.
>
>     Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>     Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>     Acked-by: Yonghong Song <yhs@fb.com>
>     Cc: Alexei Starovoitov <ast@kernel.org>
>     Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin@fb.com
>
> --
> ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Intel-wired-lan] zero-copy between interfaces
@ 2020-01-20  8:24       ` Magnus Karlsson
  0 siblings, 0 replies; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-20  8:24 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jan 17, 2020 at 6:16 PM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Fri, Jan 17, 2020 at 01:32:07PM +0100, Bj?rn T?pel wrote:
> > On Mon, 13 Jan 2020 at 01:28, Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > >
> > [...]
> > >
> > > I could not get zero-copy to work with the i40e driver as it would crash. I've
> > > attached the corresponding traces from dmesg.
> >
> > Thanks Ryan! I had a look at the crash, and it's in the XDP setup:
> >
> > i40e_xdp_setup:
> > ...
> >  for (i = 0; i < vsi->num_queue_pairs; i++)
> >      WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> >
> > and the vsi->rx_ring[0] is NULL. This is clearly broken.
> >
> > It would help with more lines from your dmesg: the cut i40e log hints
> > that something is really broken:
> >
> > [  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256
> > queues for VSI 0 err -12
> > [  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
> > [  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left
> >
> > Is it possible to dig out the complete log?
>
> Hi Bj?rn,
>
> I've linked a full dmesg log from an XDP setup crash. Note that there are
> two i40e cards on this machine. The X710 (0000:65:00.0, 0000:65:00.1) works
> fine, the X722 (0000:b7:00.0, 0000:b7:00.1, 0000:b7:00.2, 0000:b7:00.3) is the
> one that is crashing on XDP setup.

Ryan,

I was wondering if you could run two small experiments since I cannot
reproduce this?

1: Run your program using the two ports on your X710 card. Does it
work? This is my setup and works for me.
2: On your Mellanox setup, insert a kick_tx() call for each of your
two sockets before the poll() call in your forward() function. Just to
see if it works when we explicitly wake up the driver.

Thanks: Magnus

> https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080
>
> Some info that may be useful:
>
> ry at turbine:~$ sudo ethtool -i eno7
> driver: i40e
> version: 2.8.20-k
> firmware-version: 3.33 0x80001006 1.1747.0
> expansion-rom-version:
> bus-info: 0000:b7:00.2
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> The firmware version 3.33 was the latest I could find as of a few weeks ago.
>
> ry at turbine:~$ sudo lspci -vvv | grep 722
> b7:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
>         DeviceName: Intel LAN X722 #1
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
> b7:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 04)
>         DeviceName: Intel LAN X722 #2
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GBASE-T
> b7:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
>         DeviceName: Intel LAN X722 #3
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
> b7:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 04)
>         DeviceName: Intel LAN X722 #4
>         Subsystem: Super Micro Computer Inc Ethernet Connection X722 for 10GbE SFP+
>
> ry at ryzen:~$ uname -a
> Linux ryzen 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux
>
> ry at turbine:~/kmoa/bpf-next$ git log -2
> commit 60d71397d27e7859fdaaaaab6594e4d977ae46e2 (HEAD -> master)
> Author: Ryan Goodfellow <rgoodfel@isi.edu>
> Date:   Wed Jan 15 16:54:39 2020 -0500
>
>     add xdpsock_multidev sample program
>
>     This is a simple program that uses AF_XDP sockets to forward packets
>     between two interfaces using a common memory region and no copying of
>     packets.
>
>     Signed-off-by: Ryan Goodfellow <rgoodfel@isi.edu>
>
> commit 9173cac3b64e6785dd604f5075e6035b045a0026 (origin/master, origin/HEAD)
> Author: Andrii Nakryiko <andriin@fb.com>
> Date:   Wed Jan 15 11:08:56 2020 -0800
>
>     libbpf: Support .text sub-calls relocations
>
>     The LLVM patch https://reviews.llvm.org/D72197 makes LLVM emit function call
>     relocations within the same section. This includes a default .text section,
>     which contains any BPF sub-programs. This wasn't the case before and so libbpf
>     was able to get a way with slightly simpler handling of subprogram call
>     relocations.
>
>     This patch adds support for .text section relocations. It needs to ensure
>     correct order of relocations, so does two passes:
>     - first, relocate .text instructions, if there are any relocations in it;
>     - then process all the other programs and copy over patched .text instructions
>     for all sub-program calls.
>
>     v1->v2:
>     - break early once .text program is processed.
>
>     Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>     Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>     Acked-by: Yonghong Song <yhs@fb.com>
>     Cc: Alexei Starovoitov <ast@kernel.org>
>     Link: https://lore.kernel.org/bpf/20200115190856.2391325-1-andriin at fb.com
>
> --
> ~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-17 17:16     ` [Intel-wired-lan] " Ryan Goodfellow
@ 2020-01-20 17:04       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2020-01-20 17:04 UTC (permalink / raw)
  To: Ryan Goodfellow, Björn Töpel
  Cc: xdp-newbies, Karlsson, Magnus, intel-wired-lan, Magnus Karlsson


On 2020-01-17 18:16, Ryan Goodfellow wrote:
[...]
> 
> https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080
> 

Ryan, thanks a lot for the detailed report! Much appreciated!

Long story short, the i40e crash is that the drivers tries to allocate 
256 queues, but the HW is short on queues. The drivers enters a broken 
state, which triggers the crash.

I'll make sure we'll get a patch for this.


Björn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Intel-wired-lan] zero-copy between interfaces
@ 2020-01-20 17:04       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 49+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2020-01-20 17:04 UTC (permalink / raw)
  To: intel-wired-lan


On 2020-01-17 18:16, Ryan Goodfellow wrote:
[...]
> 
> https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1931080
> 

Ryan, thanks a lot for the detailed report! Much appreciated!

Long story short, the i40e crash is that the drivers tries to allocate 
256 queues, but the HW is short on queues. The drivers enters a broken 
state, which triggers the crash.

I'll make sure we'll get a patch for this.


Bj?rn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-20  8:24       ` [Intel-wired-lan] " Magnus Karlsson
@ 2020-01-20 18:33         ` Ryan Goodfellow
  -1 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-20 18:33 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, xdp-newbies, Karlsson, Magnus,
	intel-wired-lan, Björn Töpel

On Mon, Jan 20, 2020 at 09:24:05AM +0100, Magnus Karlsson wrote:
> 
> I was wondering if you could run two small experiments since I cannot
> reproduce this?
> 
> 1: Run your program using the two ports on your X710 card. Does it
> work? This is my setup and works for me.

The X710 card works without issue.

> 2: On your Mellanox setup, insert a kick_tx() call for each of your
> two sockets before the poll() call in your forward() function. Just to
> see if it works when we explicitly wake up the driver.

This did not have an effect on the observed behavior. Exactly N packets go
through the interface where N is equal to the size of the FQ/CQ rings and then
forwarding halts.

--
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Intel-wired-lan] zero-copy between interfaces
@ 2020-01-20 18:33         ` Ryan Goodfellow
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-20 18:33 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Jan 20, 2020 at 09:24:05AM +0100, Magnus Karlsson wrote:
> 
> I was wondering if you could run two small experiments since I cannot
> reproduce this?
> 
> 1: Run your program using the two ports on your X710 card. Does it
> work? This is my setup and works for me.

The X710 card works without issue.

> 2: On your Mellanox setup, insert a kick_tx() call for each of your
> two sockets before the poll() call in your forward() function. Just to
> see if it works when we explicitly wake up the driver.

This did not have an effect on the observed behavior. Exactly N packets go
through the interface where N is equal to the size of the FQ/CQ rings and then
forwarding halts.

--
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-16  2:04               ` Ryan Goodfellow
  2020-01-16 14:32                 ` Magnus Karlsson
@ 2020-01-21  7:34                 ` Magnus Karlsson
  2020-01-21 13:40                   ` Maxim Mikityanskiy
  1 sibling, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-21  7:34 UTC (permalink / raw)
  To: Ryan Goodfellow, Maxim Mikityanskiy; +Cc: xdp-newbies

On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>
> On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote:
> > On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
> > > > > On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
> > > > > >
> > > > > > Just sent out a patch on the mailing list. Would be great if you could
> > > > > > try it out.
> > > > >
> > > > > Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
> > > > > tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
> > > > > this point. It does allow for allocating more memory, however packet
> > > > > forwarding no longer works. I did not see any complaints from dmesg, but here
> > > > > is an example iperf3 session from a client that worked before.
> > > > >
> > > > > ry@xd2:~$ iperf3 -c 10.1.0.2
> > > > > Connecting to host 10.1.0.2, port 5201
> > > > > [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
> > > > > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
> > > > > [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
> > > > > iperf3: interrupt - the client has terminated
> > > > >
> > > > > I'll continue to investigate and report back with anything that I find.
> > > >
> > > > Interestingly I found this behavior to exist in the bpf-next tree independent
> > > > of the patch being present.
> > >
> > > Ryan,
> > >
> > > Could you please do a bisect on it? In the 12 commits after the merge
> > > commit below there are number of sensitive rewrites of the ring access
> > > functions. Maybe one of them breaks your code. When you say "packet
> > > forwarding no longer works", do you mean it works for a second or so,
> > > then no packets come through? What HW are you using?
> > >
> > > commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
> > > Merge: 99cacdc 1d9cb1f
> > > Author: Alexei Starovoitov <ast@kernel.org>
> > > Date:   Fri Dec 20 16:00:10 2019 -0800
> > >
> > >     Merge branch 'xsk-cleanup'
> > >
> > >     Magnus Karlsson says:
> > >
> > >     ====================
> > >     This patch set cleans up the ring access functions of AF_XDP in hope
> > >     that it will now be easier to understand and maintain. I used to get a
> > >     headache every time I looked at this code in order to really understand it,
> > >     but now I do think it is a lot less painful.
> > >     <snip>
> > >
> > > /Magnus
> >
> > I see that you have debug messages in your application. Could you
> > please run with those on and send me the output so I can see where it
> > stops. A bisect that pin-points what commit that breaks your program
> > plus the debug output should hopefully send us on the right path for a
> > fix.
> >
> > Thanks: Magnus
> >
>
> Hi Magnus,
>
> I did a bisect starting from the head of the bpf-next tree (990bca1) down to
> the first commit before the patch series you identified (df034c9). The result
> was identifying df0ae6f as the commit that causes the issue I am seeing.
>
> I've posted output from the program in debugging mode here
>
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
>
> Yes, you are correct in that forwarding works for a brief period and then stops.
> I've noticed that the number of packets that are forwarded is equal to the size
> of the producer/consumer descriptor rings. I've posted two ping traces from a
> client ping that shows this.
>
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
>
> I've also noticed that when the forwarding stops, the CPU usage for the proc
> running the program is pegged, which is not the norm for this program as it uses
> a poll call with a timeout on the xsk fd.
>
> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
> running the mlx5 driver. The program is running in zero copy mode. I also tested
> this code out in a virtual machine with virtio NICs in SKB mode which uses
> xdpgeneric - there were no issues in that setting.
>
> --
> ~ ry

Maxim,

Do you think you could help me debug this issue that Ryan is having? I
can unfortunately not reproduce the stalling issue with my Intel i40e
cards.

Ryan, Maxim is Mellanox's responsible for AF_XDP support and he has
also contributed to the core AF_XDP code. So you are in good hands
:-).

Thanks: Magnus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-21  7:34                 ` Magnus Karlsson
@ 2020-01-21 13:40                   ` Maxim Mikityanskiy
  2020-01-22 21:43                     ` Ryan Goodfellow
  0 siblings, 1 reply; 49+ messages in thread
From: Maxim Mikityanskiy @ 2020-01-21 13:40 UTC (permalink / raw)
  To: Ryan Goodfellow
  Cc: Magnus Karlsson, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On 2020-01-21 09:34, Magnus Karlsson wrote:
> On Thu, Jan 16, 2020 at 3:04 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>>
>> On Wed, Jan 15, 2020 at 09:20:30AM +0100, Magnus Karlsson wrote:
>>> On Wed, Jan 15, 2020 at 8:40 AM Magnus Karlsson
>>> <magnus.karlsson@gmail.com> wrote:
>>>>
>>>> On Wed, Jan 15, 2020 at 2:41 AM Ryan Goodfellow <rgoodfel@isi.edu> wrote:
>>>>>
>>>>> On Tue, Jan 14, 2020 at 03:52:50PM -0500, Ryan Goodfellow wrote:
>>>>>> On Tue, Jan 14, 2020 at 10:59:19AM +0100, Magnus Karlsson wrote:
>>>>>>>
>>>>>>> Just sent out a patch on the mailing list. Would be great if you could
>>>>>>> try it out.
>>>>>>
>>>>>> Thanks for the quick turnaround. I gave this patch a go, both in the bpf-next
>>>>>> tree and manually applied to the 5.5.0-rc3 branch I've been working with up to
>>>>>> this point. It does allow for allocating more memory, however packet
>>>>>> forwarding no longer works. I did not see any complaints from dmesg, but here
>>>>>> is an example iperf3 session from a client that worked before.
>>>>>>
>>>>>> ry@xd2:~$ iperf3 -c 10.1.0.2
>>>>>> Connecting to host 10.1.0.2, port 5201
>>>>>> [  5] local 10.1.0.1 port 53304 connected to 10.1.0.2 port 5201
>>>>>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>>>>>> [  5]   0.00-1.00   sec  5.91 MBytes  49.5 Mbits/sec    2   1.41 KBytes
>>>>>> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
>>>>>> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
>>>>>> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
>>>>>> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
>>>>>> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
>>>>>> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
>>>>>> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
>>>>>> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
>>>>>> ^C[  5]  10.00-139.77 sec  0.00 Bytes  0.00 bits/sec    4   1.41 KBytes
>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>> [ ID] Interval           Transfer     Bitrate         Retr
>>>>>> [  5]   0.00-139.77 sec  5.91 MBytes   355 Kbits/sec    9             sender
>>>>>> [  5]   0.00-139.77 sec  0.00 Bytes  0.00 bits/sec                  receiver
>>>>>> iperf3: interrupt - the client has terminated
>>>>>>
>>>>>> I'll continue to investigate and report back with anything that I find.
>>>>>
>>>>> Interestingly I found this behavior to exist in the bpf-next tree independent
>>>>> of the patch being present.
>>>>
>>>> Ryan,
>>>>
>>>> Could you please do a bisect on it? In the 12 commits after the merge
>>>> commit below there are number of sensitive rewrites of the ring access
>>>> functions. Maybe one of them breaks your code. When you say "packet
>>>> forwarding no longer works", do you mean it works for a second or so,
>>>> then no packets come through? What HW are you using?
>>>>
>>>> commit ce3cec27933c069d2015a81e59b93eb656fe7ee4
>>>> Merge: 99cacdc 1d9cb1f
>>>> Author: Alexei Starovoitov <ast@kernel.org>
>>>> Date:   Fri Dec 20 16:00:10 2019 -0800
>>>>
>>>>      Merge branch 'xsk-cleanup'
>>>>
>>>>      Magnus Karlsson says:
>>>>
>>>>      ====================
>>>>      This patch set cleans up the ring access functions of AF_XDP in hope
>>>>      that it will now be easier to understand and maintain. I used to get a
>>>>      headache every time I looked at this code in order to really understand it,
>>>>      but now I do think it is a lot less painful.
>>>>      <snip>
>>>>
>>>> /Magnus
>>>
>>> I see that you have debug messages in your application. Could you
>>> please run with those on and send me the output so I can see where it
>>> stops. A bisect that pin-points what commit that breaks your program
>>> plus the debug output should hopefully send us on the right path for a
>>> fix.
>>>
>>> Thanks: Magnus
>>>

Hi Ryan,

>> Hi Magnus,
>>
>> I did a bisect starting from the head of the bpf-next tree (990bca1) down to
>> the first commit before the patch series you identified (df034c9). The result
>> was identifying df0ae6f as the commit that causes the issue I am seeing.

This commit and the commit before it remove batching in xskq_nb_avail. 
Before these two commits xskq_nb_avail may have returned values 0..16, 
and now it's not limited. It looks like a change too minor to cause this 
issue...

>> I've posted output from the program in debugging mode here
>>
>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
>>
>> Yes, you are correct in that forwarding works for a brief period and then stops.
>> I've noticed that the number of packets that are forwarded is equal to the size
>> of the producer/consumer descriptor rings. I've posted two ping traces from a
>> client ping that shows this.
>>
>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377

These snippets are not available.

>>
>> I've also noticed that when the forwarding stops, the CPU usage for the proc
>> running the program is pegged, which is not the norm for this program as it uses
>> a poll call with a timeout on the xsk fd.

This information led me to a guess what may be happening. On the RX 
side, mlx5e allocates pages in bulks for performance reasons and to 
leverage hardware features targeted to performance. In AF_XDP mode, 
bulking of frames is also used (on x86, the bulk size is 64 with 
striding RQ enabled, and 8 otherwise, however, it's implementation 
details that might change later). If you don't put enough frames to XSK 
Fill Ring, the driver will be demanding more frames and return from 
poll() immediately. Basically, in the application, you should put as 
many frames to the Fill Ring as you can. Please check if that could be 
the root cause of your issue.

I tracked this issue in our internal bug tracker in case we need to 
perform actual debugging of mlx5e. I'm looking forward to your feedback 
on my assumption above.

>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
>> running the mlx5 driver.

This one should run without striding RQ, please verify it with ethtool 
--show-priv-flags (the flag name is rx_striding_rq).

> The program is running in zero copy mode. I also tested
>> this code out in a virtual machine with virtio NICs in SKB mode which uses
>> xdpgeneric - there were no issues in that setting.
>>
>> --
>> ~ ry
> 
> Maxim,
> 
> Do you think you could help me debug this issue that Ryan is having? I
> can unfortunately not reproduce the stalling issue with my Intel i40e
> cards.
> 
> Ryan, Maxim is Mellanox's responsible for AF_XDP support and he has
> also contributed to the core AF_XDP code. So you are in good hands
> :-).
> 
> Thanks: Magnus
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-21 13:40                   ` Maxim Mikityanskiy
@ 2020-01-22 21:43                     ` Ryan Goodfellow
  2020-01-27 14:01                       ` Maxim Mikityanskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-22 21:43 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Magnus Karlsson, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> >> I've posted output from the program in debugging mode here
> >>
> >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
> >>
> >> Yes, you are correct in that forwarding works for a brief period and then stops.
> >> I've noticed that the number of packets that are forwarded is equal to the size
> >> of the producer/consumer descriptor rings. I've posted two ping traces from a
> >> client ping that shows this.
> >>
> >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
> >> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
> 
> These snippets are not available.

Apologies, I had the wrong permissions set. They should be available now.

> 
> >>
> >> I've also noticed that when the forwarding stops, the CPU usage for the proc
> >> running the program is pegged, which is not the norm for this program as it uses
> >> a poll call with a timeout on the xsk fd.
> 
> This information led me to a guess what may be happening. On the RX 
> side, mlx5e allocates pages in bulks for performance reasons and to 
> leverage hardware features targeted to performance. In AF_XDP mode, 
> bulking of frames is also used (on x86, the bulk size is 64 with 
> striding RQ enabled, and 8 otherwise, however, it's implementation 
> details that might change later). If you don't put enough frames to XSK 
> Fill Ring, the driver will be demanding more frames and return from 
> poll() immediately. Basically, in the application, you should put as 
> many frames to the Fill Ring as you can. Please check if that could be 
> the root cause of your issue.

The code in this application makes an effort to relenish the fill ring as fast
as possible. The basic loop of the application is to first check if there are
any descriptors to be consumed from the completion queue or any descriptors that
can be added to the fill queue, and only then to move on to moving packets
through the rx and tx rings.

https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474

> 
> I tracked this issue in our internal bug tracker in case we need to 
> perform actual debugging of mlx5e. I'm looking forward to your feedback 
> on my assumption above.
> 
> >> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
> >> running the mlx5 driver.
> 
> This one should run without striding RQ, please verify it with ethtool 
> --show-priv-flags (the flag name is rx_striding_rq).

I do not remember changing this option, so whatever the default is, is what it
was running with. I am traveling this week and do not have access to these
systems, but will ensure that this flag is set properly when I get back.

-- 
~ ry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-18 14:08           ` Ryan Goodfellow
@ 2020-01-26  4:53             ` Dan Siemon
  0 siblings, 0 replies; 49+ messages in thread
From: Dan Siemon @ 2020-01-26  4:53 UTC (permalink / raw)
  To: Ryan Goodfellow, Jesper Dangaard Brouer; +Cc: xdp-newbies

On Sat, 2020-01-18 at 09:08 -0500, Ryan Goodfellow wrote:
> On Sat, Jan 18, 2020 at 11:14:05AM +0100, Jesper Dangaard
> Brouerwrote:
> > 
> > I'm wondering why did you choose/need AF_XDP technology for doing
> > forwarding?
> 
> This is just a sample program used to demonstrate moving packets
> between
> different interfaces efficiently using AF_XDP.
> 
> Our actual use case is performing network emulation in userspace. For
> example,
> representing impaired links or entire networks with link-by-link
> shaping
> specifications. We are using AF_XDP to get packets to/from our
> network emulation
> software as quickly as possible without having to go through the
> entire network
> stack, as the emulation host's network configuration does not
> influence the
> networks it's emulating.
> 
> Traditionally we've used DPDK for this, but are porting to AF_XDP for
> the 
> relative simplicity and flexibility it provides. Some specific
> benefits for us
> are:
> 
> - Can attach to VTEPs which allows us to hook into some EVPN/VXLAN
> based
>   networks we have easily. Alternatively with the BPF program
> flexibility, we
>   also have the option to split out BGP control plane traffic from
> overlay
>   traffic when attaching to the physical interface and pass it
> through to the 
>   kernel. Both of these approaches let the kernel manage the FDB for
> VTEPs as 
>   well as taking care of encap/decap (potentially offloaded to the
> NIC itself) 
>   and let our software focus on emulation.
> 
> - Using XDP in virtual machines in our testing environment is
> straightforward,
>   while this is possible with DPDK and virtio, the setup was rather
> convoluted.

I'm in a very similar situation. The per-packet work we do is
complicated. Doing it all in BPF would be much more painful if even
possible. For us, AF_XDP is a nice, simple DPDK. We have no need for
BPF in the AF_XDP path.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-22 21:43                     ` Ryan Goodfellow
@ 2020-01-27 14:01                       ` Maxim Mikityanskiy
  2020-01-27 15:54                         ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Maxim Mikityanskiy @ 2020-01-27 14:01 UTC (permalink / raw)
  To: Ryan Goodfellow, Magnus Karlsson
  Cc: xdp-newbies, Tariq Toukan, Saeed Mahameed, Moshe Shemesh

On 2020-01-22 23:43, Ryan Goodfellow wrote:
> On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
>>>> I've posted output from the program in debugging mode here
>>>>
>>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
>>>>
>>>> Yes, you are correct in that forwarding works for a brief period and then stops.
>>>> I've noticed that the number of packets that are forwarded is equal to the size
>>>> of the producer/consumer descriptor rings. I've posted two ping traces from a
>>>> client ping that shows this.
>>>>
>>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
>>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
>>
>> These snippets are not available.
> 
> Apologies, I had the wrong permissions set. They should be available now.
> 
>>
>>>>
>>>> I've also noticed that when the forwarding stops, the CPU usage for the proc
>>>> running the program is pegged, which is not the norm for this program as it uses
>>>> a poll call with a timeout on the xsk fd.
>>
>> This information led me to a guess what may be happening. On the RX
>> side, mlx5e allocates pages in bulks for performance reasons and to
>> leverage hardware features targeted to performance. In AF_XDP mode,
>> bulking of frames is also used (on x86, the bulk size is 64 with
>> striding RQ enabled, and 8 otherwise, however, it's implementation
>> details that might change later). If you don't put enough frames to XSK
>> Fill Ring, the driver will be demanding more frames and return from
>> poll() immediately. Basically, in the application, you should put as
>> many frames to the Fill Ring as you can. Please check if that could be
>> the root cause of your issue.
> 
> The code in this application makes an effort to relenish the fill ring as fast
> as possible. The basic loop of the application is to first check if there are
> any descriptors to be consumed from the completion queue or any descriptors that
> can be added to the fill queue, and only then to move on to moving packets
> through the rx and tx rings.
> 
> https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474

I reproduced your issue and did my investigation, and here is what I found:

1. Commit df0ae6f78a45 (that you found during bisect) introduces an 
important behavioral change (which I thought was not that important). 
xskq_nb_avail used to return min(entries, dcnt), but after the change it 
just returns entries, which may be as big as the ring size.

2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail 
catches up with q->cons_head. So, before that patch and one previous 
patch, cons_head - cons_tail was not more than 16, so the consumer index 
was updated periodically. Now consumer is updated only when the whole 
ring is exhausted.

3. The application can't replenish the fill ring if the consumer index 
doesn't move. As a consequence, refilling the descriptors by the 
application can't happen in parallel with using them by the driver. It 
should have some performance penalty and possibly even lead to packet 
drops, because the driver uses all the descriptors and only then 
advances the consumer index, and then it has to wait until the 
application refills the ring, busy-looping and losing packets.

4. As mlx5e allocates frames in batches, the consequences are even more 
severe: it's a deadlock where the driver waits for the application, and 
vice versa. The driver never reaches the point where cons_tail gets 
equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch 
size requested by the driver is 8, the driver won't peek anything from 
the fill ring waiting for difference between cons_tail and cons_head to 
increase to be at least 8. On the other hand, the application can't put 
anything to the ring, because it still thinks that the consumer index is 
0. As cons_tail never reaches cons_head, the consumer index doesn't get 
updated, hence the deadlock.

So, in my vision, the decision to remove RX_BATCH_SIZE and periodic 
updates of the consumer index was wrong. It totally breaks mlx5e, that 
does batching, and it will affect the performance of any driver, because 
the application can't refill the ring until it gets completely empty and 
the driver starts waiting for frames. I suggest that periodic updates of 
the consumer index should be readded to xskq_cons_peek_addr.

Magnus, what do you think of the suggestion above?

Thanks,
Max

>>
>> I tracked this issue in our internal bug tracker in case we need to
>> perform actual debugging of mlx5e. I'm looking forward to your feedback
>> on my assumption above.
>>
>>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
>>>> running the mlx5 driver.
>>
>> This one should run without striding RQ, please verify it with ethtool
>> --show-priv-flags (the flag name is rx_striding_rq).
> 
> I do not remember changing this option, so whatever the default is, is what it
> was running with. I am traveling this week and do not have access to these
> systems, but will ensure that this flag is set properly when I get back.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-27 14:01                       ` Maxim Mikityanskiy
@ 2020-01-27 15:54                         ` Magnus Karlsson
  2020-01-30  9:37                           ` Maxim Mikityanskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-27 15:54 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> >>>> I've posted output from the program in debugging mode here
> >>>>
> >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
> >>>>
> >>>> Yes, you are correct in that forwarding works for a brief period and then stops.
> >>>> I've noticed that the number of packets that are forwarded is equal to the size
> >>>> of the producer/consumer descriptor rings. I've posted two ping traces from a
> >>>> client ping that shows this.
> >>>>
> >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
> >>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
> >>
> >> These snippets are not available.
> >
> > Apologies, I had the wrong permissions set. They should be available now.
> >
> >>
> >>>>
> >>>> I've also noticed that when the forwarding stops, the CPU usage for the proc
> >>>> running the program is pegged, which is not the norm for this program as it uses
> >>>> a poll call with a timeout on the xsk fd.
> >>
> >> This information led me to a guess what may be happening. On the RX
> >> side, mlx5e allocates pages in bulks for performance reasons and to
> >> leverage hardware features targeted to performance. In AF_XDP mode,
> >> bulking of frames is also used (on x86, the bulk size is 64 with
> >> striding RQ enabled, and 8 otherwise, however, it's implementation
> >> details that might change later). If you don't put enough frames to XSK
> >> Fill Ring, the driver will be demanding more frames and return from
> >> poll() immediately. Basically, in the application, you should put as
> >> many frames to the Fill Ring as you can. Please check if that could be
> >> the root cause of your issue.
> >
> > The code in this application makes an effort to relenish the fill ring as fast
> > as possible. The basic loop of the application is to first check if there are
> > any descriptors to be consumed from the completion queue or any descriptors that
> > can be added to the fill queue, and only then to move on to moving packets
> > through the rx and tx rings.
> >
> > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474
>
> I reproduced your issue and did my investigation, and here is what I found:
>
> 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> important behavioral change (which I thought was not that important).
> xskq_nb_avail used to return min(entries, dcnt), but after the change it
> just returns entries, which may be as big as the ring size.
>
> 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> catches up with q->cons_head. So, before that patch and one previous
> patch, cons_head - cons_tail was not more than 16, so the consumer index
> was updated periodically. Now consumer is updated only when the whole
> ring is exhausted.
>
> 3. The application can't replenish the fill ring if the consumer index
> doesn't move. As a consequence, refilling the descriptors by the
> application can't happen in parallel with using them by the driver. It
> should have some performance penalty and possibly even lead to packet
> drops, because the driver uses all the descriptors and only then
> advances the consumer index, and then it has to wait until the
> application refills the ring, busy-looping and losing packets.

This will happen if user space always fills up the whole ring, which
might or might not happen all depending on the app. With that said, it
might provide better performance to update it once in a while. It
might also be the case that we will get better performance with the
new scheme if we only fill half the fill ring. I will look into this
and see what I get.

> 4. As mlx5e allocates frames in batches, the consequences are even more
> severe: it's a deadlock where the driver waits for the application, and
> vice versa. The driver never reaches the point where cons_tail gets
> equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> size requested by the driver is 8, the driver won't peek anything from
> the fill ring waiting for difference between cons_tail and cons_head to
> increase to be at least 8. On the other hand, the application can't put
> anything to the ring, because it still thinks that the consumer index is
> 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> updated, hence the deadlock.

Good thing that you detected this. Maybe I should get a Mellanox card
:-). This is different from how we implemented Intel's drivers that
just work on any batch size. If it gets 3 packets back, it will use
those. How do you deal with the case when the application just puts a
single buffer in the fill ring and wants to receive a single packet?
Why does the Mellanox driver need a specific batch size? This is just
for my understanding so we can find a good solution.

> So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> updates of the consumer index was wrong. It totally breaks mlx5e, that
> does batching, and it will affect the performance of any driver, because
> the application can't refill the ring until it gets completely empty and
> the driver starts waiting for frames. I suggest that periodic updates of
> the consumer index should be readded to xskq_cons_peek_addr.

The reason I wanted to remove RX_BATCH_SIZE is that application
developers complained about it giving rise to counter intuitive
behavior in user space. I will try to dig out the complaints and the
explanations Björn and I had to send which it seemed that users really
should not have to care about. It should just work. I also do not like
to have arbitrary constants like this. Why 16? Would much prefer not
having to deal with this, unless of course it horribly breaks
something or gives rise to worse performance. Might still be the case
here, but if not, I would like to remove it.

Thanks: Magnus

> Magnus, what do you think of the suggestion above?
>
> Thanks,
> Max
>
> >>
> >> I tracked this issue in our internal bug tracker in case we need to
> >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> >> on my assumption above.
> >>
> >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
> >>>> running the mlx5 driver.
> >>
> >> This one should run without striding RQ, please verify it with ethtool
> >> --show-priv-flags (the flag name is rx_striding_rq).
> >
> > I do not remember changing this option, so whatever the default is, is what it
> > was running with. I am traveling this week and do not have access to these
> > systems, but will ensure that this flag is set properly when I get back.
> >
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: zero-copy between interfaces
  2020-01-27 15:54                         ` Magnus Karlsson
@ 2020-01-30  9:37                           ` Maxim Mikityanskiy
  2020-01-30  9:59                             ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Maxim Mikityanskiy @ 2020-01-30  9:37 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

> -----Original Message-----
> From: Magnus Karlsson <magnus.karlsson@gmail.com>
> Sent: 27 January, 2020 17:55
> To: Maxim Mikityanskiy <maximmi@mellanox.com>
> Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
> Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
> Shemesh <moshe@mellanox.com>
> Subject: Re: zero-copy between interfaces
> 
> On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
> wrote:
> >
> > On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> > >>>> I've posted output from the program in debugging mode here
> > >>>>
> > >>>> - https://gitlab.com/mergetb/tech/network-
> emulation/kernel/snippets/1930375
> > >>>>
> > >>>> Yes, you are correct in that forwarding works for a brief period and
> then stops.
> > >>>> I've noticed that the number of packets that are forwarded is equal
> to the size
> > >>>> of the producer/consumer descriptor rings. I've posted two ping
> traces from a
> > >>>> client ping that shows this.
> > >>>>
> > >>>> - https://gitlab.com/mergetb/tech/network-
> emulation/kernel/snippets/1930376
> > >>>> - https://gitlab.com/mergetb/tech/network-
> emulation/kernel/snippets/1930377
> > >>
> > >> These snippets are not available.
> > >
> > > Apologies, I had the wrong permissions set. They should be available
> now.
> > >
> > >>
> > >>>>
> > >>>> I've also noticed that when the forwarding stops, the CPU usage for
> the proc
> > >>>> running the program is pegged, which is not the norm for this program
> as it uses
> > >>>> a poll call with a timeout on the xsk fd.
> > >>
> > >> This information led me to a guess what may be happening. On the RX
> > >> side, mlx5e allocates pages in bulks for performance reasons and to
> > >> leverage hardware features targeted to performance. In AF_XDP mode,
> > >> bulking of frames is also used (on x86, the bulk size is 64 with
> > >> striding RQ enabled, and 8 otherwise, however, it's implementation
> > >> details that might change later). If you don't put enough frames to XSK
> > >> Fill Ring, the driver will be demanding more frames and return from
> > >> poll() immediately. Basically, in the application, you should put as
> > >> many frames to the Fill Ring as you can. Please check if that could be
> > >> the root cause of your issue.
> > >
> > > The code in this application makes an effort to relenish the fill ring
> as fast
> > > as possible. The basic loop of the application is to first check if
> there are
> > > any descriptors to be consumed from the completion queue or any
> descriptors that
> > > can be added to the fill queue, and only then to move on to moving
> packets
> > > through the rx and tx rings.
> > >
> > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> moa/samples/bpf/xdpsock_multidev.c#L452-474
> >
> > I reproduced your issue and did my investigation, and here is what I
> found:
> >
> > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> > important behavioral change (which I thought was not that important).
> > xskq_nb_avail used to return min(entries, dcnt), but after the change it
> > just returns entries, which may be as big as the ring size.
> >
> > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> > catches up with q->cons_head. So, before that patch and one previous
> > patch, cons_head - cons_tail was not more than 16, so the consumer index
> > was updated periodically. Now consumer is updated only when the whole
> > ring is exhausted.
> >
> > 3. The application can't replenish the fill ring if the consumer index
> > doesn't move. As a consequence, refilling the descriptors by the
> > application can't happen in parallel with using them by the driver. It
> > should have some performance penalty and possibly even lead to packet
> > drops, because the driver uses all the descriptors and only then
> > advances the consumer index, and then it has to wait until the
> > application refills the ring, busy-looping and losing packets.
> 
> This will happen if user space always fills up the whole ring, which
> might or might not happen all depending on the app.

Yes, that's right, and as far as I know, it's common to fill as many
frames as the application can (there was no reason to do it other way
till now).

> With that said, it
> might provide better performance to update it once in a while. It
> might also be the case that we will get better performance with the
> new scheme if we only fill half the fill ring.

Yes, it may improve performance. However, I don't think it's correct to
set such a limitation on the app.

> I will look into this
> and see what I get.
> 
> > 4. As mlx5e allocates frames in batches, the consequences are even more
> > severe: it's a deadlock where the driver waits for the application, and
> > vice versa. The driver never reaches the point where cons_tail gets
> > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> > size requested by the driver is 8, the driver won't peek anything from
> > the fill ring waiting for difference between cons_tail and cons_head to
> > increase to be at least 8. On the other hand, the application can't put
> > anything to the ring, because it still thinks that the consumer index is
> > 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> > updated, hence the deadlock.
> 
> Good thing that you detected this. Maybe I should get a Mellanox card
> :-). This is different from how we implemented Intel's drivers that
> just work on any batch size. If it gets 3 packets back, it will use
> those. How do you deal with the case when the application just puts a
> single buffer in the fill ring and wants to receive a single packet?

mlx5e will wait until the full batch is available. As AF_XDP is intended
for high-performance apps, this scenario is less expected. We prefer to
leverage our performance features.

> Why does the Mellanox driver need a specific batch size? This is just
> for my understanding so we can find a good solution.

The main reason is our performance feature called striding RQ. Skipping
all irrelevant details, a batch of 64 pages is posted to the NIC with a
single request, and the NIC fills them progressively. This feature is
turned on by default on modern NICs, and it's really good for
performance.

It might be possible to post a smaller batch though, I'm not sure about
it, it needs to be checked, but anyway it's not something that we will
likely do, because it's a complication of the data path, and if we know
more frames are coming, it's much better for the performance to wait for
them, rather than to post several incomplete batches.

> > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> > updates of the consumer index was wrong. It totally breaks mlx5e, that
> > does batching, and it will affect the performance of any driver, because
> > the application can't refill the ring until it gets completely empty and
> > the driver starts waiting for frames. I suggest that periodic updates of
> > the consumer index should be readded to xskq_cons_peek_addr.
> 
> The reason I wanted to remove RX_BATCH_SIZE is that application
> developers complained about it giving rise to counter intuitive
> behavior in user space. I will try to dig out the complaints and the
> explanations Björn and I had to send which it seemed that users really
> should not have to care about. It should just work.

I think the counter that doesn't update till the very last moment and
then advances by the ring size will also be something to complain about
(and I am the first one to complain :D). Such bursts are not good in any
case.

> I also do not like
> to have arbitrary constants like this. Why 16?

I believe any batching mechanism has a constant that look arbitrary.
This constant should be the golden mean: if it's too small, there is
little effect from batching; if it's too big, it gets too bursty.

Basically, after your patch it just changed from 16 to the ring size.
Maybe we can tie that constant to ring size? Make it ring_size /
another_arbitrary_constant? :)

> Would much prefer not
> having to deal with this, unless of course it horribly breaks
> something or gives rise to worse performance. Might still be the case
> here, but if not, I would like to remove it.
> 
> Thanks: Magnus
> 
> > Magnus, what do you think of the suggestion above?
> >
> > Thanks,
> > Max
> >
> > >>
> > >> I tracked this issue in our internal bug tracker in case we need to
> > >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> > >> on my assumption above.
> > >>
> > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> CCAT)
> > >>>> running the mlx5 driver.
> > >>
> > >> This one should run without striding RQ, please verify it with ethtool
> > >> --show-priv-flags (the flag name is rx_striding_rq).
> > >
> > > I do not remember changing this option, so whatever the default is, is
> what it
> > > was running with. I am traveling this week and do not have access to
> these
> > > systems, but will ensure that this flag is set properly when I get back.
> > >
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-30  9:37                           ` Maxim Mikityanskiy
@ 2020-01-30  9:59                             ` Magnus Karlsson
  2020-01-30 11:40                               ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-30  9:59 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
<maximmi@mellanox.com> wrote:
>
> > -----Original Message-----
> > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Sent: 27 January, 2020 17:55
> > To: Maxim Mikityanskiy <maximmi@mellanox.com>
> > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
> > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
> > Shemesh <moshe@mellanox.com>
> > Subject: Re: zero-copy between interfaces
> >
> > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
> > wrote:
> > >
> > > On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> > > >>>> I've posted output from the program in debugging mode here
> > > >>>>
> > > >>>> - https://gitlab.com/mergetb/tech/network-
> > emulation/kernel/snippets/1930375
> > > >>>>
> > > >>>> Yes, you are correct in that forwarding works for a brief period and
> > then stops.
> > > >>>> I've noticed that the number of packets that are forwarded is equal
> > to the size
> > > >>>> of the producer/consumer descriptor rings. I've posted two ping
> > traces from a
> > > >>>> client ping that shows this.
> > > >>>>
> > > >>>> - https://gitlab.com/mergetb/tech/network-
> > emulation/kernel/snippets/1930376
> > > >>>> - https://gitlab.com/mergetb/tech/network-
> > emulation/kernel/snippets/1930377
> > > >>
> > > >> These snippets are not available.
> > > >
> > > > Apologies, I had the wrong permissions set. They should be available
> > now.
> > > >
> > > >>
> > > >>>>
> > > >>>> I've also noticed that when the forwarding stops, the CPU usage for
> > the proc
> > > >>>> running the program is pegged, which is not the norm for this program
> > as it uses
> > > >>>> a poll call with a timeout on the xsk fd.
> > > >>
> > > >> This information led me to a guess what may be happening. On the RX
> > > >> side, mlx5e allocates pages in bulks for performance reasons and to
> > > >> leverage hardware features targeted to performance. In AF_XDP mode,
> > > >> bulking of frames is also used (on x86, the bulk size is 64 with
> > > >> striding RQ enabled, and 8 otherwise, however, it's implementation
> > > >> details that might change later). If you don't put enough frames to XSK
> > > >> Fill Ring, the driver will be demanding more frames and return from
> > > >> poll() immediately. Basically, in the application, you should put as
> > > >> many frames to the Fill Ring as you can. Please check if that could be
> > > >> the root cause of your issue.
> > > >
> > > > The code in this application makes an effort to relenish the fill ring
> > as fast
> > > > as possible. The basic loop of the application is to first check if
> > there are
> > > > any descriptors to be consumed from the completion queue or any
> > descriptors that
> > > > can be added to the fill queue, and only then to move on to moving
> > packets
> > > > through the rx and tx rings.
> > > >
> > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> > moa/samples/bpf/xdpsock_multidev.c#L452-474
> > >
> > > I reproduced your issue and did my investigation, and here is what I
> > found:
> > >
> > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> > > important behavioral change (which I thought was not that important).
> > > xskq_nb_avail used to return min(entries, dcnt), but after the change it
> > > just returns entries, which may be as big as the ring size.
> > >
> > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> > > catches up with q->cons_head. So, before that patch and one previous
> > > patch, cons_head - cons_tail was not more than 16, so the consumer index
> > > was updated periodically. Now consumer is updated only when the whole
> > > ring is exhausted.
> > >
> > > 3. The application can't replenish the fill ring if the consumer index
> > > doesn't move. As a consequence, refilling the descriptors by the
> > > application can't happen in parallel with using them by the driver. It
> > > should have some performance penalty and possibly even lead to packet
> > > drops, because the driver uses all the descriptors and only then
> > > advances the consumer index, and then it has to wait until the
> > > application refills the ring, busy-looping and losing packets.
> >
> > This will happen if user space always fills up the whole ring, which
> > might or might not happen all depending on the app.
>
> Yes, that's right, and as far as I know, it's common to fill as many
> frames as the application can (there was no reason to do it other way
> till now).
>
> > With that said, it
> > might provide better performance to update it once in a while. It
> > might also be the case that we will get better performance with the
> > new scheme if we only fill half the fill ring.
>
> Yes, it may improve performance. However, I don't think it's correct to
> set such a limitation on the app.

I will examine what provides the best performance. On one hand it is
the number of updates to shared ring state (minimized by current
scheme) and the ability for the user app to but buffers on the fill
ring. Stating that putting e.g. half the packets on the fill ring
provides better performance (for some application) is not a
limitation. It is performance tuning advise :-).

> > I will look into this
> > and see what I get.
> >
> > > 4. As mlx5e allocates frames in batches, the consequences are even more
> > > severe: it's a deadlock where the driver waits for the application, and
> > > vice versa. The driver never reaches the point where cons_tail gets
> > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> > > size requested by the driver is 8, the driver won't peek anything from
> > > the fill ring waiting for difference between cons_tail and cons_head to
> > > increase to be at least 8. On the other hand, the application can't put
> > > anything to the ring, because it still thinks that the consumer index is
> > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> > > updated, hence the deadlock.
> >
> > Good thing that you detected this. Maybe I should get a Mellanox card
> > :-). This is different from how we implemented Intel's drivers that
> > just work on any batch size. If it gets 3 packets back, it will use
> > those. How do you deal with the case when the application just puts a
> > single buffer in the fill ring and wants to receive a single packet?
>
> mlx5e will wait until the full batch is available. As AF_XDP is intended
> for high-performance apps, this scenario is less expected. We prefer to
> leverage our performance features.

That you cannot support all applications on top of AF_XDP with your
zero-copy support seems broken to me. But I give you that you might
support all the ones you care about with your current batching
support. Like you say, most apps will put plenty of buffers on the
fill ring, so this should not be a problem. Can you not implement some
slow path for these cases? You must have one for the skb case.

> > Why does the Mellanox driver need a specific batch size? This is just
> > for my understanding so we can find a good solution.
>
> The main reason is our performance feature called striding RQ. Skipping
> all irrelevant details, a batch of 64 pages is posted to the NIC with a
> single request, and the NIC fills them progressively. This feature is
> turned on by default on modern NICs, and it's really good for
> performance.
>
> It might be possible to post a smaller batch though, I'm not sure about
> it, it needs to be checked, but anyway it's not something that we will
> likely do, because it's a complication of the data path, and if we know
> more frames are coming, it's much better for the performance to wait for
> them, rather than to post several incomplete batches.
>
> > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> > > updates of the consumer index was wrong. It totally breaks mlx5e, that
> > > does batching, and it will affect the performance of any driver, because
> > > the application can't refill the ring until it gets completely empty and
> > > the driver starts waiting for frames. I suggest that periodic updates of
> > > the consumer index should be readded to xskq_cons_peek_addr.
> >
> > The reason I wanted to remove RX_BATCH_SIZE is that application
> > developers complained about it giving rise to counter intuitive
> > behavior in user space. I will try to dig out the complaints and the
> > explanations Björn and I had to send which it seemed that users really
> > should not have to care about. It should just work.
>
> I think the counter that doesn't update till the very last moment and
> then advances by the ring size will also be something to complain about
> (and I am the first one to complain :D). Such bursts are not good in any
> case.

Do you have any performance data that shows this scheme is bad for
performance? The good thing about this scheme is that global state is
updated less frequently. And the bad thing is what you mentioned. But
which one has the greatest effect, is the question.

> > I also do not like
> > to have arbitrary constants like this. Why 16?
>
> I believe any batching mechanism has a constant that look arbitrary.
> This constant should be the golden mean: if it's too small, there is
> little effect from batching; if it's too big, it gets too bursty.
>
> Basically, after your patch it just changed from 16 to the ring size.
> Maybe we can tie that constant to ring size? Make it ring_size /
> another_arbitrary_constant? :)

Agreed, I thought about this too. Something tied to ring_size might be
a good compromise. Will examine this. But I want to base this on
performance data not idle speculation, so need to run some experiments
first.

/Magnus

> > Would much prefer not
> > having to deal with this, unless of course it horribly breaks
> > something or gives rise to worse performance. Might still be the case
> > here, but if not, I would like to remove it.
> >
> > Thanks: Magnus
> >
> > > Magnus, what do you think of the suggestion above?
> > >
> > > Thanks,
> > > Max
> > >
> > > >>
> > > >> I tracked this issue in our internal bug tracker in case we need to
> > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> > > >> on my assumption above.
> > > >>
> > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> > CCAT)
> > > >>>> running the mlx5 driver.
> > > >>
> > > >> This one should run without striding RQ, please verify it with ethtool
> > > >> --show-priv-flags (the flag name is rx_striding_rq).
> > > >
> > > > I do not remember changing this option, so whatever the default is, is
> > what it
> > > > was running with. I am traveling this week and do not have access to
> > these
> > > > systems, but will ensure that this flag is set properly when I get back.
> > > >
> > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-30  9:59                             ` Magnus Karlsson
@ 2020-01-30 11:40                               ` Magnus Karlsson
  2020-02-04 16:10                                 ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-01-30 11:40 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
> <maximmi@mellanox.com> wrote:
> >
> > > -----Original Message-----
> > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > Sent: 27 January, 2020 17:55
> > > To: Maxim Mikityanskiy <maximmi@mellanox.com>
> > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
> > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
> > > Shemesh <moshe@mellanox.com>
> > > Subject: Re: zero-copy between interfaces
> > >
> > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
> > > wrote:
> > > >
> > > > On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> > > > >>>> I've posted output from the program in debugging mode here
> > > > >>>>
> > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > emulation/kernel/snippets/1930375
> > > > >>>>
> > > > >>>> Yes, you are correct in that forwarding works for a brief period and
> > > then stops.
> > > > >>>> I've noticed that the number of packets that are forwarded is equal
> > > to the size
> > > > >>>> of the producer/consumer descriptor rings. I've posted two ping
> > > traces from a
> > > > >>>> client ping that shows this.
> > > > >>>>
> > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > emulation/kernel/snippets/1930376
> > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > emulation/kernel/snippets/1930377
> > > > >>
> > > > >> These snippets are not available.
> > > > >
> > > > > Apologies, I had the wrong permissions set. They should be available
> > > now.
> > > > >
> > > > >>
> > > > >>>>
> > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for
> > > the proc
> > > > >>>> running the program is pegged, which is not the norm for this program
> > > as it uses
> > > > >>>> a poll call with a timeout on the xsk fd.
> > > > >>
> > > > >> This information led me to a guess what may be happening. On the RX
> > > > >> side, mlx5e allocates pages in bulks for performance reasons and to
> > > > >> leverage hardware features targeted to performance. In AF_XDP mode,
> > > > >> bulking of frames is also used (on x86, the bulk size is 64 with
> > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation
> > > > >> details that might change later). If you don't put enough frames to XSK
> > > > >> Fill Ring, the driver will be demanding more frames and return from
> > > > >> poll() immediately. Basically, in the application, you should put as
> > > > >> many frames to the Fill Ring as you can. Please check if that could be
> > > > >> the root cause of your issue.
> > > > >
> > > > > The code in this application makes an effort to relenish the fill ring
> > > as fast
> > > > > as possible. The basic loop of the application is to first check if
> > > there are
> > > > > any descriptors to be consumed from the completion queue or any
> > > descriptors that
> > > > > can be added to the fill queue, and only then to move on to moving
> > > packets
> > > > > through the rx and tx rings.
> > > > >
> > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> > > moa/samples/bpf/xdpsock_multidev.c#L452-474
> > > >
> > > > I reproduced your issue and did my investigation, and here is what I
> > > found:
> > > >
> > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> > > > important behavioral change (which I thought was not that important).
> > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it
> > > > just returns entries, which may be as big as the ring size.
> > > >
> > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> > > > catches up with q->cons_head. So, before that patch and one previous
> > > > patch, cons_head - cons_tail was not more than 16, so the consumer index
> > > > was updated periodically. Now consumer is updated only when the whole
> > > > ring is exhausted.
> > > >
> > > > 3. The application can't replenish the fill ring if the consumer index
> > > > doesn't move. As a consequence, refilling the descriptors by the
> > > > application can't happen in parallel with using them by the driver. It
> > > > should have some performance penalty and possibly even lead to packet
> > > > drops, because the driver uses all the descriptors and only then
> > > > advances the consumer index, and then it has to wait until the
> > > > application refills the ring, busy-looping and losing packets.
> > >
> > > This will happen if user space always fills up the whole ring, which
> > > might or might not happen all depending on the app.
> >
> > Yes, that's right, and as far as I know, it's common to fill as many
> > frames as the application can (there was no reason to do it other way
> > till now).
> >
> > > With that said, it
> > > might provide better performance to update it once in a while. It
> > > might also be the case that we will get better performance with the
> > > new scheme if we only fill half the fill ring.
> >
> > Yes, it may improve performance. However, I don't think it's correct to
> > set such a limitation on the app.

Actually, a much worse limitation to put on an application is to say
that you have to have a certain amount of buffers on some ring for the
zero-copy feature to work. For example that we need at least 64
buffers on the fill ring for all the NIC cards to work in zero-copy
mode. That would be a bad thing to have to put in the documentation.
An OS is supposed to abstract away HW differences, and with this
current limitation in your driver, it shines through for sure. What we
would like to put in the documentation is a statement along the lines
of: "for best performance, make sure you have plenty of buffers on the
fill ring so that the NIC can work as efficiently as possible". Not a
statement that it does not work on Mellanox unless you put enough
buffers on the fill ring. So my advice (and wish) is that you fix this
in your driver. With that said, I will still look into what is the
best way to get at least the sample to work for you. But there is no
way to make sure every single app works for you in zero-copy mode,
unless you support arbitrary amount of buffers on the fill ring. I
guess that sooner or later, a customer of yours will get into this
situation one way or the other, so why not fix it now.

/Magnus

> I will examine what provides the best performance. On one hand it is
> the number of updates to shared ring state (minimized by current
> scheme) and the ability for the user app to but buffers on the fill
> ring. Stating that putting e.g. half the packets on the fill ring
> provides better performance (for some application) is not a
> limitation. It is performance tuning advise :-).
>
> > > I will look into this
> > > and see what I get.
> > >
> > > > 4. As mlx5e allocates frames in batches, the consequences are even more
> > > > severe: it's a deadlock where the driver waits for the application, and
> > > > vice versa. The driver never reaches the point where cons_tail gets
> > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> > > > size requested by the driver is 8, the driver won't peek anything from
> > > > the fill ring waiting for difference between cons_tail and cons_head to
> > > > increase to be at least 8. On the other hand, the application can't put
> > > > anything to the ring, because it still thinks that the consumer index is
> > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> > > > updated, hence the deadlock.
> > >
> > > Good thing that you detected this. Maybe I should get a Mellanox card
> > > :-). This is different from how we implemented Intel's drivers that
> > > just work on any batch size. If it gets 3 packets back, it will use
> > > those. How do you deal with the case when the application just puts a
> > > single buffer in the fill ring and wants to receive a single packet?
> >
> > mlx5e will wait until the full batch is available. As AF_XDP is intended
> > for high-performance apps, this scenario is less expected. We prefer to
> > leverage our performance features.
>
> That you cannot support all applications on top of AF_XDP with your
> zero-copy support seems broken to me. But I give you that you might
> support all the ones you care about with your current batching
> support. Like you say, most apps will put plenty of buffers on the
> fill ring, so this should not be a problem. Can you not implement some
> slow path for these cases? You must have one for the skb case.
>
> > > Why does the Mellanox driver need a specific batch size? This is just
> > > for my understanding so we can find a good solution.
> >
> > The main reason is our performance feature called striding RQ. Skipping
> > all irrelevant details, a batch of 64 pages is posted to the NIC with a
> > single request, and the NIC fills them progressively. This feature is
> > turned on by default on modern NICs, and it's really good for
> > performance.
> >
> > It might be possible to post a smaller batch though, I'm not sure about
> > it, it needs to be checked, but anyway it's not something that we will
> > likely do, because it's a complication of the data path, and if we know
> > more frames are coming, it's much better for the performance to wait for
> > them, rather than to post several incomplete batches.
> >
> > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> > > > updates of the consumer index was wrong. It totally breaks mlx5e, that
> > > > does batching, and it will affect the performance of any driver, because
> > > > the application can't refill the ring until it gets completely empty and
> > > > the driver starts waiting for frames. I suggest that periodic updates of
> > > > the consumer index should be readded to xskq_cons_peek_addr.
> > >
> > > The reason I wanted to remove RX_BATCH_SIZE is that application
> > > developers complained about it giving rise to counter intuitive
> > > behavior in user space. I will try to dig out the complaints and the
> > > explanations Björn and I had to send which it seemed that users really
> > > should not have to care about. It should just work.
> >
> > I think the counter that doesn't update till the very last moment and
> > then advances by the ring size will also be something to complain about
> > (and I am the first one to complain :D). Such bursts are not good in any
> > case.
>
> Do you have any performance data that shows this scheme is bad for
> performance? The good thing about this scheme is that global state is
> updated less frequently. And the bad thing is what you mentioned. But
> which one has the greatest effect, is the question.
>
> > > I also do not like
> > > to have arbitrary constants like this. Why 16?
> >
> > I believe any batching mechanism has a constant that look arbitrary.
> > This constant should be the golden mean: if it's too small, there is
> > little effect from batching; if it's too big, it gets too bursty.
> >
> > Basically, after your patch it just changed from 16 to the ring size.
> > Maybe we can tie that constant to ring size? Make it ring_size /
> > another_arbitrary_constant? :)
>
> Agreed, I thought about this too. Something tied to ring_size might be
> a good compromise. Will examine this. But I want to base this on
> performance data not idle speculation, so need to run some experiments
> first.
>
> /Magnus
>
> > > Would much prefer not
> > > having to deal with this, unless of course it horribly breaks
> > > something or gives rise to worse performance. Might still be the case
> > > here, but if not, I would like to remove it.
> > >
> > > Thanks: Magnus
> > >
> > > > Magnus, what do you think of the suggestion above?
> > > >
> > > > Thanks,
> > > > Max
> > > >
> > > > >>
> > > > >> I tracked this issue in our internal bug tracker in case we need to
> > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> > > > >> on my assumption above.
> > > > >>
> > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> > > CCAT)
> > > > >>>> running the mlx5 driver.
> > > > >>
> > > > >> This one should run without striding RQ, please verify it with ethtool
> > > > >> --show-priv-flags (the flag name is rx_striding_rq).
> > > > >
> > > > > I do not remember changing this option, so whatever the default is, is
> > > what it
> > > > > was running with. I am traveling this week and do not have access to
> > > these
> > > > > systems, but will ensure that this flag is set properly when I get back.
> > > > >
> > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-01-30 11:40                               ` Magnus Karlsson
@ 2020-02-04 16:10                                 ` Magnus Karlsson
  2020-02-05 13:31                                   ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-02-04 16:10 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
> > <maximmi@mellanox.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > Sent: 27 January, 2020 17:55
> > > > To: Maxim Mikityanskiy <maximmi@mellanox.com>
> > > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
> > > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
> > > > Shemesh <moshe@mellanox.com>
> > > > Subject: Re: zero-copy between interfaces
> > > >
> > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
> > > > wrote:
> > > > >
> > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> > > > > >>>> I've posted output from the program in debugging mode here
> > > > > >>>>
> > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > emulation/kernel/snippets/1930375
> > > > > >>>>
> > > > > >>>> Yes, you are correct in that forwarding works for a brief period and
> > > > then stops.
> > > > > >>>> I've noticed that the number of packets that are forwarded is equal
> > > > to the size
> > > > > >>>> of the producer/consumer descriptor rings. I've posted two ping
> > > > traces from a
> > > > > >>>> client ping that shows this.
> > > > > >>>>
> > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > emulation/kernel/snippets/1930376
> > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > emulation/kernel/snippets/1930377
> > > > > >>
> > > > > >> These snippets are not available.
> > > > > >
> > > > > > Apologies, I had the wrong permissions set. They should be available
> > > > now.
> > > > > >
> > > > > >>
> > > > > >>>>
> > > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for
> > > > the proc
> > > > > >>>> running the program is pegged, which is not the norm for this program
> > > > as it uses
> > > > > >>>> a poll call with a timeout on the xsk fd.
> > > > > >>
> > > > > >> This information led me to a guess what may be happening. On the RX
> > > > > >> side, mlx5e allocates pages in bulks for performance reasons and to
> > > > > >> leverage hardware features targeted to performance. In AF_XDP mode,
> > > > > >> bulking of frames is also used (on x86, the bulk size is 64 with
> > > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation
> > > > > >> details that might change later). If you don't put enough frames to XSK
> > > > > >> Fill Ring, the driver will be demanding more frames and return from
> > > > > >> poll() immediately. Basically, in the application, you should put as
> > > > > >> many frames to the Fill Ring as you can. Please check if that could be
> > > > > >> the root cause of your issue.
> > > > > >
> > > > > > The code in this application makes an effort to relenish the fill ring
> > > > as fast
> > > > > > as possible. The basic loop of the application is to first check if
> > > > there are
> > > > > > any descriptors to be consumed from the completion queue or any
> > > > descriptors that
> > > > > > can be added to the fill queue, and only then to move on to moving
> > > > packets
> > > > > > through the rx and tx rings.
> > > > > >
> > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> > > > moa/samples/bpf/xdpsock_multidev.c#L452-474
> > > > >
> > > > > I reproduced your issue and did my investigation, and here is what I
> > > > found:
> > > > >
> > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> > > > > important behavioral change (which I thought was not that important).
> > > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it
> > > > > just returns entries, which may be as big as the ring size.
> > > > >
> > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> > > > > catches up with q->cons_head. So, before that patch and one previous
> > > > > patch, cons_head - cons_tail was not more than 16, so the consumer index
> > > > > was updated periodically. Now consumer is updated only when the whole
> > > > > ring is exhausted.
> > > > >
> > > > > 3. The application can't replenish the fill ring if the consumer index
> > > > > doesn't move. As a consequence, refilling the descriptors by the
> > > > > application can't happen in parallel with using them by the driver. It
> > > > > should have some performance penalty and possibly even lead to packet
> > > > > drops, because the driver uses all the descriptors and only then
> > > > > advances the consumer index, and then it has to wait until the
> > > > > application refills the ring, busy-looping and losing packets.
> > > >
> > > > This will happen if user space always fills up the whole ring, which
> > > > might or might not happen all depending on the app.
> > >
> > > Yes, that's right, and as far as I know, it's common to fill as many
> > > frames as the application can (there was no reason to do it other way
> > > till now).
> > >
> > > > With that said, it
> > > > might provide better performance to update it once in a while. It
> > > > might also be the case that we will get better performance with the
> > > > new scheme if we only fill half the fill ring.
> > >
> > > Yes, it may improve performance. However, I don't think it's correct to
> > > set such a limitation on the app.
>
> Actually, a much worse limitation to put on an application is to say
> that you have to have a certain amount of buffers on some ring for the
> zero-copy feature to work. For example that we need at least 64
> buffers on the fill ring for all the NIC cards to work in zero-copy
> mode. That would be a bad thing to have to put in the documentation.
> An OS is supposed to abstract away HW differences, and with this
> current limitation in your driver, it shines through for sure. What we
> would like to put in the documentation is a statement along the lines
> of: "for best performance, make sure you have plenty of buffers on the
> fill ring so that the NIC can work as efficiently as possible". Not a
> statement that it does not work on Mellanox unless you put enough
> buffers on the fill ring. So my advice (and wish) is that you fix this
> in your driver. With that said, I will still look into what is the
> best way to get at least the sample to work for you. But there is no
> way to make sure every single app works for you in zero-copy mode,
> unless you support arbitrary amount of buffers on the fill ring. I
> guess that sooner or later, a customer of yours will get into this
> situation one way or the other, so why not fix it now.
>
> /Magnus
>
> > I will examine what provides the best performance. On one hand it is
> > the number of updates to shared ring state (minimized by current
> > scheme) and the ability for the user app to but buffers on the fill
> > ring. Stating that putting e.g. half the packets on the fill ring
> > provides better performance (for some application) is not a
> > limitation. It is performance tuning advise :-).

I have now made a set of measurements. First I just made a variation
study using the xdpsock app, varying the amount of packets the kernel
can withdraw from a consumer ring (fill and Tx) before updating global
state. For the 1 core case (app and driver on the same core) the more
frequent you do this update, the better. The reason for this is that
it costs very little to update the state since the application is not
running. And it is beneficial for the app to have a freshly updated
state when it starts to execute as it can operate on more packets. For
the 2 core case (app on one core, driver on another) it is the
complete opposite: the fewer updates to global state, the better. The
reason for this is that it costs a lot to update global state as it
triggers cache coherency actions between the two cores.

What I did then was to compare the current scheme, update only when
grabbing new packets, to a new scheme were we also update the global
consumer pointer when we are exiting Rx or Tx processing in the NAPI
context. On two cores the current scheme gets 0.5 to 1 Mpps more in
throughput than also updating the pointer at the end of NAPI. But for
1 core case, the new scheme is the best and generates between 0.2 and
0.3 Mpps more in throughput than the current one. But all in all, the
current scheme is more beneficial than the proposed one if we say that
both the 1 core and the 2 core case is equally important.

Something to note is that the xdpsock application only puts batch size
(64) of packets in the fill ring in every iteration, and this might
lead to some good pacing for the current scheme and the 2 core case.
I.e., we do not get into the case of the fill ring only being full or
empty. But I will run this on some real apps to get some more results,
and I know that Björn has an optimized xdpsock application that puts
many more packets into the fill queue than 64. This optimization might
actually make the new proposal (also updating at the end of NAPI) be
better and make the current scheme suffer. We will examine this
further and get back.

/Magnus

> > > > I will look into this
> > > > and see what I get.
> > > >
> > > > > 4. As mlx5e allocates frames in batches, the consequences are even more
> > > > > severe: it's a deadlock where the driver waits for the application, and
> > > > > vice versa. The driver never reaches the point where cons_tail gets
> > > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> > > > > size requested by the driver is 8, the driver won't peek anything from
> > > > > the fill ring waiting for difference between cons_tail and cons_head to
> > > > > increase to be at least 8. On the other hand, the application can't put
> > > > > anything to the ring, because it still thinks that the consumer index is
> > > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> > > > > updated, hence the deadlock.
> > > >
> > > > Good thing that you detected this. Maybe I should get a Mellanox card
> > > > :-). This is different from how we implemented Intel's drivers that
> > > > just work on any batch size. If it gets 3 packets back, it will use
> > > > those. How do you deal with the case when the application just puts a
> > > > single buffer in the fill ring and wants to receive a single packet?
> > >
> > > mlx5e will wait until the full batch is available. As AF_XDP is intended
> > > for high-performance apps, this scenario is less expected. We prefer to
> > > leverage our performance features.
> >
> > That you cannot support all applications on top of AF_XDP with your
> > zero-copy support seems broken to me. But I give you that you might
> > support all the ones you care about with your current batching
> > support. Like you say, most apps will put plenty of buffers on the
> > fill ring, so this should not be a problem. Can you not implement some
> > slow path for these cases? You must have one for the skb case.
> >
> > > > Why does the Mellanox driver need a specific batch size? This is just
> > > > for my understanding so we can find a good solution.
> > >
> > > The main reason is our performance feature called striding RQ. Skipping
> > > all irrelevant details, a batch of 64 pages is posted to the NIC with a
> > > single request, and the NIC fills them progressively. This feature is
> > > turned on by default on modern NICs, and it's really good for
> > > performance.
> > >
> > > It might be possible to post a smaller batch though, I'm not sure about
> > > it, it needs to be checked, but anyway it's not something that we will
> > > likely do, because it's a complication of the data path, and if we know
> > > more frames are coming, it's much better for the performance to wait for
> > > them, rather than to post several incomplete batches.
> > >
> > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> > > > > updates of the consumer index was wrong. It totally breaks mlx5e, that
> > > > > does batching, and it will affect the performance of any driver, because
> > > > > the application can't refill the ring until it gets completely empty and
> > > > > the driver starts waiting for frames. I suggest that periodic updates of
> > > > > the consumer index should be readded to xskq_cons_peek_addr.
> > > >
> > > > The reason I wanted to remove RX_BATCH_SIZE is that application
> > > > developers complained about it giving rise to counter intuitive
> > > > behavior in user space. I will try to dig out the complaints and the
> > > > explanations Björn and I had to send which it seemed that users really
> > > > should not have to care about. It should just work.
> > >
> > > I think the counter that doesn't update till the very last moment and
> > > then advances by the ring size will also be something to complain about
> > > (and I am the first one to complain :D). Such bursts are not good in any
> > > case.
> >
> > Do you have any performance data that shows this scheme is bad for
> > performance? The good thing about this scheme is that global state is
> > updated less frequently. And the bad thing is what you mentioned. But
> > which one has the greatest effect, is the question.
> >
> > > > I also do not like
> > > > to have arbitrary constants like this. Why 16?
> > >
> > > I believe any batching mechanism has a constant that look arbitrary.
> > > This constant should be the golden mean: if it's too small, there is
> > > little effect from batching; if it's too big, it gets too bursty.
> > >
> > > Basically, after your patch it just changed from 16 to the ring size.
> > > Maybe we can tie that constant to ring size? Make it ring_size /
> > > another_arbitrary_constant? :)
> >
> > Agreed, I thought about this too. Something tied to ring_size might be
> > a good compromise. Will examine this. But I want to base this on
> > performance data not idle speculation, so need to run some experiments
> > first.
> >
> > /Magnus
> >
> > > > Would much prefer not
> > > > having to deal with this, unless of course it horribly breaks
> > > > something or gives rise to worse performance. Might still be the case
> > > > here, but if not, I would like to remove it.
> > > >
> > > > Thanks: Magnus
> > > >
> > > > > Magnus, what do you think of the suggestion above?
> > > > >
> > > > > Thanks,
> > > > > Max
> > > > >
> > > > > >>
> > > > > >> I tracked this issue in our internal bug tracker in case we need to
> > > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> > > > > >> on my assumption above.
> > > > > >>
> > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> > > > CCAT)
> > > > > >>>> running the mlx5 driver.
> > > > > >>
> > > > > >> This one should run without striding RQ, please verify it with ethtool
> > > > > >> --show-priv-flags (the flag name is rx_striding_rq).
> > > > > >
> > > > > > I do not remember changing this option, so whatever the default is, is
> > > > what it
> > > > > > was running with. I am traveling this week and do not have access to
> > > > these
> > > > > > systems, but will ensure that this flag is set properly when I get back.
> > > > > >
> > > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-02-04 16:10                                 ` Magnus Karlsson
@ 2020-02-05 13:31                                   ` Magnus Karlsson
  2020-02-06 14:56                                     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Magnus Karlsson @ 2020-02-05 13:31 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
> > > <maximmi@mellanox.com> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > > Sent: 27 January, 2020 17:55
> > > > > To: Maxim Mikityanskiy <maximmi@mellanox.com>
> > > > > Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
> > > > > Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
> > > > > Shemesh <moshe@mellanox.com>
> > > > > Subject: Re: zero-copy between interfaces
> > > > >
> > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
> > > > > wrote:
> > > > > >
> > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> > > > > > >>>> I've posted output from the program in debugging mode here
> > > > > > >>>>
> > > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > > emulation/kernel/snippets/1930375
> > > > > > >>>>
> > > > > > >>>> Yes, you are correct in that forwarding works for a brief period and
> > > > > then stops.
> > > > > > >>>> I've noticed that the number of packets that are forwarded is equal
> > > > > to the size
> > > > > > >>>> of the producer/consumer descriptor rings. I've posted two ping
> > > > > traces from a
> > > > > > >>>> client ping that shows this.
> > > > > > >>>>
> > > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > > emulation/kernel/snippets/1930376
> > > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > > emulation/kernel/snippets/1930377
> > > > > > >>
> > > > > > >> These snippets are not available.
> > > > > > >
> > > > > > > Apologies, I had the wrong permissions set. They should be available
> > > > > now.
> > > > > > >
> > > > > > >>
> > > > > > >>>>
> > > > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for
> > > > > the proc
> > > > > > >>>> running the program is pegged, which is not the norm for this program
> > > > > as it uses
> > > > > > >>>> a poll call with a timeout on the xsk fd.
> > > > > > >>
> > > > > > >> This information led me to a guess what may be happening. On the RX
> > > > > > >> side, mlx5e allocates pages in bulks for performance reasons and to
> > > > > > >> leverage hardware features targeted to performance. In AF_XDP mode,
> > > > > > >> bulking of frames is also used (on x86, the bulk size is 64 with
> > > > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation
> > > > > > >> details that might change later). If you don't put enough frames to XSK
> > > > > > >> Fill Ring, the driver will be demanding more frames and return from
> > > > > > >> poll() immediately. Basically, in the application, you should put as
> > > > > > >> many frames to the Fill Ring as you can. Please check if that could be
> > > > > > >> the root cause of your issue.
> > > > > > >
> > > > > > > The code in this application makes an effort to relenish the fill ring
> > > > > as fast
> > > > > > > as possible. The basic loop of the application is to first check if
> > > > > there are
> > > > > > > any descriptors to be consumed from the completion queue or any
> > > > > descriptors that
> > > > > > > can be added to the fill queue, and only then to move on to moving
> > > > > packets
> > > > > > > through the rx and tx rings.
> > > > > > >
> > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> > > > > moa/samples/bpf/xdpsock_multidev.c#L452-474
> > > > > >
> > > > > > I reproduced your issue and did my investigation, and here is what I
> > > > > found:
> > > > > >
> > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> > > > > > important behavioral change (which I thought was not that important).
> > > > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it
> > > > > > just returns entries, which may be as big as the ring size.
> > > > > >
> > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> > > > > > catches up with q->cons_head. So, before that patch and one previous
> > > > > > patch, cons_head - cons_tail was not more than 16, so the consumer index
> > > > > > was updated periodically. Now consumer is updated only when the whole
> > > > > > ring is exhausted.
> > > > > >
> > > > > > 3. The application can't replenish the fill ring if the consumer index
> > > > > > doesn't move. As a consequence, refilling the descriptors by the
> > > > > > application can't happen in parallel with using them by the driver. It
> > > > > > should have some performance penalty and possibly even lead to packet
> > > > > > drops, because the driver uses all the descriptors and only then
> > > > > > advances the consumer index, and then it has to wait until the
> > > > > > application refills the ring, busy-looping and losing packets.
> > > > >
> > > > > This will happen if user space always fills up the whole ring, which
> > > > > might or might not happen all depending on the app.
> > > >
> > > > Yes, that's right, and as far as I know, it's common to fill as many
> > > > frames as the application can (there was no reason to do it other way
> > > > till now).
> > > >
> > > > > With that said, it
> > > > > might provide better performance to update it once in a while. It
> > > > > might also be the case that we will get better performance with the
> > > > > new scheme if we only fill half the fill ring.
> > > >
> > > > Yes, it may improve performance. However, I don't think it's correct to
> > > > set such a limitation on the app.
> >
> > Actually, a much worse limitation to put on an application is to say
> > that you have to have a certain amount of buffers on some ring for the
> > zero-copy feature to work. For example that we need at least 64
> > buffers on the fill ring for all the NIC cards to work in zero-copy
> > mode. That would be a bad thing to have to put in the documentation.
> > An OS is supposed to abstract away HW differences, and with this
> > current limitation in your driver, it shines through for sure. What we
> > would like to put in the documentation is a statement along the lines
> > of: "for best performance, make sure you have plenty of buffers on the
> > fill ring so that the NIC can work as efficiently as possible". Not a
> > statement that it does not work on Mellanox unless you put enough
> > buffers on the fill ring. So my advice (and wish) is that you fix this
> > in your driver. With that said, I will still look into what is the
> > best way to get at least the sample to work for you. But there is no
> > way to make sure every single app works for you in zero-copy mode,
> > unless you support arbitrary amount of buffers on the fill ring. I
> > guess that sooner or later, a customer of yours will get into this
> > situation one way or the other, so why not fix it now.
> >
> > /Magnus
> >
> > > I will examine what provides the best performance. On one hand it is
> > > the number of updates to shared ring state (minimized by current
> > > scheme) and the ability for the user app to but buffers on the fill
> > > ring. Stating that putting e.g. half the packets on the fill ring
> > > provides better performance (for some application) is not a
> > > limitation. It is performance tuning advise :-).
>
> I have now made a set of measurements. First I just made a variation
> study using the xdpsock app, varying the amount of packets the kernel
> can withdraw from a consumer ring (fill and Tx) before updating global
> state. For the 1 core case (app and driver on the same core) the more
> frequent you do this update, the better. The reason for this is that
> it costs very little to update the state since the application is not
> running. And it is beneficial for the app to have a freshly updated
> state when it starts to execute as it can operate on more packets. For
> the 2 core case (app on one core, driver on another) it is the
> complete opposite: the fewer updates to global state, the better. The
> reason for this is that it costs a lot to update global state as it
> triggers cache coherency actions between the two cores.
>
> What I did then was to compare the current scheme, update only when
> grabbing new packets, to a new scheme were we also update the global
> consumer pointer when we are exiting Rx or Tx processing in the NAPI
> context. On two cores the current scheme gets 0.5 to 1 Mpps more in
> throughput than also updating the pointer at the end of NAPI. But for
> 1 core case, the new scheme is the best and generates between 0.2 and
> 0.3 Mpps more in throughput than the current one. But all in all, the
> current scheme is more beneficial than the proposed one if we say that
> both the 1 core and the 2 core case is equally important.
>
> Something to note is that the xdpsock application only puts batch size
> (64) of packets in the fill ring in every iteration, and this might
> lead to some good pacing for the current scheme and the 2 core case.
> I.e., we do not get into the case of the fill ring only being full or
> empty. But I will run this on some real apps to get some more results,
> and I know that Björn has an optimized xdpsock application that puts
> many more packets into the fill queue than 64. This optimization might
> actually make the new proposal (also updating at the end of NAPI) be
> better and make the current scheme suffer. We will examine this
> further and get back.

Actually, after some more thought and discussions I think we should
optimize for the 1 core case, since that is what gives the whole
system the best performance, provided that you can scale your
application with instantiation that is. For a 4 core system, 4 x the 1
core performance > 2 x 2 core performance by a lot. I think that the 1
core case is the one that is going to be used out there. At least that
is what I hear and see.

So, when the merge window opens I am going to submit a patch that
updates the consumer pointer when we exit NAPI too. Will increase the
performance of the 1 core case.

/Magnus

> /Magnus
>
> > > > > I will look into this
> > > > > and see what I get.
> > > > >
> > > > > > 4. As mlx5e allocates frames in batches, the consequences are even more
> > > > > > severe: it's a deadlock where the driver waits for the application, and
> > > > > > vice versa. The driver never reaches the point where cons_tail gets
> > > > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> > > > > > size requested by the driver is 8, the driver won't peek anything from
> > > > > > the fill ring waiting for difference between cons_tail and cons_head to
> > > > > > increase to be at least 8. On the other hand, the application can't put
> > > > > > anything to the ring, because it still thinks that the consumer index is
> > > > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> > > > > > updated, hence the deadlock.
> > > > >
> > > > > Good thing that you detected this. Maybe I should get a Mellanox card
> > > > > :-). This is different from how we implemented Intel's drivers that
> > > > > just work on any batch size. If it gets 3 packets back, it will use
> > > > > those. How do you deal with the case when the application just puts a
> > > > > single buffer in the fill ring and wants to receive a single packet?
> > > >
> > > > mlx5e will wait until the full batch is available. As AF_XDP is intended
> > > > for high-performance apps, this scenario is less expected. We prefer to
> > > > leverage our performance features.
> > >
> > > That you cannot support all applications on top of AF_XDP with your
> > > zero-copy support seems broken to me. But I give you that you might
> > > support all the ones you care about with your current batching
> > > support. Like you say, most apps will put plenty of buffers on the
> > > fill ring, so this should not be a problem. Can you not implement some
> > > slow path for these cases? You must have one for the skb case.
> > >
> > > > > Why does the Mellanox driver need a specific batch size? This is just
> > > > > for my understanding so we can find a good solution.
> > > >
> > > > The main reason is our performance feature called striding RQ. Skipping
> > > > all irrelevant details, a batch of 64 pages is posted to the NIC with a
> > > > single request, and the NIC fills them progressively. This feature is
> > > > turned on by default on modern NICs, and it's really good for
> > > > performance.
> > > >
> > > > It might be possible to post a smaller batch though, I'm not sure about
> > > > it, it needs to be checked, but anyway it's not something that we will
> > > > likely do, because it's a complication of the data path, and if we know
> > > > more frames are coming, it's much better for the performance to wait for
> > > > them, rather than to post several incomplete batches.
> > > >
> > > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> > > > > > updates of the consumer index was wrong. It totally breaks mlx5e, that
> > > > > > does batching, and it will affect the performance of any driver, because
> > > > > > the application can't refill the ring until it gets completely empty and
> > > > > > the driver starts waiting for frames. I suggest that periodic updates of
> > > > > > the consumer index should be readded to xskq_cons_peek_addr.
> > > > >
> > > > > The reason I wanted to remove RX_BATCH_SIZE is that application
> > > > > developers complained about it giving rise to counter intuitive
> > > > > behavior in user space. I will try to dig out the complaints and the
> > > > > explanations Björn and I had to send which it seemed that users really
> > > > > should not have to care about. It should just work.
> > > >
> > > > I think the counter that doesn't update till the very last moment and
> > > > then advances by the ring size will also be something to complain about
> > > > (and I am the first one to complain :D). Such bursts are not good in any
> > > > case.
> > >
> > > Do you have any performance data that shows this scheme is bad for
> > > performance? The good thing about this scheme is that global state is
> > > updated less frequently. And the bad thing is what you mentioned. But
> > > which one has the greatest effect, is the question.
> > >
> > > > > I also do not like
> > > > > to have arbitrary constants like this. Why 16?
> > > >
> > > > I believe any batching mechanism has a constant that look arbitrary.
> > > > This constant should be the golden mean: if it's too small, there is
> > > > little effect from batching; if it's too big, it gets too bursty.
> > > >
> > > > Basically, after your patch it just changed from 16 to the ring size.
> > > > Maybe we can tie that constant to ring size? Make it ring_size /
> > > > another_arbitrary_constant? :)
> > >
> > > Agreed, I thought about this too. Something tied to ring_size might be
> > > a good compromise. Will examine this. But I want to base this on
> > > performance data not idle speculation, so need to run some experiments
> > > first.
> > >
> > > /Magnus
> > >
> > > > > Would much prefer not
> > > > > having to deal with this, unless of course it horribly breaks
> > > > > something or gives rise to worse performance. Might still be the case
> > > > > here, but if not, I would like to remove it.
> > > > >
> > > > > Thanks: Magnus
> > > > >
> > > > > > Magnus, what do you think of the suggestion above?
> > > > > >
> > > > > > Thanks,
> > > > > > Max
> > > > > >
> > > > > > >>
> > > > > > >> I tracked this issue in our internal bug tracker in case we need to
> > > > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> > > > > > >> on my assumption above.
> > > > > > >>
> > > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> > > > > CCAT)
> > > > > > >>>> running the mlx5 driver.
> > > > > > >>
> > > > > > >> This one should run without striding RQ, please verify it with ethtool
> > > > > > >> --show-priv-flags (the flag name is rx_striding_rq).
> > > > > > >
> > > > > > > I do not remember changing this option, so whatever the default is, is
> > > > > what it
> > > > > > > was running with. I am traveling this week and do not have access to
> > > > > these
> > > > > > > systems, but will ensure that this flag is set properly when I get back.
> > > > > > >
> > > > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-02-05 13:31                                   ` Magnus Karlsson
@ 2020-02-06 14:56                                     ` Maxim Mikityanskiy
  2020-02-07  9:01                                       ` Magnus Karlsson
  0 siblings, 1 reply; 49+ messages in thread
From: Maxim Mikityanskiy @ 2020-02-06 14:56 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On 2020-02-05 15:31, Magnus Karlsson wrote:
> On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
>>
>> On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson
>> <magnus.karlsson@gmail.com> wrote:
>>>
>>> On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson
>>> <magnus.karlsson@gmail.com> wrote:
>>>>
>>>> On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
>>>> <maximmi@mellanox.com> wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>> Sent: 27 January, 2020 17:55
>>>>>> To: Maxim Mikityanskiy <maximmi@mellanox.com>
>>>>>> Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
>>>>>> Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
>>>>>> Shemesh <moshe@mellanox.com>
>>>>>> Subject: Re: zero-copy between interfaces
>>>>>>
>>>>>> On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 2020-01-22 23:43, Ryan Goodfellow wrote:
>>>>>>>> On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
>>>>>>>>>>> I've posted output from the program in debugging mode here
>>>>>>>>>>>
>>>>>>>>>>> - https://gitlab.com/mergetb/tech/network-
>>>>>> emulation/kernel/snippets/1930375
>>>>>>>>>>>
>>>>>>>>>>> Yes, you are correct in that forwarding works for a brief period and
>>>>>> then stops.
>>>>>>>>>>> I've noticed that the number of packets that are forwarded is equal
>>>>>> to the size
>>>>>>>>>>> of the producer/consumer descriptor rings. I've posted two ping
>>>>>> traces from a
>>>>>>>>>>> client ping that shows this.
>>>>>>>>>>>
>>>>>>>>>>> - https://gitlab.com/mergetb/tech/network-
>>>>>> emulation/kernel/snippets/1930376
>>>>>>>>>>> - https://gitlab.com/mergetb/tech/network-
>>>>>> emulation/kernel/snippets/1930377
>>>>>>>>>
>>>>>>>>> These snippets are not available.
>>>>>>>>
>>>>>>>> Apologies, I had the wrong permissions set. They should be available
>>>>>> now.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've also noticed that when the forwarding stops, the CPU usage for
>>>>>> the proc
>>>>>>>>>>> running the program is pegged, which is not the norm for this program
>>>>>> as it uses
>>>>>>>>>>> a poll call with a timeout on the xsk fd.
>>>>>>>>>
>>>>>>>>> This information led me to a guess what may be happening. On the RX
>>>>>>>>> side, mlx5e allocates pages in bulks for performance reasons and to
>>>>>>>>> leverage hardware features targeted to performance. In AF_XDP mode,
>>>>>>>>> bulking of frames is also used (on x86, the bulk size is 64 with
>>>>>>>>> striding RQ enabled, and 8 otherwise, however, it's implementation
>>>>>>>>> details that might change later). If you don't put enough frames to XSK
>>>>>>>>> Fill Ring, the driver will be demanding more frames and return from
>>>>>>>>> poll() immediately. Basically, in the application, you should put as
>>>>>>>>> many frames to the Fill Ring as you can. Please check if that could be
>>>>>>>>> the root cause of your issue.
>>>>>>>>
>>>>>>>> The code in this application makes an effort to relenish the fill ring
>>>>>> as fast
>>>>>>>> as possible. The basic loop of the application is to first check if
>>>>>> there are
>>>>>>>> any descriptors to be consumed from the completion queue or any
>>>>>> descriptors that
>>>>>>>> can be added to the fill queue, and only then to move on to moving
>>>>>> packets
>>>>>>>> through the rx and tx rings.
>>>>>>>>
>>>>>>>> https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
>>>>>> moa/samples/bpf/xdpsock_multidev.c#L452-474
>>>>>>>
>>>>>>> I reproduced your issue and did my investigation, and here is what I
>>>>>> found:
>>>>>>>
>>>>>>> 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
>>>>>>> important behavioral change (which I thought was not that important).
>>>>>>> xskq_nb_avail used to return min(entries, dcnt), but after the change it
>>>>>>> just returns entries, which may be as big as the ring size.
>>>>>>>
>>>>>>> 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
>>>>>>> catches up with q->cons_head. So, before that patch and one previous
>>>>>>> patch, cons_head - cons_tail was not more than 16, so the consumer index
>>>>>>> was updated periodically. Now consumer is updated only when the whole
>>>>>>> ring is exhausted.
>>>>>>>
>>>>>>> 3. The application can't replenish the fill ring if the consumer index
>>>>>>> doesn't move. As a consequence, refilling the descriptors by the
>>>>>>> application can't happen in parallel with using them by the driver. It
>>>>>>> should have some performance penalty and possibly even lead to packet
>>>>>>> drops, because the driver uses all the descriptors and only then
>>>>>>> advances the consumer index, and then it has to wait until the
>>>>>>> application refills the ring, busy-looping and losing packets.
>>>>>>
>>>>>> This will happen if user space always fills up the whole ring, which
>>>>>> might or might not happen all depending on the app.
>>>>>
>>>>> Yes, that's right, and as far as I know, it's common to fill as many
>>>>> frames as the application can (there was no reason to do it other way
>>>>> till now).
>>>>>
>>>>>> With that said, it
>>>>>> might provide better performance to update it once in a while. It
>>>>>> might also be the case that we will get better performance with the
>>>>>> new scheme if we only fill half the fill ring.
>>>>>
>>>>> Yes, it may improve performance. However, I don't think it's correct to
>>>>> set such a limitation on the app.
>>>
>>> Actually, a much worse limitation to put on an application is to say
>>> that you have to have a certain amount of buffers on some ring for the
>>> zero-copy feature to work. For example that we need at least 64
>>> buffers on the fill ring for all the NIC cards to work in zero-copy
>>> mode. That would be a bad thing to have to put in the documentation.
>>> An OS is supposed to abstract away HW differences, and with this
>>> current limitation in your driver, it shines through for sure. What we
>>> would like to put in the documentation is a statement along the lines
>>> of: "for best performance, make sure you have plenty of buffers on the
>>> fill ring so that the NIC can work as efficiently as possible". Not a
>>> statement that it does not work on Mellanox unless you put enough
>>> buffers on the fill ring. So my advice (and wish) is that you fix this
>>> in your driver. With that said, I will still look into what is the
>>> best way to get at least the sample to work for you. But there is no
>>> way to make sure every single app works for you in zero-copy mode,
>>> unless you support arbitrary amount of buffers on the fill ring. I
>>> guess that sooner or later, a customer of yours will get into this
>>> situation one way or the other, so why not fix it now.

Hi Magnus,

We made an internal discussion about batching and AF_XDP in mlx5e.

There are two types of RX queues supported by mlx5e: striding RQ and 
legacy RQ. Which type of RQ is used, depends on the configuration, 
hardware support and defaults for different NICs, but basically in cases 
when striding RQ is enabled by default, it's faster than legacy RQ, and 
this is the case for modern NICs. All RX queues created in the driver 
are of the same type. Striding RQ has a requirement of allocating in 
batches, and the batch size is specified on queue creation, so there is 
no fallback possible for this case. Legacy RQ, on the other hand, does 
not require batching in XDP use cases, but now we do it anyway for 
performance reasons and for code unification with non-XDP queues.

I understand your concern that the current API doesn't provide a 
completely opaque abstraction over the driver. However, we can't just 
throw away an important performance feature (striding RQ) to support 
some exotic case of a fill ring with a single frame only. AF_XDP is a 
framework for high-performance applications, so it's extremely unlikely 
that an AF_XDP application will only need to receive a single packet. 
Such applications just won't need AF_XDP. So, given that the issue can't 
be fixed without disabling striding RQ, and disabling striding RQ will 
just reduce the performance of all real-world applications, we decided 
to keep things as is for now, and if a customer complains about it, we 
will suggest them to disable striding RQ in their configuration, and 
we'll consider an option of disabling batching in legacy RQ for AF_XDP.

BTW, if the current API can't provide a good enough abstraction over 
advanced features of mlx5e, maybe we should extend the API somehow? 
E.g., when need_wakeup for RX goes to "yes", also tell how many frames 
need to be refilled?

>>> /Magnus
>>>
>>>> I will examine what provides the best performance. On one hand it is
>>>> the number of updates to shared ring state (minimized by current
>>>> scheme) and the ability for the user app to but buffers on the fill
>>>> ring. Stating that putting e.g. half the packets on the fill ring
>>>> provides better performance (for some application) is not a
>>>> limitation. It is performance tuning advise :-).
>>
>> I have now made a set of measurements. First I just made a variation
>> study using the xdpsock app, varying the amount of packets the kernel
>> can withdraw from a consumer ring (fill and Tx) before updating global
>> state. For the 1 core case (app and driver on the same core) the more
>> frequent you do this update, the better. The reason for this is that
>> it costs very little to update the state since the application is not
>> running. And it is beneficial for the app to have a freshly updated
>> state when it starts to execute as it can operate on more packets. For
>> the 2 core case (app on one core, driver on another) it is the
>> complete opposite: the fewer updates to global state, the better. The
>> reason for this is that it costs a lot to update global state as it
>> triggers cache coherency actions between the two cores.
>>
>> What I did then was to compare the current scheme, update only when
>> grabbing new packets, to a new scheme were we also update the global
>> consumer pointer when we are exiting Rx or Tx processing in the NAPI
>> context. On two cores the current scheme gets 0.5 to 1 Mpps more in
>> throughput than also updating the pointer at the end of NAPI. But for
>> 1 core case, the new scheme is the best and generates between 0.2 and
>> 0.3 Mpps more in throughput than the current one. But all in all, the
>> current scheme is more beneficial than the proposed one if we say that
>> both the 1 core and the 2 core case is equally important.
>>
>> Something to note is that the xdpsock application only puts batch size
>> (64) of packets in the fill ring in every iteration, and this might
>> lead to some good pacing for the current scheme and the 2 core case.
>> I.e., we do not get into the case of the fill ring only being full or
>> empty. But I will run this on some real apps to get some more results,
>> and I know that Björn has an optimized xdpsock application that puts
>> many more packets into the fill queue than 64. This optimization might
>> actually make the new proposal (also updating at the end of NAPI) be
>> better and make the current scheme suffer. We will examine this
>> further and get back.
> 
> Actually, after some more thought and discussions I think we should
> optimize for the 1 core case, since that is what gives the whole
> system the best performance, provided that you can scale your
> application with instantiation that is. For a 4 core system, 4 x the 1
> core performance > 2 x 2 core performance by a lot. I think that the 1
> core case is the one that is going to be used out there. At least that
> is what I hear and see.
> 
> So, when the merge window opens I am going to submit a patch that
> updates the consumer pointer when we exit NAPI too. Will increase the
> performance of the 1 core case.

That sounds good to me. It doesn't make sense to update it multiple 
times per NAPI (in the single core case the application won't run at 
that time anyway), so once per NAPI is the most frequent, and according 
to your experiments it should be the most efficient. It should make 
mlx5e work again. One concern though: you say you are going to submit it 
to -next, but a kernel with your patches has been released, and it has 
broken AF_XDP support in mlx5e. I can send a small fix to net that will 
revert the behavior back to updating the consumer index once every 16 
frames (so it makes mlx5e work again), and your patch will go on top of 
my bugfix. Does it sound right to you?

Thanks for taking time to do the tests!

> /Magnus
> 
>> /Magnus
>>
>>>>>> I will look into this
>>>>>> and see what I get.
>>>>>>
>>>>>>> 4. As mlx5e allocates frames in batches, the consequences are even more
>>>>>>> severe: it's a deadlock where the driver waits for the application, and
>>>>>>> vice versa. The driver never reaches the point where cons_tail gets
>>>>>>> equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
>>>>>>> size requested by the driver is 8, the driver won't peek anything from
>>>>>>> the fill ring waiting for difference between cons_tail and cons_head to
>>>>>>> increase to be at least 8. On the other hand, the application can't put
>>>>>>> anything to the ring, because it still thinks that the consumer index is
>>>>>>> 0. As cons_tail never reaches cons_head, the consumer index doesn't get
>>>>>>> updated, hence the deadlock.
>>>>>>
>>>>>> Good thing that you detected this. Maybe I should get a Mellanox card
>>>>>> :-). This is different from how we implemented Intel's drivers that
>>>>>> just work on any batch size. If it gets 3 packets back, it will use
>>>>>> those. How do you deal with the case when the application just puts a
>>>>>> single buffer in the fill ring and wants to receive a single packet?
>>>>>
>>>>> mlx5e will wait until the full batch is available. As AF_XDP is intended
>>>>> for high-performance apps, this scenario is less expected. We prefer to
>>>>> leverage our performance features.
>>>>
>>>> That you cannot support all applications on top of AF_XDP with your
>>>> zero-copy support seems broken to me. But I give you that you might
>>>> support all the ones you care about with your current batching
>>>> support. Like you say, most apps will put plenty of buffers on the
>>>> fill ring, so this should not be a problem. Can you not implement some
>>>> slow path for these cases? You must have one for the skb case.
>>>>
>>>>>> Why does the Mellanox driver need a specific batch size? This is just
>>>>>> for my understanding so we can find a good solution.
>>>>>
>>>>> The main reason is our performance feature called striding RQ. Skipping
>>>>> all irrelevant details, a batch of 64 pages is posted to the NIC with a
>>>>> single request, and the NIC fills them progressively. This feature is
>>>>> turned on by default on modern NICs, and it's really good for
>>>>> performance.
>>>>>
>>>>> It might be possible to post a smaller batch though, I'm not sure about
>>>>> it, it needs to be checked, but anyway it's not something that we will
>>>>> likely do, because it's a complication of the data path, and if we know
>>>>> more frames are coming, it's much better for the performance to wait for
>>>>> them, rather than to post several incomplete batches.
>>>>>
>>>>>>> So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
>>>>>>> updates of the consumer index was wrong. It totally breaks mlx5e, that
>>>>>>> does batching, and it will affect the performance of any driver, because
>>>>>>> the application can't refill the ring until it gets completely empty and
>>>>>>> the driver starts waiting for frames. I suggest that periodic updates of
>>>>>>> the consumer index should be readded to xskq_cons_peek_addr.
>>>>>>
>>>>>> The reason I wanted to remove RX_BATCH_SIZE is that application
>>>>>> developers complained about it giving rise to counter intuitive
>>>>>> behavior in user space. I will try to dig out the complaints and the
>>>>>> explanations Björn and I had to send which it seemed that users really
>>>>>> should not have to care about. It should just work.
>>>>>
>>>>> I think the counter that doesn't update till the very last moment and
>>>>> then advances by the ring size will also be something to complain about
>>>>> (and I am the first one to complain :D). Such bursts are not good in any
>>>>> case.
>>>>
>>>> Do you have any performance data that shows this scheme is bad for
>>>> performance? The good thing about this scheme is that global state is
>>>> updated less frequently. And the bad thing is what you mentioned. But
>>>> which one has the greatest effect, is the question.
>>>>
>>>>>> I also do not like
>>>>>> to have arbitrary constants like this. Why 16?
>>>>>
>>>>> I believe any batching mechanism has a constant that look arbitrary.
>>>>> This constant should be the golden mean: if it's too small, there is
>>>>> little effect from batching; if it's too big, it gets too bursty.
>>>>>
>>>>> Basically, after your patch it just changed from 16 to the ring size.
>>>>> Maybe we can tie that constant to ring size? Make it ring_size /
>>>>> another_arbitrary_constant? :)
>>>>
>>>> Agreed, I thought about this too. Something tied to ring_size might be
>>>> a good compromise. Will examine this. But I want to base this on
>>>> performance data not idle speculation, so need to run some experiments
>>>> first.
>>>>
>>>> /Magnus
>>>>
>>>>>> Would much prefer not
>>>>>> having to deal with this, unless of course it horribly breaks
>>>>>> something or gives rise to worse performance. Might still be the case
>>>>>> here, but if not, I would like to remove it.
>>>>>>
>>>>>> Thanks: Magnus
>>>>>>
>>>>>>> Magnus, what do you think of the suggestion above?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>
>>>>>>>>>
>>>>>>>>> I tracked this issue in our internal bug tracker in case we need to
>>>>>>>>> perform actual debugging of mlx5e. I'm looking forward to your feedback
>>>>>>>>> on my assumption above.
>>>>>>>>>
>>>>>>>>>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
>>>>>> CCAT)
>>>>>>>>>>> running the mlx5 driver.
>>>>>>>>>
>>>>>>>>> This one should run without striding RQ, please verify it with ethtool
>>>>>>>>> --show-priv-flags (the flag name is rx_striding_rq).
>>>>>>>>
>>>>>>>> I do not remember changing this option, so whatever the default is, is
>>>>>> what it
>>>>>>>> was running with. I am traveling this week and do not have access to
>>>>>> these
>>>>>>>> systems, but will ensure that this flag is set properly when I get back.
>>>>>>>>
>>>>>>>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: zero-copy between interfaces
  2020-02-06 14:56                                     ` Maxim Mikityanskiy
@ 2020-02-07  9:01                                       ` Magnus Karlsson
  0 siblings, 0 replies; 49+ messages in thread
From: Magnus Karlsson @ 2020-02-07  9:01 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Ryan Goodfellow, xdp-newbies, Tariq Toukan, Saeed Mahameed,
	Moshe Shemesh

On Thu, Feb 6, 2020 at 3:56 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> On 2020-02-05 15:31, Magnus Karlsson wrote:
> > On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> >>
> >> On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson
> >> <magnus.karlsson@gmail.com> wrote:
> >>>
> >>> On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson
> >>> <magnus.karlsson@gmail.com> wrote:
> >>>>
> >>>> On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
> >>>> <maximmi@mellanox.com> wrote:
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Magnus Karlsson <magnus.karlsson@gmail.com>
> >>>>>> Sent: 27 January, 2020 17:55
> >>>>>> To: Maxim Mikityanskiy <maximmi@mellanox.com>
> >>>>>> Cc: Ryan Goodfellow <rgoodfel@isi.edu>; xdp-newbies@vger.kernel.org; Tariq
> >>>>>> Toukan <tariqt@mellanox.com>; Saeed Mahameed <saeedm@mellanox.com>; Moshe
> >>>>>> Shemesh <moshe@mellanox.com>
> >>>>>> Subject: Re: zero-copy between interfaces
> >>>>>>
> >>>>>> On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@mellanox.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 2020-01-22 23:43, Ryan Goodfellow wrote:
> >>>>>>>> On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> >>>>>>>>>>> I've posted output from the program in debugging mode here
> >>>>>>>>>>>
> >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network-
> >>>>>> emulation/kernel/snippets/1930375
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, you are correct in that forwarding works for a brief period and
> >>>>>> then stops.
> >>>>>>>>>>> I've noticed that the number of packets that are forwarded is equal
> >>>>>> to the size
> >>>>>>>>>>> of the producer/consumer descriptor rings. I've posted two ping
> >>>>>> traces from a
> >>>>>>>>>>> client ping that shows this.
> >>>>>>>>>>>
> >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network-
> >>>>>> emulation/kernel/snippets/1930376
> >>>>>>>>>>> - https://gitlab.com/mergetb/tech/network-
> >>>>>> emulation/kernel/snippets/1930377
> >>>>>>>>>
> >>>>>>>>> These snippets are not available.
> >>>>>>>>
> >>>>>>>> Apologies, I had the wrong permissions set. They should be available
> >>>>>> now.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I've also noticed that when the forwarding stops, the CPU usage for
> >>>>>> the proc
> >>>>>>>>>>> running the program is pegged, which is not the norm for this program
> >>>>>> as it uses
> >>>>>>>>>>> a poll call with a timeout on the xsk fd.
> >>>>>>>>>
> >>>>>>>>> This information led me to a guess what may be happening. On the RX
> >>>>>>>>> side, mlx5e allocates pages in bulks for performance reasons and to
> >>>>>>>>> leverage hardware features targeted to performance. In AF_XDP mode,
> >>>>>>>>> bulking of frames is also used (on x86, the bulk size is 64 with
> >>>>>>>>> striding RQ enabled, and 8 otherwise, however, it's implementation
> >>>>>>>>> details that might change later). If you don't put enough frames to XSK
> >>>>>>>>> Fill Ring, the driver will be demanding more frames and return from
> >>>>>>>>> poll() immediately. Basically, in the application, you should put as
> >>>>>>>>> many frames to the Fill Ring as you can. Please check if that could be
> >>>>>>>>> the root cause of your issue.
> >>>>>>>>
> >>>>>>>> The code in this application makes an effort to relenish the fill ring
> >>>>>> as fast
> >>>>>>>> as possible. The basic loop of the application is to first check if
> >>>>>> there are
> >>>>>>>> any descriptors to be consumed from the completion queue or any
> >>>>>> descriptors that
> >>>>>>>> can be added to the fill queue, and only then to move on to moving
> >>>>>> packets
> >>>>>>>> through the rx and tx rings.
> >>>>>>>>
> >>>>>>>> https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> >>>>>> moa/samples/bpf/xdpsock_multidev.c#L452-474
> >>>>>>>
> >>>>>>> I reproduced your issue and did my investigation, and here is what I
> >>>>>> found:
> >>>>>>>
> >>>>>>> 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> >>>>>>> important behavioral change (which I thought was not that important).
> >>>>>>> xskq_nb_avail used to return min(entries, dcnt), but after the change it
> >>>>>>> just returns entries, which may be as big as the ring size.
> >>>>>>>
> >>>>>>> 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> >>>>>>> catches up with q->cons_head. So, before that patch and one previous
> >>>>>>> patch, cons_head - cons_tail was not more than 16, so the consumer index
> >>>>>>> was updated periodically. Now consumer is updated only when the whole
> >>>>>>> ring is exhausted.
> >>>>>>>
> >>>>>>> 3. The application can't replenish the fill ring if the consumer index
> >>>>>>> doesn't move. As a consequence, refilling the descriptors by the
> >>>>>>> application can't happen in parallel with using them by the driver. It
> >>>>>>> should have some performance penalty and possibly even lead to packet
> >>>>>>> drops, because the driver uses all the descriptors and only then
> >>>>>>> advances the consumer index, and then it has to wait until the
> >>>>>>> application refills the ring, busy-looping and losing packets.
> >>>>>>
> >>>>>> This will happen if user space always fills up the whole ring, which
> >>>>>> might or might not happen all depending on the app.
> >>>>>
> >>>>> Yes, that's right, and as far as I know, it's common to fill as many
> >>>>> frames as the application can (there was no reason to do it other way
> >>>>> till now).
> >>>>>
> >>>>>> With that said, it
> >>>>>> might provide better performance to update it once in a while. It
> >>>>>> might also be the case that we will get better performance with the
> >>>>>> new scheme if we only fill half the fill ring.
> >>>>>
> >>>>> Yes, it may improve performance. However, I don't think it's correct to
> >>>>> set such a limitation on the app.
> >>>
> >>> Actually, a much worse limitation to put on an application is to say
> >>> that you have to have a certain amount of buffers on some ring for the
> >>> zero-copy feature to work. For example that we need at least 64
> >>> buffers on the fill ring for all the NIC cards to work in zero-copy
> >>> mode. That would be a bad thing to have to put in the documentation.
> >>> An OS is supposed to abstract away HW differences, and with this
> >>> current limitation in your driver, it shines through for sure. What we
> >>> would like to put in the documentation is a statement along the lines
> >>> of: "for best performance, make sure you have plenty of buffers on the
> >>> fill ring so that the NIC can work as efficiently as possible". Not a
> >>> statement that it does not work on Mellanox unless you put enough
> >>> buffers on the fill ring. So my advice (and wish) is that you fix this
> >>> in your driver. With that said, I will still look into what is the
> >>> best way to get at least the sample to work for you. But there is no
> >>> way to make sure every single app works for you in zero-copy mode,
> >>> unless you support arbitrary amount of buffers on the fill ring. I
> >>> guess that sooner or later, a customer of yours will get into this
> >>> situation one way or the other, so why not fix it now.
>
> Hi Magnus,
>
> We made an internal discussion about batching and AF_XDP in mlx5e.
>
> There are two types of RX queues supported by mlx5e: striding RQ and
> legacy RQ. Which type of RQ is used, depends on the configuration,
> hardware support and defaults for different NICs, but basically in cases
> when striding RQ is enabled by default, it's faster than legacy RQ, and
> this is the case for modern NICs. All RX queues created in the driver
> are of the same type. Striding RQ has a requirement of allocating in
> batches, and the batch size is specified on queue creation, so there is
> no fallback possible for this case. Legacy RQ, on the other hand, does
> not require batching in XDP use cases, but now we do it anyway for
> performance reasons and for code unification with non-XDP queues.

Thanks. I understand why this is a problem now. Let see how we can
work around it in the best way.

> I understand your concern that the current API doesn't provide a
> completely opaque abstraction over the driver. However, we can't just
> throw away an important performance feature (striding RQ) to support
> some exotic case of a fill ring with a single frame only. AF_XDP is a
> framework for high-performance applications, so it's extremely unlikely
> that an AF_XDP application will only need to receive a single packet.
> Such applications just won't need AF_XDP. So, given that the issue can't
> be fixed without disabling striding RQ, and disabling striding RQ will
> just reduce the performance of all real-world applications, we decided
> to keep things as is for now, and if a customer complains about it, we
> will suggest them to disable striding RQ in their configuration, and
> we'll consider an option of disabling batching in legacy RQ for AF_XDP.
>
> BTW, if the current API can't provide a good enough abstraction over
> advanced features of mlx5e, maybe we should extend the API somehow?
> E.g., when need_wakeup for RX goes to "yes", also tell how many frames
> need to be refilled?

Yes, let us set need_wakeup for Rx, unless you are not already doing
that. The Intel drivers do this when there are no entries at all on
the fill ring and none on the HW Rx ring. You can set it when there
are less than 64 on the fill ring (or whatever striding size that you
have) and no entries on the HW Rx ring. But I prefer not to return a
number. Why not just document that if the user gets a need_wakeup on
the fill ring it needs to put entries on the fill ring and wake up the
kernel. We do recommend putting as many entries in the fill ring as
possible since this is good for performance (for all NICs). Putting
too few might trigger another need_wakeup event on NICs that prefer to
work on batches of packets. Or something like that.

> >>> /Magnus
> >>>
> >>>> I will examine what provides the best performance. On one hand it is
> >>>> the number of updates to shared ring state (minimized by current
> >>>> scheme) and the ability for the user app to but buffers on the fill
> >>>> ring. Stating that putting e.g. half the packets on the fill ring
> >>>> provides better performance (for some application) is not a
> >>>> limitation. It is performance tuning advise :-).
> >>
> >> I have now made a set of measurements. First I just made a variation
> >> study using the xdpsock app, varying the amount of packets the kernel
> >> can withdraw from a consumer ring (fill and Tx) before updating global
> >> state. For the 1 core case (app and driver on the same core) the more
> >> frequent you do this update, the better. The reason for this is that
> >> it costs very little to update the state since the application is not
> >> running. And it is beneficial for the app to have a freshly updated
> >> state when it starts to execute as it can operate on more packets. For
> >> the 2 core case (app on one core, driver on another) it is the
> >> complete opposite: the fewer updates to global state, the better. The
> >> reason for this is that it costs a lot to update global state as it
> >> triggers cache coherency actions between the two cores.
> >>
> >> What I did then was to compare the current scheme, update only when
> >> grabbing new packets, to a new scheme were we also update the global
> >> consumer pointer when we are exiting Rx or Tx processing in the NAPI
> >> context. On two cores the current scheme gets 0.5 to 1 Mpps more in
> >> throughput than also updating the pointer at the end of NAPI. But for
> >> 1 core case, the new scheme is the best and generates between 0.2 and
> >> 0.3 Mpps more in throughput than the current one. But all in all, the
> >> current scheme is more beneficial than the proposed one if we say that
> >> both the 1 core and the 2 core case is equally important.
> >>
> >> Something to note is that the xdpsock application only puts batch size
> >> (64) of packets in the fill ring in every iteration, and this might
> >> lead to some good pacing for the current scheme and the 2 core case.
> >> I.e., we do not get into the case of the fill ring only being full or
> >> empty. But I will run this on some real apps to get some more results,
> >> and I know that Björn has an optimized xdpsock application that puts
> >> many more packets into the fill queue than 64. This optimization might
> >> actually make the new proposal (also updating at the end of NAPI) be
> >> better and make the current scheme suffer. We will examine this
> >> further and get back.
> >
> > Actually, after some more thought and discussions I think we should
> > optimize for the 1 core case, since that is what gives the whole
> > system the best performance, provided that you can scale your
> > application with instantiation that is. For a 4 core system, 4 x the 1
> > core performance > 2 x 2 core performance by a lot. I think that the 1
> > core case is the one that is going to be used out there. At least that
> > is what I hear and see.
> >
> > So, when the merge window opens I am going to submit a patch that
> > updates the consumer pointer when we exit NAPI too. Will increase the
> > performance of the 1 core case.
>
> That sounds good to me. It doesn't make sense to update it multiple
> times per NAPI (in the single core case the application won't run at
> that time anyway), so once per NAPI is the most frequent, and according
> to your experiments it should be the most efficient. It should make
> mlx5e work again. One concern though: you say you are going to submit it
> to -next, but a kernel with your patches has been released, and it has
> broken AF_XDP support in mlx5e. I can send a small fix to net that will
> revert the behavior back to updating the consumer index once every 16
> frames (so it makes mlx5e work again), and your patch will go on top of
> my bugfix. Does it sound right to you?

How about a patch set of two patches. In the cover letter you explain
the situation. 1: The application might deadlock if the kernel
requires multiple buffers to be put on the fill ring to proceed but
the consumer pointer has not been updated by the kernel so it is not
possible to put them there. 2: When the kernel needs more buffers, the
driver should signal this through the fill ring's need_wakeup flag.
Document this. There are two solutions to this: a) fix the publishing
of the consumer pointer and document that the user application has to
put entries in the fill ring when the need_wakeup flag is set for the
fill ring (and implement this in your driver if you have not already
done so). b) Fix the Mellanox driver so it can work on any amount of
buffers put in the fill ring. We chose a) because it will result in
the most performant system and optimizing for the "few packets in the
fill ring case" is optimizing for something that will rarely happen.

I can send you the first patch. I do not want to reintroduce the batch
size, if I do not have to. As I have said before, there are already
enough tunables in an AF_XDP application. We do not need another one.
My scheme of publishing the consumer pointers at the end of the NAPI
loop seems to work really well and requires no parameters. You can
then add a patch for 2 and a cover letter making our case that we
would like this fixed. I will send you the patch in a separate mail on
this list. What do you think?

Thanks: Magnus

> Thanks for taking time to do the tests!
>
> > /Magnus
> >
> >> /Magnus
> >>
> >>>>>> I will look into this
> >>>>>> and see what I get.
> >>>>>>
> >>>>>>> 4. As mlx5e allocates frames in batches, the consequences are even more
> >>>>>>> severe: it's a deadlock where the driver waits for the application, and
> >>>>>>> vice versa. The driver never reaches the point where cons_tail gets
> >>>>>>> equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> >>>>>>> size requested by the driver is 8, the driver won't peek anything from
> >>>>>>> the fill ring waiting for difference between cons_tail and cons_head to
> >>>>>>> increase to be at least 8. On the other hand, the application can't put
> >>>>>>> anything to the ring, because it still thinks that the consumer index is
> >>>>>>> 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> >>>>>>> updated, hence the deadlock.
> >>>>>>
> >>>>>> Good thing that you detected this. Maybe I should get a Mellanox card
> >>>>>> :-). This is different from how we implemented Intel's drivers that
> >>>>>> just work on any batch size. If it gets 3 packets back, it will use
> >>>>>> those. How do you deal with the case when the application just puts a
> >>>>>> single buffer in the fill ring and wants to receive a single packet?
> >>>>>
> >>>>> mlx5e will wait until the full batch is available. As AF_XDP is intended
> >>>>> for high-performance apps, this scenario is less expected. We prefer to
> >>>>> leverage our performance features.
> >>>>
> >>>> That you cannot support all applications on top of AF_XDP with your
> >>>> zero-copy support seems broken to me. But I give you that you might
> >>>> support all the ones you care about with your current batching
> >>>> support. Like you say, most apps will put plenty of buffers on the
> >>>> fill ring, so this should not be a problem. Can you not implement some
> >>>> slow path for these cases? You must have one for the skb case.
> >>>>
> >>>>>> Why does the Mellanox driver need a specific batch size? This is just
> >>>>>> for my understanding so we can find a good solution.
> >>>>>
> >>>>> The main reason is our performance feature called striding RQ. Skipping
> >>>>> all irrelevant details, a batch of 64 pages is posted to the NIC with a
> >>>>> single request, and the NIC fills them progressively. This feature is
> >>>>> turned on by default on modern NICs, and it's really good for
> >>>>> performance.
> >>>>>
> >>>>> It might be possible to post a smaller batch though, I'm not sure about
> >>>>> it, it needs to be checked, but anyway it's not something that we will
> >>>>> likely do, because it's a complication of the data path, and if we know
> >>>>> more frames are coming, it's much better for the performance to wait for
> >>>>> them, rather than to post several incomplete batches.
> >>>>>
> >>>>>>> So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> >>>>>>> updates of the consumer index was wrong. It totally breaks mlx5e, that
> >>>>>>> does batching, and it will affect the performance of any driver, because
> >>>>>>> the application can't refill the ring until it gets completely empty and
> >>>>>>> the driver starts waiting for frames. I suggest that periodic updates of
> >>>>>>> the consumer index should be readded to xskq_cons_peek_addr.
> >>>>>>
> >>>>>> The reason I wanted to remove RX_BATCH_SIZE is that application
> >>>>>> developers complained about it giving rise to counter intuitive
> >>>>>> behavior in user space. I will try to dig out the complaints and the
> >>>>>> explanations Björn and I had to send which it seemed that users really
> >>>>>> should not have to care about. It should just work.
> >>>>>
> >>>>> I think the counter that doesn't update till the very last moment and
> >>>>> then advances by the ring size will also be something to complain about
> >>>>> (and I am the first one to complain :D). Such bursts are not good in any
> >>>>> case.
> >>>>
> >>>> Do you have any performance data that shows this scheme is bad for
> >>>> performance? The good thing about this scheme is that global state is
> >>>> updated less frequently. And the bad thing is what you mentioned. But
> >>>> which one has the greatest effect, is the question.
> >>>>
> >>>>>> I also do not like
> >>>>>> to have arbitrary constants like this. Why 16?
> >>>>>
> >>>>> I believe any batching mechanism has a constant that look arbitrary.
> >>>>> This constant should be the golden mean: if it's too small, there is
> >>>>> little effect from batching; if it's too big, it gets too bursty.
> >>>>>
> >>>>> Basically, after your patch it just changed from 16 to the ring size.
> >>>>> Maybe we can tie that constant to ring size? Make it ring_size /
> >>>>> another_arbitrary_constant? :)
> >>>>
> >>>> Agreed, I thought about this too. Something tied to ring_size might be
> >>>> a good compromise. Will examine this. But I want to base this on
> >>>> performance data not idle speculation, so need to run some experiments
> >>>> first.
> >>>>
> >>>> /Magnus
> >>>>
> >>>>>> Would much prefer not
> >>>>>> having to deal with this, unless of course it horribly breaks
> >>>>>> something or gives rise to worse performance. Might still be the case
> >>>>>> here, but if not, I would like to remove it.
> >>>>>>
> >>>>>> Thanks: Magnus
> >>>>>>
> >>>>>>> Magnus, what do you think of the suggestion above?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Max
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> I tracked this issue in our internal bug tracker in case we need to
> >>>>>>>>> perform actual debugging of mlx5e. I'm looking forward to your feedback
> >>>>>>>>> on my assumption above.
> >>>>>>>>>
> >>>>>>>>>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> >>>>>> CCAT)
> >>>>>>>>>>> running the mlx5 driver.
> >>>>>>>>>
> >>>>>>>>> This one should run without striding RQ, please verify it with ethtool
> >>>>>>>>> --show-priv-flags (the flag name is rx_striding_rq).
> >>>>>>>>
> >>>>>>>> I do not remember changing this option, so whatever the default is, is
> >>>>>> what it
> >>>>>>>> was running with. I am traveling this week and do not have access to
> >>>>>> these
> >>>>>>>> systems, but will ensure that this flag is set properly when I get back.
> >>>>>>>>
> >>>>>>>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2020-02-07  9:01 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13  0:18 zero-copy between interfaces Ryan Goodfellow
2020-01-13  9:16 ` Magnus Karlsson
2020-01-13 10:43   ` Toke Høiland-Jørgensen
2020-01-13 15:25     ` Ryan Goodfellow
2020-01-13 17:09       ` Toke Høiland-Jørgensen
2020-01-14  7:47         ` Magnus Karlsson
2020-01-14  8:11           ` Toke Høiland-Jørgensen
2020-01-13 15:11   ` Ryan Goodfellow
2020-01-14  9:59     ` Magnus Karlsson
2020-01-14 20:52       ` Ryan Goodfellow
2020-01-15  1:41         ` Ryan Goodfellow
2020-01-15  7:40           ` Magnus Karlsson
2020-01-15  8:20             ` Magnus Karlsson
2020-01-16  2:04               ` Ryan Goodfellow
2020-01-16 14:32                 ` Magnus Karlsson
2020-01-17  9:45                   ` Magnus Karlsson
2020-01-17 17:05                     ` Ryan Goodfellow
2020-01-21  7:34                 ` Magnus Karlsson
2020-01-21 13:40                   ` Maxim Mikityanskiy
2020-01-22 21:43                     ` Ryan Goodfellow
2020-01-27 14:01                       ` Maxim Mikityanskiy
2020-01-27 15:54                         ` Magnus Karlsson
2020-01-30  9:37                           ` Maxim Mikityanskiy
2020-01-30  9:59                             ` Magnus Karlsson
2020-01-30 11:40                               ` Magnus Karlsson
2020-02-04 16:10                                 ` Magnus Karlsson
2020-02-05 13:31                                   ` Magnus Karlsson
2020-02-06 14:56                                     ` Maxim Mikityanskiy
2020-02-07  9:01                                       ` Magnus Karlsson
2020-01-17 17:40         ` William Tu
2020-01-13 11:41 ` Jesper Dangaard Brouer
2020-01-13 15:28   ` Ryan Goodfellow
2020-01-13 17:04     ` Jesper Dangaard Brouer
2020-01-17 17:54       ` Ryan Goodfellow
2020-01-18 10:14         ` Jesper Dangaard Brouer
2020-01-18 14:08           ` Ryan Goodfellow
2020-01-26  4:53             ` Dan Siemon
2020-01-17 12:32 ` Björn Töpel
2020-01-17 12:32   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-01-17 17:16   ` Ryan Goodfellow
2020-01-17 17:16     ` [Intel-wired-lan] " Ryan Goodfellow
2020-01-17 18:10     ` Ryan Goodfellow
2020-01-17 18:10       ` [Intel-wired-lan] " Ryan Goodfellow
2020-01-20  8:24     ` Magnus Karlsson
2020-01-20  8:24       ` [Intel-wired-lan] " Magnus Karlsson
2020-01-20 18:33       ` Ryan Goodfellow
2020-01-20 18:33         ` [Intel-wired-lan] " Ryan Goodfellow
2020-01-20 17:04     ` Björn Töpel
2020-01-20 17:04       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.