zero-copy between interfaces

* zero-copy between interfaces
@ 2020-01-13  0:18 Ryan Goodfellow
  2020-01-13  9:16 ` Magnus Karlsson
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Ryan Goodfellow @ 2020-01-13  0:18 UTC (permalink / raw)
  To: xdp-newbies

[-- Attachment #1: Type: text/plain, Size: 3081 bytes --]

Greetings XDP folks. I've been working on a zero-copy XDP bridge 
implementation similar to what's described in the following thread.

  https://www.spinics.net/lists/xdp-newbies/msg01333.html

I now have an implementation that is working reasonably well under certain
conditions for various hardware. The implementation is primarily based on the
xdpsock_user program in the kernel under samples/bpf. You can find my program
and corresponding BPF program here.

- https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c
- https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev_kern.c

I have small testbed to run this code on that looks like the following.

Packet forwarding machine:
    CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
    Memory: 32 GB
    NICs: 
    - Mellanox ConnectX 4 Dual 100G MCX416A-CCAT (connected at 40G)
    - Intel X722 10G SFP+

Sender/receiver machines
    CPU: Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz (8 core / 16 thread)
    Memory: 32 GB
    NICs: 
    - Mellanox ConnectX 4 40G MCX4131A-BCAT
    - Intel X722 10G SFP+

I could not get zero-copy to work with the i40e driver as it would crash. I've
attached the corresponding traces from dmesg. The results below are with the 
i40e running in SKB/copy mode. I do have an X710-DA4 that I could plug into the 
server and test with instead of the X722 if that is of interest. In all cases I 
used a single hardware queue via the following.

    ethtool -L <dev> combined 1

The Mellanox cards in zero-copy mode create a sort of shadow set of queues, I 
used ntuple rules to push things through queue 1 (shadows 0) as follows

    ethtool -N <dev> flow-type ether src <mac> action 1

The numbers that I have been able to achive with this code are the following. MTU
is 1500 in all cases.

    mlx5: pps ~ 2.4 Mpps, 29 Gbps (driver mode, zero-copy)
    i40e: pps ~ 700 Kpps, 8 Gbps (skb mode, copy)
    virtio: pps ~ 200 Kpps, 2.4 Gbps (skb mode, copy, all qemu/kvm VMs)

Are these numbers in the ballpark of what's expected?

One thing I have noticed is that I cannot create large memory maps for the
packet buffers. For example a frame size of 2048 with 524288 frames (around
1G of packets) is fine. However, increasing size by an order of magnitude, which
is well within the memory capacity of the host machine results in an error when 
creating the UMEM and the kernel shows the attached call trace. I'm going to 
begin investigating this in more detail soon, but if anyone has advice on large 
XDP memory maps that would be much appreciated.

The reason for wanting large memory maps is that our use case for XDP is network
emulation - and sometimes that means introducing delay factors that can require
a rather large in-memory packet buffers.

If there is interest in including this program in the official BPF samples I'm happy to
submit a patch. Any comments on the program are also much appreciated.

Thanks

~ ry

[-- Attachment #2: large-umem-trace.txt --]
[-- Type: text/plain, Size: 3188 bytes --]

[ 9879.806272] ------------[ cut here ]------------
[ 9879.811489] WARNING: CPU: 14 PID: 18118 at mm/page_alloc.c:4738 __alloc_pages_nodemask+0x1de/0x2b0
[ 9879.816743] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core
[ 9879.822023] CPU: 14 PID: 18118 Comm: xdpsock_multide Not tainted 5.5.0-rc3-moa+ #1
[ 9879.827399] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019
[ 9879.832838] RIP: 0010:__alloc_pages_nodemask+0x1de/0x2b0
[ 9879.838257] Code: 0f 85 1a ff ff ff 65 48 8b 04 25 00 7d 01 00 48 05 28 08 00 00 41 bd 01 00 00 00 48 89 44 24 08 e9 fb fe ff ff 80 e7 20 75 02 <0f> 0b 45 31 ed eb 98 44 8b 64 24 18 65 8b 05 17 ff c0 5a 89 c0 48
[ 9879.849435] RSP: 0018:ffffa8fe00777db8 EFLAGS: 00010246
[ 9879.855031] RAX: 0000000000000000 RBX: 00000000000400c0 RCX: 0000000000000000
[ 9879.860665] RDX: 0000000000000000 RSI: 000000000000000b RDI: 0000000000040dc0
[ 9879.866260] RBP: 0000000000800000 R08: ffffffffa6e27560 R09: ffff93f01c1a0000
[ 9879.871862] R10: ffffcb5c5e48a9c0 R11: 0000000822898000 R12: ffff93f01828c600
[ 9879.877478] R13: 000000000000000b R14: 000000000000000b R15: ffff93f01c1a0000
[ 9879.883079] FS:  00007fd124b8c740(0000) GS:ffff93f020180000(0000) knlGS:0000000000000000
[ 9879.888724] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9879.894361] CR2: 00007f04f8629ecc CR3: 000000080480a005 CR4: 00000000007606e0
[ 9879.900038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9879.905692] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9879.911207] PKRU: 55555554
[ 9879.916581] Call Trace:
[ 9879.921841]  kmalloc_order+0x1b/0x80
[ 9879.926998]  kmalloc_order_trace+0x1d/0xa0
[ 9879.932083]  xdp_umem_create+0x380/0x450
[ 9879.937114]  xsk_setsockopt+0x1e9/0x270
[ 9879.942151]  __sys_setsockopt+0xd6/0x190
[ 9879.947071]  __x64_sys_setsockopt+0x21/0x30
[ 9879.951872]  do_syscall_64+0x50/0x140
[ 9879.956564]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9879.961189] RIP: 0033:0x7fd124ca787a
[ 9879.965746] Code: ff ff ff c3 48 8b 15 15 d6 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e6 d5 0b 00 f7 d8 64 89 01 48
[ 9879.975218] RSP: 002b:00007fff8463f498 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
[ 9879.979993] RAX: ffffffffffffffda RBX: 00007fd0a4b8c000 RCX: 00007fd124ca787a
[ 9879.984779] RDX: 0000000000000004 RSI: 000000000000011b RDI: 0000000000000009
[ 9879.989489] RBP: 00007fff8463f580 R08: 0000000000000020 R09: 00005605752b5f70
[ 9879.994115] R10: 00007fff8463f4b0 R11: 0000000000000206 R12: 00007fff8463f590
[ 9879.998684] R13: 0000000080000000 R14: 00005605752b5a40 R15: 00005605752b5ac0
[ 9880.003156] ---[ end trace d2ad1cbfd1388dbe ]---
[ 9880.007555] xdp_umem: kalloc failed
[ 9880.032714] xdpsock_multide[18118]: segfault at 98 ip 0000560573d2aba1 sp 00007fff8463f520 error 4 in xdpsock_multidev[560573d2a000+16000]
[ 9880.045593] Code: ec 10 41 8b 3c 9c 8b 15 4d 07 02 00 c7 45 ec 00 00 00 00 e8 e1 0f 01 00 85 c0 75 51 48 8d 15 46 09 02 00 8b 45 ec 48 8b 14 da <39> 82 98 00 00 00 74 27 85 c0 74 15 48 8d 3d fc 55 01 00 e8 37 f5

[-- Attachment #3: i40e-zerocopy-trace.txt --]
[-- Type: text/plain, Size: 8627 bytes --]

[  328.579154] i40e 0000:b7:00.2: failed to get tracking for 256 queues for VSI 0 err -12
[  328.579280] i40e 0000:b7:00.2: setup of MAIN VSI failed
[  328.579367] i40e 0000:b7:00.2: can't remove VEB 162 with 0 VSIs left
[  328.579467] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  328.579573] #PF: supervisor read access in kernel mode
[  328.579655] #PF: error_code(0x0000) - not-present page
[  328.579737] PGD 0 P4D 0 
[  328.579780] Oops: 0000 [#1] SMP PTI
[  328.579835] CPU: 0 PID: 2157 Comm: xdpsock_multide Not tainted 5.5.0-rc3-moa+ #1
[  328.579951] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019
[  328.580093] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e]
[  328.580162] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f
[  328.580455] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246
[  328.580538] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000
[  328.580650] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810
[  328.580762] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5
[  328.580875] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000
[  328.580987] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300
[  328.581099] FS:  00007f079f960740(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.581227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.581316] CR2: 0000000000000000 CR3: 000000085578c005 CR4: 00000000007606f0
[  328.581428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.581541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.585600] PKRU: 55555554
[  328.589632] Call Trace:
[  328.593637]  ? i40e_reconfig_rss_queues+0x100/0x100 [i40e]
[  328.597641]  dev_xdp_install+0x4e/0x70
[  328.601530]  dev_change_xdp_fd+0x11a/0x260
[  328.605335]  do_setlink+0xdd0/0xe80
[  328.609020]  ? get_page_from_freelist+0x738/0x10d0
[  328.612678]  ? on_each_cpu+0x54/0x60
[  328.616233]  ? kmem_cache_alloc_node+0x43/0x1f0
[  328.619856]  ? optimize_nops.isra.0+0x90/0x90
[  328.623457]  rtnl_setlink+0xe5/0x160
[  328.627015]  ? security_capable+0x40/0x60
[  328.630541]  rtnetlink_rcv_msg+0x2b0/0x360
[  328.634046]  ? alloc_file+0x76/0xf0
[  328.637513]  ? alloc_file_pseudo+0xa3/0x110
[  328.640949]  ? rtnl_calcit.isra.0+0x110/0x110
[  328.644369]  netlink_rcv_skb+0x49/0x110
[  328.647757]  netlink_unicast+0x191/0x230
[  328.651140]  netlink_sendmsg+0x225/0x460
[  328.654481]  sock_sendmsg+0x5e/0x60
[  328.657784]  __sys_sendto+0xee/0x160
[  328.661050]  ? __sys_getsockname+0x7e/0xc0
[  328.664306]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  328.667564]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  328.670715]  ? trace_hardirqs_off_caller+0x2d/0xd0
[  328.673841]  ? do_syscall_64+0x12/0x140
[  328.676955]  __x64_sys_sendto+0x25/0x30
[  328.680025]  do_syscall_64+0x50/0x140
[  328.683046]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  328.686097] RIP: 0033:0x7f079fa7b66d
[  328.689126] Code: ff ff ff ff eb bc 0f 1f 80 00 00 00 00 48 8d 05 79 3d 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 83
[  328.695639] RSP: 002b:00007fffe2a66e88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  328.698886] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f079fa7b66d
[  328.702098] RDX: 0000000000000034 RSI: 00007fffe2a66ea0 RDI: 0000000000000007
[  328.705240] RBP: 00007fffe2a66f20 R08: 0000000000000000 R09: 0000000000000000
[  328.708278] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
[  328.711244] R13: 0000000000000006 R14: 0000000000000007 R15: 0000000000000020
[  328.714150] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core
[  328.717149] CR2: 0000000000000000
[  328.720135] ---[ end trace f2aa62f9801a8115 ]---
[  328.723157] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e]
[  328.726171] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f
[  328.732497] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246
[  328.735655] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000
[  328.738890] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810
[  328.742151] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5
[  328.745377] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000
[  328.748616] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300
[  328.751848] FS:  00007f079f960740(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.755130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.758421] CR2: 0000000000000000 CR3: 000000085578c005 CR4: 00000000007606f0
[  328.761777] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.765113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.768421] PKRU: 55555554
[  328.780850] BUG: kernel NULL pointer dereference, address: 0000000000000f40
[  328.784237] #PF: supervisor read access in kernel mode
[  328.787635] #PF: error_code(0x0000) - not-present page
[  328.791042] PGD 0 P4D 0 
[  328.794431] Oops: 0000 [#2] SMP PTI
[  328.797827] CPU: 0 PID: 185 Comm: kworker/0:2 Tainted: G      D           5.5.0-rc3-moa+ #1
[  328.801358] Hardware name: Supermicro SYS-E300-9D-8CN8TP/X11SDV-8C-TP8F, BIOS 1.1a 05/17/2019
[  328.804971] Workqueue: i40e i40e_service_task [i40e]
[  328.808596] RIP: 0010:i40e_notify_client_of_netdev_close+0x6/0x60 [i40e]
[  328.812301] Code: 40 10 e8 1d 9c ab ca 48 8b 44 24 28 65 48 33 04 25 28 00 00 00 75 06 48 83 c4 30 5b c3 e8 e2 3c d2 c9 66 90 0f 1f 44 00 00 53 <48> 8b 87 40 0f 00 00 48 8b 98 38 08 00 00 48 85 db 74 44 4c 8b 83
[  328.820156] RSP: 0018:ffffb265405b3de0 EFLAGS: 00010247
[  328.824151] RAX: ffff91570ef42c00 RBX: ffff91570ef58810 RCX: ffff915713eb6818
[  328.828215] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[  328.832300] RBP: ffff91570ef58000 R08: 0000000065303469 R09: 8080808080808080
[  328.836435] R10: ffff915717db05ac R11: 0000000000000018 R12: ffff91570ef58008
[  328.840564] R13: 0000000000000053 R14: 0000000000000000 R15: 0ffffd2653fa0980
[  328.844691] FS:  0000000000000000(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.848883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.853088] CR2: 0000000000000f40 CR3: 000000001c80a006 CR4: 00000000007606f0
[  328.857340] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.861571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.865681] PKRU: 55555554
[  328.869666] Call Trace:
[  328.873572]  i40e_service_task+0x511/0xa40 [i40e]
[  328.877484]  ? process_one_work+0x1d3/0x390
[  328.881365]  process_one_work+0x1e5/0x390
[  328.885245]  worker_thread+0x4a/0x3d0
[  328.889121]  kthread+0xfb/0x130
[  328.893004]  ? process_one_work+0x390/0x390
[  328.896831]  ? kthread_park+0x90/0x90
[  328.900580]  ret_from_fork+0x3a/0x50
[  328.904312] Modules linked in: x86_pkg_temp_thermal i40e igb mlx5_core efivarfs nvme nvme_core
[  328.908185] CR2: 0000000000000f40
[  328.911944] ---[ end trace f2aa62f9801a8116 ]---
[  328.915635] RIP: 0010:i40e_xdp+0xfc/0x1d0 [i40e]
[  328.919262] Code: 00 ba 01 00 00 00 be 01 00 00 00 e8 3e e9 ff ff 66 83 bd f6 0c 00 00 00 74 27 31 c0 48 8b 95 90 0c 00 00 48 8b 8d d0 0c 00 00 <48> 8b 14 c2 48 83 c0 01 48 89 4a 20 0f b7 95 f6 0c 00 00 39 c2 7f
[  328.926889] RSP: 0018:ffffb2654341f888 EFLAGS: 00010246
[  328.930743] RAX: 0000000000000000 RBX: ffffffffc0358301 RCX: ffffb26540719000
[  328.934672] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff91571fe18810
[  328.938595] RBP: ffff91570ef3b000 R08: 0000000000000000 R09: 00000000000005d5
[  328.942481] R10: 0000000000aaaaaa R11: 0000000000000000 R12: 0000000000000000
[  328.946304] R13: 0000000000000000 R14: ffffb26540719000 R15: ffffffffc032f300
[  328.950102] FS:  0000000000000000(0000) GS:ffff91571fe00000(0000) knlGS:0000000000000000
[  328.953961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.957847] CR2: 0000000000000f40 CR3: 000000001c80a006 CR4: 00000000007606f0
[  328.961743] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  328.965589] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  328.969372] PKRU: 55555554

^ permalink raw reply	[flat|nested] 49+ messages in thread