All of lore.kernel.org
 help / color / mirror / Atom feed
* kTLS in combination with mlx4 is very unstable
@ 2018-04-22 21:21 Andre Tomt
  2018-04-23 11:44 ` Andre Tomt
  2018-04-24 17:01 ` Dave Watson
  0 siblings, 2 replies; 5+ messages in thread
From: Andre Tomt @ 2018-04-22 21:21 UTC (permalink / raw)
  To: netdev; +Cc: davejwatson, Tariq Toukan

Hello!

kTLS looks fun, so I decided to play with it. It is quite spiffy - 
however with mlx4 I get kernel crashes I'm not seeing when testing on ixgbe.

For testing I'm using a git build of the "stream reflector" cubemap[1] 
configured with kTLS and 8 worker threads running on 4 physical cores, 
loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.

The kernel seems to get increasingly unstable as I load it up with 
client connections. At about 9Gbps and 700 connections, it is okay at 
least for a while - it might run fine for say 45 minutes. Once it gets 
to 20 - 30Gbps, the kernel will usually start spewing OOPSes within 
minutes and the traffic drops.

Some bad interaction between mlx4 and kTLS?

Hardware is a quad core Xeon-D 1520 using a dual port Mellanox 
ConnectX-3 VPI with a single 40Gbps ethernet link configured. Mellanox 
mlx4 driver settings are kernel.org upstream defaults. Interface is 
configured with FQ qdisc and sockets are using BBR congestion control.

Tested on kernel 4.14.34, 4.15.17, and 4.16.2 - 4.16.3.

[1] https://git.sesse.net/?p=cubemap

First OOPS (from 4.16.3):
> [  660.467358] BUG: stack guard page was hit at 00000000b136e403 (stack is 00000000ded3f179..00000000835ee6c5)
> [  660.467422] kernel stack overflow (double-fault): 0000 [#1] SMP PTI
> [  660.467457] Modules linked in: coretemp intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp iTCO_wdt gpio_ich iTCO_vendor_support kvm_intel mxm_wmi xfs libcrc32c kvm crc32c_generic irqbypass nls_iso8859_1 crct10dif_pclmul crc32_pclmul nls_cp437 ghash_clmulni_intel vfat fat aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_pch_thermal mei_me sg mei lpc_ich mfd_core evdev ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad tls ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid mlx4_ib mlx4_en ib_core sd_mod ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect ehci_pci xhci_pci sysimgblt fb_sys_fops ahci libahci xhci_hcd ehci_hcd libata crc32c_intel nvme drm usbcore scsi_mod mlx4_core ixgbe i2c_core mdio usb_common devlink hwmon nvme_core rtc_cm
 os
> [  660.467856] CPU: 4 PID: 660 Comm: cubemap Not tainted 4.16.0-1 #1
> [  660.467890] Hardware name: Supermicro Super Server/X10SDV-4C-TLN2F, BIOS 1.2c 09/19/2017
> [  660.467939] RIP: 0010:__kmalloc+0x7/0x1f0
> [  660.467962] RSP: 0018:ffffabafc27b8000 EFLAGS: 00010206
> [  660.467992] RAX: 000000000000000d RBX: 0000000000000010 RCX: ffffabafc27b8070
> [  660.468030] RDX: ffff98a0d0235490 RSI: 0000000001080020 RDI: 000000000000001d
> [  660.468069] RBP: 000000000000000d R08: ffff98a0d5be4860 R09: ffff98a0ec299180
> [  660.468106] R10: ffffabafc27b80b8 R11: 0000000000000010 R12: 0000000000000010
> [  660.468145] R13: ffff98a0ec299180 R14: ffff98a0ec299180 R15: 0000000000000000
> [  660.468184] FS:  00007f8a35ffb700(0000) GS:ffff98a17fd00000(0000) knlGS:0000000000000000
> [  660.468227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  660.468258] CR2: ffffabafc27b7ff8 CR3: 00000004698ee001 CR4: 00000000003606e0
> [  660.468297] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  660.468334] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  660.468373] Call Trace:
> [  660.468401]  gcmaes_encrypt.constprop.5+0x137/0x240 [aesni_intel]
> [  660.468439]  ? generic_gcmaes_encrypt+0x5f/0x80 [aesni_intel]
> [  660.468476]  ? gcmaes_wrapper_encrypt+0x36/0x80 [aesni_intel]
> [  660.468511]  ? tls_push_record+0x1d3/0x390 [tls]
> [  660.468537]  ? tls_push_record+0x1d3/0x390 [tls]
> [  660.468565]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.468593]  ? do_tcp_sendpages+0x8d/0x580
> [  660.468618]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.468643]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.468671]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.468697]  ? do_tcp_sendpages+0x8d/0x580
> [  660.468722]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.468748]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.468776]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.468802]  ? do_tcp_sendpages+0x8d/0x580
> [  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.468906]  ? do_tcp_sendpages+0x8d/0x580
> [  660.468931]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.468957]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.470165]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.471363]  ? do_tcp_sendpages+0x8d/0x580
> [  660.472555]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.473713]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.474838]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.475927]  ? do_tcp_sendpages+0x8d/0x580
> [  660.476977]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.477999]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.478968]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.479902]  ? do_tcp_sendpages+0x8d/0x580
> [  660.480790]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.481644]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.482483]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.483301]  ? do_tcp_sendpages+0x8d/0x580
> [  660.484099]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.484891]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.485674]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.486455]  ? do_tcp_sendpages+0x8d/0x580
> [  660.487220]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.487890]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.488328]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.488748]  ? do_tcp_sendpages+0x8d/0x580
> [  660.489167]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.489565]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.489970]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.490370]  ? do_tcp_sendpages+0x8d/0x580
> [  660.490771]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.491165]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.491550]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.491914]  ? do_tcp_sendpages+0x8d/0x580
> [  660.492274]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.492641]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.493008]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.493374]  ? do_tcp_sendpages+0x8d/0x580
> [  660.493787]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.494177]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.494585]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.494972]  ? do_tcp_sendpages+0x8d/0x580
> [  660.495359]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.495742]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.496128]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.496512]  ? do_tcp_sendpages+0x8d/0x580
> [  660.496901]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.497301]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.497697]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.498096]  ? do_tcp_sendpages+0x8d/0x580
> [  660.498490]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.498884]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.499291]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.499700]  ? do_tcp_sendpages+0x8d/0x580
> [  660.500103]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.500511]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.500909]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.501326]  ? do_tcp_sendpages+0x8d/0x580
> [  660.501737]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.502131]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.502525]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.502928]  ? do_tcp_sendpages+0x8d/0x580
> [  660.503331]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.503724]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.504127]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.504547]  ? do_tcp_sendpages+0x8d/0x580
> [  660.504949]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.505348]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.505769]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.506207]  ? do_tcp_sendpages+0x8d/0x580
> [  660.506622]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.507030]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.507435]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.507841]  ? do_tcp_sendpages+0x8d/0x580
> [  660.508518]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.509261]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.510011]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.510187] BUG: stack guard page was hit at 00000000e0315e51 (stack is 00000000bea6f919..0000000005fc5eb4)
> [  660.510473] BUG: stack guard page was hit at 000000004b958a15 (stack is 000000001f2af2d1..000000006295a4b1)
> [  660.510758]  ? do_tcp_sendpages+0x8d/0x580
> [  660.513094]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.513886]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.514680]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.515487]  ? do_tcp_sendpages+0x8d/0x580
> [  660.515750] BUG: stack guard page was hit at 00000000bc93cf0d (stack is 0000000031a15c9c..0000000029a82776)
> [  660.516295]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.518017]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.518883]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.519752]  ? do_tcp_sendpages+0x8d/0x580
> [  660.519816] BUG: stack guard page was hit at 000000002d1db286 (stack is 00000000b5bb06d4..000000007a29c8f2)
> [  660.520544]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.522315]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.523162]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.524006]  ? do_tcp_sendpages+0x8d/0x580
> [  660.524849]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.525695]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.526545]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.527399]  ? do_tcp_sendpages+0x8d/0x580
> [  660.528247]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.529099]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.529955]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.530797]  ? do_tcp_sendpages+0x8d/0x580
> [  660.531643]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.532010] BUG: stack guard page was hit at 0000000027abda92 (stack is 00000000aadcb221..00000000a587b67b)
> [  660.532535]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.534511]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.535506]  ? do_tcp_sendpages+0x8d/0x580
> [  660.536500]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.537495]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.538493]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.539482]  ? do_tcp_sendpages+0x8d/0x580
> [  660.540462]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.541447]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.542430]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.543411]  ? do_tcp_sendpages+0x8d/0x580
> [  660.544395]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.545382]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.546365]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.547347]  ? do_tcp_sendpages+0x8d/0x580
> [  660.548334]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.549318]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.550300]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.551284]  ? do_tcp_sendpages+0x8d/0x580
> [  660.552267]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.553250]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.554205]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.555158]  ? do_tcp_sendpages+0x8d/0x580
> [  660.556083]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.557009]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.557936]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.558862]  ? do_tcp_sendpages+0x8d/0x580
> [  660.559786]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.560681]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.561547]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.562413]  ? do_tcp_sendpages+0x8d/0x580
> [  660.563279]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.564143]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.564979]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.565783]  ? do_tcp_sendpages+0x8d/0x580
> [  660.566587]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.567392]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.568197]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.569000]  ? do_tcp_sendpages+0x8d/0x580
> [  660.569804]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.570609]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.571415]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.572218]  ? do_tcp_sendpages+0x8d/0x580
> [  660.573023]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.573830]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.574634]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.575437]  ? do_tcp_sendpages+0x8d/0x580
> [  660.576210]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.576953]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.577698]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.578441]  ? do_tcp_sendpages+0x8d/0x580
> [  660.579183]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.579929]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.580673]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.581417]  ? do_tcp_sendpages+0x8d/0x580
> [  660.582159]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.582904]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.583649]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.584394]  ? do_tcp_sendpages+0x8d/0x580
> [  660.585137]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.585882]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.586628]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.587372]  ? do_tcp_sendpages+0x8d/0x580
> [  660.588115]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.588861]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.589605]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.590350]  ? do_tcp_sendpages+0x8d/0x580
> [  660.591093]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.591840]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.592585]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.593328]  ? do_tcp_sendpages+0x8d/0x580
> [  660.594072]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.594816]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.595563]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.596308]  ? do_tcp_sendpages+0x8d/0x580
> [  660.597050]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.597794]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.598540]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.599281]  ? do_tcp_sendpages+0x8d/0x580
> [  660.600025]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.600772]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.601517]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.602260]  ? do_tcp_sendpages+0x8d/0x580
> [  660.603003]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.603750]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.604495]  ? tls_write_space+0x6a/0x80 [tls]
> [  660.605238]  ? do_tcp_sendpages+0x8d/0x580
> [  660.605981]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.606726]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.607474]  ? tls_sw_sendpage+0x14a/0x390 [tls]
> [  660.608214]  ? direct_splice_actor+0x40/0x40
> [  660.608951]  ? inet_sendpage+0x40/0xf0
> [  660.609689]  ? kernel_sendpage+0x1a/0x30
> [  660.610426]  ? sock_sendpage+0x20/0x30
> [  660.611161]  ? pipe_to_sendpage+0x5f/0x70
> [  660.611898]  ? __splice_from_pipe+0x80/0x180
> [  660.612637]  ? generic_file_splice_read+0x100/0x150
> [  660.613382]  ? direct_splice_actor+0x40/0x40
> [  660.614128]  ? splice_from_pipe+0x4f/0x70
> [  660.614871]  ? direct_splice_actor+0x35/0x40
> [  660.615619]  ? splice_direct_to_actor+0xce/0x1d0
> [  660.616368]  ? generic_pipe_buf_nosteal+0x10/0x10
> [  660.617122]  ? do_splice_direct+0x8c/0xa0
> [  660.617876]  ? do_sendfile+0x19d/0x380
> [  660.618626]  ? SyS_sendfile64+0x4c/0x90
> [  660.619376]  ? do_syscall_64+0x7a/0x390
> [  660.620121]  ? do_page_fault+0x31/0x130
> [  660.620863]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  660.621618] Code: 24 08 4c 89 e9 48 89 de e8 d7 66 63 00 4d 8b 17 5a 4d 85 d2 75 d7 e9 e9 fe ff ff 48 89 c5 e9 e1 fe ff ff 90 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 53 48 81 ff 00 20 00 00 0f 87 a4 01 00 00 
> [  660.623316] RIP: __kmalloc+0x7/0x1f0 RSP: ffffabafc27b8000
> [  660.624168] ---[ end trace 7f6206177c0cc58f ]---

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kTLS in combination with mlx4 is very unstable
  2018-04-22 21:21 kTLS in combination with mlx4 is very unstable Andre Tomt
@ 2018-04-23 11:44 ` Andre Tomt
  2018-04-24 17:01 ` Dave Watson
  1 sibling, 0 replies; 5+ messages in thread
From: Andre Tomt @ 2018-04-23 11:44 UTC (permalink / raw)
  To: netdev; +Cc: davejwatson, Tariq Toukan

On 22. april 2018 23:21, Andre Tomt wrote:
> Hello!
> 
> kTLS looks fun, so I decided to play with it. It is quite spiffy - 
> however with mlx4 I get kernel crashes I'm not seeing when testing on 
> ixgbe.
> 
> For testing I'm using a git build of the "stream reflector" cubemap[1] 
> configured with kTLS and 8 worker threads running on 4 physical cores, 
> loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.
> 
> The kernel seems to get increasingly unstable as I load it up with 
> client connections. At about 9Gbps and 700 connections, it is okay at 
> least for a while - it might run fine for say 45 minutes. Once it gets 
> to 20 - 30Gbps, the kernel will usually start spewing OOPSes within 
> minutes and the traffic drops.
> 
> Some bad interaction between mlx4 and kTLS?
> 
> Hardware is a quad core Xeon-D 1520 using a dual port Mellanox 
> ConnectX-3 VPI with a single 40Gbps ethernet link configured. Mellanox 
> mlx4 driver settings are kernel.org upstream defaults. Interface is 
> configured with FQ qdisc and sockets are using BBR congestion control.
> 
> Tested on kernel 4.14.34, 4.15.17, and 4.16.2 - 4.16.3.
> 
> [1] https://git.sesse.net/?p=cubemap
> 
> First OOPS (from 4.16.3)

It also blows up with a similar trace on 4.17-rc2.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kTLS in combination with mlx4 is very unstable
  2018-04-22 21:21 kTLS in combination with mlx4 is very unstable Andre Tomt
  2018-04-23 11:44 ` Andre Tomt
@ 2018-04-24 17:01 ` Dave Watson
  2018-05-01 16:09   ` Dave Watson
  1 sibling, 1 reply; 5+ messages in thread
From: Dave Watson @ 2018-04-24 17:01 UTC (permalink / raw)
  To: Andre Tomt; +Cc: netdev, borisp, Aviad Yehezkel

On 04/22/18 11:21 PM, Andre Tomt wrote:
> kTLS looks fun, so I decided to play with it. It is quite spiffy - however
> with mlx4 I get kernel crashes I'm not seeing when testing on ixgbe.
> 
> For testing I'm using a git build of the "stream reflector" cubemap[1]
> configured with kTLS and 8 worker threads running on 4 physical cores,
> loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.
> 
> The kernel seems to get increasingly unstable as I load it up with client
> connections. At about 9Gbps and 700 connections, it is okay at least for a
> while - it might run fine for say 45 minutes. Once it gets to 20 - 30Gbps,
> the kernel will usually start spewing OOPSes within minutes and the traffic
> drops.
> 
> Some bad interaction between mlx4 and kTLS?

I'm not familiar with any mlx4 specific issues, but it looks like
there is enough information here to fix the stack overflow from
recursive callbacks. I'll see if I can come up with something.

Thanks for the report.

> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__git.sesse.net_-3Fp-3Dcubemap&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=vou6lT5jmE_fWQWZZgNrsMWu4RT87QAB9V07tPHlP5U&m=nXYfAmb3ozJUT-pI1JGDgMYhxb7Dq4XSorzfyyQeGWk&s=05SnCOrNbK2DHRub2qPdVxAzXW9e7utxqDMeVaGBd8k&e=
> 
> First OOPS (from 4.16.3):
> > [  660.467358] BUG: stack guard page was hit at 00000000b136e403 (stack is 00000000ded3f179..00000000835ee6c5)
> > [  660.467422] kernel stack overflow (double-fault): 0000 [#1] SMP PTI
> > [  660.467457] Modules linked in: coretemp intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp iTCO_wdt gpio_ich iTCO_vendor_support kvm_intel mxm_wmi xfs libcrc32c kvm crc32c_generic irqbypass nls_iso8859_1 crct10dif_pclmul crc32_pclmul nls_cp437 ghash_clmulni_intel vfat fat aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_pch_thermal mei_me sg mei lpc_ich mfd_core evdev ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad tls ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid mlx4_ib mlx4_en ib_core sd_mod ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect ehci_pci xhci_pci sysimgblt fb_sys_fops ahci libahci xhci_hcd ehci_hcd libata crc32c_intel nvme drm usbcore scsi_mod mlx4_core ixgbe i2c_core mdio usb_common devlink hwmon nvme_core rtc_
 cmos
> > [  660.467856] CPU: 4 PID: 660 Comm: cubemap Not tainted 4.16.0-1 #1
> > [  660.467890] Hardware name: Supermicro Super Server/X10SDV-4C-TLN2F, BIOS 1.2c 09/19/2017
> > [  660.467939] RIP: 0010:__kmalloc+0x7/0x1f0
> > [  660.467962] RSP: 0018:ffffabafc27b8000 EFLAGS: 00010206
> > [  660.467992] RAX: 000000000000000d RBX: 0000000000000010 RCX: ffffabafc27b8070
> > [  660.468030] RDX: ffff98a0d0235490 RSI: 0000000001080020 RDI: 000000000000001d
> > [  660.468069] RBP: 000000000000000d R08: ffff98a0d5be4860 R09: ffff98a0ec299180
> > [  660.468106] R10: ffffabafc27b80b8 R11: 0000000000000010 R12: 0000000000000010
> > [  660.468145] R13: ffff98a0ec299180 R14: ffff98a0ec299180 R15: 0000000000000000
> > [  660.468184] FS:  00007f8a35ffb700(0000) GS:ffff98a17fd00000(0000) knlGS:0000000000000000
> > [  660.468227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  660.468258] CR2: ffffabafc27b7ff8 CR3: 00000004698ee001 CR4: 00000000003606e0
> > [  660.468297] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  660.468334] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  660.468373] Call Trace:
> > [  660.468401]  gcmaes_encrypt.constprop.5+0x137/0x240 [aesni_intel]
> > [  660.468439]  ? generic_gcmaes_encrypt+0x5f/0x80 [aesni_intel]
> > [  660.468476]  ? gcmaes_wrapper_encrypt+0x36/0x80 [aesni_intel]
> > [  660.468511]  ? tls_push_record+0x1d3/0x390 [tls]
> > [  660.468537]  ? tls_push_record+0x1d3/0x390 [tls]
> > [  660.468565]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.468593]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.468618]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.468643]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.468671]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.468697]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.468722]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.468748]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.468776]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.468802]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.468906]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.468931]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.468957]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.470165]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.471363]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.472555]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.473713]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.474838]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.475927]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.476977]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.477999]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.478968]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.479902]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.480790]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.481644]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.482483]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.483301]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.484099]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.484891]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.485674]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.486455]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.487220]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.487890]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.488328]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.488748]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.489167]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.489565]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.489970]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.490370]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.490771]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.491165]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.491550]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.491914]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.492274]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.492641]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.493008]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.493374]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.493787]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.494177]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.494585]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.494972]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.495359]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.495742]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.496128]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.496512]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.496901]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.497301]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.497697]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.498096]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.498490]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.498884]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.499291]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.499700]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.500103]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.500511]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.500909]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.501326]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.501737]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.502131]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.502525]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.502928]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.503331]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.503724]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.504127]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.504547]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.504949]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.505348]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.505769]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.506207]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.506622]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.507030]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.507435]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.507841]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.508518]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.509261]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.510011]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.510187] BUG: stack guard page was hit at 00000000e0315e51 (stack is 00000000bea6f919..0000000005fc5eb4)
> > [  660.510473] BUG: stack guard page was hit at 000000004b958a15 (stack is 000000001f2af2d1..000000006295a4b1)
> > [  660.510758]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.513094]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.513886]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.514680]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.515487]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.515750] BUG: stack guard page was hit at 00000000bc93cf0d (stack is 0000000031a15c9c..0000000029a82776)
> > [  660.516295]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.518017]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.518883]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.519752]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.519816] BUG: stack guard page was hit at 000000002d1db286 (stack is 00000000b5bb06d4..000000007a29c8f2)
> > [  660.520544]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.522315]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.523162]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.524006]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.524849]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.525695]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.526545]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.527399]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.528247]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.529099]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.529955]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.530797]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.531643]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.532010] BUG: stack guard page was hit at 0000000027abda92 (stack is 00000000aadcb221..00000000a587b67b)
> > [  660.532535]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.534511]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.535506]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.536500]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.537495]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.538493]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.539482]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.540462]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.541447]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.542430]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.543411]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.544395]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.545382]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.546365]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.547347]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.548334]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.549318]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.550300]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.551284]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.552267]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.553250]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.554205]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.555158]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.556083]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.557009]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.557936]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.558862]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.559786]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.560681]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.561547]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.562413]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.563279]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.564143]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.564979]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.565783]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.566587]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.567392]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.568197]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.569000]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.569804]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.570609]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.571415]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.572218]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.573023]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.573830]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.574634]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.575437]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.576210]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.576953]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.577698]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.578441]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.579183]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.579929]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.580673]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.581417]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.582159]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.582904]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.583649]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.584394]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.585137]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.585882]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.586628]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.587372]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.588115]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.588861]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.589605]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.590350]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.591093]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.591840]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.592585]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.593328]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.594072]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.594816]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.595563]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.596308]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.597050]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.597794]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.598540]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.599281]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.600025]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.600772]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.601517]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.602260]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.603003]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.603750]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.604495]  ? tls_write_space+0x6a/0x80 [tls]
> > [  660.605238]  ? do_tcp_sendpages+0x8d/0x580
> > [  660.605981]  ? tls_push_sg+0x74/0x130 [tls]
> > [  660.606726]  ? tls_push_record+0x24a/0x390 [tls]
> > [  660.607474]  ? tls_sw_sendpage+0x14a/0x390 [tls]
> > [  660.608214]  ? direct_splice_actor+0x40/0x40
> > [  660.608951]  ? inet_sendpage+0x40/0xf0
> > [  660.609689]  ? kernel_sendpage+0x1a/0x30
> > [  660.610426]  ? sock_sendpage+0x20/0x30
> > [  660.611161]  ? pipe_to_sendpage+0x5f/0x70
> > [  660.611898]  ? __splice_from_pipe+0x80/0x180
> > [  660.612637]  ? generic_file_splice_read+0x100/0x150
> > [  660.613382]  ? direct_splice_actor+0x40/0x40
> > [  660.614128]  ? splice_from_pipe+0x4f/0x70
> > [  660.614871]  ? direct_splice_actor+0x35/0x40
> > [  660.615619]  ? splice_direct_to_actor+0xce/0x1d0
> > [  660.616368]  ? generic_pipe_buf_nosteal+0x10/0x10
> > [  660.617122]  ? do_splice_direct+0x8c/0xa0
> > [  660.617876]  ? do_sendfile+0x19d/0x380
> > [  660.618626]  ? SyS_sendfile64+0x4c/0x90
> > [  660.619376]  ? do_syscall_64+0x7a/0x390
> > [  660.620121]  ? do_page_fault+0x31/0x130
> > [  660.620863]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [  660.621618] Code: 24 08 4c 89 e9 48 89 de e8 d7 66 63 00 4d 8b 17 5a
> > 4d 85 d2 75 d7 e9 e9 fe ff ff 48 89 c5 e9 e1 fe ff ff 90 0f 1f 44 00 00
> > 41 57 <41> 56 41 55 41 54 55 53 48 81 ff 00 20 00 00 0f 87 a4 01 00 00 [
> > 660.623316] RIP: __kmalloc+0x7/0x1f0 RSP: ffffabafc27b8000
> > [  660.624168] ---[ end trace 7f6206177c0cc58f ]---
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kTLS in combination with mlx4 is very unstable
  2018-04-24 17:01 ` Dave Watson
@ 2018-05-01 16:09   ` Dave Watson
  2018-05-01 17:41     ` Andre Tomt
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Watson @ 2018-05-01 16:09 UTC (permalink / raw)
  To: Andre Tomt; +Cc: netdev, borisp, Aviad Yehezkel

Hi Andre, 

On 04/24/18 10:01 AM, Dave Watson wrote:
> On 04/22/18 11:21 PM, Andre Tomt wrote:
> > The kernel seems to get increasingly unstable as I load it up with client
> > connections. At about 9Gbps and 700 connections, it is okay at least for a
> > while - it might run fine for say 45 minutes. Once it gets to 20 - 30Gbps,
> > the kernel will usually start spewing OOPSes within minutes and the traffic
> > drops.
> > 
> > Some bad interaction between mlx4 and kTLS?

I tried to repro, but wasn't able to - of course I don't have an mlx4
test setup.  If I manually add a tls_write_space call after
do_tcp_sendpages, I get a similar stack though.

Something like the following should work, can you test?  Thanks

diff --git a/include/net/tls.h b/include/net/tls.h
index 8c56809..ee78f33 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -187,6 +187,7 @@ struct tls_context {
        struct scatterlist *partially_sent_record;
        u16 partially_sent_offset;
        unsigned long flags;
+       bool in_tcp_sendpages;
 
        u16 pending_open_record_frags;
        int (*push_pending_record)(struct sock *sk, int flags);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 3aafb87..095af65 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -114,6 +114,7 @@ int tls_push_sg(struct sock *sk,
        size = sg->length - offset;
        offset += sg->offset;
 
+       ctx->in_tcp_sendpages = 1;
        while (1) {
                if (sg_is_last(sg))
                        sendpage_flags = flags;
@@ -148,6 +149,8 @@ int tls_push_sg(struct sock *sk,
        }
 
        clear_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
+       ctx->in_tcp_sendpages = 0;
+       ctx->sk_write_space(sk);
 
        return 0;
 }
@@ -217,6 +220,9 @@ static void tls_write_space(struct sock *sk)
 {
        struct tls_context *ctx = tls_get_ctx(sk);
 
+       if (ctx->in_tcp_sendpages)
+               return;
+
        if (!sk->sk_write_pending && tls_is_pending_closed_record(ctx)) {
                gfp_t sk_allocation = sk->sk_allocation;
                int rc;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: kTLS in combination with mlx4 is very unstable
  2018-05-01 16:09   ` Dave Watson
@ 2018-05-01 17:41     ` Andre Tomt
  0 siblings, 0 replies; 5+ messages in thread
From: Andre Tomt @ 2018-05-01 17:41 UTC (permalink / raw)
  To: Dave Watson; +Cc: netdev, borisp, Aviad Yehezkel

On 01. mai 2018 18:09, Dave Watson wrote:
> On 04/24/18 10:01 AM, Dave Watson wrote:
>> On 04/22/18 11:21 PM, Andre Tomt wrote:
>>> The kernel seems to get increasingly unstable as I load it up with client
>>> connections. At about 9Gbps and 700 connections, it is okay at least for a
>>> while - it might run fine for say 45 minutes. Once it gets to 20 - 30Gbps,
>>> the kernel will usually start spewing OOPSes within minutes and the traffic
>>> drops.
>>>
>>> Some bad interaction between mlx4 and kTLS?
> I tried to repro, but wasn't able to - of course I don't have an mlx4
> test setup.  If I manually add a tls_write_space call after
> do_tcp_sendpages, I get a similar stack though.
> 
> Something like the following should work, can you test?  Thanks

Thank you!

This does indeed seem to have fixed this problem. It has been sustaining 
~36Gbps and about 3000 clients for about an hour now without any crashes.

Tested on 4.17-rc3 git snapshot as of a few hours ago.

As for performance I am very happy with kTLS. This is some very cool 
stuff. I dig it. I'm getting a bit over 10Gbps per 2.5Ghz Broadwell-DE 
core on this low power quad core system. Nearly ideal network conditions 
and all the data is hot in pagecache but still. I'm going to have to add 
another port. ;-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-05-01 17:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-22 21:21 kTLS in combination with mlx4 is very unstable Andre Tomt
2018-04-23 11:44 ` Andre Tomt
2018-04-24 17:01 ` Dave Watson
2018-05-01 16:09   ` Dave Watson
2018-05-01 17:41     ` Andre Tomt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.