netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/1] net/tls(TLS_SW): double free in tls_tx_records
       [not found] ` <CAOrEdsmxstWoBz2AotrTx_OBFN_jycqnSqtsvLxuCLGtKKi_dA@mail.gmail.com>
@ 2019-09-09 17:46   ` Pooja Trivedi
  2019-09-09 18:15     ` [PATCH net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition " Pooja Trivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-09 17:46 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 6590 bytes --]

TLS module crash while running SSL record encryption using
klts_send_[file] using crypto accelerator (Nitrox).

Following are the preconditions and steps to reproduce the issue:

Preconditions:
1) Installed 5.3-rc4
2) Nitrox5 card plugin (crypto accelerator)

Steps to reproduce the issue:
1) Installed n5pf.ko (drivers/crypto/cavium/nitrox)
2) Installed tls.ko if not is installed by default (net/tls)
3) Obtained uperf tool from GitHub
   3.1) Modified uperf to use tls module by using setsocket.
   3.2) Modified uperf tool to support sendfile with SSL.

Test:
1) Ran uperf with 4 threads
2) Each thread sends data using sendfile over SSL protocol.

After a few seconds into the test, kernel crashes because of record
list corruption

[  270.888952] ------------[ cut here ]------------
[  270.890450] list_del corruption, ffff91cc3753a800->prev is
LIST_POISON2 (dead000000000122)
[  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50
__list_del_entry_valid+0x62/0x90
[  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding
intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei
lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler
wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci
libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb
libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log
dm_mod [last unloaded: nitrox_drv]
[  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G
        OE     5.3.0-rc4 #1
[  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS
3.0c 03/24/2014
[  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
[  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee
8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e
cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87
cf c8
[  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
[  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX: 0000000000000000
[  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
[  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09: 0000000000000000
[  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
[  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000)
knlGS:0000000000000000
[  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
[  270.909490] Call Trace:
[  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
[  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
[  270.912154]  inet_sendpage+0x52/0x90
[  270.913045]  ? direct_splice_actor+0x40/0x40
[  270.913941]  kernel_sendpage+0x1a/0x30
[  270.914831]  sock_sendpage+0x20/0x30
[  270.915714]  pipe_to_sendpage+0x62/0x90
[  270.916592]  __splice_from_pipe+0x80/0x180
[  270.917461]  ? direct_splice_actor+0x40/0x40
[  270.918334]  splice_from_pipe+0x5d/0x90
[  270.919208]  direct_splice_actor+0x35/0x40
[  270.920086]  splice_direct_to_actor+0x103/0x230
[  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
[  270.921850]  do_splice_direct+0x9a/0xd0
[  270.922733]  do_sendfile+0x1c9/0x3d0
[  270.923612]  __x64_sys_sendfile64+0x5c/0xc0

Observations:
1) This issue is observed after applying "Commit a42055e8d2c3: Add
support for async encryption of records for performance"
2) 5.2.2 kernel exhibits the same issue

Attached is the complete crash log.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204669

linux-crypto original post:
https://marc.info/?l=linux-crypto-vger&m=156700690108854&w=2


After adding custom profiling around lock_sock/release_sock as well as
getting backtraces of the call stack at and around the time of the
crash/race-condition, it seems like using the socket lock is not the
best way to synchronize write access to tls_tx_records, especially
when the socket lock can get released under tcp memory pressure situation.

One potential way for race condition to appear:

When under tcp memory pressure, Thread 1 takes the following code path:
do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event

sk_wait_event releases the socket lock and sleeps waiting for memory:

#define sk_wait_event(__sk, __timeo, __condition, __wait)       \
     ({  int __rc;                       \
         release_sock(__sk);                 \
         __rc = __condition;                 \
         if (!__rc) {                        \
             *(__timeo) = wait_woken(__wait,         \
                         TASK_INTERRUPTIBLE, \
                         *(__timeo));        \
         }                           \
         sched_annotate_sleep();                 \
         lock_sock(__sk);                    \
         __rc = __condition;                 \
         __rc;                           \
     })

Thread 2 code path:
tx_work_handler ---> tls_tx_records

Thread 2 is able to obtain the socket lock and go through the
transmission of the ctx->tx_list, deleting the sent ones (as in the
for loop below).

int tls_tx_records(struct sock *sk, int flags)
{
     ....
     ....
     ....
     ....
     list_for_each_entry_safe(rec, tmp, &ctx->tx_list, list) {
          if (READ_ONCE(rec->tx_ready)) {
              if (flags == -1)
                  tx_flags = rec->tx_flags;
              else
                  tx_flags = flags;

              msg_en = &rec->msg_encrypted;
              rc = tls_push_sg(sk, tls_ctx,
                       &msg_en->sg.data[msg_en->sg.curr],
                       0, tx_flags);
              if (rc)
                  goto tx_err;

              list_del(&rec->list); // **** crash location ****
              sk_msg_free(sk, &rec->msg_plaintext);
              kfree(rec);
          } else {
              break;
          }
      }
     ....
     ....
     ....
     ....
}

When Thread 1 wakes up from tls_push_sg call and attempts list_del on
previously grabbed record which was sent and deleted by Thread 2, it
causes the crash.


To fix this race, a flag or bool inside of ctx can be used to
synchronize access to tls_tx_records.

[-- Attachment #2: tls_crash_log_5_3_rc4.txt --]
[-- Type: text/plain, Size: 11274 bytes --]

[  270.888952] ------------[ cut here ]------------
[  270.890450] list_del corruption, ffff91cc3753a800->prev is LIST_POISON2 (dead000000000122)
[  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50 __list_del_entry_valid+0x62/0x90
[  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nitrox_drv]
[  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G           OE     5.3.0-rc4 #1
[  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS 3.0c 03/24/2014
[  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
[  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee 8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87 cf c8
[  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
[  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX: 0000000000000000
[  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
[  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09: 0000000000000000
[  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
[  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000) knlGS:0000000000000000
[  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
[  270.909490] Call Trace:
[  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
[  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
[  270.912154]  inet_sendpage+0x52/0x90
[  270.913045]  ? direct_splice_actor+0x40/0x40
[  270.913941]  kernel_sendpage+0x1a/0x30
[  270.914831]  sock_sendpage+0x20/0x30
[  270.915714]  pipe_to_sendpage+0x62/0x90
[  270.916592]  __splice_from_pipe+0x80/0x180
[  270.917461]  ? direct_splice_actor+0x40/0x40
[  270.918334]  splice_from_pipe+0x5d/0x90
[  270.919208]  direct_splice_actor+0x35/0x40
[  270.920086]  splice_direct_to_actor+0x103/0x230
[  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
[  270.921850]  do_splice_direct+0x9a/0xd0
[  270.922733]  do_sendfile+0x1c9/0x3d0
[  270.923612]  __x64_sys_sendfile64+0x5c/0xc0
[  270.924485]  do_syscall_64+0x5b/0x1d0
[  270.925344]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  270.926188] RIP: 0033:0x7f454c9bf6ba
[  270.927010] Code: 31 c0 5b c3 0f 1f 40 00 48 89 da 4c 89 ce 44 89 c7 5b e9 09 fe ff ff 66 0f 1f 84 00 00 00 00 00 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 86 27 2d 00 f7 d8 64 89 01 48
[  270.928710] RSP: 002b:00007f454a88cd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
[  270.929573] RAX: ffffffffffffffda RBX: 0000000002025460 RCX: 00007f454c9bf6ba
[  270.930433] RDX: 00007f454a88cd18 RSI: 0000000000000005 RDI: 000000000000000a
[  270.931294] RBP: 000000000000000a R08: 00007f454cc920ec R09: 00007f454cc92140
[  270.932146] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000000
[  270.932988] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000002030788
[  270.933809] ---[ end trace 4ee70802e847a9de ]---
[  270.934624] ------------[ cut here ]------------
[  270.935957] list_del corruption, ffff91cc3753c000->prev is LIST_POISON2 (dead000000000122)
[  270.937381] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50 __list_del_entry_valid+0x62/0x90
[  270.938384] Modules linked in: n5pf(OE) netconsole tls(OE) bonding intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nitrox_drv]
[  270.943119] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G        W  OE     5.3.0-rc4 #1
[  270.943920] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS 3.0c 03/24/2014
[  270.944725] RIP: 0010:__list_del_entry_valid+0x62/0x90
[  270.945522] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee 8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87 cf c8
[  270.947193] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
[  270.948037] RAX: 0000000000000000 RBX: ffff91cc3753a800 RCX: 0000000000000000
[  270.948882] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
[  270.949723] RBP: ffff91bc36020940 R08: 000000000000058d R09: 0000000000000000
[  270.950562] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  270.951387] R13: ffff91cc3753c000 R14: ffff91cc37cc6400 R15: ffff91cc3753c000
[  270.952220] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000) knlGS:0000000000000000
[  270.953065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.953914] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
[  270.954777] Call Trace:
[  270.955636]  tls_tx_records+0x138/0x1c0 [tls]
[  270.956499]  tls_sw_sendpage+0x3e0/0x420 [tls]
[  270.957366]  inet_sendpage+0x52/0x90
[  270.958232]  ? direct_splice_actor+0x40/0x40
[  270.959101]  kernel_sendpage+0x1a/0x30
[  270.959967]  sock_sendpage+0x20/0x30
[  270.960830]  pipe_to_sendpage+0x62/0x90
[  270.961693]  __splice_from_pipe+0x80/0x180
[  270.962556]  ? direct_splice_actor+0x40/0x40
[  270.963413]  splice_from_pipe+0x5d/0x90
[  270.964274]  direct_splice_actor+0x35/0x40
[  270.965136]  splice_direct_to_actor+0x103/0x230
[  270.966001]  ? generic_pipe_buf_nosteal+0x10/0x10
[  270.966867]  do_splice_direct+0x9a/0xd0
[  270.967727]  do_sendfile+0x1c9/0x3d0
[  270.968565]  __x64_sys_sendfile64+0x5c/0xc0
[  270.969380]  do_syscall_64+0x5b/0x1d0
[  270.970185]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  270.970985] RIP: 0033:0x7f454c9bf6ba
[  270.971778] Code: 31 c0 5b c3 0f 1f 40 00 48 89 da 4c 89 ce 44 89 c7 5b e9 09 fe ff ff 66 0f 1f 84 00 00 00 00 00 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 86 27 2d 00 f7 d8 64 89 01 48
[  270.973438] RSP: 002b:00007f454a88cd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
[  270.974293] RAX: ffffffffffffffda RBX: 0000000002025460 RCX: 00007f454c9bf6ba
[  270.975148] RDX: 00007f454a88cd18 RSI: 0000000000000005 RDI: 000000000000000a
[  270.976009] RBP: 000000000000000a R08: 00007f454cc920ec R09: 00007f454cc92140
[  270.976860] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000000
[  270.977699] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000002030788
[  270.978514] ---[ end trace 4ee70802e847a9df ]---
[  270.979330] BUG: unable to handle page fault for address: ffff91d383e05784
[  270.980749] #PF: supervisor read access in kernel mode
[  270.981820] #PF: error_code(0x0000) - not-present page
[  270.982571] PGD 1032e01067 P4D 1032e01067 PUD 0 
[  270.983303] Oops: 0000 [#1] SMP PTI
[  270.984020] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G        W  OE     5.3.0-rc4 #1
[  270.984748] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS 3.0c 03/24/2014
[  270.985473] RIP: 0010:tls_push_sg+0x27/0x190 [tls]
[  270.986184] Code: 00 00 00 0f 1f 44 00 00 41 57 44 0f b7 d1 41 56 41 55 49 89 d5 41 54 45 89 c4 41 81 cc 00 00 02 00 55 48 89 fd 53 48 83 ec 10 <44> 8b 4a 0c 48 89 74 24 08 44 89 44 24 04 45 89 ce 45 29 d6 44 03
[  270.987631] RSP: 0018:ffffb6ea86eb7be0 EFLAGS: 00010282
[  270.988350] RAX: 000000074c8caca0 RBX: ffff91cc3753b800 RCX: 0000000000000000
[  270.989081] RDX: ffff91d383e05778 RSI: ffff91cc37cc6400 RDI: ffff91bc36020940
[  270.989813] RBP: ffff91bc36020940 R08: 0000000000000000 R09: 0000000000000000
[  270.990548] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000020000
[  270.991272] R13: ffff91d383e05778 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
[  270.991999] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000) knlGS:0000000000000000
[  270.992727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.993448] CR2: ffff91d383e05784 CR3: 000000103554e003 CR4: 00000000001606e0
[  270.994177] Call Trace:
[  270.994905]  tls_tx_records+0x128/0x1c0 [tls]
[  270.995638]  tls_sw_sendpage+0x3e0/0x420 [tls]
[  270.996365]  inet_sendpage+0x52/0x90
[  270.997095]  ? direct_splice_actor+0x40/0x40
[  270.997826]  kernel_sendpage+0x1a/0x30
[  270.998555]  sock_sendpage+0x20/0x30
[  270.999281]  pipe_to_sendpage+0x62/0x90
[  271.000010]  __splice_from_pipe+0x80/0x180
[  271.000740]  ? direct_splice_actor+0x40/0x40
[  271.001467]  splice_from_pipe+0x5d/0x90
[  271.002199]  direct_splice_actor+0x35/0x40
[  271.002930]  splice_direct_to_actor+0x103/0x230
[  271.003661]  ? generic_pipe_buf_nosteal+0x10/0x10
[  271.004387]  do_splice_direct+0x9a/0xd0
[  271.005118]  do_sendfile+0x1c9/0x3d0
[  271.005849]  __x64_sys_sendfile64+0x5c/0xc0
[  271.006579]  do_syscall_64+0x5b/0x1d0
[  271.007301]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  271.008031] RIP: 0033:0x7f454c9bf6ba
[  271.008737] Code: 31 c0 5b c3 0f 1f 40 00 48 89 da 4c 89 ce 44 89 c7 5b e9 09 fe ff ff 66 0f 1f 84 00 00 00 00 00 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 86 27 2d 00 f7 d8 64 89 01 48
[  271.010187] RSP: 002b:00007f454a88cd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
[  271.010915] RAX: ffffffffffffffda RBX: 0000000002025460 RCX: 00007f454c9bf6ba
[  271.011648] RDX: 00007f454a88cd18 RSI: 0000000000000005 RDI: 000000000000000a
[  271.012374] RBP: 000000000000000a R08: 00007f454cc920ec R09: 00007f454cc92140
[  271.013109] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000000
[  271.013848] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000002030788
[  271.014577] Modules linked in: n5pf(OE) netconsole tls(OE) bonding intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nitrox_drv]
[  271.019344] CR2: ffff91d383e05784


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-09 17:46   ` [PATCH net 0/1] net/tls(TLS_SW): double free in tls_tx_records Pooja Trivedi
@ 2019-09-09 18:15     ` Pooja Trivedi
  2019-09-17 21:13       ` [PATCH V2 " Pooja Trivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-09 18:15 UTC (permalink / raw)
  To: netdev

    Enclosing tls_tx_records within lock_sock/release_sock pair to ensure
    write-synchronization is not sufficient because socket lock gets released
    under memory pressure situation by sk_wait_event while it sleeps waiting
    for memory, allowing another writer into tls_tx_records. This causes a
    race condition with record deletion post transmission.

    To fix this bug, use a flag set in tx_bitmask field of TLS context to
    ensure single writer in tls_tx_records at a time

    The bug resulted in the following crash:


    [  270.888952] ------------[ cut here ]------------
    [  270.890450] list_del corruption, ffff91cc3753a800->prev is
    LIST_POISON2 (dead000000000122)
    [  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50
    __list_del_entry_valid+0x62/0x90
    [  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding
    intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
    intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support
    irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
    aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei
    lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler
    wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm
    drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci
    libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb
    libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log
    dm_mod [last unloaded: nitrox_drv]
    [  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G
            OE     5.3.0-rc4 #1
    [  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS
    3.0c 03/24/2014
    [  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
    [  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee
    8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e
    cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87
    cf c8
    [  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
    [  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX:
0000000000000000
    [  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI:
ffff91bc3f857738
    [  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09:
0000000000000000
    [  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
    [  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15:
ffff91cc3753a800
    [  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000)
    knlGS:0000000000000000
    [  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4:
00000000001606e0
    [  270.909490] Call Trace:
    [  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
    [  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
    [  270.912154]  inet_sendpage+0x52/0x90
    [  270.913045]  ? direct_splice_actor+0x40/0x40
    [  270.913941]  kernel_sendpage+0x1a/0x30
    [  270.914831]  sock_sendpage+0x20/0x30
    [  270.915714]  pipe_to_sendpage+0x62/0x90
    [  270.916592]  __splice_from_pipe+0x80/0x180
    [  270.917461]  ? direct_splice_actor+0x40/0x40
    [  270.918334]  splice_from_pipe+0x5d/0x90
    [  270.919208]  direct_splice_actor+0x35/0x40
    [  270.920086]  splice_direct_to_actor+0x103/0x230
    [  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
    [  270.921850]  do_splice_direct+0x9a/0xd0
    [  270.922733]  do_sendfile+0x1c9/0x3d0
    [  270.923612]  __x64_sys_sendfile64+0x5c/0xc0

    Signed-off-by: Pooja Trivedi <pooja.trivedi@stackpath.com>

----------

diff --git a/include/net/tls.h b/include/net/tls.h
index 41b2d41..f346a54 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -161,6 +161,7 @@ struct tls_sw_context_tx {

 #define BIT_TX_SCHEDULED 0
 #define BIT_TX_CLOSING 1
+#define BIT_TX_IN_PROGRESS 2
  unsigned long tx_bitmask;
 };

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 91d21b0..6e99c61 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -367,6 +367,10 @@ int tls_tx_records(struct sock *sk, int flags)
  struct sk_msg *msg_en;
  int tx_flags, rc = 0;

+ /* If another writer is already in tls_tx_records, backoff and leave */
+ if (test_and_set_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask))
+ return 0;
+
  if (tls_is_partially_sent_record(tls_ctx)) {
  rec = list_first_entry(&ctx->tx_list,
        struct tls_rec, list);
@@ -415,6 +419,9 @@ int tls_tx_records(struct sock *sk, int flags)
  if (rc < 0 && rc != -EAGAIN)
  tls_err_abort(sk, EBADMSG);

+ /* clear the bit so another writer can get into tls_tx_records */
+ clear_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask);
+
  return rc;
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-09 18:15     ` [PATCH net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition " Pooja Trivedi
@ 2019-09-17 21:13       ` Pooja Trivedi
  2019-09-17 21:24         ` Pooja Trivedi
  2019-09-18 21:25         ` Jakub Kicinski
  0 siblings, 2 replies; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-17 21:13 UTC (permalink / raw)
  To: netdev
  Cc: davem, jakub.kicinski, daniel, john.fastabend, davejwatson,
	aviadye, borisp, Pooja Trivedi

From: Pooja Trivedi <pooja.trivedi@stackpath.com>

Enclosing tls_tx_records within lock_sock/release_sock pair to ensure
write-synchronization is not sufficient because socket lock gets released
under memory pressure situation by sk_wait_event while it sleeps waiting
for memory, allowing another writer into tls_tx_records. This causes a
race condition with record deletion post transmission.

To fix this bug, use a flag set in tx_bitmask field of TLS context to
ensure single writer in tls_tx_records at a time

The bug resulted in the following crash:

[  270.888952] ------------[ cut here ]------------
[  270.890450] list_del corruption, ffff91cc3753a800->prev is
LIST_POISON2 (dead000000000122)
[  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50
__list_del_entry_valid+0x62/0x90
[  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding
intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei
lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler
wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci
libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb
libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log
dm_mod [last unloaded: nitrox_drv]
[  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G
        OE     5.3.0-rc4 #1
[  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS
3.0c 03/24/2014
[  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
[  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee
8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e
cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87
cf c8
[  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
[  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX: 0000000000000000
[  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
[  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09: 0000000000000000
[  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
[  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000)
knlGS:0000000000000000
[  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
[  270.909490] Call Trace:
[  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
[  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
[  270.912154]  inet_sendpage+0x52/0x90
[  270.913045]  ? direct_splice_actor+0x40/0x40
[  270.913941]  kernel_sendpage+0x1a/0x30
[  270.914831]  sock_sendpage+0x20/0x30
[  270.915714]  pipe_to_sendpage+0x62/0x90
[  270.916592]  __splice_from_pipe+0x80/0x180
[  270.917461]  ? direct_splice_actor+0x40/0x40
[  270.918334]  splice_from_pipe+0x5d/0x90
[  270.919208]  direct_splice_actor+0x35/0x40
[  270.920086]  splice_direct_to_actor+0x103/0x230
[  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
[  270.921850]  do_splice_direct+0x9a/0xd0
[  270.922733]  do_sendfile+0x1c9/0x3d0
[  270.923612]  __x64_sys_sendfile64+0x5c/0xc0

Signed-off-by: Pooja Trivedi <pooja.trivedi@stackpath.com>
---
 include/net/tls.h | 1 +
 net/tls/tls_sw.c  | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/net/tls.h b/include/net/tls.h
index 41b2d41..f346a54 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -161,6 +161,7 @@ struct tls_sw_context_tx {
 
 #define BIT_TX_SCHEDULED	0
 #define BIT_TX_CLOSING		1
+#define BIT_TX_IN_PROGRESS	2
 	unsigned long tx_bitmask;
 };
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 91d21b0..6e99c61 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -367,6 +367,10 @@ int tls_tx_records(struct sock *sk, int flags)
 	struct sk_msg *msg_en;
 	int tx_flags, rc = 0;
 
+	/* If another writer is already in tls_tx_records, backoff and leave */
+	if (test_and_set_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask))
+		return 0;
+
 	if (tls_is_partially_sent_record(tls_ctx)) {
 		rec = list_first_entry(&ctx->tx_list,
 				       struct tls_rec, list);
@@ -415,6 +419,9 @@ int tls_tx_records(struct sock *sk, int flags)
 	if (rc < 0 && rc != -EAGAIN)
 		tls_err_abort(sk, EBADMSG);
 
+	/* clear the bit so another writer can get into tls_tx_records */
+	clear_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask);
+
 	return rc;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-17 21:13       ` [PATCH V2 " Pooja Trivedi
@ 2019-09-17 21:24         ` Pooja Trivedi
  2019-09-18 21:25         ` Jakub Kicinski
  1 sibling, 0 replies; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-17 21:24 UTC (permalink / raw)
  To: netdev
  Cc: davem, jakub.kicinski, daniel, john.fastabend, davejwatson,
	aviadye, borisp, Pooja Trivedi

Hello,

I see that the 'State' of the initially submitted patch has changed to
"Changes Requested" but I have not seen a follow-up email regarding
what needs to be changed. I also realized that the tabs had not been
preserved in my initial submission, hence the resubmission.

Thanks,
Pooja.



On Tue, Sep 17, 2019 at 5:14 PM Pooja Trivedi <poojatrivedi@gmail.com> wrote:
>
> From: Pooja Trivedi <pooja.trivedi@stackpath.com>
>
> Enclosing tls_tx_records within lock_sock/release_sock pair to ensure
> write-synchronization is not sufficient because socket lock gets released
> under memory pressure situation by sk_wait_event while it sleeps waiting
> for memory, allowing another writer into tls_tx_records. This causes a
> race condition with record deletion post transmission.
>
> To fix this bug, use a flag set in tx_bitmask field of TLS context to
> ensure single writer in tls_tx_records at a time
>
> The bug resulted in the following crash:
>
> [  270.888952] ------------[ cut here ]------------
> [  270.890450] list_del corruption, ffff91cc3753a800->prev is
> LIST_POISON2 (dead000000000122)
> [  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50
> __list_del_entry_valid+0x62/0x90
> [  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding
> intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei
> lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler
> wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci
> libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb
> libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log
> dm_mod [last unloaded: nitrox_drv]
> [  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G
>         OE     5.3.0-rc4 #1
> [  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS
> 3.0c 03/24/2014
> [  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
> [  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee
> 8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e
> cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87
> cf c8
> [  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
> [  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX: 0000000000000000
> [  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
> [  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09: 0000000000000000
> [  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
> [  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000)
> knlGS:0000000000000000
> [  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
> [  270.909490] Call Trace:
> [  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
> [  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
> [  270.912154]  inet_sendpage+0x52/0x90
> [  270.913045]  ? direct_splice_actor+0x40/0x40
> [  270.913941]  kernel_sendpage+0x1a/0x30
> [  270.914831]  sock_sendpage+0x20/0x30
> [  270.915714]  pipe_to_sendpage+0x62/0x90
> [  270.916592]  __splice_from_pipe+0x80/0x180
> [  270.917461]  ? direct_splice_actor+0x40/0x40
> [  270.918334]  splice_from_pipe+0x5d/0x90
> [  270.919208]  direct_splice_actor+0x35/0x40
> [  270.920086]  splice_direct_to_actor+0x103/0x230
> [  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
> [  270.921850]  do_splice_direct+0x9a/0xd0
> [  270.922733]  do_sendfile+0x1c9/0x3d0
> [  270.923612]  __x64_sys_sendfile64+0x5c/0xc0
>
> Signed-off-by: Pooja Trivedi <pooja.trivedi@stackpath.com>
> ---
>  include/net/tls.h | 1 +
>  net/tls/tls_sw.c  | 7 +++++++
>  2 files changed, 8 insertions(+)
>
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 41b2d41..f346a54 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -161,6 +161,7 @@ struct tls_sw_context_tx {
>
>  #define BIT_TX_SCHEDULED       0
>  #define BIT_TX_CLOSING         1
> +#define BIT_TX_IN_PROGRESS     2
>         unsigned long tx_bitmask;
>  };
>
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 91d21b0..6e99c61 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -367,6 +367,10 @@ int tls_tx_records(struct sock *sk, int flags)
>         struct sk_msg *msg_en;
>         int tx_flags, rc = 0;
>
> +       /* If another writer is already in tls_tx_records, backoff and leave */
> +       if (test_and_set_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask))
> +               return 0;
> +
>         if (tls_is_partially_sent_record(tls_ctx)) {
>                 rec = list_first_entry(&ctx->tx_list,
>                                        struct tls_rec, list);
> @@ -415,6 +419,9 @@ int tls_tx_records(struct sock *sk, int flags)
>         if (rc < 0 && rc != -EAGAIN)
>                 tls_err_abort(sk, EBADMSG);
>
> +       /* clear the bit so another writer can get into tls_tx_records */
> +       clear_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask);
> +
>         return rc;
>  }
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-17 21:13       ` [PATCH V2 " Pooja Trivedi
  2019-09-17 21:24         ` Pooja Trivedi
@ 2019-09-18 21:25         ` Jakub Kicinski
  2019-09-18 21:37           ` Pooja Trivedi
  1 sibling, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2019-09-18 21:25 UTC (permalink / raw)
  To: Pooja Trivedi
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Tue, 17 Sep 2019 21:13:56 +0000, Pooja Trivedi wrote:
> From: Pooja Trivedi <pooja.trivedi@stackpath.com>

Ugh the same problem was diagnosed recently by Mallesham but I just
realized he took the conversation off list so you can't see it.

> Enclosing tls_tx_records within lock_sock/release_sock pair to ensure
> write-synchronization is not sufficient because socket lock gets released
> under memory pressure situation by sk_wait_event while it sleeps waiting
> for memory, allowing another writer into tls_tx_records. This causes a
> race condition with record deletion post transmission.
> 
> To fix this bug, use a flag set in tx_bitmask field of TLS context to
> ensure single writer in tls_tx_records at a time

Could you point me to the place where socket lock gets released in/under
tls_tx_records()? I thought it's only done in tls_sw_do_sendpage()/
tls_sw_do_sendmsg().

FWIW this was my answer to Mallesham:

If I understand you correctly after we release and re-acquire socket
lock msg_pl may be pointing to already freed message? Could we perhaps
reload the pointer from the context/record? Something like:

	if (ret) {
		rec = ctx->open_rec;
		if (rec)
			tls_trim_both_msgs(sk, &rec->msg_plaintext.sg.size);
		goto sendpage_end;
	}

I'm not 100% sure if that makes sense, perhaps John will find time to
look or you could experiment?

We could try to add some state like we have ctx->in_tcp_sendpages to
let the async processing know it's not needed since there's still a
writer present, but I get a feeling that'd end up being more complex.

> The bug resulted in the following crash:
> 
> [  270.888952] ------------[ cut here ]------------
> [  270.890450] list_del corruption, ffff91cc3753a800->prev is
> LIST_POISON2 (dead000000000122)
> [  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50
> __list_del_entry_valid+0x62/0x90
> [  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding
> intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei
> lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler
> wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci
> libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb
> libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log
> dm_mod [last unloaded: nitrox_drv]
> [  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G
>         OE     5.3.0-rc4 #1
> [  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS
> 3.0c 03/24/2014
> [  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
> [  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee
> 8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e
> cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87
> cf c8
> [  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
> [  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX: 0000000000000000
> [  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
> [  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09: 0000000000000000
> [  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
> [  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000)
> knlGS:0000000000000000
> [  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
> [  270.909490] Call Trace:
> [  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
> [  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
> [  270.912154]  inet_sendpage+0x52/0x90
> [  270.913045]  ? direct_splice_actor+0x40/0x40
> [  270.913941]  kernel_sendpage+0x1a/0x30
> [  270.914831]  sock_sendpage+0x20/0x30
> [  270.915714]  pipe_to_sendpage+0x62/0x90
> [  270.916592]  __splice_from_pipe+0x80/0x180
> [  270.917461]  ? direct_splice_actor+0x40/0x40
> [  270.918334]  splice_from_pipe+0x5d/0x90
> [  270.919208]  direct_splice_actor+0x35/0x40
> [  270.920086]  splice_direct_to_actor+0x103/0x230
> [  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
> [  270.921850]  do_splice_direct+0x9a/0xd0
> [  270.922733]  do_sendfile+0x1c9/0x3d0
> [  270.923612]  __x64_sys_sendfile64+0x5c/0xc0
> 
> Signed-off-by: Pooja Trivedi <pooja.trivedi@stackpath.com>
> ---
>  include/net/tls.h | 1 +
>  net/tls/tls_sw.c  | 7 +++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 41b2d41..f346a54 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -161,6 +161,7 @@ struct tls_sw_context_tx {
>  
>  #define BIT_TX_SCHEDULED	0
>  #define BIT_TX_CLOSING		1
> +#define BIT_TX_IN_PROGRESS	2
>  	unsigned long tx_bitmask;
>  };
>  
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 91d21b0..6e99c61 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -367,6 +367,10 @@ int tls_tx_records(struct sock *sk, int flags)
>  	struct sk_msg *msg_en;
>  	int tx_flags, rc = 0;
>  
> +	/* If another writer is already in tls_tx_records, backoff and leave */
> +	if (test_and_set_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask))
> +		return 0;
> +
>  	if (tls_is_partially_sent_record(tls_ctx)) {
>  		rec = list_first_entry(&ctx->tx_list,
>  				       struct tls_rec, list);
> @@ -415,6 +419,9 @@ int tls_tx_records(struct sock *sk, int flags)
>  	if (rc < 0 && rc != -EAGAIN)
>  		tls_err_abort(sk, EBADMSG);
>  
> +	/* clear the bit so another writer can get into tls_tx_records */
> +	clear_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask);
> +
>  	return rc;
>  }
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-18 21:25         ` Jakub Kicinski
@ 2019-09-18 21:37           ` Pooja Trivedi
  2019-09-18 21:45             ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-18 21:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

Hi Jakub,

I have explained one potential way for the race to happen in my
original message to the netdev mailing list here:
https://marc.info/?l=linux-netdev&m=156805120229554&w=2

Here is the part out of there that's relevant to your question:

-----------------------------------------

One potential way for race condition to appear:

When under tcp memory pressure, Thread 1 takes the following code path:
do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event

sk_wait_event releases the socket lock and sleeps waiting for memory:

#define sk_wait_event(__sk, __timeo, __condition, __wait)       \
     ({  int __rc;                       \
         release_sock(__sk);                 \
         __rc = __condition;                 \
         if (!__rc) {                        \
             *(__timeo) = wait_woken(__wait,         \
                         TASK_INTERRUPTIBLE, \
                         *(__timeo));        \
         }                           \
         sched_annotate_sleep();                 \
         lock_sock(__sk);                    \
         __rc = __condition;                 \
         __rc;                           \
     })

Thread 2 code path:
tx_work_handler ---> tls_tx_records

Thread 2 is able to obtain the socket lock and go through the
transmission of the ctx->tx_list, deleting the sent ones (as in the
for loop below).

int tls_tx_records(struct sock *sk, int flags)
{
     ....
     ....
     ....
     ....
     list_for_each_entry_safe(rec, tmp, &ctx->tx_list, list) {
          if (READ_ONCE(rec->tx_ready)) {
              if (flags == -1)
                  tx_flags = rec->tx_flags;
              else
                  tx_flags = flags;

              msg_en = &rec->msg_encrypted;
              rc = tls_push_sg(sk, tls_ctx,
                       &msg_en->sg.data[msg_en->sg.curr],
                       0, tx_flags);
              if (rc)
                  goto tx_err;

              list_del(&rec->list); // **** crash location ****
              sk_msg_free(sk, &rec->msg_plaintext);
              kfree(rec);
          } else {
              break;
          }
      }
     ....
     ....
     ....
     ....
}

When Thread 1 wakes up from tls_push_sg call and attempts list_del on
previously grabbed record which was sent and deleted by Thread 2, it
causes the crash.

To fix this race, a flag or bool inside of ctx can be used to
synchronize access to tls_tx_records.

-----------------------------------------

Let me know if you need more information. Thanks!






On Wed, Sep 18, 2019 at 5:25 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Tue, 17 Sep 2019 21:13:56 +0000, Pooja Trivedi wrote:
> > From: Pooja Trivedi <pooja.trivedi@stackpath.com>
>
> Ugh the same problem was diagnosed recently by Mallesham but I just
> realized he took the conversation off list so you can't see it.
>
> > Enclosing tls_tx_records within lock_sock/release_sock pair to ensure
> > write-synchronization is not sufficient because socket lock gets released
> > under memory pressure situation by sk_wait_event while it sleeps waiting
> > for memory, allowing another writer into tls_tx_records. This causes a
> > race condition with record deletion post transmission.
> >
> > To fix this bug, use a flag set in tx_bitmask field of TLS context to
> > ensure single writer in tls_tx_records at a time
>
> Could you point me to the place where socket lock gets released in/under
> tls_tx_records()? I thought it's only done in tls_sw_do_sendpage()/
> tls_sw_do_sendmsg().
>
> FWIW this was my answer to Mallesham:
>
> If I understand you correctly after we release and re-acquire socket
> lock msg_pl may be pointing to already freed message? Could we perhaps
> reload the pointer from the context/record? Something like:
>
>         if (ret) {
>                 rec = ctx->open_rec;
>                 if (rec)
>                         tls_trim_both_msgs(sk, &rec->msg_plaintext.sg.size);
>                 goto sendpage_end;
>         }
>
> I'm not 100% sure if that makes sense, perhaps John will find time to
> look or you could experiment?
>
> We could try to add some state like we have ctx->in_tcp_sendpages to
> let the async processing know it's not needed since there's still a
> writer present, but I get a feeling that'd end up being more complex.
>
> > The bug resulted in the following crash:
> >
> > [  270.888952] ------------[ cut here ]------------
> > [  270.890450] list_del corruption, ffff91cc3753a800->prev is
> > LIST_POISON2 (dead000000000122)
> > [  270.891194] WARNING: CPU: 1 PID: 7387 at lib/list_debug.c:50
> > __list_del_entry_valid+0x62/0x90
> > [  270.892037] Modules linked in: n5pf(OE) netconsole tls(OE) bonding
> > intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> > intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support
> > irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> > aesni_intel crypto_simd mei_me cryptd glue_helper ipmi_si sg mei
> > lpc_ich pcspkr joydev ioatdma i2c_i801 ipmi_devintf ipmi_msghandler
> > wmi ip_tables xfs libcrc32c sd_mod mgag200 drm_vram_helper ttm
> > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm isci
> > libsas ahci scsi_transport_sas libahci crc32c_intel serio_raw igb
> > libata ptp pps_core dca i2c_algo_bit dm_mirror dm_region_hash dm_log
> > dm_mod [last unloaded: nitrox_drv]
> > [  270.896836] CPU: 1 PID: 7387 Comm: uperf Kdump: loaded Tainted: G
> >         OE     5.3.0-rc4 #1
> > [  270.897711] Hardware name: Supermicro SYS-1027R-N3RF/X9DRW, BIOS
> > 3.0c 03/24/2014
> > [  270.898597] RIP: 0010:__list_del_entry_valid+0x62/0x90
> > [  270.899478] Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 e0 f9 ee
> > 8d e8 b2 cf c8 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 18 fa ee 8d e8 9e
> > cf c8 ff <0f> 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 50 fa ee 8d e8 87
> > cf c8
> > [  270.901321] RSP: 0018:ffffb6ea86eb7c20 EFLAGS: 00010282
> > [  270.902240] RAX: 0000000000000000 RBX: ffff91cc3753c000 RCX: 0000000000000000
> > [  270.903157] RDX: ffff91bc3f867080 RSI: ffff91bc3f857738 RDI: ffff91bc3f857738
> > [  270.904074] RBP: ffff91bc36020940 R08: 0000000000000560 R09: 0000000000000000
> > [  270.904988] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [  270.905902] R13: ffff91cc3753a800 R14: ffff91cc37cc6400 R15: ffff91cc3753a800
> > [  270.906809] FS:  00007f454a88d700(0000) GS:ffff91bc3f840000(0000)
> > knlGS:0000000000000000
> > [  270.907715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  270.908606] CR2: 00007f453c00292c CR3: 000000103554e003 CR4: 00000000001606e0
> > [  270.909490] Call Trace:
> > [  270.910373]  tls_tx_records+0x138/0x1c0 [tls]
> > [  270.911262]  tls_sw_sendpage+0x3e0/0x420 [tls]
> > [  270.912154]  inet_sendpage+0x52/0x90
> > [  270.913045]  ? direct_splice_actor+0x40/0x40
> > [  270.913941]  kernel_sendpage+0x1a/0x30
> > [  270.914831]  sock_sendpage+0x20/0x30
> > [  270.915714]  pipe_to_sendpage+0x62/0x90
> > [  270.916592]  __splice_from_pipe+0x80/0x180
> > [  270.917461]  ? direct_splice_actor+0x40/0x40
> > [  270.918334]  splice_from_pipe+0x5d/0x90
> > [  270.919208]  direct_splice_actor+0x35/0x40
> > [  270.920086]  splice_direct_to_actor+0x103/0x230
> > [  270.920966]  ? generic_pipe_buf_nosteal+0x10/0x10
> > [  270.921850]  do_splice_direct+0x9a/0xd0
> > [  270.922733]  do_sendfile+0x1c9/0x3d0
> > [  270.923612]  __x64_sys_sendfile64+0x5c/0xc0
> >
> > Signed-off-by: Pooja Trivedi <pooja.trivedi@stackpath.com>
> > ---
> >  include/net/tls.h | 1 +
> >  net/tls/tls_sw.c  | 7 +++++++
> >  2 files changed, 8 insertions(+)
> >
> > diff --git a/include/net/tls.h b/include/net/tls.h
> > index 41b2d41..f346a54 100644
> > --- a/include/net/tls.h
> > +++ b/include/net/tls.h
> > @@ -161,6 +161,7 @@ struct tls_sw_context_tx {
> >
> >  #define BIT_TX_SCHEDULED     0
> >  #define BIT_TX_CLOSING               1
> > +#define BIT_TX_IN_PROGRESS   2
> >       unsigned long tx_bitmask;
> >  };
> >
> > diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> > index 91d21b0..6e99c61 100644
> > --- a/net/tls/tls_sw.c
> > +++ b/net/tls/tls_sw.c
> > @@ -367,6 +367,10 @@ int tls_tx_records(struct sock *sk, int flags)
> >       struct sk_msg *msg_en;
> >       int tx_flags, rc = 0;
> >
> > +     /* If another writer is already in tls_tx_records, backoff and leave */
> > +     if (test_and_set_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask))
> > +             return 0;
> > +
> >       if (tls_is_partially_sent_record(tls_ctx)) {
> >               rec = list_first_entry(&ctx->tx_list,
> >                                      struct tls_rec, list);
> > @@ -415,6 +419,9 @@ int tls_tx_records(struct sock *sk, int flags)
> >       if (rc < 0 && rc != -EAGAIN)
> >               tls_err_abort(sk, EBADMSG);
> >
> > +     /* clear the bit so another writer can get into tls_tx_records */
> > +     clear_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask);
> > +
> >       return rc;
> >  }
> >
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-18 21:37           ` Pooja Trivedi
@ 2019-09-18 21:45             ` Jakub Kicinski
  2019-09-22  3:19               ` Pooja Trivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2019-09-18 21:45 UTC (permalink / raw)
  To: Pooja Trivedi
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:
> Hi Jakub,
> 
> I have explained one potential way for the race to happen in my
> original message to the netdev mailing list here:
> https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> 
> Here is the part out of there that's relevant to your question:
> 
> -----------------------------------------
> 
> One potential way for race condition to appear:
> 
> When under tcp memory pressure, Thread 1 takes the following code path:
> do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event

Ugh, so do_tcp_sendpages() can also release the lock :/

Since the problem occurs in tls_sw_do_sendpage() and
tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-18 21:45             ` Jakub Kicinski
@ 2019-09-22  3:19               ` Pooja Trivedi
  2019-09-24  0:28                 ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-22  3:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:
> > Hi Jakub,
> >
> > I have explained one potential way for the race to happen in my
> > original message to the netdev mailing list here:
> > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> >
> > Here is the part out of there that's relevant to your question:
> >
> > -----------------------------------------
> >
> > One potential way for race condition to appear:
> >
> > When under tcp memory pressure, Thread 1 takes the following code path:
> > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event
>
> Ugh, so do_tcp_sendpages() can also release the lock :/
>
> Since the problem occurs in tls_sw_do_sendpage() and
> tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?

That won't do because tls_tx_records also gets called when completion
callbacks schedule delayed work. That was the code path that caused
the crash for my test. Cavium's nitrox crypto offload driver calling
tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
was scheduled would then be processed by tx_work_handler.
Notice in my previous reply,
"Thread 2 code path:
tx_work_handler ---> tls_tx_records"


"Thread 2 code path:
tx_work_handler ---> tls_tx_records"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-22  3:19               ` Pooja Trivedi
@ 2019-09-24  0:28                 ` Jakub Kicinski
  2019-09-24 16:48                   ` Pooja Trivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2019-09-24  0:28 UTC (permalink / raw)
  To: Pooja Trivedi
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote:
> On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote:
> > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:  
> > > Hi Jakub,
> > >
> > > I have explained one potential way for the race to happen in my
> > > original message to the netdev mailing list here:
> > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> > >
> > > Here is the part out of there that's relevant to your question:
> > >
> > > -----------------------------------------
> > >
> > > One potential way for race condition to appear:
> > >
> > > When under tcp memory pressure, Thread 1 takes the following code path:
> > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event  
> >
> > Ugh, so do_tcp_sendpages() can also release the lock :/
> >
> > Since the problem occurs in tls_sw_do_sendpage() and
> > tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?  
> 
> That won't do because tls_tx_records also gets called when completion
> callbacks schedule delayed work. That was the code path that caused
> the crash for my test. Cavium's nitrox crypto offload driver calling
> tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
> was scheduled would then be processed by tx_work_handler.
> Notice in my previous reply,
> "Thread 2 code path:
> tx_work_handler ---> tls_tx_records"
> 
> "Thread 2 code path:
> tx_work_handler ---> tls_tx_records"

Right, the work handler would obviously also have to obey the exclusion
mechanism of choice.

Having said that this really does feel like we are trying to lock code,
not data here :(

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-24  0:28                 ` Jakub Kicinski
@ 2019-09-24 16:48                   ` Pooja Trivedi
  2019-09-28  0:37                     ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Pooja Trivedi @ 2019-09-24 16:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Mon, Sep 23, 2019 at 8:28 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote:
> > On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote:
> > > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:
> > > > Hi Jakub,
> > > >
> > > > I have explained one potential way for the race to happen in my
> > > > original message to the netdev mailing list here:
> > > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> > > >
> > > > Here is the part out of there that's relevant to your question:
> > > >
> > > > -----------------------------------------
> > > >
> > > > One potential way for race condition to appear:
> > > >
> > > > When under tcp memory pressure, Thread 1 takes the following code path:
> > > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event
> > >
> > > Ugh, so do_tcp_sendpages() can also release the lock :/
> > >
> > > Since the problem occurs in tls_sw_do_sendpage() and
> > > tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?
> >
> > That won't do because tls_tx_records also gets called when completion
> > callbacks schedule delayed work. That was the code path that caused
> > the crash for my test. Cavium's nitrox crypto offload driver calling
> > tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
> > was scheduled would then be processed by tx_work_handler.
> > Notice in my previous reply,
> > "Thread 2 code path:
> > tx_work_handler ---> tls_tx_records"
> >
> > "Thread 2 code path:
> > tx_work_handler ---> tls_tx_records"
>
> Right, the work handler would obviously also have to obey the exclusion
> mechanism of choice.
>
> Having said that this really does feel like we are trying to lock code,
> not data here :(


Agree with you and exactly the thought process I went through. So what
are some other options?

1) A lock member inside of ctx to protect tx_list
We are load testing ktls offload with nitrox and the performance was
quite adversely affected by this. This approach can be explored more,
but the original design of using socket lock didn't follow this model
either.
2) Allow tagging of individual record inside of tx_list to indicate if
it has been 'processed'
This approach would likely protect the data without compromising
performance. It will allow Thread 2 to proceed with the TX portion of
tls_tx_records while Thread 1 sleeps waiting for memory. There will
need to be careful cleanup and backtracking after the thread wakes up
to ensure a consistent state of tx_list and record transmission.
The approach has several problems, however -- (a) It could cause
out-of-order record tx (b) If Thread 1 is waiting for memory, Thread 2
most likely will (c) Again, socket lock wasn't designed to follow this
model to begin with


Given that socket lock essentially was working as a code protector --
as an exclusion mechanism to allow only a single writer through
tls_tx_records at a time -- what other clean ways do we have to fix
the race without a significant refactor of the design and code?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-24 16:48                   ` Pooja Trivedi
@ 2019-09-28  0:37                     ` Jakub Kicinski
  2019-10-09 16:57                       ` Pooja Trivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2019-09-28  0:37 UTC (permalink / raw)
  To: Pooja Trivedi
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Tue, 24 Sep 2019 12:48:26 -0400, Pooja Trivedi wrote:
> On Mon, Sep 23, 2019 at 8:28 PM Jakub Kicinski wrote:
> > On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote:  
> > > On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote:  
> > > > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:  
> > > > > Hi Jakub,
> > > > >
> > > > > I have explained one potential way for the race to happen in my
> > > > > original message to the netdev mailing list here:
> > > > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> > > > >
> > > > > Here is the part out of there that's relevant to your question:
> > > > >
> > > > > -----------------------------------------
> > > > >
> > > > > One potential way for race condition to appear:
> > > > >
> > > > > When under tcp memory pressure, Thread 1 takes the following code path:
> > > > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > > > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > > > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event  
> > > >
> > > > Ugh, so do_tcp_sendpages() can also release the lock :/
> > > >
> > > > Since the problem occurs in tls_sw_do_sendpage() and
> > > > tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?  
> > >
> > > That won't do because tls_tx_records also gets called when completion
> > > callbacks schedule delayed work. That was the code path that caused
> > > the crash for my test. Cavium's nitrox crypto offload driver calling
> > > tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
> > > was scheduled would then be processed by tx_work_handler.
> > > Notice in my previous reply,
> > > "Thread 2 code path:
> > > tx_work_handler ---> tls_tx_records"
> > >
> > > "Thread 2 code path:
> > > tx_work_handler ---> tls_tx_records"  
> >
> > Right, the work handler would obviously also have to obey the exclusion
> > mechanism of choice.
> >
> > Having said that this really does feel like we are trying to lock code,
> > not data here :(  
> 
> Agree with you and exactly the thought process I went through. So what
> are some other options?
> 
> 1) A lock member inside of ctx to protect tx_list
> We are load testing ktls offload with nitrox and the performance was
> quite adversely affected by this. This approach can be explored more,
> but the original design of using socket lock didn't follow this model
> either.
> 2) Allow tagging of individual record inside of tx_list to indicate if
> it has been 'processed'
> This approach would likely protect the data without compromising
> performance. It will allow Thread 2 to proceed with the TX portion of
> tls_tx_records while Thread 1 sleeps waiting for memory. There will
> need to be careful cleanup and backtracking after the thread wakes up
> to ensure a consistent state of tx_list and record transmission.
> The approach has several problems, however -- (a) It could cause
> out-of-order record tx (b) If Thread 1 is waiting for memory, Thread 2
> most likely will (c) Again, socket lock wasn't designed to follow this
> model to begin with
> 
> 
> Given that socket lock essentially was working as a code protector --
> as an exclusion mechanism to allow only a single writer through
> tls_tx_records at a time -- what other clean ways do we have to fix
> the race without a significant refactor of the design and code?

Very sorry about the delay. I don't think we can maintain the correct
semantics without sleeping :( If we just bail in tls_tx_records() when
there's already another writer the later writer will return from the
system call, even though the data is not pushed into the TCP layer.

What was reason for the performance impact on (1)? My feeling is that
we need to make writers wait to maintain socket write semantics, and
that implies putting writers to sleep, which is indeed very costly..

Perhaps something along the lines of:

	if (ctx->in_tcp_sendpages) {
		rc = sk_stream_wait_memory(sk, &timeo);
		...
	}

in tls_tx_records() would be the "most correct" solution? If we get
there and there is already a writer, that means the first writer has
to be waiting for memory, and so should the second..

WDYT?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-09-28  0:37                     ` Jakub Kicinski
@ 2019-10-09 16:57                       ` Pooja Trivedi
  2019-10-17 23:48                         ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Pooja Trivedi @ 2019-10-09 16:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Fri, Sep 27, 2019 at 05:37:53PM -0700, Jakub Kicinski wrote:
> On Tue, 24 Sep 2019 12:48:26 -0400, Pooja Trivedi wrote:
> > On Mon, Sep 23, 2019 at 8:28 PM Jakub Kicinski wrote:
> > > On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote:  
> > > > On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote:  
> > > > > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:  
> > > > > > Hi Jakub,
> > > > > >
> > > > > > I have explained one potential way for the race to happen in my
> > > > > > original message to the netdev mailing list here:
> > > > > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> > > > > >
> > > > > > Here is the part out of there that's relevant to your question:
> > > > > >
> > > > > > -----------------------------------------
> > > > > >
> > > > > > One potential way for race condition to appear:
> > > > > >
> > > > > > When under tcp memory pressure, Thread 1 takes the following code path:
> > > > > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > > > > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > > > > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event  
> > > > >
> > > > > Ugh, so do_tcp_sendpages() can also release the lock :/
> > > > >
> > > > > Since the problem occurs in tls_sw_do_sendpage() and
> > > > > tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?  
> > > >
> > > > That won't do because tls_tx_records also gets called when completion
> > > > callbacks schedule delayed work. That was the code path that caused
> > > > the crash for my test. Cavium's nitrox crypto offload driver calling
> > > > tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
> > > > was scheduled would then be processed by tx_work_handler.
> > > > Notice in my previous reply,
> > > > "Thread 2 code path:
> > > > tx_work_handler ---> tls_tx_records"
> > > >
> > > > "Thread 2 code path:
> > > > tx_work_handler ---> tls_tx_records"  
> > >
> > > Right, the work handler would obviously also have to obey the exclusion
> > > mechanism of choice.
> > >
> > > Having said that this really does feel like we are trying to lock code,
> > > not data here :(  
> > 
> > Agree with you and exactly the thought process I went through. So what
> > are some other options?
> > 
> > 1) A lock member inside of ctx to protect tx_list
> > We are load testing ktls offload with nitrox and the performance was
> > quite adversely affected by this. This approach can be explored more,
> > but the original design of using socket lock didn't follow this model
> > either.
> > 2) Allow tagging of individual record inside of tx_list to indicate if
> > it has been 'processed'
> > This approach would likely protect the data without compromising
> > performance. It will allow Thread 2 to proceed with the TX portion of
> > tls_tx_records while Thread 1 sleeps waiting for memory. There will
> > need to be careful cleanup and backtracking after the thread wakes up
> > to ensure a consistent state of tx_list and record transmission.
> > The approach has several problems, however -- (a) It could cause
> > out-of-order record tx (b) If Thread 1 is waiting for memory, Thread 2
> > most likely will (c) Again, socket lock wasn't designed to follow this
> > model to begin with
> > 
> > 
> > Given that socket lock essentially was working as a code protector --
> > as an exclusion mechanism to allow only a single writer through
> > tls_tx_records at a time -- what other clean ways do we have to fix
> > the race without a significant refactor of the design and code?
> 
> Very sorry about the delay. I don't think we can maintain the correct
> semantics without sleeping :( If we just bail in tls_tx_records() when
> there's already another writer the later writer will return from the
> system call, even though the data is not pushed into the TCP layer.
> 

Thanks for your response and sorry about the delay!

I am trying the following scenarios in my head to see how valid your
concern is. Play along with me please.

The two main writers in picture here are
Thread 1 -- Enqueue thread (sendfile system call) -- pushes records to
card, also performs completions (push to tcp) if records are ready
Thread 2 -- Work handler (tx_work_handler) -- bottom-half completions routine

With the submitted patch,
Case 1 (your concern) : Thread 2 grabs socket lock, calls
tls_tx_records, runs into memory pressure, releases socket lock, waits
for memory. Now Thread 1 grabs socket lock, calls tls_tx_records, bails.
In this case, sendfile system call will bail out without performing
completions. Is that really a problem? When Thread 1 ultimately
proceeds, it will perform the completions anyway.

Case 2: Threads grab socket lock in a reverse sequence of Case 1. So
Thread 1 grabs socket lock first and ends up waiting for memory.
Thread 2 comes in later and bails from tls_tx_records.
In this case, in the submitted patch, I realized that we are not rescheduling 
the work before bailing. So I think an amendment to that patch, 
something along the lines of what's shown below, would fare better.

********************************************************************************

diff --git a/include/net/tls.h b/include/net/tls.h
index c664e6dba0d1..4ae3b4822770 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -162,6 +162,7 @@ struct tls_sw_context_tx {
 
 #define BIT_TX_SCHEDULED	0
 #define BIT_TX_CLOSING		1
+#define BIT_TX_IN_PROGRESS	2
 	unsigned long tx_bitmask;
 };
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index c2b5e0d2ba1a..608db30bbe62 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -359,7 +359,7 @@ static void tls_free_open_rec(struct sock *sk)
 	}
 }
 
-int tls_tx_records(struct sock *sk, int flags)
+int tls_tx_records(struct sock *sk, int flags, bool bh)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
@@ -367,6 +367,15 @@ int tls_tx_records(struct sock *sk, int flags)
 	struct sk_msg *msg_en;
 	int tx_flags, rc = 0;
 
+	/* If another writer is already in tls_tx_records, backoff and leave */
+	if (test_and_set_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask)) {
+		if (bh) {
+			if (!test_and_set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+				schedule_delayed_work(&ctx->tx_work.work, 1);
+		}
+		return 0;
+	}
+
 	if (tls_is_partially_sent_record(tls_ctx)) {
 		rec = list_first_entry(&ctx->tx_list,
 				       struct tls_rec, list);
@@ -415,6 +424,9 @@ int tls_tx_records(struct sock *sk, int flags)
 	if (rc < 0 && rc != -EAGAIN)
 		tls_err_abort(sk, EBADMSG);
 
+	/* clear the bit so another writer can get into tls_tx_records */
+	clear_bit(BIT_TX_IN_PROGRESS, &ctx->tx_bitmask);
+
 	return rc;
 }
 
@@ -747,7 +759,7 @@ static int tls_push_record(struct sock *sk, int flags,
 		ctx->open_rec = tmp;
 	}
 
-	return tls_tx_records(sk, flags);
+	return tls_tx_records(sk, flags, false);
 }
 
 static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk,
@@ -1084,7 +1096,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	/* Transmit if any encryptions have completed */
 	if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
 		cancel_delayed_work(&ctx->tx_work.work);
-		tls_tx_records(sk, msg->msg_flags);
+		tls_tx_records(sk, msg->msg_flags, false);
 	}
 
 send_end:
@@ -1208,7 +1220,7 @@ static int tls_sw_do_sendpage(struct sock *sk, struct page *page,
 		/* Transmit if any encryptions have completed */
 		if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
 			cancel_delayed_work(&ctx->tx_work.work);
-			tls_tx_records(sk, flags);
+			tls_tx_records(sk, flags, false);
 		}
 	}
 sendpage_end:
@@ -2072,7 +2084,7 @@ void tls_sw_release_resources_tx(struct sock *sk)
 	if (atomic_read(&ctx->encrypt_pending))
 		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
 
-	tls_tx_records(sk, -1);
+	tls_tx_records(sk, -1, false);
 
 	/* Free up un-sent records in tx_list. First, free
 	 * the partially sent record if any at head of tx_list.
@@ -2171,7 +2183,7 @@ static void tx_work_handler(struct work_struct *work)
 	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
 		return;
 	lock_sock(sk);
-	tls_tx_records(sk, -1);
+	tls_tx_records(sk, -1, true);
 	release_sock(sk);
 }

********************************************************************************

> What was reason for the performance impact on (1)? 


While the performance impact still needs to be investigated, that effort has stopped short due 
to other issues with that approach like hard lockups. The basic problem is that the socket lock 
is touched at multiple layers (tls, tcp etc.).

Here are the two approaches we tried along the lines of using an additional lock...

Approach 1 -- Protect tx_list with a spinlock_t tx_list_lock :

********************************************************************************

--- test/linux-5.3.2/include/net/tls.h	2019-10-01 02:24:50.000000000 -0400
+++ linux-5.3.2/include/net/tls.h	2019-10-01 09:34:34.490353833 -0400
@@ -155,6 +155,7 @@
 	struct tx_work tx_work;
 	struct tls_rec *open_rec;
 	struct list_head tx_list;
+	spinlock_t tx_list_lock;	/* protects records list */
 	atomic_t encrypt_pending;
 	int async_notify;
 	int async_capable;
--- test/linux-5.3.2/net/tls/tls_sw.c	2019-10-01 02:24:50.000000000 -0400
+++ tls/tls_sw.c	2019-10-01 09:55:55.113629719 -0400
@@ -366,7 +366,9 @@
 	struct tls_rec *rec, *tmp;
 	struct sk_msg *msg_en;
 	int tx_flags, rc = 0;
-
+	unsigned long lock_flags;
+	
+	spin_lock_irqsave(&ctx->tx_list_lock, lock_flags);	
 	if (tls_is_partially_sent_record(tls_ctx)) {
 		rec = list_first_entry(&ctx->tx_list,
 				       struct tls_rec, list);
@@ -412,6 +414,7 @@
 	}
 
 tx_err:
+	spin_unlock_irqrestore(&ctx->tx_list_lock, lock_flags);	
 	if (rc < 0 && rc != -EAGAIN)
 		tls_err_abort(sk, EBADMSG);
 
@@ -1195,6 +1198,7 @@
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
+		//printk("%s %d\n", __func__, __LINE__);
 		ret = sk_stream_wait_memory(sk, &timeo);
 		if (ret) {
 			tls_trim_both_msgs(sk, msg_pl->sg.size);
@@ -2259,6 +2263,7 @@
 		aead = &sw_ctx_tx->aead_send;
 		INIT_LIST_HEAD(&sw_ctx_tx->tx_list);
 		INIT_DELAYED_WORK(&sw_ctx_tx->tx_work.work, tx_work_handler);
+		spin_lock_init(&sw_ctx_tx->tx_list_lock);
 		sw_ctx_tx->tx_work.sk = sk;
 	} else {
 		crypto_init_wait(&sw_ctx_rx->async_wait);

********************************************************************************

Approach 1 produced the following crash:

--------------------------------------------------------------------------------

[  771.311524] 00000000: 7d 33 88 93 0f 93 b2 42
[ 1658.735003] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
[ 1658.735004] Modules linked in: n5pf(OE) tls(OE) netconsole ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support mxm_wmi irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ses crypto_simd enclosure cryptd mei_me pcspkr glue_helper sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad ip_tables xfs libcrc32c sd_mod ast drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm qede qed igb liquidio mpt3sas ahci crc32c_intel libahci dca libata crc8 raid_class ptp scsi_transport_sas i2c_algo_bit
[ 1658.735018]  pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 1658.735020] CPU: 15 PID: 435 Comm: kworker/15:1 Kdump: loaded Tainted: G           OE     5.3.2 #2
[ 1658.735020] Hardware name: Echostreams DOPPLER/S7076GM2NR, BIOS V7.401 05/12/2017
[ 1658.735020] Workqueue: events tx_work_handler [tls]
[ 1658.735021] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1f0
[ 1658.735021] Code: ff ff 75 3e f0 0f ba 2f 08 0f 82 27 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
[ 1658.735022] RSP: 0018:ffffa02087083e00 EFLAGS: 00000002
[ 1658.735022] RAX: 0000000000000101 RBX: 0000000000000282 RCX: ffff939d9f46a960
[ 1658.735023] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff938e441c4ca8
[ 1658.735023] RBP: ffff939d9df38940 R08: 000073746e657665 R09: 8080808080808080
[ 1658.735023] R10: 0000000000000001 R11: ffff939d97c09438 R12: 00000000ffffffff
[ 1658.735023] R13: ffff939d9f470700 R14: 0000000000000000 R15: ffff938e441c4c00
[ 1658.735024] FS:  0000000000000000(0000) GS:ffff939d9f440000(0000) knlGS:0000000000000000
[ 1658.735024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1658.735024] CR2: 00007f0b08006000 CR3: 00000010369f2003 CR4: 00000000003606e0
[ 1658.735025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1658.735025] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1658.735025] Call Trace:
[ 1658.735025]  queued_spin_lock_slowpath+0x7/0x17
[ 1658.735026]  _raw_spin_lock_irqsave+0x35/0x40
[ 1658.735026]  tls_tx_records+0x3f/0x200 [tls]
[ 1658.735026]  tx_work_handler+0x4b/0x60 [tls]
[ 1658.735026]  process_one_work+0x171/0x380
[ 1658.735027]  worker_thread+0x49/0x3f0
[ 1658.735027]  kthread+0xf8/0x130
[ 1658.735027]  ? max_active_store+0x80/0x80
[ 1658.735027]  ? kthread_bind+0x10/0x10
[ 1658.735027]  ret_from_fork+0x35/0x40
[ 1658.735028] Kernel panic - not syncing: Hard LOCKUP
[ 1658.735028] CPU: 15 PID: 435 Comm: kworker/15:1 Kdump: loaded Tainted: G           OE     5.3.2 #2
[ 1658.735028] Hardware name: Echostreams DOPPLER/S7076GM2NR, BIOS V7.401 05/12/2017
[ 1658.735029] Workqueue: events tx_work_handler [tls]
[ 1658.735029] Call Trace:
[ 1658.735029]  <NMI>
[ 1658.735029]  dump_stack+0x5a/0x73
[ 1658.735030]  panic+0x102/0x2ec
[ 1658.735030]  ? ret_from_fork+0x34/0x40
[ 1658.735030]  nmi_panic+0x37/0x40
[ 1658.735030]  watchdog_overflow_callback+0xef/0x110
[ 1658.735031]  __perf_event_overflow+0x51/0xe0
[ 1658.735031]  handle_pmi_common+0x1b2/0x270
[ 1658.735031]  intel_pmu_handle_irq+0xad/0x170
[ 1658.735032]  perf_event_nmi_handler+0x2e/0x50
[ 1658.735032]  nmi_handle+0x6e/0x110
[ 1658.735032]  default_do_nmi+0x3e/0x110
[ 1658.735033]  do_nmi+0x164/0x1c0
[ 1658.735033]  end_repeat_nmi+0x16/0x50
[ 1658.735033] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1f0
[ 1658.735034] Code: ff ff 75 3e f0 0f ba 2f 08 0f 82 27 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
[ 1658.735034] RSP: 0018:ffffa02087083e00 EFLAGS: 00000002
[ 1658.735035] RAX: 0000000000000101 RBX: 0000000000000282 RCX: ffff939d9f46a960
[ 1658.735035] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff938e441c4ca8
[ 1658.735036] RBP: ffff939d9df38940 R08: 000073746e657665 R09: 8080808080808080
[ 1658.735036] R10: 0000000000000001 R11: ffff939d97c09438 R12: 00000000ffffffff
[ 1658.735036] R13: ffff939d9f470700 R14: 0000000000000000 R15: ffff938e441c4c00
[ 1658.735037]  ? native_queued_spin_lock_slowpath+0x5d/0x1f0
[ 1658.735037]  ? native_queued_spin_lock_slowpath+0x5d/0x1f0
[ 1658.735037]  </NMI>
[ 1658.735037]  queued_spin_lock_slowpath+0x7/0x17
[ 1658.735037]  _raw_spin_lock_irqsave+0x35/0x40
[ 1658.735038]  tls_tx_records+0x3f/0x200 [tls]
[ 1658.735038]  tx_work_handler+0x4b/0x60 [tls]
[ 1658.735038]  process_one_work+0x171/0x380
[ 1658.735038]  worker_thread+0x49/0x3f0
[ 1658.735039]  kthread+0xf8/0x130
[ 1658.735039]  ? max_active_store+0x80/0x80
[ 1658.735039]  ? kthread_bind+0x10/0x10
[ 1658.735039]  ret_from_fork+0x35/0x40

--------------------------------------------------------------------------------

Approach 2 -- Protect tx_list with a spinlock_t tx_list_lock and move lock/release of socket lock to tls_push_sg :

********************************************************************************

diff --git a/test/linux-5.3.2/include/net/tls.h b/linux-5.3.2/include/net/tls.h
index 41b2d41..8606e55 100644
--- a/test/linux-5.3.2/include/net/tls.h
+++ b/linux-5.3.2/include/net/tls.h
@@ -155,6 +155,7 @@ struct tls_sw_context_tx {
 	struct tx_work tx_work;
 	struct tls_rec *open_rec;
 	struct list_head tx_list;
+	spinlock_t tx_list_lock;	/* protects records list */
 	atomic_t encrypt_pending;
 	int async_notify;
 	int async_capable;
diff --git a/test/linux-5.3.2/net/tls/tls_main.c b/tls/tls_main.c
index 43252a8..c31180c 100644
--- a/test/linux-5.3.2/net/tls/tls_main.c
+++ b/tls/tls_main.c
@@ -106,11 +106,13 @@ int tls_push_sg(struct sock *sk,
 	struct page *p;
 	size_t size;
 	int offset = first_offset;
+	int rc = 0;
 
 	size = sg->length - offset;
 	offset += sg->offset;
 
 	ctx->in_tcp_sendpages = true;
+	lock_sock(sk);
 	while (1) {
 		if (sg_is_last(sg))
 			sendpage_flags = flags;
@@ -132,7 +134,8 @@ retry:
 			ctx->partially_sent_offset = offset;
 			ctx->partially_sent_record = (void *)sg;
 			ctx->in_tcp_sendpages = false;
-			return ret;
+			rc = ret;
+			goto end;
 		}
 
 		put_page(p);
@@ -146,8 +149,9 @@ retry:
 	}
 
 	ctx->in_tcp_sendpages = false;
-
-	return 0;
+end:
+	release_sock(sk);
+	return rc;
 }
 
 static int tls_handle_open_record(struct sock *sk, int flags)
diff --git a/test/linux-5.3.2/net/tls/tls_sw.c b/tls/tls_sw.c
index 91d21b0..32042f3 100644
--- a/test/linux-5.3.2/net/tls/tls_sw.c
+++ b/tls/tls_sw.c
@@ -366,7 +366,9 @@ int tls_tx_records(struct sock *sk, int flags)
 	struct tls_rec *rec, *tmp;
 	struct sk_msg *msg_en;
 	int tx_flags, rc = 0;
+	unsigned long lock_flags;
 
+	spin_lock_irqsave(&ctx->tx_list_lock, lock_flags);	
 	if (tls_is_partially_sent_record(tls_ctx)) {
 		rec = list_first_entry(&ctx->tx_list,
 				       struct tls_rec, list);
@@ -412,6 +414,7 @@ int tls_tx_records(struct sock *sk, int flags)
 	}
 
 tx_err:
+	spin_unlock_irqrestore(&ctx->tx_list_lock, lock_flags);
 	if (rc < 0 && rc != -EAGAIN)
 		tls_err_abort(sk, EBADMSG);
 
@@ -1110,7 +1113,9 @@ static int tls_sw_do_sendpage(struct sock *sk, struct page *page,
 	int record_room;
 	int ret = 0;
 	bool eor;
+	int flag = 0;
 
+	lock_sock(sk);
 	eor = !(flags & (MSG_MORE | MSG_SENDPAGE_NOTLAST));
 	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 
@@ -1195,6 +1200,7 @@ alloc_payload:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
+		//printk("%s %d\n", __func__, __LINE__);
 		ret = sk_stream_wait_memory(sk, &timeo);
 		if (ret) {
 			tls_trim_both_msgs(sk, msg_pl->sg.size);
@@ -1208,10 +1214,14 @@ wait_for_memory:
 		/* Transmit if any encryptions have completed */
 		if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
 			cancel_delayed_work(&ctx->tx_work.work);
+			release_sock(sk);
+			flag = 1;
 			tls_tx_records(sk, flags);
 		}
 	}
 sendpage_end:
+	if (flag == 0)
+		release_sock(sk);
 	ret = sk_stream_error(sk, flags, ret);
 	return copied ? copied : ret;
 }
@@ -1225,9 +1235,7 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 		      MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
 		return -ENOTSUPP;
 
-	lock_sock(sk);
 	ret = tls_sw_do_sendpage(sk, page, offset, size, flags);
-	release_sock(sk);
 	return ret;
 }
 
@@ -2172,9 +2180,9 @@ static void tx_work_handler(struct work_struct *work)
 
 	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
 		return;
-	lock_sock(sk);
+	//lock_sock(sk);
 	tls_tx_records(sk, -1);
-	release_sock(sk);
+	//release_sock(sk);
 }
 
 void tls_sw_write_space(struct sock *sk, struct tls_context *ctx)
@@ -2259,6 +2267,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		aead = &sw_ctx_tx->aead_send;
 		INIT_LIST_HEAD(&sw_ctx_tx->tx_list);
 		INIT_DELAYED_WORK(&sw_ctx_tx->tx_work.work, tx_work_handler);
+		spin_lock_init(&sw_ctx_tx->tx_list_lock);
 		sw_ctx_tx->tx_work.sk = sk;
 	} else {
 		crypto_init_wait(&sw_ctx_rx->async_wait);

********************************************************************************

Approach 2 produced the following crash:

--------------------------------------------------------------------------------

[  177.884608] 00000000: 7d 33 88 93 0f 93 b2 42
[  571.779993] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
[  571.779993] Modules linked in: n5pf(OE) tls(OE) netconsole ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt kvm iTCO_vendor_support mxm_wmi irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ses enclosure crypto_simd mei_me cryptd glue_helper pcspkr sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad ip_tables xfs libcrc32c sd_mod ast drm_vram_helper ttm drm_kms_helper qede syscopyarea sysfillrect sysimgblt fb_sys_fops qed drm liquidio igb ahci libahci mpt3sas crc32c_intel libata dca crc8 ptp raid_class scsi_transport_sas i2c_algo_bit
[  571.780006]  pps_core dm_mirror dm_region_hash dm_log dm_mod
[  571.780007] CPU: 24 PID: 13915 Comm: uperf Kdump: loaded Tainted: G           OE     5.3.2 #2
[  571.780007] Hardware name: Echostreams DOPPLER/S7076GM2NR, BIOS V7.401 05/12/2017
[  571.780008] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1f0
[  571.780008] Code: 75 3e f0 0f ba 2f 08 0f 82 27 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 01 00
[  571.780008] RSP: 0018:ffffbba507f6fbf8 EFLAGS: 00000002
[  571.780009] RAX: 0000000000000101 RBX: 0000000000000296 RCX: dead000000000122
[  571.780009] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff96ea43c27428
[  571.780009] RBP: ffff96f199250940 R08: 0000000000000001 R09: ffff96f19258c5c0
[  571.780010] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  571.780010] R13: 0000000000001000 R14: ffff96f1c3c84000 R15: ffff96ea43c27380
[  571.780010] FS:  00007f3d40d6e700(0000) GS:ffff96f99f680000(0000) knlGS:0000000000000000
[  571.780010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  571.780011] CR2: 00007f75400058b0 CR3: 0000000859128001 CR4: 00000000003606e0
[  571.780011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  571.780011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  571.780011] Call Trace:
[  571.780011]  queued_spin_lock_slowpath+0x7/0x17
[  571.780012]  _raw_spin_lock_irqsave+0x35/0x40
[  571.780012]  tls_tx_records+0x3f/0x200 [tls]
[  571.780012]  tls_sw_sendpage+0x3ea/0x420 [tls]
[  571.780012]  inet_sendpage+0x52/0x80
[  571.780012]  ? direct_splice_actor+0x40/0x40
[  571.780013]  kernel_sendpage+0x1a/0x30
[  571.780013]  sock_sendpage+0x23/0x30
[  571.780013]  pipe_to_sendpage+0x5d/0xa0
[  571.780013]  __splice_from_pipe+0xff/0x170
[  571.780013]  ? generic_file_splice_read+0x19a/0x1b0
[  571.780014]  ? direct_splice_actor+0x40/0x40
[  571.780014]  splice_from_pipe+0x5b/0x90
[  571.780014]  direct_splice_actor+0x35/0x40
[  571.780014]  splice_direct_to_actor+0xb3/0x240
[  571.780015]  ? generic_pipe_buf_nosteal+0x10/0x10
[  571.780015]  do_splice_direct+0x87/0xd0
[  571.780015]  do_sendfile+0x1c6/0x3d0
[  571.780015]  __x64_sys_sendfile64+0x5c/0xb0
[  571.780015]  do_syscall_64+0x5b/0x1d0
[  571.780016]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  571.780016] RIP: 0033:0x7f3d436a1f1a
[  571.780016] Code: 31 c0 5b c3 0f 1f 40 00 48 89 da 4c 89 ce 44 89 c7 5b e9 09 fe ff ff 66 0f 1f 84 00 00 00 00 00 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 26 2f 2d 00 f7 d8 64 89 01 48
[  571.780016] RSP: 002b:00007f3d40d6dd08 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
[  571.780017] RAX: ffffffffffffffda RBX: 00000000010f9460 RCX: 00007f3d436a1f1a
[  571.780017] RDX: 00007f3d40d6dd18 RSI: 0000000000000005 RDI: 000000000000000c
[  571.780018] RBP: 000000000000000c R08: 0000000000001000 R09: 00007f3d43975140
[  571.780018] R10: 0000000000001000 R11: 0000000000000206 R12: 0000000000000000
[  571.780018] R13: 0000000000001000 R14: 0000000000000000 R15: 00000000011054d0
[  571.780018] Kernel panic - not syncing: Hard LOCKUP
[  571.780019] CPU: 24 PID: 13915 Comm: uperf Kdump: loaded Tainted: G           OE     5.3.2 #2
[  571.780019] Hardware name: Echostreams DOPPLER/S7076GM2NR, BIOS V7.401 05/12/2017
[  571.780019] Call Trace:
[  571.780019]  <NMI>
[  571.780020]  dump_stack+0x5a/0x73
[  571.780020]  panic+0x102/0x2ec
[  571.780020]  nmi_panic+0x37/0x40
[  571.780020]  watchdog_overflow_callback+0xef/0x110
[  571.780020]  __perf_event_overflow+0x51/0xe0
[  571.780021]  handle_pmi_common+0x1b2/0x270
[  571.780021]  intel_pmu_handle_irq+0xad/0x170
[  571.780021]  perf_event_nmi_handler+0x2e/0x50
[  571.780021]  nmi_handle+0x6e/0x110
[  571.780021]  default_do_nmi+0xc3/0x110
[  571.780021]  do_nmi+0x164/0x1c0
[  571.780022]  end_repeat_nmi+0x16/0x50
[  571.780022] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1f0
[  571.780022] Code: 75 3e f0 0f ba 2f 08 0f 82 27 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 01 00
[  571.780023] RSP: 0018:ffffbba507f6fbf8 EFLAGS: 00000002
[  571.780023] RAX: 0000000000000101 RBX: 0000000000000296 RCX: dead000000000122
[  571.780023] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff96ea43c27428
[  571.780023] RBP: ffff96f199250940 R08: 0000000000000001 R09: ffff96f19258c5c0
[  571.780024] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  571.780024] R13: 0000000000001000 R14: ffff96f1c3c84000 R15: ffff96ea43c27380
[  571.780024]  ? native_queued_spin_lock_slowpath+0x5f/0x1f0
[  571.780024]  ? native_queued_spin_lock_slowpath+0x5f/0x1f0
[  571.780025]  </NMI>
[  571.780025]  queued_spin_lock_slowpath+0x7/0x17
[  571.780025]  _raw_spin_lock_irqsave+0x35/0x40
[  571.780025]  tls_tx_records+0x3f/0x200 [tls]
[  571.780025]  tls_sw_sendpage+0x3ea/0x420 [tls]
[  571.780026]  inet_sendpage+0x52/0x80
[  571.780026]  ? direct_splice_actor+0x40/0x40
[  571.780026]  kernel_sendpage+0x1a/0x30
[  571.780026]  sock_sendpage+0x23/0x30
[  571.780026]  pipe_to_sendpage+0x5d/0xa0
[  571.780027]  __splice_from_pipe+0xff/0x170
[  571.780027]  ? generic_file_splice_read+0x19a/0x1b0
[  571.780027]  ? direct_splice_actor+0x40/0x40
[  571.780027]  splice_from_pipe+0x5b/0x90
[  571.780027]  direct_splice_actor+0x35/0x40
[  571.780028]  splice_direct_to_actor+0xb3/0x240
[  571.780028]  ? generic_pipe_buf_nosteal+0x10/0x10
[  571.780028]  do_splice_direct+0x87/0xd0
[  571.780028]  do_sendfile+0x1c6/0x3d0
[  571.780028]  __x64_sys_sendfile64+0x5c/0xb0
[  571.780029]  do_syscall_64+0x5b/0x1d0
[  571.780029]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  571.780029] RIP: 0033:0x7f3d436a1f1a
[  571.780029] Code: 31 c0 5b c3 0f 1f 40 00 48 89 da 4c 89 ce 44 89 c7 5b e9 09 fe ff ff 66 0f 1f 84 00 00 00 00 00 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 26 2f 2d 00 f7 d8 64 89 01 48
[  571.780030] RSP: 002b:00007f3d40d6dd08 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
[  571.780030] RAX: ffffffffffffffda RBX: 00000000010f9460 RCX: 00007f3d436a1f1a
[  571.780030] RDX: 00007f3d40d6dd18 RSI: 0000000000000005 RDI: 000000000000000c
[  571.780031] RBP: 000000000000000c R08: 0000000000001000 R09: 00007f3d43975140
[  571.780031] R10: 0000000000001000 R11: 0000000000000206 R12: 0000000000000000
[  571.780031] R13: 0000000000001000 R14: 0000000000000000 R15: 00000000011054d0

--------------------------------------------------------------------------------

> My feeling is that
> we need to make writers wait to maintain socket write semantics, and
> that implies putting writers to sleep, which is indeed very costly..
> 
> Perhaps something along the lines of:
> 
> 	if (ctx->in_tcp_sendpages) {
> 		rc = sk_stream_wait_memory(sk, &timeo);
> 		...
> 	}
> 
> in tls_tx_records() would be the "most correct" solution? If we get
> there and there is already a writer, that means the first writer has
> to be waiting for memory, and so should the second..
> 
> WDYT?

Hmmm, I am not sure what the benefit would be of having two threads do the completions, as explained above with Case 1 and Case 2 scenarios. If we made the later thread wait also while earlier one waits for memory, just so the later one can also perform the completions, it will either have no completions left to take care of or have minimal (a few records that may have sneaked in while the earlier thread returned and second made it in).

The only patch that we have been able to make consistently work without crashing and also without compromising performance, is the previously submitted one where later thread bails out of tls_tx_records. And as mentioned, it can perhaps be made more efficient by rescheduling delayed work in the case where work handler thread turns out to be the later thread that has to bail.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-10-09 16:57                       ` Pooja Trivedi
@ 2019-10-17 23:48                         ` Jakub Kicinski
  2019-10-18  0:25                           ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2019-10-17 23:48 UTC (permalink / raw)
  To: Pooja Trivedi
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Wed, 9 Oct 2019 16:57:39 +0000, Pooja Trivedi wrote:
> On Fri, Sep 27, 2019 at 05:37:53PM -0700, Jakub Kicinski wrote:
> > On Tue, 24 Sep 2019 12:48:26 -0400, Pooja Trivedi wrote:  
> > > On Mon, Sep 23, 2019 at 8:28 PM Jakub Kicinski wrote:  
> > > > On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote:    
> > > > > On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote:    
> > > > > > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote:    
> > > > > > > Hi Jakub,
> > > > > > >
> > > > > > > I have explained one potential way for the race to happen in my
> > > > > > > original message to the netdev mailing list here:
> > > > > > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2
> > > > > > >
> > > > > > > Here is the part out of there that's relevant to your question:
> > > > > > >
> > > > > > > -----------------------------------------
> > > > > > >
> > > > > > > One potential way for race condition to appear:
> > > > > > >
> > > > > > > When under tcp memory pressure, Thread 1 takes the following code path:
> > > > > > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage --->
> > > > > > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg --->
> > > > > > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event    
> > > > > >
> > > > > > Ugh, so do_tcp_sendpages() can also release the lock :/
> > > > > >
> > > > > > Since the problem occurs in tls_sw_do_sendpage() and
> > > > > > tls_sw_do_sendmsg() as well, should we perhaps fix it at that level?    
> > > > >
> > > > > That won't do because tls_tx_records also gets called when completion
> > > > > callbacks schedule delayed work. That was the code path that caused
> > > > > the crash for my test. Cavium's nitrox crypto offload driver calling
> > > > > tls_encrypt_done, which calls schedule_delayed_work. Delayed work that
> > > > > was scheduled would then be processed by tx_work_handler.
> > > > > Notice in my previous reply,
> > > > > "Thread 2 code path:
> > > > > tx_work_handler ---> tls_tx_records"
> > > > >
> > > > > "Thread 2 code path:
> > > > > tx_work_handler ---> tls_tx_records"    
> > > >
> > > > Right, the work handler would obviously also have to obey the exclusion
> > > > mechanism of choice.
> > > >
> > > > Having said that this really does feel like we are trying to lock code,
> > > > not data here :(    
> > > 
> > > Agree with you and exactly the thought process I went through. So what
> > > are some other options?
> > > 
> > > 1) A lock member inside of ctx to protect tx_list
> > > We are load testing ktls offload with nitrox and the performance was
> > > quite adversely affected by this. This approach can be explored more,
> > > but the original design of using socket lock didn't follow this model
> > > either.
> > > 2) Allow tagging of individual record inside of tx_list to indicate if
> > > it has been 'processed'
> > > This approach would likely protect the data without compromising
> > > performance. It will allow Thread 2 to proceed with the TX portion of
> > > tls_tx_records while Thread 1 sleeps waiting for memory. There will
> > > need to be careful cleanup and backtracking after the thread wakes up
> > > to ensure a consistent state of tx_list and record transmission.
> > > The approach has several problems, however -- (a) It could cause
> > > out-of-order record tx (b) If Thread 1 is waiting for memory, Thread 2
> > > most likely will (c) Again, socket lock wasn't designed to follow this
> > > model to begin with
> > > 
> > > 
> > > Given that socket lock essentially was working as a code protector --
> > > as an exclusion mechanism to allow only a single writer through
> > > tls_tx_records at a time -- what other clean ways do we have to fix
> > > the race without a significant refactor of the design and code?  
> > 
> > Very sorry about the delay. I don't think we can maintain the correct
> > semantics without sleeping :( If we just bail in tls_tx_records() when
> > there's already another writer the later writer will return from the
> > system call, even though the data is not pushed into the TCP layer.
> 
> Thanks for your response and sorry about the delay!
> 
> I am trying the following scenarios in my head to see how valid your
> concern is. Play along with me please.
> 
> The two main writers in picture here are
> Thread 1 -- Enqueue thread (sendfile system call) -- pushes records to
> card, also performs completions (push to tcp) if records are ready
> Thread 2 -- Work handler (tx_work_handler) -- bottom-half completions routine
> 
> With the submitted patch,
> Case 1 (your concern) : Thread 2 grabs socket lock, calls
> tls_tx_records, runs into memory pressure, releases socket lock, waits
> for memory. Now Thread 1 grabs socket lock, calls tls_tx_records, bails.
> In this case, sendfile system call will bail out without performing
> completions. Is that really a problem? When Thread 1 ultimately
> proceeds, it will perform the completions anyway.

I think there is value in preserving normal socket semantics as much
as possible. If a writer pushes more data to a TCP connection than send
buffer the call should block.

> Case 2: Threads grab socket lock in a reverse sequence of Case 1. So
> Thread 1 grabs socket lock first and ends up waiting for memory.
> Thread 2 comes in later and bails from tls_tx_records.
> In this case, in the submitted patch, I realized that we are not rescheduling 
> the work before bailing. So I think an amendment to that patch, 
> something along the lines of what's shown below, would fare better.
> 

> ********************************************************************************
> 
> > What was reason for the performance impact on (1)?   
> 
> 
> While the performance impact still needs to be investigated, that effort has stopped short due 
> to other issues with that approach like hard lockups. The basic problem is that the socket lock 
> is touched at multiple layers (tls, tcp etc.).
> 
> Here are the two approaches we tried along the lines of using an additional lock...
> 
> Approach 1 -- Protect tx_list with a spinlock_t tx_list_lock :

> Approach 2 -- Protect tx_list with a spinlock_t tx_list_lock and move lock/release of socket lock to tls_push_sg :

You can't sleep with spin locks, we'd need a mutex...

> > My feeling is that
> > we need to make writers wait to maintain socket write semantics, and
> > that implies putting writers to sleep, which is indeed very costly..
> > 
> > Perhaps something along the lines of:
> > 
> > 	if (ctx->in_tcp_sendpages) {
> > 		rc = sk_stream_wait_memory(sk, &timeo);
> > 		...
> > 	}
> > 
> > in tls_tx_records() would be the "most correct" solution? If we get
> > there and there is already a writer, that means the first writer has
> > to be waiting for memory, and so should the second..
> > 
> > WDYT?  
> 
> Hmmm, I am not sure what the benefit would be of having two threads
> do the completions, as explained above with Case 1 and Case 2
> scenarios. If we made the later thread wait also while earlier one
> waits for memory, just so the later one can also perform the
> completions, it will either have no completions left to take care of
> or have minimal (a few records that may have sneaked in while the
> earlier thread returned and second made it in).

Semantics matter.

> The only patch that we have been able to make consistently work
> without crashing and also without compromising performance, is the
> previously submitted one where later thread bails out of
> tls_tx_records. And as mentioned, it can perhaps be made more
> efficient by rescheduling delayed work in the case where work handler
> thread turns out to be the later thread that has to bail.

Let me try to find a way to repro this reliably without any funky
accelerators. The sleep in do_tcp_sendpages() should affect all cases.
I should have some time today and tomorrow to look into this, bear with
me..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V2 net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition in tls_tx_records
  2019-10-17 23:48                         ` Jakub Kicinski
@ 2019-10-18  0:25                           ` Jakub Kicinski
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Kicinski @ 2019-10-18  0:25 UTC (permalink / raw)
  To: Pooja Trivedi
  Cc: netdev, davem, daniel, john.fastabend, davejwatson, aviadye,
	borisp, Pooja Trivedi, Mallesham Jatharakonda

On Thu, 17 Oct 2019 16:48:25 -0700, Jakub Kicinski wrote:
> > The only patch that we have been able to make consistently work
> > without crashing and also without compromising performance, is the
> > previously submitted one where later thread bails out of
> > tls_tx_records. And as mentioned, it can perhaps be made more
> > efficient by rescheduling delayed work in the case where work handler
> > thread turns out to be the later thread that has to bail.  
> 
> Let me try to find a way to repro this reliably without any funky
> accelerators. The sleep in do_tcp_sendpages() should affect all cases.
> I should have some time today and tomorrow to look into this, bear with
> me..

Could you please try this?

---->8-----

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index c2b5e0d2ba1a..ab7b0af162a7 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1204,12 +1204,10 @@ static int tls_sw_do_sendpage(struct sock *sk, struct page *page,
 		goto alloc_payload;
 	}
 
-	if (num_async) {
-		/* Transmit if any encryptions have completed */
-		if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
-			cancel_delayed_work(&ctx->tx_work.work);
-			tls_tx_records(sk, flags);
-		}
+	/* Transmit if any encryptions have completed */
+	if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
+		cancel_delayed_work(&ctx->tx_work.work);
+		tls_tx_records(sk, flags);
 	}
 sendpage_end:
 	ret = sk_stream_error(sk, flags, ret);
@@ -2171,7 +2169,8 @@ static void tx_work_handler(struct work_struct *work)
 	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
 		return;
 	lock_sock(sk);
-	tls_tx_records(sk, -1);
+	if (!sk->sk_write_pending)
+		tls_tx_records(sk, -1);
 	release_sock(sk);
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-10-18  0:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOrEdsmpHT-=zH9zyHv=pLX2ENb1S0AnkrcWVgMxqWrxKsF3yw@mail.gmail.com>
     [not found] ` <CAOrEdsmxstWoBz2AotrTx_OBFN_jycqnSqtsvLxuCLGtKKi_dA@mail.gmail.com>
2019-09-09 17:46   ` [PATCH net 0/1] net/tls(TLS_SW): double free in tls_tx_records Pooja Trivedi
2019-09-09 18:15     ` [PATCH net 1/1] net/tls(TLS_SW): Fix list_del double free caused by a race condition " Pooja Trivedi
2019-09-17 21:13       ` [PATCH V2 " Pooja Trivedi
2019-09-17 21:24         ` Pooja Trivedi
2019-09-18 21:25         ` Jakub Kicinski
2019-09-18 21:37           ` Pooja Trivedi
2019-09-18 21:45             ` Jakub Kicinski
2019-09-22  3:19               ` Pooja Trivedi
2019-09-24  0:28                 ` Jakub Kicinski
2019-09-24 16:48                   ` Pooja Trivedi
2019-09-28  0:37                     ` Jakub Kicinski
2019-10-09 16:57                       ` Pooja Trivedi
2019-10-17 23:48                         ` Jakub Kicinski
2019-10-18  0:25                           ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).