* livelock with KASAN_SW_TAGS @ 2019-02-14 3:32 Qian Cai 2019-02-14 16:35 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Qian Cai @ 2019-02-14 3:32 UTC (permalink / raw) To: Andrey Konovalov Cc: aryabinin, Will Deacon, Linux ARM, kasan-dev, Catalin Marinas Running LTP msgstress03 [1] triggers endless soft lockups below after a few minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the test in roughly 11 minutes. [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/ipc/msgstress/msgstress03.c # ./scripts/faddr2line vmlinux _raw_spin_unlock_irqrestore+0x44/0xac _raw_spin_unlock_irqrestore+0x44/0xac: arch_local_irq_restore at arch/arm64/include/asm/irqflags.h:88 (inlined by) __raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160 (inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184 This code: asm volatile( "msr daif, %0 : : "r" (flags) : "memory"); [ 545.622261] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:1439] [ 545.629847] Modules linked in: thunderx2_pmu ip_tables xfs libcrc32c sd_mod ahci libahci mlx5_core libata dm_mirror dm_region_hash dm_log dm_mod efivarfs [ 545.643692] CPU: 12 PID: 1439 Comm: kworker/12:1 Tainted: G W 5.0.0-rc6+ #44 [ 545.652082] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 [ 545.661947] Workqueue: events free_obj_work [ 545.666173] pstate: 60400009 (nZCv daif +PAN -UAO) [ 545.670981] pc : _raw_spin_unlock_irqrestore+0x44/0xac [ 545.676166] lr : _raw_spin_unlock_irqrestore+0x38/0xac [ 545.681389] sp : 30ff80098de3fa70 [ 545.684724] x29: 30ff80098de3fa90 x28: 00000000000044c0 [ 545.690048] x27: 0000000000000001 x26: 8aff808ba9f844d0 [ 545.695380] x25: 0000000000000001 x24: 0000000000000001 [ 545.700705] x23: 0000000000000000 x22: 62ff808b7d00df00 [ 545.706038] x21: 62ff808b7d00df00 x20: 0000000000000000 [ 545.711366] x19: ffff1000105d6ac0 x18: ffff1000148a5538 [ 545.716720] x17: 000000000000001b x16: 00000000000000ff [ 545.722040] x15: ffff1000148a5000 x14: 000000000000004c [ 545.727358] x13: 0000000000000001 x12: efff100000000000 [ 545.732678] x11: 0000000000000002 x10: 00000000000000ff [ 545.738034] x9 : 00000000ffffffff x8 : 4cff800832347840 [ 545.743354] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008 [ 545.748711] x5 : 000000000000005a x4 : 0000000000000000 [ 545.754041] x3 : 30ff80098de3fa04 x2 : 0000000000000003 [ 545.759422] x1 : 62ff808b7d00df18 x0 : 4cff800832347850 [ 545.764761] Call trace: [ 545.767225] _raw_spin_unlock_irqrestore+0x44/0xac [ 545.772027] free_debug_processing+0x308/0x410 [ 545.776489] kmem_cache_free+0x604/0x9f0 [ 545.780442] free_obj_work+0x92c/0xa44 [ 545.784241] process_one_work+0x894/0x1280 [ 545.788349] worker_thread+0x684/0xa1c [ 545.792122] kthread+0x2cc/0x2e8 [ 545.795367] ret_from_fork+0x10/0x18 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-14 3:32 livelock with KASAN_SW_TAGS Qian Cai @ 2019-02-14 16:35 ` Will Deacon 2019-02-14 16:50 ` Qian Cai 0 siblings, 1 reply; 10+ messages in thread From: Will Deacon @ 2019-02-14 16:35 UTC (permalink / raw) To: Qian Cai Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin Hi Qian Cai, On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote: > Running LTP msgstress03 [1] triggers endless soft lockups below after a few > minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the > test in roughly 11 minutes. I've not been able to reproduce this failure under KVM, however the test only takes around 18s to complete on the host and the guest, so it feels like something is amiss here. Please could you share more information about how you're triggering this problem? For example: - Kernel version and .config - Clang version - LTP version and command-line - Machine configuration (CPUs, memory, etc) > # ./scripts/faddr2line vmlinux _raw_spin_unlock_irqrestore+0x44/0xac > _raw_spin_unlock_irqrestore+0x44/0xac: > arch_local_irq_restore at arch/arm64/include/asm/irqflags.h:88 > (inlined by) __raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160 > (inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184 > > This code: > > asm volatile( > "msr daif, %0 > : > : "r" (flags) > : "memory"); This is just the instruction when we unmask interrupts, so that's when the soft watchdog fires. I suspect the real issue is that we're spending lots of time in: > [ 545.772027] free_debug_processing+0x308/0x410 with irqs disabled. Does the machine eventually come back to life? Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-14 16:35 ` Will Deacon @ 2019-02-14 16:50 ` Qian Cai 2019-02-14 18:01 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Qian Cai @ 2019-02-14 16:50 UTC (permalink / raw) To: Will Deacon Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin On 2/14/19 11:35 AM, Will Deacon wrote: > Hi Qian Cai, > > On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote: >> Running LTP msgstress03 [1] triggers endless soft lockups below after a few >> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the >> test in roughly 11 minutes. > > I've not been able to reproduce this failure under KVM, however the test > only takes around 18s to complete on the host and the guest, so it feels > like something is amiss here. Please could you share more information about > how you're triggering this problem? For example: > > - Kernel version and .config Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus a few KASAN_SW_TAGS patches in order to boot. https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5) https://marc.info/?l=linux-mm&m=154968731424637&w=2 https://marc.info/?l=linux-mm&m=155010395725051&w=2 https://git.sr.ht/~cai/linux-debug/tree/master/config > - Clang version clang-7.0.1 > - LTP version and command-line latest git at db57ddc1497e /opt/ltp/testcases/bin/msgstress03 > - Machine configuration (CPUs, memory, etc) https://git.sr.ht/~cai/linux-debug/tree/master/dmesg # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 4 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX2 99xx Stepping: 0x1 BogoMIPS: 400.00 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 32768K NUMA node0 CPU(s): 0-127 NUMA node1 CPU(s): 128-255 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm # free -m total used free shared buff/cache available Mem: 84871 8347 75942 55 581 69699 Swap: 4095 0 4095 > >> # ./scripts/faddr2line vmlinux _raw_spin_unlock_irqrestore+0x44/0xac >> _raw_spin_unlock_irqrestore+0x44/0xac: >> arch_local_irq_restore at arch/arm64/include/asm/irqflags.h:88 >> (inlined by) __raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160 >> (inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184 >> >> This code: >> >> asm volatile( >> "msr daif, %0 >> : >> : "r" (flags) >> : "memory"); > > This is just the instruction when we unmask interrupts, so that's when the > soft watchdog fires. I suspect the real issue is that we're spending lots > of time in: > >> [ 545.772027] free_debug_processing+0x308/0x410 > > with irqs disabled. Does the machine eventually come back to life? > Never. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-14 16:50 ` Qian Cai @ 2019-02-14 18:01 ` Will Deacon 2019-02-15 4:04 ` Qian Cai 0 siblings, 1 reply; 10+ messages in thread From: Will Deacon @ 2019-02-14 18:01 UTC (permalink / raw) To: Qian Cai Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote: > On 2/14/19 11:35 AM, Will Deacon wrote: > > On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote: > >> Running LTP msgstress03 [1] triggers endless soft lockups below after a few > >> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the > >> test in roughly 11 minutes. > > > > I've not been able to reproduce this failure under KVM, however the test > > only takes around 18s to complete on the host and the guest, so it feels > > like something is amiss here. Please could you share more information about > > how you're triggering this problem? For example: > > > > - Kernel version and .config > > Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus > a few KASAN_SW_TAGS patches in order to boot. > > https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5) > https://marc.info/?l=linux-mm&m=154968731424637&w=2 > https://marc.info/?l=linux-mm&m=155010395725051&w=2 > > https://git.sr.ht/~cai/linux-debug/tree/master/config I struggled to get this config to boot under KVM :( > > - Clang version > > clang-7.0.1 It would be helpful to know if the issue persists with the latest nightly build of clang. Anyway, please could you annotate the goto loop in free_debug_processing() so that the object and tail pointer are printed each time around? It would be useful to know if we're failing to exit that. Thanks, Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-14 18:01 ` Will Deacon @ 2019-02-15 4:04 ` Qian Cai 2019-02-15 14:23 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Qian Cai @ 2019-02-15 4:04 UTC (permalink / raw) To: Will Deacon Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin On 2/14/19 1:01 PM, Will Deacon wrote: > On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote: >> On 2/14/19 11:35 AM, Will Deacon wrote: >>> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote: >>>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few >>>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the >>>> test in roughly 11 minutes. >>> >>> I've not been able to reproduce this failure under KVM, however the test >>> only takes around 18s to complete on the host and the guest, so it feels >>> like something is amiss here. Please could you share more information about >>> how you're triggering this problem? For example: >>> >>> - Kernel version and .config >> >> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus >> a few KASAN_SW_TAGS patches in order to boot. >> >> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5) >> https://marc.info/?l=linux-mm&m=154968731424637&w=2 >> https://marc.info/?l=linux-mm&m=155010395725051&w=2 >> >> https://git.sr.ht/~cai/linux-debug/tree/master/config > > I struggled to get this config to boot under KVM :( > >>> - Clang version >> >> clang-7.0.1 > > It would be helpful to know if the issue persists with the latest nightly > build of clang. > > Anyway, please could you annotate the goto loop in free_debug_processing() > so that the object and tail pointer are printed each time around? It would > be useful to know if we're failing to exit that. > Well, I am not sure I understand what your debugging strategy. Maybe, you can send along a debug patch you have in mind for me to run. From the trace, it definitely exit the "goto next_object" and reached further in this line in free_debug_processing(), spin_unlock_irqrestore(&n->list_lock, flags); Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still trigger soft lockups and msgstress03 would seem running forever, the machine is still responsible and is able to login via ssh. Hence, it is possible to capture a task dump (echo t >/proc/sysrq-trigger) while this is happening. https://git.sr.ht/~cai/linux-debug/tree/master/console Some traces looks strange that looks like running free_debug_processing() in a loop, [ 1986.002139] Call trace: [ 1986.002145] _raw_spin_unlock_irqrestore+0x44/0xac [ 1986.002152] free_debug_processing+0x2f4/0x3e4 [ 1986.002157] kmem_cache_free+0x44c/0x870 [ 1986.002163] free_object_rcu+0x200/0x228 [ 1986.002169] rcu_process_callbacks+0xb00/0x12c0 [ 1986.002175] __do_softirq+0x644/0xfd0 [ 1986.002181] irq_exit+0x29c/0x370 [ 1986.002187] __handle_domain_irq+0xe0/0x1c4 [ 1986.002192] gic_handle_irq+0x1c4/0x3b0 [ 1986.002197] el1_irq+0xb0/0x140 [ 1986.002203] lock_release+0x660/0x7dc [ 1986.002209] rcu_lock_release+0x20/0x28 [ 1986.002214] do_msgrcv+0x708/0xed0 [ 1986.002219] ksys_msgrcv+0x4c/0x60 [ 1986.002224] __arm64_sys_msgrcv+0xb8/0x194 [ 1986.002230] el0_svc_handler+0x230/0x3bc [ 1986.002236] el0_svc+0x8/0xc [ 1986.007106] OUTLINED_FUNCTION_169+0x4/0xc [ 1986.011885] free_debug_processing+0x2f4/0x3e4 [ 1986.017186] load_msg+0x4c/0x324 [ 1986.021617] kmem_cache_free+0x44c/0x870 [ 1986.026917] ksys_msgsnd+0x1e0/0xe5c [ 1988.050035] _raw_spin_unlock_irqrestore+0x44/0xac [ 1988.054821] free_debug_processing+0x2f4/0x3e4 [ 1988.059260] kfree+0x3f8/0x7ac [ 1988.062313] free_msg+0x50/0xb0 [ 1988.065450] do_msgrcv+0xd80/0xed0 [ 1988.068846] ksys_msgrcv+0x4c/0x60 [ 1988.072243] __arm64_sys_msgrcv+0xb8/0x194 [ 1988.076336] el0_svc_handler+0x230/0x3bc [ 1988.080255] el0_svc+0x Occasionally, msgstress03 would trigger a panic. [ 997.982080] Internal error: synchronous external abort: 96000610 [#1] SMP [ 997.988886] Modules linked in: thunderx2_pmu efivarfs ip_tables xfs libcrc32c sd_mod ahci libahci mlx5_core libata dm_mirror dm_region_hash dm_log dm_mod [ 998.002665] CPU: 229 PID: 1323 Comm: kworker/229:1 Kdump: loaded Tainted: G W 5.0.0-rc6+ #57 [ 998.012402] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 [ 998.022243] Workqueue: events free_obj_work [ 998.026434] pstate: 00400009 (nzcv daif +PAN -UAO) [ 998.031229] pc : kmem_cache_free+0x410/0x870 [ 998.035500] lr : free_obj_work+0x92c/0xa44 [ 998.039592] sp : 29ff808baebefb50 [ 998.042908] x29: 29ff808baebefca0 x28: 73ff808a939eb6e0 [ 998.048223] x27: 33ff800820013488 x26: 73ff808a939eb710 [ 998.053537] x25: 0000000000a3008e x24: ffff100014b73000 [ 998.058850] x23: 00000000000000ff x22: dead000000000100 [ 998.064164] x21: efff100000000000 x20: efff100000000000 [ 998.069480] x19: ffff100014b73d70 x18: ffff1000148a5538 [ 998.074795] x17: 000000000000001b x16: 0000000000a3008d [ 998.080108] x15: b1ff808a939ee2a0 x14: 0000000000000000 [ 998.085421] x13: 9dff808ba9713850 x12: 0000000000000000 [ 998.090737] x11: 007ffffffc000201 x10: ffff1000138d4000 [ 998.096055] x9 : 0000000000000000 x8 : 9dff808ba9713840 [ 998.101371] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008 [ 998.106686] x5 : 000000000000005a x4 : 0000000000000000 [ 998.111999] x3 : 29ff808baebef9f4 x2 : 0000000000000003 [ 998.117312] x1 : cbff80082000df18 x0 : ffff808a939eb710 [ 998.122632] Process kworker/229:1 (pid: 1323, stack limit = 0x00000000d1a3376a) [ 998.129937] Call trace: [ 998.132388] kmem_cache_free+0x410/0x870 [ 998.136320] process_one_work+0x894/0x1280 [ 998.140418] worker_thread+0x684/0xa1c [ 998.144174] kthread+0x2cc/0x2e8 [ 998.147409] ret_from_fork+0x10/0x18 [ 998.150989] Code: a94b7bfd a94a4ff4 a94957f6 a9485ff8 (a94767fa) _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-15 4:04 ` Qian Cai @ 2019-02-15 14:23 ` Will Deacon 2019-02-15 14:26 ` Will Deacon 2019-02-19 17:56 ` Andrey Konovalov 0 siblings, 2 replies; 10+ messages in thread From: Will Deacon @ 2019-02-15 14:23 UTC (permalink / raw) To: Qian Cai Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin On Thu, Feb 14, 2019 at 11:04:25PM -0500, Qian Cai wrote: > On 2/14/19 1:01 PM, Will Deacon wrote: > > On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote: > >> On 2/14/19 11:35 AM, Will Deacon wrote: > >>> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote: > >>>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few > >>>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the > >>>> test in roughly 11 minutes. > >>> > >>> I've not been able to reproduce this failure under KVM, however the test > >>> only takes around 18s to complete on the host and the guest, so it feels > >>> like something is amiss here. Please could you share more information about > >>> how you're triggering this problem? For example: > >>> > >>> - Kernel version and .config > >> > >> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus > >> a few KASAN_SW_TAGS patches in order to boot. > >> > >> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5) > >> https://marc.info/?l=linux-mm&m=154968731424637&w=2 > >> https://marc.info/?l=linux-mm&m=155010395725051&w=2 > >> > >> https://git.sr.ht/~cai/linux-debug/tree/master/config > > > > I struggled to get this config to boot under KVM :( > > > >>> - Clang version > >> > >> clang-7.0.1 > > > > It would be helpful to know if the issue persists with the latest nightly > > build of clang. > > > > Anyway, please could you annotate the goto loop in free_debug_processing() > > so that the object and tail pointer are printed each time around? It would > > be useful to know if we're failing to exit that. > > > > Well, I am not sure I understand what your debugging strategy. Maybe, you can > send along a debug patch you have in mind for me to run. > > From the trace, it definitely exit the "goto next_object" and reached further in > this line in free_debug_processing(), > > spin_unlock_irqrestore(&n->list_lock, flags); > > Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still > trigger soft lockups and msgstress03 would seem running forever, the machine is > still responsible and is able to login via ssh. Hence, it is possible to capture > a task dump (echo t >/proc/sysrq-trigger) while this is happening. > > https://git.sr.ht/~cai/linux-debug/tree/master/console > > Some traces looks strange that looks like running free_debug_processing() in a loop, > > [ 1986.002139] Call trace: > [ 1986.002145] _raw_spin_unlock_irqrestore+0x44/0xac > [ 1986.002152] free_debug_processing+0x2f4/0x3e4 > [ 1986.002157] kmem_cache_free+0x44c/0x870 > [ 1986.002163] free_object_rcu+0x200/0x228 > [ 1986.002169] rcu_process_callbacks+0xb00/0x12c0 > [ 1986.002175] __do_softirq+0x644/0xfd0 > [ 1986.002181] irq_exit+0x29c/0x370 > [ 1986.002187] __handle_domain_irq+0xe0/0x1c4 > [ 1986.002192] gic_handle_irq+0x1c4/0x3b0 > [ 1986.002197] el1_irq+0xb0/0x140 > [ 1986.002203] lock_release+0x660/0x7dc > [ 1986.002209] rcu_lock_release+0x20/0x28 > [ 1986.002214] do_msgrcv+0x708/0xed0 > [ 1986.002219] ksys_msgrcv+0x4c/0x60 > [ 1986.002224] __arm64_sys_msgrcv+0xb8/0x194 > [ 1986.002230] el0_svc_handler+0x230/0x3bc > [ 1986.002236] el0_svc+0x8/0xc > [ 1986.007106] OUTLINED_FUNCTION_169+0x4/0xc > [ 1986.011885] free_debug_processing+0x2f4/0x3e4 > [ 1986.017186] load_msg+0x4c/0x324 > [ 1986.021617] kmem_cache_free+0x44c/0x870 > [ 1986.026917] ksys_msgsnd+0x1e0/0xe5c > [ 1988.050035] _raw_spin_unlock_irqrestore+0x44/0xac > [ 1988.054821] free_debug_processing+0x2f4/0x3e4 > [ 1988.059260] kfree+0x3f8/0x7ac > [ 1988.062313] free_msg+0x50/0xb0 > [ 1988.065450] do_msgrcv+0xd80/0xed0 > [ 1988.068846] ksys_msgrcv+0x4c/0x60 > [ 1988.072243] __arm64_sys_msgrcv+0xb8/0x194 > [ 1988.076336] el0_svc_handler+0x230/0x3bc > [ 1988.080255] el0_svc+0x I'm hoping that Andrey can make sense of this, since he recently hacked up freelist_ptr(), although only if CONFIG_SLAB_FREELIST_HARDENED=y, which isn't the case in your .config. One thing I did spot is that KASAN doesn't play nicely with page poisoning, because the pointer returned by kmap_atomic() which is used to poison and unpoison a page on free and alloc respectively is tagged. The simplest way to fix this was to disable KASAN around these operations (diff below). I think the unfortunate reality is that KASAN_SW_TAGS will always be flaky because most kernel code isn't (and wasn't) written with it in mind. The reason that's acceptable is because it's a debug option, but it also means I can't justify spending very much time on these sort of reports, sorry. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-15 14:23 ` Will Deacon @ 2019-02-15 14:26 ` Will Deacon 2019-02-19 17:56 ` Andrey Konovalov 1 sibling, 0 replies; 10+ messages in thread From: Will Deacon @ 2019-02-15 14:26 UTC (permalink / raw) To: Qian Cai Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin [this time, with the diff] On Fri, Feb 15, 2019 at 02:23:56PM +0000, Will Deacon wrote: > One thing I did spot is that KASAN doesn't play nicely with page poisoning, > because the pointer returned by kmap_atomic() which is used to poison and > unpoison a page on free and alloc respectively is tagged. The simplest way > to fix this was to disable KASAN around these operations (diff below). Will --->8 diff --git a/mm/page_poison.c b/mm/page_poison.c index f0c15e9017c0..24aa8789971d 100644 --- a/mm/page_poison.c +++ b/mm/page_poison.c @@ -3,6 +3,7 @@ #include <linux/string.h> #include <linux/mm.h> #include <linux/highmem.h> +#include <linux/kasan.h> #include <linux/page_ext.h> #include <linux/poison.h> #include <linux/ratelimit.h> @@ -116,10 +117,12 @@ void kernel_poison_pages(struct page *page, int numpages, int enable) if (!page_poisoning_enabled()) return; + kasan_disable_current(); if (enable) unpoison_pages(page, numpages); else poison_pages(page, numpages); + kasan_enable_current(); } #ifndef CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: livelock with KASAN_SW_TAGS 2019-02-15 14:23 ` Will Deacon 2019-02-15 14:26 ` Will Deacon @ 2019-02-19 17:56 ` Andrey Konovalov [not found] ` <1550601754.6911.41.camel@lca.pw> 1 sibling, 1 reply; 10+ messages in thread From: Andrey Konovalov @ 2019-02-19 17:56 UTC (permalink / raw) To: Will Deacon Cc: Andrey Ryabinin, Qian Cai, Linux ARM, kasan-dev, Catalin Marinas On Fri, Feb 15, 2019 at 3:23 PM Will Deacon <will.deacon@arm.com> wrote: > > On Thu, Feb 14, 2019 at 11:04:25PM -0500, Qian Cai wrote: > > On 2/14/19 1:01 PM, Will Deacon wrote: > > > On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote: > > >> On 2/14/19 11:35 AM, Will Deacon wrote: > > >>> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote: > > >>>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few > > >>>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the > > >>>> test in roughly 11 minutes. > > >>> > > >>> I've not been able to reproduce this failure under KVM, however the test > > >>> only takes around 18s to complete on the host and the guest, so it feels > > >>> like something is amiss here. Please could you share more information about > > >>> how you're triggering this problem? For example: > > >>> > > >>> - Kernel version and .config > > >> > > >> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus > > >> a few KASAN_SW_TAGS patches in order to boot. > > >> > > >> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5) > > >> https://marc.info/?l=linux-mm&m=154968731424637&w=2 > > >> https://marc.info/?l=linux-mm&m=155010395725051&w=2 > > >> > > >> https://git.sr.ht/~cai/linux-debug/tree/master/config > > > > > > I struggled to get this config to boot under KVM :( > > > > > >>> - Clang version > > >> > > >> clang-7.0.1 > > > > > > It would be helpful to know if the issue persists with the latest nightly > > > build of clang. > > > > > > Anyway, please could you annotate the goto loop in free_debug_processing() > > > so that the object and tail pointer are printed each time around? It would > > > be useful to know if we're failing to exit that. > > > > > > > Well, I am not sure I understand what your debugging strategy. Maybe, you can > > send along a debug patch you have in mind for me to run. > > > > From the trace, it definitely exit the "goto next_object" and reached further in > > this line in free_debug_processing(), > > > > spin_unlock_irqrestore(&n->list_lock, flags); > > > > Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still > > trigger soft lockups and msgstress03 would seem running forever, the machine is > > still responsible and is able to login via ssh. Hence, it is possible to capture > > a task dump (echo t >/proc/sysrq-trigger) while this is happening. > > > > https://git.sr.ht/~cai/linux-debug/tree/master/console > > > > Some traces looks strange that looks like running free_debug_processing() in a loop, > > > > [ 1986.002139] Call trace: > > [ 1986.002145] _raw_spin_unlock_irqrestore+0x44/0xac > > [ 1986.002152] free_debug_processing+0x2f4/0x3e4 > > [ 1986.002157] kmem_cache_free+0x44c/0x870 > > [ 1986.002163] free_object_rcu+0x200/0x228 > > [ 1986.002169] rcu_process_callbacks+0xb00/0x12c0 > > [ 1986.002175] __do_softirq+0x644/0xfd0 > > [ 1986.002181] irq_exit+0x29c/0x370 > > [ 1986.002187] __handle_domain_irq+0xe0/0x1c4 > > [ 1986.002192] gic_handle_irq+0x1c4/0x3b0 > > [ 1986.002197] el1_irq+0xb0/0x140 > > [ 1986.002203] lock_release+0x660/0x7dc > > [ 1986.002209] rcu_lock_release+0x20/0x28 > > [ 1986.002214] do_msgrcv+0x708/0xed0 > > [ 1986.002219] ksys_msgrcv+0x4c/0x60 > > [ 1986.002224] __arm64_sys_msgrcv+0xb8/0x194 > > [ 1986.002230] el0_svc_handler+0x230/0x3bc > > [ 1986.002236] el0_svc+0x8/0xc > > [ 1986.007106] OUTLINED_FUNCTION_169+0x4/0xc > > [ 1986.011885] free_debug_processing+0x2f4/0x3e4 > > [ 1986.017186] load_msg+0x4c/0x324 > > [ 1986.021617] kmem_cache_free+0x44c/0x870 > > [ 1986.026917] ksys_msgsnd+0x1e0/0xe5c > > [ 1988.050035] _raw_spin_unlock_irqrestore+0x44/0xac > > [ 1988.054821] free_debug_processing+0x2f4/0x3e4 > > [ 1988.059260] kfree+0x3f8/0x7ac > > [ 1988.062313] free_msg+0x50/0xb0 > > [ 1988.065450] do_msgrcv+0xd80/0xed0 > > [ 1988.068846] ksys_msgrcv+0x4c/0x60 > > [ 1988.072243] __arm64_sys_msgrcv+0xb8/0x194 > > [ 1988.076336] el0_svc_handler+0x230/0x3bc > > [ 1988.080255] el0_svc+0x > > I'm hoping that Andrey can make sense of this, since he recently hacked up > freelist_ptr(), although only if CONFIG_SLAB_FREELIST_HARDENED=y, which > isn't the case in your .config. So far, I've been unable to trigger this in QEMU as well. Qian, could you check if this still happens after adding that -pg flag in KASAN Makefile? > > One thing I did spot is that KASAN doesn't play nicely with page poisoning, > because the pointer returned by kmap_atomic() which is used to poison and > unpoison a page on free and alloc respectively is tagged. The simplest way > to fix this was to disable KASAN around these operations (diff below). There's a patch in the mm tree that addresses this already. > > I think the unfortunate reality is that KASAN_SW_TAGS will always be flaky > because most kernel code isn't (and wasn't) written with it in mind. The > reason that's acceptable is because it's a debug option, but it also means I > can't justify spending very much time on these sort of reports, sorry. > > Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <1550601754.6911.41.camel@lca.pw>]
* Re: soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS) [not found] ` <1550601754.6911.41.camel@lca.pw> @ 2019-02-19 18:56 ` Andrey Konovalov 2019-02-19 19:08 ` Qian Cai 0 siblings, 1 reply; 10+ messages in thread From: Andrey Konovalov @ 2019-02-19 18:56 UTC (permalink / raw) To: Qian Cai Cc: Catalin Marinas, Will Deacon, kasan-dev, Linux ARM, Andrey Ryabinin, Dmitry Vyukov On Tue, Feb 19, 2019 at 7:42 PM Qian Cai <cai@lca.pw> wrote: > > On Tue, 2019-02-19 at 18:56 +0100, Andrey Konovalov wrote: > > > > Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still > > > > trigger soft lockups and msgstress03 would seem running forever, the > > > > machine is > > > > still responsible and is able to login via ssh. Hence, it is possible to > > > > capture > > > > a task dump (echo t >/proc/sysrq-trigger) while this is happening. > > > > > > > > https://git.sr.ht/~cai/linux-debug/tree/master/console > > > > > > > > Some traces looks strange that looks like running free_debug_processing() > > > > in a loop, > > > > > > > > [ 1986.002139] Call trace: > > > > [ 1986.002145] _raw_spin_unlock_irqrestore+0x44/0xac > > > > [ 1986.002152] free_debug_processing+0x2f4/0x3e4 > > > > [ 1986.002157] kmem_cache_free+0x44c/0x870 > > > > [ 1986.002163] free_object_rcu+0x200/0x228 > > > > [ 1986.002169] rcu_process_callbacks+0xb00/0x12c0 > > > > [ 1986.002175] __do_softirq+0x644/0xfd0 > > > > [ 1986.002181] irq_exit+0x29c/0x370 > > > > [ 1986.002187] __handle_domain_irq+0xe0/0x1c4 > > > > [ 1986.002192] gic_handle_irq+0x1c4/0x3b0 > > > > [ 1986.002197] el1_irq+0xb0/0x140 > > > > [ 1986.002203] lock_release+0x660/0x7dc > > > > [ 1986.002209] rcu_lock_release+0x20/0x28 > > > > [ 1986.002214] do_msgrcv+0x708/0xed0 > > > > [ 1986.002219] ksys_msgrcv+0x4c/0x60 > > > > [ 1986.002224] __arm64_sys_msgrcv+0xb8/0x194 > > > > [ 1986.002230] el0_svc_handler+0x230/0x3bc > > > > [ 1986.002236] el0_svc+0x8/0xc > > > > [ 1986.007106] OUTLINED_FUNCTION_169+0x4/0xc > > > > [ 1986.011885] free_debug_processing+0x2f4/0x3e4 > > > > [ 1986.017186] load_msg+0x4c/0x324 > > > > [ 1986.021617] kmem_cache_free+0x44c/0x870 > > > > [ 1986.026917] ksys_msgsnd+0x1e0/0xe5c > > > > [ 1988.050035] _raw_spin_unlock_irqrestore+0x44/0xac > > > > [ 1988.054821] free_debug_processing+0x2f4/0x3e4 > > > > [ 1988.059260] kfree+0x3f8/0x7ac > > > > [ 1988.062313] free_msg+0x50/0xb0 > > > > [ 1988.065450] do_msgrcv+0xd80/0xed0 > > > > [ 1988.068846] ksys_msgrcv+0x4c/0x60 > > > > [ 1988.072243] __arm64_sys_msgrcv+0xb8/0x194 > > > > [ 1988.076336] el0_svc_handler+0x230/0x3bc > > > > [ 1988.080255] el0_svc+0x > > > > > > I'm hoping that Andrey can make sense of this, since he recently hacked up > > > freelist_ptr(), although only if CONFIG_SLAB_FREELIST_HARDENED=y, which > > > isn't the case in your .config. > > > > So far, I've been unable to trigger this in QEMU as well. > > > > Qian, could you check if this still happens after adding that -pg flag > > in KASAN Makefile? > > Yes, it still happen. Although the reproducer (LTP msgstress0[3-4]) is making > slow progress, so not strict a live lock now. The situation gets worse if the > system has more CPUs probably because more CPUs are trying to acquire the > spinlock in free_debug_processing() and then flush the console with soft > lockups. > > One workaround is to add "KASAN_SANITIZE_string.o := n" to lib/Makefile which > will stop inserting KASAN instruments for check_bytes8(), and then the > reproducers are running smoothly without triggering any soft lockups. > > It looks like check_bytes8() is a big CPU consumer especially with KASAN > instruments added. > > 0000000000001a2c <check_bytes8>: > 1a2c: d2c20008 mov x8, #0x100000000000 > 1a30: f2fdffe8 movk x8, #0xefff, lsl #48 > 1a34: 34000202 cbz w2, 1a74 <check_bytes8+0x48> > 1a38: aa1e03e3 mov x3, x30 > 1a3c: 94000047 bl 1b58 <OUTLINED_FUNCTION_2> > 1a40: aa0303fe mov x30, x3 > 1a44: 54000060 b.eq 1a50 <check_bytes8+0x24> // b.none > 1a48: 7103fd3f cmp w9, #0xff > 1a4c: 54000101 b.ne 1a6c <check_bytes8+0x40> // b.any > 1a50: 39400009 ldrb w9, [x0] > 1a54: 6b21013f cmp w9, w1, uxtb > 1a58: 54000101 b.ne 1a78 <check_bytes8+0x4c> // b.any > 1a5c: 91000400 add x0, x0, #0x1 > 1a60: 51000442 sub w2, w2, #0x1 > 1a64: 35fffea2 cbnz w2, 1a38 <check_bytes8+0xc> > 1a68: 14000003 b 1a74 <check_bytes8+0x48> > 1a6c: d4212400 brk #0x920 > 1a70: 17fffff8 b 1a50 <check_bytes8+0x24> > 1a74: aa1f03e0 mov x0, xzr > 1a78: d65f03c0 ret > > This function is called by over and over again (with interrupts disabled), > > free_debug_processing [1] > free_consistency_checks > check_object > memchr_inv [2] > check_bytes8 > > [1] iterate all objects in the slab. > [2] while (words) { words--; Ah, so it doesn't lock, it's just very slow? memchr_inv() is the only caller of check_bytes8(), so we could remove instrumentation from the latter, and add one KASAN range check into the former. But I'd say this is the expected behavior, KASAN slows down stuff and I don't think it makes much sense to enable it together with other memory debugging options. > > I also noticed that even the single "top" command is now consuming 30% - 40% > CPUs all the time. Sometimes, it could jump to 80% or so. > > 5969 root 20 0 24512 10560 4736 R 83.8 0.0 3:25.79 top _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS) 2019-02-19 18:56 ` soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS) Andrey Konovalov @ 2019-02-19 19:08 ` Qian Cai 0 siblings, 0 replies; 10+ messages in thread From: Qian Cai @ 2019-02-19 19:08 UTC (permalink / raw) To: Andrey Konovalov Cc: Catalin Marinas, Will Deacon, kasan-dev, Linux ARM, Andrey Ryabinin, Dmitry Vyukov On Tue, 2019-02-19 at 19:56 +0100, Andrey Konovalov wrote: > Ah, so it doesn't lock, it's just very slow? memchr_inv() is the only > caller of check_bytes8(), so we could remove instrumentation from the > latter, and add one KASAN range check into the former. But I'd say > this is the expected behavior, KASAN slows down stuff and I don't > think it makes much sense to enable it together with other memory > debugging options. > Yes, just very slow. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-02-19 19:08 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-14 3:32 livelock with KASAN_SW_TAGS Qian Cai 2019-02-14 16:35 ` Will Deacon 2019-02-14 16:50 ` Qian Cai 2019-02-14 18:01 ` Will Deacon 2019-02-15 4:04 ` Qian Cai 2019-02-15 14:23 ` Will Deacon 2019-02-15 14:26 ` Will Deacon 2019-02-19 17:56 ` Andrey Konovalov [not found] ` <1550601754.6911.41.camel@lca.pw> 2019-02-19 18:56 ` soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS) Andrey Konovalov 2019-02-19 19:08 ` Qian Cai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).