linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* livelock with KASAN_SW_TAGS
@ 2019-02-14  3:32 Qian Cai
  2019-02-14 16:35 ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Qian Cai @ 2019-02-14  3:32 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: aryabinin, Will Deacon, Linux ARM, kasan-dev, Catalin Marinas

Running LTP msgstress03 [1] triggers endless soft lockups below after a few
minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
test in roughly 11 minutes.

[1]
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/ipc/msgstress/msgstress03.c

# ./scripts/faddr2line vmlinux _raw_spin_unlock_irqrestore+0x44/0xac
_raw_spin_unlock_irqrestore+0x44/0xac:
arch_local_irq_restore at arch/arm64/include/asm/irqflags.h:88
(inlined by) __raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160
(inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184

This code:

	asm volatile(
		"msr	daif, %0
	:
	: "r" (flags)
	: "memory");

[  545.622261] watchdog: BUG: soft lockup - CPU#12 stuck for 23s!
[kworker/12:1:1439]
[  545.629847] Modules linked in: thunderx2_pmu ip_tables xfs libcrc32c sd_mod
ahci libahci mlx5_core libata dm_mirror dm_region_hash dm_log dm_mod efivarfs
[  545.643692] CPU: 12 PID: 1439 Comm: kworker/12:1 Tainted: G        W
5.0.0-rc6+ #44
[  545.652082] Hardware name: HPE Apollo 70             /C01_APACHE_MB         ,
BIOS L50_5.13_1.0.6 07/10/2018
[  545.661947] Workqueue: events free_obj_work
[  545.666173] pstate: 60400009 (nZCv daif +PAN -UAO)
[  545.670981] pc : _raw_spin_unlock_irqrestore+0x44/0xac
[  545.676166] lr : _raw_spin_unlock_irqrestore+0x38/0xac
[  545.681389] sp : 30ff80098de3fa70
[  545.684724] x29: 30ff80098de3fa90 x28: 00000000000044c0
[  545.690048] x27: 0000000000000001 x26: 8aff808ba9f844d0
[  545.695380] x25: 0000000000000001 x24: 0000000000000001
[  545.700705] x23: 0000000000000000 x22: 62ff808b7d00df00
[  545.706038] x21: 62ff808b7d00df00 x20: 0000000000000000
[  545.711366] x19: ffff1000105d6ac0 x18: ffff1000148a5538
[  545.716720] x17: 000000000000001b x16: 00000000000000ff
[  545.722040] x15: ffff1000148a5000 x14: 000000000000004c
[  545.727358] x13: 0000000000000001 x12: efff100000000000
[  545.732678] x11: 0000000000000002 x10: 00000000000000ff
[  545.738034] x9 : 00000000ffffffff x8 : 4cff800832347840
[  545.743354] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
[  545.748711] x5 : 000000000000005a x4 : 0000000000000000
[  545.754041] x3 : 30ff80098de3fa04 x2 : 0000000000000003
[  545.759422] x1 : 62ff808b7d00df18 x0 : 4cff800832347850
[  545.764761] Call trace:
[  545.767225]  _raw_spin_unlock_irqrestore+0x44/0xac
[  545.772027]  free_debug_processing+0x308/0x410
[  545.776489]  kmem_cache_free+0x604/0x9f0
[  545.780442]  free_obj_work+0x92c/0xa44
[  545.784241]  process_one_work+0x894/0x1280
[  545.788349]  worker_thread+0x684/0xa1c
[  545.792122]  kthread+0x2cc/0x2e8
[  545.795367]  ret_from_fork+0x10/0x18

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-14  3:32 livelock with KASAN_SW_TAGS Qian Cai
@ 2019-02-14 16:35 ` Will Deacon
  2019-02-14 16:50   ` Qian Cai
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2019-02-14 16:35 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin

Hi Qian Cai,

On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote:
> Running LTP msgstress03 [1] triggers endless soft lockups below after a few
> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
> test in roughly 11 minutes.

I've not been able to reproduce this failure under KVM, however the test
only takes around 18s to complete on the host and the guest, so it feels
like something is amiss here. Please could you share more information about
how you're triggering this problem? For example:

  - Kernel version and .config
  - Clang version
  - LTP version and command-line
  - Machine configuration (CPUs, memory, etc)

> # ./scripts/faddr2line vmlinux _raw_spin_unlock_irqrestore+0x44/0xac
> _raw_spin_unlock_irqrestore+0x44/0xac:
> arch_local_irq_restore at arch/arm64/include/asm/irqflags.h:88
> (inlined by) __raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160
> (inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184
> 
> This code:
> 
> 	asm volatile(
> 		"msr	daif, %0
> 	:
> 	: "r" (flags)
> 	: "memory");

This is just the instruction when we unmask interrupts, so that's when the
soft watchdog fires. I suspect the real issue is that we're spending lots
of time in:

> [  545.772027]  free_debug_processing+0x308/0x410

with irqs disabled. Does the machine eventually come back to life?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-14 16:35 ` Will Deacon
@ 2019-02-14 16:50   ` Qian Cai
  2019-02-14 18:01     ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Qian Cai @ 2019-02-14 16:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin



On 2/14/19 11:35 AM, Will Deacon wrote:
> Hi Qian Cai,
> 
> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote:
>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few
>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
>> test in roughly 11 minutes.
> 
> I've not been able to reproduce this failure under KVM, however the test
> only takes around 18s to complete on the host and the guest, so it feels
> like something is amiss here. Please could you share more information about
> how you're triggering this problem? For example:
> 
>   - Kernel version and .config

Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus
a few KASAN_SW_TAGS patches in order to boot.

https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5)
https://marc.info/?l=linux-mm&m=154968731424637&w=2
https://marc.info/?l=linux-mm&m=155010395725051&w=2

https://git.sr.ht/~cai/linux-debug/tree/master/config

>   - Clang version

clang-7.0.1

>   - LTP version and command-line

latest git at db57ddc1497e

/opt/ltp/testcases/bin/msgstress03

>   - Machine configuration (CPUs, memory, etc)

https://git.sr.ht/~cai/linux-debug/tree/master/dmesg

# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  4
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               1
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-127
NUMA node1 CPU(s):   128-255
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
asimdrdm

# free -m
              total        used        free      shared  buff/cache   available
Mem:          84871        8347       75942          55         581       69699
Swap:          4095           0        4095

> 
>> # ./scripts/faddr2line vmlinux _raw_spin_unlock_irqrestore+0x44/0xac
>> _raw_spin_unlock_irqrestore+0x44/0xac:
>> arch_local_irq_restore at arch/arm64/include/asm/irqflags.h:88
>> (inlined by) __raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160
>> (inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184
>>
>> This code:
>>
>> 	asm volatile(
>> 		"msr	daif, %0
>> 	:
>> 	: "r" (flags)
>> 	: "memory");
> 
> This is just the instruction when we unmask interrupts, so that's when the
> soft watchdog fires. I suspect the real issue is that we're spending lots
> of time in:
> 
>> [  545.772027]  free_debug_processing+0x308/0x410
> 
> with irqs disabled. Does the machine eventually come back to life?
>

Never.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-14 16:50   ` Qian Cai
@ 2019-02-14 18:01     ` Will Deacon
  2019-02-15  4:04       ` Qian Cai
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2019-02-14 18:01 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin

On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote:
> On 2/14/19 11:35 AM, Will Deacon wrote:
> > On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote:
> >> Running LTP msgstress03 [1] triggers endless soft lockups below after a few
> >> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
> >> test in roughly 11 minutes.
> > 
> > I've not been able to reproduce this failure under KVM, however the test
> > only takes around 18s to complete on the host and the guest, so it feels
> > like something is amiss here. Please could you share more information about
> > how you're triggering this problem? For example:
> > 
> >   - Kernel version and .config
> 
> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus
> a few KASAN_SW_TAGS patches in order to boot.
> 
> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5)
> https://marc.info/?l=linux-mm&m=154968731424637&w=2
> https://marc.info/?l=linux-mm&m=155010395725051&w=2
> 
> https://git.sr.ht/~cai/linux-debug/tree/master/config

I struggled to get this config to boot under KVM :(

> >   - Clang version
> 
> clang-7.0.1

It would be helpful to know if the issue persists with the latest nightly
build of clang.

Anyway, please could you annotate the goto loop in free_debug_processing()
so that the object and tail pointer are printed each time around? It would
be useful to know if we're failing to exit that.

Thanks,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-14 18:01     ` Will Deacon
@ 2019-02-15  4:04       ` Qian Cai
  2019-02-15 14:23         ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Qian Cai @ 2019-02-15  4:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin



On 2/14/19 1:01 PM, Will Deacon wrote:
> On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote:
>> On 2/14/19 11:35 AM, Will Deacon wrote:
>>> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote:
>>>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few
>>>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
>>>> test in roughly 11 minutes.
>>>
>>> I've not been able to reproduce this failure under KVM, however the test
>>> only takes around 18s to complete on the host and the guest, so it feels
>>> like something is amiss here. Please could you share more information about
>>> how you're triggering this problem? For example:
>>>
>>>   - Kernel version and .config
>>
>> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus
>> a few KASAN_SW_TAGS patches in order to boot.
>>
>> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5)
>> https://marc.info/?l=linux-mm&m=154968731424637&w=2
>> https://marc.info/?l=linux-mm&m=155010395725051&w=2
>>
>> https://git.sr.ht/~cai/linux-debug/tree/master/config
> 
> I struggled to get this config to boot under KVM :(
> 
>>>   - Clang version
>>
>> clang-7.0.1
> 
> It would be helpful to know if the issue persists with the latest nightly
> build of clang.
> 
> Anyway, please could you annotate the goto loop in free_debug_processing()
> so that the object and tail pointer are printed each time around? It would
> be useful to know if we're failing to exit that.
>

Well, I am not sure I understand what your debugging strategy. Maybe, you can
send along a debug patch you have in mind for me to run.

From the trace, it definitely exit the "goto next_object" and reached further in
this line in free_debug_processing(),

spin_unlock_irqrestore(&n->list_lock, flags);

Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still
trigger soft lockups and msgstress03 would seem running forever, the machine is
still responsible and is able to login via ssh. Hence, it is possible to capture
a task dump (echo t >/proc/sysrq-trigger) while this is happening.

https://git.sr.ht/~cai/linux-debug/tree/master/console

Some traces looks strange that looks like running free_debug_processing() in a loop,

[ 1986.002139] Call trace:
[ 1986.002145]  _raw_spin_unlock_irqrestore+0x44/0xac
[ 1986.002152]  free_debug_processing+0x2f4/0x3e4
[ 1986.002157]  kmem_cache_free+0x44c/0x870
[ 1986.002163]  free_object_rcu+0x200/0x228
[ 1986.002169]  rcu_process_callbacks+0xb00/0x12c0
[ 1986.002175]  __do_softirq+0x644/0xfd0
[ 1986.002181]  irq_exit+0x29c/0x370
[ 1986.002187]  __handle_domain_irq+0xe0/0x1c4
[ 1986.002192]  gic_handle_irq+0x1c4/0x3b0
[ 1986.002197]  el1_irq+0xb0/0x140
[ 1986.002203]  lock_release+0x660/0x7dc
[ 1986.002209]  rcu_lock_release+0x20/0x28
[ 1986.002214]  do_msgrcv+0x708/0xed0
[ 1986.002219]  ksys_msgrcv+0x4c/0x60
[ 1986.002224]  __arm64_sys_msgrcv+0xb8/0x194
[ 1986.002230]  el0_svc_handler+0x230/0x3bc
[ 1986.002236]  el0_svc+0x8/0xc
[ 1986.007106]  OUTLINED_FUNCTION_169+0x4/0xc
[ 1986.011885]  free_debug_processing+0x2f4/0x3e4
[ 1986.017186]  load_msg+0x4c/0x324
[ 1986.021617]  kmem_cache_free+0x44c/0x870
[ 1986.026917]  ksys_msgsnd+0x1e0/0xe5c
[ 1988.050035]  _raw_spin_unlock_irqrestore+0x44/0xac
[ 1988.054821]  free_debug_processing+0x2f4/0x3e4
[ 1988.059260]  kfree+0x3f8/0x7ac
[ 1988.062313]  free_msg+0x50/0xb0
[ 1988.065450]  do_msgrcv+0xd80/0xed0
[ 1988.068846]  ksys_msgrcv+0x4c/0x60
[ 1988.072243]  __arm64_sys_msgrcv+0xb8/0x194
[ 1988.076336]  el0_svc_handler+0x230/0x3bc
[ 1988.080255]  el0_svc+0x

Occasionally, msgstress03 would trigger a panic.

[  997.982080] Internal error: synchronous external abort: 96000610 [#1] SMP
[  997.988886] Modules linked in: thunderx2_pmu efivarfs ip_tables xfs libcrc32c
sd_mod ahci libahci mlx5_core libata dm_mirror dm_region_hash dm_log dm_mod
[  998.002665] CPU: 229 PID: 1323 Comm: kworker/229:1 Kdump: loaded Tainted: G
     W         5.0.0-rc6+ #57
[  998.012402] Hardware name: HPE Apollo 70             /C01_APACHE_MB         ,
BIOS L50_5.13_1.0.6 07/10/2018
[  998.022243] Workqueue: events free_obj_work
[  998.026434] pstate: 00400009 (nzcv daif +PAN -UAO)
[  998.031229] pc : kmem_cache_free+0x410/0x870
[  998.035500] lr : free_obj_work+0x92c/0xa44
[  998.039592] sp : 29ff808baebefb50
[  998.042908] x29: 29ff808baebefca0 x28: 73ff808a939eb6e0
[  998.048223] x27: 33ff800820013488 x26: 73ff808a939eb710
[  998.053537] x25: 0000000000a3008e x24: ffff100014b73000
[  998.058850] x23: 00000000000000ff x22: dead000000000100
[  998.064164] x21: efff100000000000 x20: efff100000000000
[  998.069480] x19: ffff100014b73d70 x18: ffff1000148a5538
[  998.074795] x17: 000000000000001b x16: 0000000000a3008d
[  998.080108] x15: b1ff808a939ee2a0 x14: 0000000000000000
[  998.085421] x13: 9dff808ba9713850 x12: 0000000000000000
[  998.090737] x11: 007ffffffc000201 x10: ffff1000138d4000
[  998.096055] x9 : 0000000000000000 x8 : 9dff808ba9713840
[  998.101371] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
[  998.106686] x5 : 000000000000005a x4 : 0000000000000000
[  998.111999] x3 : 29ff808baebef9f4 x2 : 0000000000000003
[  998.117312] x1 : cbff80082000df18 x0 : ffff808a939eb710
[  998.122632] Process kworker/229:1 (pid: 1323, stack limit = 0x00000000d1a3376a)
[  998.129937] Call trace:
[  998.132388]  kmem_cache_free+0x410/0x870
[  998.136320]  process_one_work+0x894/0x1280
[  998.140418]  worker_thread+0x684/0xa1c
[  998.144174]  kthread+0x2cc/0x2e8
[  998.147409]  ret_from_fork+0x10/0x18
[  998.150989] Code: a94b7bfd a94a4ff4 a94957f6 a9485ff8 (a94767fa)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-15  4:04       ` Qian Cai
@ 2019-02-15 14:23         ` Will Deacon
  2019-02-15 14:26           ` Will Deacon
  2019-02-19 17:56           ` Andrey Konovalov
  0 siblings, 2 replies; 10+ messages in thread
From: Will Deacon @ 2019-02-15 14:23 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin

On Thu, Feb 14, 2019 at 11:04:25PM -0500, Qian Cai wrote:
> On 2/14/19 1:01 PM, Will Deacon wrote:
> > On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote:
> >> On 2/14/19 11:35 AM, Will Deacon wrote:
> >>> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote:
> >>>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few
> >>>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
> >>>> test in roughly 11 minutes.
> >>>
> >>> I've not been able to reproduce this failure under KVM, however the test
> >>> only takes around 18s to complete on the host and the guest, so it feels
> >>> like something is amiss here. Please could you share more information about
> >>> how you're triggering this problem? For example:
> >>>
> >>>   - Kernel version and .config
> >>
> >> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus
> >> a few KASAN_SW_TAGS patches in order to boot.
> >>
> >> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5)
> >> https://marc.info/?l=linux-mm&m=154968731424637&w=2
> >> https://marc.info/?l=linux-mm&m=155010395725051&w=2
> >>
> >> https://git.sr.ht/~cai/linux-debug/tree/master/config
> > 
> > I struggled to get this config to boot under KVM :(
> > 
> >>>   - Clang version
> >>
> >> clang-7.0.1
> > 
> > It would be helpful to know if the issue persists with the latest nightly
> > build of clang.
> > 
> > Anyway, please could you annotate the goto loop in free_debug_processing()
> > so that the object and tail pointer are printed each time around? It would
> > be useful to know if we're failing to exit that.
> >
> 
> Well, I am not sure I understand what your debugging strategy. Maybe, you can
> send along a debug patch you have in mind for me to run.
> 
> From the trace, it definitely exit the "goto next_object" and reached further in
> this line in free_debug_processing(),
> 
> spin_unlock_irqrestore(&n->list_lock, flags);
> 
> Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still
> trigger soft lockups and msgstress03 would seem running forever, the machine is
> still responsible and is able to login via ssh. Hence, it is possible to capture
> a task dump (echo t >/proc/sysrq-trigger) while this is happening.
> 
> https://git.sr.ht/~cai/linux-debug/tree/master/console
> 
> Some traces looks strange that looks like running free_debug_processing() in a loop,
> 
> [ 1986.002139] Call trace:
> [ 1986.002145]  _raw_spin_unlock_irqrestore+0x44/0xac
> [ 1986.002152]  free_debug_processing+0x2f4/0x3e4
> [ 1986.002157]  kmem_cache_free+0x44c/0x870
> [ 1986.002163]  free_object_rcu+0x200/0x228
> [ 1986.002169]  rcu_process_callbacks+0xb00/0x12c0
> [ 1986.002175]  __do_softirq+0x644/0xfd0
> [ 1986.002181]  irq_exit+0x29c/0x370
> [ 1986.002187]  __handle_domain_irq+0xe0/0x1c4
> [ 1986.002192]  gic_handle_irq+0x1c4/0x3b0
> [ 1986.002197]  el1_irq+0xb0/0x140
> [ 1986.002203]  lock_release+0x660/0x7dc
> [ 1986.002209]  rcu_lock_release+0x20/0x28
> [ 1986.002214]  do_msgrcv+0x708/0xed0
> [ 1986.002219]  ksys_msgrcv+0x4c/0x60
> [ 1986.002224]  __arm64_sys_msgrcv+0xb8/0x194
> [ 1986.002230]  el0_svc_handler+0x230/0x3bc
> [ 1986.002236]  el0_svc+0x8/0xc
> [ 1986.007106]  OUTLINED_FUNCTION_169+0x4/0xc
> [ 1986.011885]  free_debug_processing+0x2f4/0x3e4
> [ 1986.017186]  load_msg+0x4c/0x324
> [ 1986.021617]  kmem_cache_free+0x44c/0x870
> [ 1986.026917]  ksys_msgsnd+0x1e0/0xe5c
> [ 1988.050035]  _raw_spin_unlock_irqrestore+0x44/0xac
> [ 1988.054821]  free_debug_processing+0x2f4/0x3e4
> [ 1988.059260]  kfree+0x3f8/0x7ac
> [ 1988.062313]  free_msg+0x50/0xb0
> [ 1988.065450]  do_msgrcv+0xd80/0xed0
> [ 1988.068846]  ksys_msgrcv+0x4c/0x60
> [ 1988.072243]  __arm64_sys_msgrcv+0xb8/0x194
> [ 1988.076336]  el0_svc_handler+0x230/0x3bc
> [ 1988.080255]  el0_svc+0x

I'm hoping that Andrey can make sense of this, since he recently hacked up
freelist_ptr(), although only if CONFIG_SLAB_FREELIST_HARDENED=y, which
isn't the case in your .config.

One thing I did spot is that KASAN doesn't play nicely with page poisoning,
because the pointer returned by kmap_atomic() which is used to poison and
unpoison a page on free and alloc respectively is tagged. The simplest way
to fix this was to disable KASAN around these operations (diff below).

I think the unfortunate reality is that KASAN_SW_TAGS will always be flaky
because most kernel code isn't (and wasn't) written with it in mind.  The
reason that's acceptable is because it's a debug option, but it also means I
can't justify spending very much time on these sort of reports, sorry.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-15 14:23         ` Will Deacon
@ 2019-02-15 14:26           ` Will Deacon
  2019-02-19 17:56           ` Andrey Konovalov
  1 sibling, 0 replies; 10+ messages in thread
From: Will Deacon @ 2019-02-15 14:26 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrey Konovalov, Catalin Marinas, Linux ARM, kasan-dev, aryabinin

[this time, with the diff]

On Fri, Feb 15, 2019 at 02:23:56PM +0000, Will Deacon wrote:
> One thing I did spot is that KASAN doesn't play nicely with page poisoning,
> because the pointer returned by kmap_atomic() which is used to poison and
> unpoison a page on free and alloc respectively is tagged. The simplest way
> to fix this was to disable KASAN around these operations (diff below).

Will

--->8

diff --git a/mm/page_poison.c b/mm/page_poison.c
index f0c15e9017c0..24aa8789971d 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -3,6 +3,7 @@
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include <linux/kasan.h>
 #include <linux/page_ext.h>
 #include <linux/poison.h>
 #include <linux/ratelimit.h>
@@ -116,10 +117,12 @@ void kernel_poison_pages(struct page *page, int numpages, int enable)
 	if (!page_poisoning_enabled())
 		return;
 
+	kasan_disable_current();
 	if (enable)
 		unpoison_pages(page, numpages);
 	else
 		poison_pages(page, numpages);
+	kasan_enable_current();
 }
 
 #ifndef CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: livelock with KASAN_SW_TAGS
  2019-02-15 14:23         ` Will Deacon
  2019-02-15 14:26           ` Will Deacon
@ 2019-02-19 17:56           ` Andrey Konovalov
       [not found]             ` <1550601754.6911.41.camel@lca.pw>
  1 sibling, 1 reply; 10+ messages in thread
From: Andrey Konovalov @ 2019-02-19 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrey Ryabinin, Qian Cai, Linux ARM, kasan-dev, Catalin Marinas

On Fri, Feb 15, 2019 at 3:23 PM Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, Feb 14, 2019 at 11:04:25PM -0500, Qian Cai wrote:
> > On 2/14/19 1:01 PM, Will Deacon wrote:
> > > On Thu, Feb 14, 2019 at 11:50:59AM -0500, Qian Cai wrote:
> > >> On 2/14/19 11:35 AM, Will Deacon wrote:
> > >>> On Wed, Feb 13, 2019 at 10:32:11PM -0500, Qian Cai wrote:
> > >>>> Running LTP msgstress03 [1] triggers endless soft lockups below after a few
> > >>>> minutes on a ThunderX2 server. It works fine with KASAN_GENERIC and finish the
> > >>>> test in roughly 11 minutes.
> > >>>
> > >>> I've not been able to reproduce this failure under KVM, however the test
> > >>> only takes around 18s to complete on the host and the guest, so it feels
> > >>> like something is amiss here. Please could you share more information about
> > >>> how you're triggering this problem? For example:
> > >>>
> > >>>   - Kernel version and .config
> > >>
> > >> Latest mainline at 1f947a7a01 ("Merge branch 'akpm' (patches from Andrew)") plus
> > >> a few KASAN_SW_TAGS patches in order to boot.
> > >>
> > >> https://marc.info/?l=linux-mm&m=155006632110129&w=2 (all 5)
> > >> https://marc.info/?l=linux-mm&m=154968731424637&w=2
> > >> https://marc.info/?l=linux-mm&m=155010395725051&w=2
> > >>
> > >> https://git.sr.ht/~cai/linux-debug/tree/master/config
> > >
> > > I struggled to get this config to boot under KVM :(
> > >
> > >>>   - Clang version
> > >>
> > >> clang-7.0.1
> > >
> > > It would be helpful to know if the issue persists with the latest nightly
> > > build of clang.
> > >
> > > Anyway, please could you annotate the goto loop in free_debug_processing()
> > > so that the object and tail pointer are printed each time around? It would
> > > be useful to know if we're failing to exit that.
> > >
> >
> > Well, I am not sure I understand what your debugging strategy. Maybe, you can
> > send along a debug patch you have in mind for me to run.
> >
> > From the trace, it definitely exit the "goto next_object" and reached further in
> > this line in free_debug_processing(),
> >
> > spin_unlock_irqrestore(&n->list_lock, flags);
> >
> > Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still
> > trigger soft lockups and msgstress03 would seem running forever, the machine is
> > still responsible and is able to login via ssh. Hence, it is possible to capture
> > a task dump (echo t >/proc/sysrq-trigger) while this is happening.
> >
> > https://git.sr.ht/~cai/linux-debug/tree/master/console
> >
> > Some traces looks strange that looks like running free_debug_processing() in a loop,
> >
> > [ 1986.002139] Call trace:
> > [ 1986.002145]  _raw_spin_unlock_irqrestore+0x44/0xac
> > [ 1986.002152]  free_debug_processing+0x2f4/0x3e4
> > [ 1986.002157]  kmem_cache_free+0x44c/0x870
> > [ 1986.002163]  free_object_rcu+0x200/0x228
> > [ 1986.002169]  rcu_process_callbacks+0xb00/0x12c0
> > [ 1986.002175]  __do_softirq+0x644/0xfd0
> > [ 1986.002181]  irq_exit+0x29c/0x370
> > [ 1986.002187]  __handle_domain_irq+0xe0/0x1c4
> > [ 1986.002192]  gic_handle_irq+0x1c4/0x3b0
> > [ 1986.002197]  el1_irq+0xb0/0x140
> > [ 1986.002203]  lock_release+0x660/0x7dc
> > [ 1986.002209]  rcu_lock_release+0x20/0x28
> > [ 1986.002214]  do_msgrcv+0x708/0xed0
> > [ 1986.002219]  ksys_msgrcv+0x4c/0x60
> > [ 1986.002224]  __arm64_sys_msgrcv+0xb8/0x194
> > [ 1986.002230]  el0_svc_handler+0x230/0x3bc
> > [ 1986.002236]  el0_svc+0x8/0xc
> > [ 1986.007106]  OUTLINED_FUNCTION_169+0x4/0xc
> > [ 1986.011885]  free_debug_processing+0x2f4/0x3e4
> > [ 1986.017186]  load_msg+0x4c/0x324
> > [ 1986.021617]  kmem_cache_free+0x44c/0x870
> > [ 1986.026917]  ksys_msgsnd+0x1e0/0xe5c
> > [ 1988.050035]  _raw_spin_unlock_irqrestore+0x44/0xac
> > [ 1988.054821]  free_debug_processing+0x2f4/0x3e4
> > [ 1988.059260]  kfree+0x3f8/0x7ac
> > [ 1988.062313]  free_msg+0x50/0xb0
> > [ 1988.065450]  do_msgrcv+0xd80/0xed0
> > [ 1988.068846]  ksys_msgrcv+0x4c/0x60
> > [ 1988.072243]  __arm64_sys_msgrcv+0xb8/0x194
> > [ 1988.076336]  el0_svc_handler+0x230/0x3bc
> > [ 1988.080255]  el0_svc+0x
>
> I'm hoping that Andrey can make sense of this, since he recently hacked up
> freelist_ptr(), although only if CONFIG_SLAB_FREELIST_HARDENED=y, which
> isn't the case in your .config.

So far, I've been unable to trigger this in QEMU as well.

Qian, could you check if this still happens after adding that -pg flag
in KASAN Makefile?

>
> One thing I did spot is that KASAN doesn't play nicely with page poisoning,
> because the pointer returned by kmap_atomic() which is used to poison and
> unpoison a page on free and alloc respectively is tagged. The simplest way
> to fix this was to disable KASAN around these operations (diff below).

There's a patch in the mm tree that addresses this already.


>
> I think the unfortunate reality is that KASAN_SW_TAGS will always be flaky
> because most kernel code isn't (and wasn't) written with it in mind.  The
> reason that's acceptable is because it's a debug option, but it also means I
> can't justify spending very much time on these sort of reports, sorry.
>
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS)
       [not found]             ` <1550601754.6911.41.camel@lca.pw>
@ 2019-02-19 18:56               ` Andrey Konovalov
  2019-02-19 19:08                 ` Qian Cai
  0 siblings, 1 reply; 10+ messages in thread
From: Andrey Konovalov @ 2019-02-19 18:56 UTC (permalink / raw)
  To: Qian Cai
  Cc: Catalin Marinas, Will Deacon, kasan-dev, Linux ARM,
	Andrey Ryabinin, Dmitry Vyukov

On Tue, Feb 19, 2019 at 7:42 PM Qian Cai <cai@lca.pw> wrote:
>
> On Tue, 2019-02-19 at 18:56 +0100, Andrey Konovalov wrote:
> > > > Once the machine is restricted to 16 CPUs (nr_cpus=16), although it still
> > > > trigger soft lockups and msgstress03 would seem running forever, the
> > > > machine is
> > > > still responsible and is able to login via ssh. Hence, it is possible to
> > > > capture
> > > > a task dump (echo t >/proc/sysrq-trigger) while this is happening.
> > > >
> > > > https://git.sr.ht/~cai/linux-debug/tree/master/console
> > > >
> > > > Some traces looks strange that looks like running free_debug_processing()
> > > > in a loop,
> > > >
> > > > [ 1986.002139] Call trace:
> > > > [ 1986.002145]  _raw_spin_unlock_irqrestore+0x44/0xac
> > > > [ 1986.002152]  free_debug_processing+0x2f4/0x3e4
> > > > [ 1986.002157]  kmem_cache_free+0x44c/0x870
> > > > [ 1986.002163]  free_object_rcu+0x200/0x228
> > > > [ 1986.002169]  rcu_process_callbacks+0xb00/0x12c0
> > > > [ 1986.002175]  __do_softirq+0x644/0xfd0
> > > > [ 1986.002181]  irq_exit+0x29c/0x370
> > > > [ 1986.002187]  __handle_domain_irq+0xe0/0x1c4
> > > > [ 1986.002192]  gic_handle_irq+0x1c4/0x3b0
> > > > [ 1986.002197]  el1_irq+0xb0/0x140
> > > > [ 1986.002203]  lock_release+0x660/0x7dc
> > > > [ 1986.002209]  rcu_lock_release+0x20/0x28
> > > > [ 1986.002214]  do_msgrcv+0x708/0xed0
> > > > [ 1986.002219]  ksys_msgrcv+0x4c/0x60
> > > > [ 1986.002224]  __arm64_sys_msgrcv+0xb8/0x194
> > > > [ 1986.002230]  el0_svc_handler+0x230/0x3bc
> > > > [ 1986.002236]  el0_svc+0x8/0xc
> > > > [ 1986.007106]  OUTLINED_FUNCTION_169+0x4/0xc
> > > > [ 1986.011885]  free_debug_processing+0x2f4/0x3e4
> > > > [ 1986.017186]  load_msg+0x4c/0x324
> > > > [ 1986.021617]  kmem_cache_free+0x44c/0x870
> > > > [ 1986.026917]  ksys_msgsnd+0x1e0/0xe5c
> > > > [ 1988.050035]  _raw_spin_unlock_irqrestore+0x44/0xac
> > > > [ 1988.054821]  free_debug_processing+0x2f4/0x3e4
> > > > [ 1988.059260]  kfree+0x3f8/0x7ac
> > > > [ 1988.062313]  free_msg+0x50/0xb0
> > > > [ 1988.065450]  do_msgrcv+0xd80/0xed0
> > > > [ 1988.068846]  ksys_msgrcv+0x4c/0x60
> > > > [ 1988.072243]  __arm64_sys_msgrcv+0xb8/0x194
> > > > [ 1988.076336]  el0_svc_handler+0x230/0x3bc
> > > > [ 1988.080255]  el0_svc+0x
> > >
> > > I'm hoping that Andrey can make sense of this, since he recently hacked up
> > > freelist_ptr(), although only if CONFIG_SLAB_FREELIST_HARDENED=y, which
> > > isn't the case in your .config.
> >
> > So far, I've been unable to trigger this in QEMU as well.
> >
> > Qian, could you check if this still happens after adding that -pg flag
> > in KASAN Makefile?
>
> Yes, it still happen. Although the reproducer (LTP msgstress0[3-4]) is making
> slow progress, so not strict a live lock now. The situation gets worse if the
> system has more CPUs probably because more CPUs are trying to acquire the
> spinlock in free_debug_processing() and then flush the console with soft
> lockups.
>
> One workaround is to add "KASAN_SANITIZE_string.o := n" to lib/Makefile which
> will stop inserting KASAN instruments for check_bytes8(), and then the
> reproducers are running smoothly without triggering any soft lockups.
>
> It looks like check_bytes8() is a big CPU consumer especially with KASAN
> instruments added.
>
> 0000000000001a2c <check_bytes8>:
>     1a2c:       d2c20008        mov     x8, #0x100000000000
>     1a30:       f2fdffe8        movk    x8, #0xefff, lsl #48
>     1a34:       34000202        cbz     w2, 1a74 <check_bytes8+0x48>
>     1a38:       aa1e03e3        mov     x3, x30
>     1a3c:       94000047        bl      1b58 <OUTLINED_FUNCTION_2>
>     1a40:       aa0303fe        mov     x30, x3
>     1a44:       54000060        b.eq    1a50 <check_bytes8+0x24>  // b.none
>     1a48:       7103fd3f        cmp     w9, #0xff
>     1a4c:       54000101        b.ne    1a6c <check_bytes8+0x40>  // b.any
>     1a50:       39400009        ldrb    w9, [x0]
>     1a54:       6b21013f        cmp     w9, w1, uxtb
>     1a58:       54000101        b.ne    1a78 <check_bytes8+0x4c>  // b.any
>     1a5c:       91000400        add     x0, x0, #0x1
>     1a60:       51000442        sub     w2, w2, #0x1
>     1a64:       35fffea2        cbnz    w2, 1a38 <check_bytes8+0xc>
>     1a68:       14000003        b       1a74 <check_bytes8+0x48>
>     1a6c:       d4212400        brk     #0x920
>     1a70:       17fffff8        b       1a50 <check_bytes8+0x24>
>     1a74:       aa1f03e0        mov     x0, xzr
>     1a78:       d65f03c0        ret
>
> This function is called by over and over again (with interrupts disabled),
>
> free_debug_processing [1]
>   free_consistency_checks
>     check_object
>       memchr_inv [2]
>         check_bytes8
>
> [1] iterate all objects in the slab.
> [2] while (words) { words--;

Ah, so it doesn't lock, it's just very slow? memchr_inv() is the only
caller of check_bytes8(), so we could remove instrumentation from the
latter, and add one KASAN range check into the former. But I'd say
this is the expected behavior, KASAN slows down stuff and I don't
think it makes much sense to enable it together with other memory
debugging options.


>
> I also noticed that even the single "top" command is now consuming 30% - 40%
> CPUs all the time. Sometimes, it could jump to 80% or so.
>
> 5969 root      20   0   24512  10560   4736 R  83.8   0.0   3:25.79 top

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS)
  2019-02-19 18:56               ` soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS) Andrey Konovalov
@ 2019-02-19 19:08                 ` Qian Cai
  0 siblings, 0 replies; 10+ messages in thread
From: Qian Cai @ 2019-02-19 19:08 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Catalin Marinas, Will Deacon, kasan-dev, Linux ARM,
	Andrey Ryabinin, Dmitry Vyukov

On Tue, 2019-02-19 at 19:56 +0100, Andrey Konovalov wrote:
> Ah, so it doesn't lock, it's just very slow? memchr_inv() is the only
> caller of check_bytes8(), so we could remove instrumentation from the
> latter, and add one KASAN range check into the former. But I'd say
> this is the expected behavior, KASAN slows down stuff and I don't
> think it makes much sense to enable it together with other memory
> debugging options.
> 

Yes, just very slow.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-02-19 19:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-14  3:32 livelock with KASAN_SW_TAGS Qian Cai
2019-02-14 16:35 ` Will Deacon
2019-02-14 16:50   ` Qian Cai
2019-02-14 18:01     ` Will Deacon
2019-02-15  4:04       ` Qian Cai
2019-02-15 14:23         ` Will Deacon
2019-02-15 14:26           ` Will Deacon
2019-02-19 17:56           ` Andrey Konovalov
     [not found]             ` <1550601754.6911.41.camel@lca.pw>
2019-02-19 18:56               ` soft lockups with SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS (was: livelock with KASAN_SW_TAGS) Andrey Konovalov
2019-02-19 19:08                 ` Qian Cai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).