netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* INFO: task hung in ip6gre_exit_batch_net
@ 2018-06-04 15:03 syzbot
  2018-06-04 15:22 ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: syzbot @ 2018-06-04 15:03 UTC (permalink / raw)
  To: christian.brauner, davem, dsahern, fw, jbenc, ktkhai,
	linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
	vyasevich

Hello,

syzbot found the following crash on:

HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
       Not tainted 4.17.0-rc6+ #68
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:1    D13944    22      2 0x80000000
Workqueue: netns cleanup_net
Call Trace:
  context_switch kernel/sched/core.c:2859 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
  schedule+0xef/0x430 kernel/sched/core.c:3545
  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Showing all locks held in the system:
4 locks held by kworker/u4:1/22:
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20  
net/core/net_namespace.c:490
  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
3 locks held by kworker/1:1/24:
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_data kernel/workqueue.c:617 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
2 locks held by khungtaskd/893:
  #0: 000000007eeb621a (rcu_read_lock){....}, at:  
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60  
kernel/hung_task.c:249
  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:  
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
2 locks held by getty/4481:
  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4482:
  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4483:
  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4484:
  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4485:
  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4486:
  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4487:
  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
3 locks held by kworker/1:3/27147:
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set  
include/asm-generic/atomic-long.h:57 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
1 lock held by syz-executor7/29665:
  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock  
include/net/sock.h:1469 [inline]
  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:  
tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
1 lock held by syz-executor7/29689:
  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
  check_hung_task kernel/hung_task.c:132 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
  watchdog+0xc10/0xf60 kernel/hung_task.c:249
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10  
arch/x86/include/asm/irqflags.h:54


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-04 15:03 INFO: task hung in ip6gre_exit_batch_net syzbot
@ 2018-06-04 15:22 ` Dmitry Vyukov
  2018-06-05  9:03   ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2018-06-04 15:22 UTC (permalink / raw)
  To: syzbot
  Cc: Christian Brauner, David Miller, David Ahern, Florian Westphal,
	Jiri Benc, Kirill Tkhai, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On Mon, Jun 4, 2018 at 5:03 PM, syzbot
<syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

Another hang on rtnl lock:

#syz dup: INFO: task hung in netdev_run_todo

May be related to "unregister_netdevice: waiting for DEV to become free":
https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9

Any other explanations for massive hangs on rtnl lock for minutes?


> INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
>       Not tainted 4.17.0-rc6+ #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u4:1    D13944    22      2 0x80000000
> Workqueue: netns cleanup_net
> Call Trace:
>  context_switch kernel/sched/core.c:2859 [inline]
>  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
>  schedule+0xef/0x430 kernel/sched/core.c:3545
>  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
>  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
>  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
>  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
>  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
>  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
>  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
>  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
>
> Showing all locks held in the system:
> 4 locks held by kworker/u4:1/22:
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20
> net/core/net_namespace.c:490
>  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 3 locks held by kworker/1:1/24:
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_data kernel/workqueue.c:617 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 2 locks held by khungtaskd/893:
>  #0: 000000007eeb621a (rcu_read_lock){....}, at:
> check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
>  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60
> kernel/hung_task.c:249
>  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:
> debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
> 2 locks held by getty/4481:
>  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4482:
>  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4483:
>  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4484:
>  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4485:
>  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4486:
>  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4487:
>  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 3 locks held by kworker/1:3/27147:
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: __write_once_size
> include/linux/compiler.h:215 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: arch_atomic64_set
> arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set
> include/asm-generic/atomic-long.h:57 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by syz-executor7/29665:
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock
> include/net/sock.h:1469 [inline]
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:
> tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
> 1 lock held by syz-executor7/29689:
>  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
>
> =============================================
>
> NMI backtrace for cpu 0
> CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
>  check_hung_task kernel/hung_task.c:132 [inline]
>  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
>  watchdog+0xc10/0xf60 kernel/hung_task.c:249
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> Sending NMI from CPU 0 to CPUs 1:
> NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10
> arch/x86/include/asm/irqflags.h:54
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/0000000000006e4595056dd23c16%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-04 15:22 ` Dmitry Vyukov
@ 2018-06-05  9:03   ` Kirill Tkhai
  2018-06-05  9:36     ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2018-06-05  9:03 UTC (permalink / raw)
  To: Dmitry Vyukov, syzbot
  Cc: Christian Brauner, David Miller, David Ahern, Florian Westphal,
	Jiri Benc, LKML, Xin Long, mschiffer, netdev, syzkaller-bugs,
	Vladislav Yasevich

Hi, Dmirty!

On 04.06.2018 18:22, Dmitry Vyukov wrote:
> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
> 
> Another hang on rtnl lock:
> 
> #syz dup: INFO: task hung in netdev_run_todo
> 
> May be related to "unregister_netdevice: waiting for DEV to become free":
> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
> 
> Any other explanations for massive hangs on rtnl lock for minutes?

To exclude the situation, when a task exists with rtnl_mutex held:

would the pr_warn() from print_held_locks_bug() be included in the console output
if they appear?

Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-05  9:03   ` Kirill Tkhai
@ 2018-06-05  9:36     ` Dmitry Vyukov
  2018-06-05 13:55       ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2018-06-05  9:36 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On Tue, Jun 5, 2018 at 11:03 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> Hi, Dmirty!
>
> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>> Hello,
>>>
>>> syzbot found the following crash on:
>>>
>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>> git tree:       upstream
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>
>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>
>> Another hang on rtnl lock:
>>
>> #syz dup: INFO: task hung in netdev_run_todo
>>
>> May be related to "unregister_netdevice: waiting for DEV to become free":
>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9

netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
be something different.


>> Any other explanations for massive hangs on rtnl lock for minutes?
>
> To exclude the situation, when a task exists with rtnl_mutex held:
>
> would the pr_warn() from print_held_locks_bug() be included in the console output
> if they appear?

Yes, everything containing "WARNING:" is detected as bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-05  9:36     ` Dmitry Vyukov
@ 2018-06-05 13:55       ` Kirill Tkhai
  2018-06-07 18:23         ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2018-06-05 13:55 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On 05.06.2018 12:36, Dmitry Vyukov wrote:
> On Tue, Jun 5, 2018 at 11:03 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> Hi, Dmirty!
>>
>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>> Hello,
>>>>
>>>> syzbot found the following crash on:
>>>>
>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>> git tree:       upstream
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>
>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>
>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>
>>> Another hang on rtnl lock:
>>>
>>> #syz dup: INFO: task hung in netdev_run_todo
>>>
>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
> 
> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
> be something different.
> 
> 
>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>
>> To exclude the situation, when a task exists with rtnl_mutex held:
>>
>> would the pr_warn() from print_held_locks_bug() be included in the console output
>> if they appear?
> 
> Yes, everything containing "WARNING:" is detected as bug.

OK, then dead task not releasing the lock is excluded.

One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
(I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
There may help a crash dump of the VM.

Also, there may be a locking code BUG, but this seems the least probable for me.

Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-05 13:55       ` Kirill Tkhai
@ 2018-06-07 18:23         ` Dmitry Vyukov
  2018-06-07 18:54           ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2018-06-07 18:23 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On Tue, Jun 5, 2018 at 3:55 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 05.06.2018 12:36, Dmitry Vyukov wrote:
>> On Tue, Jun 5, 2018 at 11:03 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> Hi, Dmirty!
>>>
>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> syzbot found the following crash on:
>>>>>
>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>> git tree:       upstream
>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>
>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>
>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>
>>>> Another hang on rtnl lock:
>>>>
>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>
>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>
>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>> be something different.
>>
>>
>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>
>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>
>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>> if they appear?
>>
>> Yes, everything containing "WARNING:" is detected as bug.
>
> OK, then dead task not releasing the lock is excluded.
>
> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
> There may help a crash dump of the VM.

I can't find any legend for these +'s and .'s, but {+.+.} is present
in large amounts in just any task hung report for different mutexes,
so I would not expect that it means corruption.

Are dozens of known corruptions that syzkaller can trigger. But
usually they are reliably caught by KASAN. If any of them would lead
to silent memory corruption, we would got dozens of assorted crashes
throughout the kernel. We've seen that at some points, but not
recently. So I would assume that memory is not corrupted in all these
cases:
https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545

I wonder if it can be just that slow, but not actually hanged... net
namespace destruction is super slow, so perhaps under heavy load it
all stalls for minutes...


> Also, there may be a locking code BUG, but this seems the least probable for me.
>
> Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-07 18:23         ` Dmitry Vyukov
@ 2018-06-07 18:54           ` Kirill Tkhai
  2018-06-07 19:03             ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2018-06-07 18:54 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On 07.06.2018 21:23, Dmitry Vyukov wrote:
> On Tue, Jun 5, 2018 at 3:55 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 05.06.2018 12:36, Dmitry Vyukov wrote:
>>> On Tue, Jun 5, 2018 at 11:03 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> Hi, Dmirty!
>>>>
>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzbot found the following crash on:
>>>>>>
>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>> git tree:       upstream
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>
>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>
>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>
>>>>> Another hang on rtnl lock:
>>>>>
>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>
>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>
>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>> be something different.
>>>
>>>
>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>
>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>
>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>> if they appear?
>>>
>>> Yes, everything containing "WARNING:" is detected as bug.
>>
>> OK, then dead task not releasing the lock is excluded.
>>
>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>> There may help a crash dump of the VM.
> 
> I can't find any legend for these +'s and .'s, but {+.+.} is present
> in large amounts in just any task hung report for different mutexes,
> so I would not expect that it means corruption.
> 
> Are dozens of known corruptions that syzkaller can trigger. But
> usually they are reliably caught by KASAN. If any of them would lead
> to silent memory corruption, we would got dozens of assorted crashes
> throughout the kernel. We've seen that at some points, but not
> recently. So I would assume that memory is not corrupted in all these
> cases:
> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545

This BUG clarifies the {+.+.}:

4 locks held by kworker/0:145/381:
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
 #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
 #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
 #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
 #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596

There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
since there is one more lock after it.

This BUG happen because of there are many rtnl_mutex waiters while owner
is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
 
> I wonder if it can be just that slow, but not actually hanged... net
> namespace destruction is super slow, so perhaps under heavy load it
> all stalls for minutes...

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-07 18:54           ` Kirill Tkhai
@ 2018-06-07 19:03             ` Dmitry Vyukov
  2018-06-07 19:59               ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2018-06-07 19:03 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On Thu, Jun 7, 2018 at 8:54 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>> Hi, Dmirty!
>>>>>
>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> syzbot found the following crash on:
>>>>>>>
>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>> git tree:       upstream
>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>
>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>
>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>
>>>>>> Another hang on rtnl lock:
>>>>>>
>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>
>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>
>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>> be something different.
>>>>
>>>>
>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>
>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>
>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>> if they appear?
>>>>
>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>
>>> OK, then dead task not releasing the lock is excluded.
>>>
>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>> There may help a crash dump of the VM.
>>
>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>> in large amounts in just any task hung report for different mutexes,
>> so I would not expect that it means corruption.
>>
>> Are dozens of known corruptions that syzkaller can trigger. But
>> usually they are reliably caught by KASAN. If any of them would lead
>> to silent memory corruption, we would got dozens of assorted crashes
>> throughout the kernel. We've seen that at some points, but not
>> recently. So I would assume that memory is not corrupted in all these
>> cases:
>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>
> This BUG clarifies the {+.+.}:
>
> 4 locks held by kworker/0:145/381:
>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>
> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
> since there is one more lock after it.
>
> This BUG happen because of there are many rtnl_mutex waiters while owner
> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.


You mean that it's not hanged, but rather needs more than 2 minutes to
complete, right?


>> I wonder if it can be just that slow, but not actually hanged... net
>> namespace destruction is super slow, so perhaps under heavy load it
>> all stalls for minutes...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-07 19:03             ` Dmitry Vyukov
@ 2018-06-07 19:59               ` Kirill Tkhai
  2018-06-08  8:18                 ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2018-06-07 19:59 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On 07.06.2018 22:03, Dmitry Vyukov wrote:
> On Thu, Jun 7, 2018 at 8:54 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>> Hi, Dmirty!
>>>>>>
>>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> syzbot found the following crash on:
>>>>>>>>
>>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>>> git tree:       upstream
>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>>
>>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>>
>>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>>
>>>>>>> Another hang on rtnl lock:
>>>>>>>
>>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>>
>>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>>
>>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>>> be something different.
>>>>>
>>>>>
>>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>>
>>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>>
>>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>>> if they appear?
>>>>>
>>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>>
>>>> OK, then dead task not releasing the lock is excluded.
>>>>
>>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>>> There may help a crash dump of the VM.
>>>
>>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>>> in large amounts in just any task hung report for different mutexes,
>>> so I would not expect that it means corruption.
>>>
>>> Are dozens of known corruptions that syzkaller can trigger. But
>>> usually they are reliably caught by KASAN. If any of them would lead
>>> to silent memory corruption, we would got dozens of assorted crashes
>>> throughout the kernel. We've seen that at some points, but not
>>> recently. So I would assume that memory is not corrupted in all these
>>> cases:
>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>
>> This BUG clarifies the {+.+.}:
>>
>> 4 locks held by kworker/0:145/381:
>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>>
>> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
>> since there is one more lock after it.
>>
>> This BUG happen because of there are many rtnl_mutex waiters while owner
>> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
> 
> 
> You mean that it's not hanged, but rather needs more than 2 minutes to
> complete, right?

Yeah, I think, this is possible. I've seen the situations like that.
Let synchronize_rcu_expedited() is executed for X seconds. Then,
it's need just 120/x calls of "this function under rtnl_mutex" to make
a soft lockup of someone else who wants the mutex too.

Also, despite the CFS is fair scheduler, in case of the calls are
made from workqueue, every work will cause sleep. So, every work
will be executed in separate worker task. Every worker will haved its
own time slice. This increases the probability these tasks will
take cpu time before the task in the header of the hang.

Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-07 19:59               ` Kirill Tkhai
@ 2018-06-08  8:18                 ` Dmitry Vyukov
  2018-06-08  8:31                   ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2018-06-08  8:18 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On Thu, Jun 7, 2018 at 9:59 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 07.06.2018 22:03, Dmitry Vyukov wrote:
>> On Thu, Jun 7, 2018 at 8:54 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>> Hi, Dmirty!
>>>>>>>
>>>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> syzbot found the following crash on:
>>>>>>>>>
>>>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>>>> git tree:       upstream
>>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>>>
>>>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>>>
>>>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>>>
>>>>>>>> Another hang on rtnl lock:
>>>>>>>>
>>>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>>>
>>>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>>>
>>>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>>>> be something different.
>>>>>>
>>>>>>
>>>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>>>
>>>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>>>
>>>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>>>> if they appear?
>>>>>>
>>>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>>>
>>>>> OK, then dead task not releasing the lock is excluded.
>>>>>
>>>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>>>> There may help a crash dump of the VM.
>>>>
>>>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>>>> in large amounts in just any task hung report for different mutexes,
>>>> so I would not expect that it means corruption.
>>>>
>>>> Are dozens of known corruptions that syzkaller can trigger. But
>>>> usually they are reliably caught by KASAN. If any of them would lead
>>>> to silent memory corruption, we would got dozens of assorted crashes
>>>> throughout the kernel. We've seen that at some points, but not
>>>> recently. So I would assume that memory is not corrupted in all these
>>>> cases:
>>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>>
>>> This BUG clarifies the {+.+.}:
>>>
>>> 4 locks held by kworker/0:145/381:
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>>>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>>>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>>>
>>> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
>>> since there is one more lock after it.
>>>
>>> This BUG happen because of there are many rtnl_mutex waiters while owner
>>> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
>>
>>
>> You mean that it's not hanged, but rather needs more than 2 minutes to
>> complete, right?
>
> Yeah, I think, this is possible. I've seen the situations like that.
> Let synchronize_rcu_expedited() is executed for X seconds. Then,
> it's need just 120/x calls of "this function under rtnl_mutex" to make
> a soft lockup of someone else who wants the mutex too.
>
> Also, despite the CFS is fair scheduler, in case of the calls are
> made from workqueue, every work will cause sleep. So, every work
> will be executed in separate worker task. Every worker will haved its
> own time slice. This increases the probability these tasks will
> take cpu time before the task in the header of the hang.


OK, let's stick with this theory for now. Looking at the crash frequencies here:
https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
I can actually believe that this is just flakes due to too slow execution.

I've noted that we need either reduce load and/or increase timeouts:
https://github.com/google/syzkaller/issues/516#issuecomment-395685629

Let's keep duping new such reports onto "INFO: task hung in
netdev_run_todo" so that they are all collected at a single location.

Thanks for help

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-08  8:18                 ` Dmitry Vyukov
@ 2018-06-08  8:31                   ` Kirill Tkhai
  2018-06-08  8:38                     ` Dmitry Vyukov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2018-06-08  8:31 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On 08.06.2018 11:18, Dmitry Vyukov wrote:
> On Thu, Jun 7, 2018 at 9:59 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 07.06.2018 22:03, Dmitry Vyukov wrote:
>>> On Thu, Jun 7, 2018 at 8:54 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>>> Hi, Dmirty!
>>>>>>>>
>>>>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> syzbot found the following crash on:
>>>>>>>>>>
>>>>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>>>>> git tree:       upstream
>>>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>>>>
>>>>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>>>>
>>>>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>>>>
>>>>>>>>> Another hang on rtnl lock:
>>>>>>>>>
>>>>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>>>>
>>>>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>>>>
>>>>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>>>>> be something different.
>>>>>>>
>>>>>>>
>>>>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>>>>
>>>>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>>>>
>>>>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>>>>> if they appear?
>>>>>>>
>>>>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>>>>
>>>>>> OK, then dead task not releasing the lock is excluded.
>>>>>>
>>>>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>>>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>>>>> There may help a crash dump of the VM.
>>>>>
>>>>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>>>>> in large amounts in just any task hung report for different mutexes,
>>>>> so I would not expect that it means corruption.
>>>>>
>>>>> Are dozens of known corruptions that syzkaller can trigger. But
>>>>> usually they are reliably caught by KASAN. If any of them would lead
>>>>> to silent memory corruption, we would got dozens of assorted crashes
>>>>> throughout the kernel. We've seen that at some points, but not
>>>>> recently. So I would assume that memory is not corrupted in all these
>>>>> cases:
>>>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>>>
>>>> This BUG clarifies the {+.+.}:
>>>>
>>>> 4 locks held by kworker/0:145/381:
>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>>>>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>>>>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>>>>
>>>> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
>>>> since there is one more lock after it.
>>>>
>>>> This BUG happen because of there are many rtnl_mutex waiters while owner
>>>> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
>>>
>>>
>>> You mean that it's not hanged, but rather needs more than 2 minutes to
>>> complete, right?
>>
>> Yeah, I think, this is possible. I've seen the situations like that.
>> Let synchronize_rcu_expedited() is executed for X seconds. Then,
>> it's need just 120/x calls of "this function under rtnl_mutex" to make
>> a soft lockup of someone else who wants the mutex too.
>>
>> Also, despite the CFS is fair scheduler, in case of the calls are
>> made from workqueue, every work will cause sleep. So, every work
>> will be executed in separate worker task. Every worker will haved its
>> own time slice. This increases the probability these tasks will
>> take cpu time before the task in the header of the hang.
> 
> 
> OK, let's stick with this theory for now. Looking at the crash frequencies here:
> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
> I can actually believe that this is just flakes due to too slow execution.
> 
> I've noted that we need either reduce load and/or increase timeouts:
> https://github.com/google/syzkaller/issues/516#issuecomment-395685629

Hm, I'm not sure we should hide such the situations from syzbot,
because the load like this may occur in real life on real workload.
They may help us to understand whether rtnl_mutex already needs
a redesign came from this statistics. Also, these hungs may happen
in a place, which can be rewritten without rtnl_mutex, so we focus
attention on it.

> Let's keep duping new such reports onto "INFO: task hung in
> netdev_run_todo" so that they are all collected at a single location.
> 
> Thanks for help

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-08  8:31                   ` Kirill Tkhai
@ 2018-06-08  8:38                     ` Dmitry Vyukov
  2018-06-08  8:43                       ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Vyukov @ 2018-06-08  8:38 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On Fri, Jun 8, 2018 at 10:31 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>>>> Hi, Dmirty!
>>>>>>>>>
>>>>>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> syzbot found the following crash on:
>>>>>>>>>>>
>>>>>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>>>>>> git tree:       upstream
>>>>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>>>>>
>>>>>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>>>>>
>>>>>>>>>> Another hang on rtnl lock:
>>>>>>>>>>
>>>>>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>>>>>
>>>>>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>>>>>
>>>>>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>>>>>> be something different.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>>>>>
>>>>>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>>>>>
>>>>>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>>>>>> if they appear?
>>>>>>>>
>>>>>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>>>>>
>>>>>>> OK, then dead task not releasing the lock is excluded.
>>>>>>>
>>>>>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>>>>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>>>>>> There may help a crash dump of the VM.
>>>>>>
>>>>>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>>>>>> in large amounts in just any task hung report for different mutexes,
>>>>>> so I would not expect that it means corruption.
>>>>>>
>>>>>> Are dozens of known corruptions that syzkaller can trigger. But
>>>>>> usually they are reliably caught by KASAN. If any of them would lead
>>>>>> to silent memory corruption, we would got dozens of assorted crashes
>>>>>> throughout the kernel. We've seen that at some points, but not
>>>>>> recently. So I would assume that memory is not corrupted in all these
>>>>>> cases:
>>>>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>>>>
>>>>> This BUG clarifies the {+.+.}:
>>>>>
>>>>> 4 locks held by kworker/0:145/381:
>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>>>>>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>>>>>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>>>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>>>>>
>>>>> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
>>>>> since there is one more lock after it.
>>>>>
>>>>> This BUG happen because of there are many rtnl_mutex waiters while owner
>>>>> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
>>>>
>>>>
>>>> You mean that it's not hanged, but rather needs more than 2 minutes to
>>>> complete, right?
>>>
>>> Yeah, I think, this is possible. I've seen the situations like that.
>>> Let synchronize_rcu_expedited() is executed for X seconds. Then,
>>> it's need just 120/x calls of "this function under rtnl_mutex" to make
>>> a soft lockup of someone else who wants the mutex too.
>>>
>>> Also, despite the CFS is fair scheduler, in case of the calls are
>>> made from workqueue, every work will cause sleep. So, every work
>>> will be executed in separate worker task. Every worker will haved its
>>> own time slice. This increases the probability these tasks will
>>> take cpu time before the task in the header of the hang.
>>
>>
>> OK, let's stick with this theory for now. Looking at the crash frequencies here:
>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>> I can actually believe that this is just flakes due to too slow execution.
>>
>> I've noted that we need either reduce load and/or increase timeouts:
>> https://github.com/google/syzkaller/issues/516#issuecomment-395685629
>
> Hm, I'm not sure we should hide such the situations from syzbot,
> because the load like this may occur in real life on real workload.
> They may help us to understand whether rtnl_mutex already needs
> a redesign came from this statistics. Also, these hungs may happen
> in a place, which can be rewritten without rtnl_mutex, so we focus
> attention on it.

If somebody wants to act on these reports:
https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
it's even better. The point is that testing must not have false
positives, one way or another. If we do nothing then syzbot will
slowly discover all 250 usages of rtnl_lock and produce unique bugs
for them. Each and every of these bug reports will need to handled by
somebody.

Does somebody want to act on these and improve rtnl performance in
foreseeable future?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: INFO: task hung in ip6gre_exit_batch_net
  2018-06-08  8:38                     ` Dmitry Vyukov
@ 2018-06-08  8:43                       ` Kirill Tkhai
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill Tkhai @ 2018-06-08  8:43 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich

On 08.06.2018 11:38, Dmitry Vyukov wrote:
> On Fri, Jun 8, 2018 at 10:31 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>>>>> Hi, Dmirty!
>>>>>>>>>>
>>>>>>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> syzbot found the following crash on:
>>>>>>>>>>>>
>>>>>>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>>>>>>> git tree:       upstream
>>>>>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>>>>>>
>>>>>>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>>>>>>
>>>>>>>>>>> Another hang on rtnl lock:
>>>>>>>>>>>
>>>>>>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>>>>>>
>>>>>>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>>>>>>
>>>>>>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>>>>>>> be something different.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>>>>>>
>>>>>>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>>>>>>
>>>>>>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>>>>>>> if they appear?
>>>>>>>>>
>>>>>>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>>>>>>
>>>>>>>> OK, then dead task not releasing the lock is excluded.
>>>>>>>>
>>>>>>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>>>>>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>>>>>>> There may help a crash dump of the VM.
>>>>>>>
>>>>>>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>>>>>>> in large amounts in just any task hung report for different mutexes,
>>>>>>> so I would not expect that it means corruption.
>>>>>>>
>>>>>>> Are dozens of known corruptions that syzkaller can trigger. But
>>>>>>> usually they are reliably caught by KASAN. If any of them would lead
>>>>>>> to silent memory corruption, we would got dozens of assorted crashes
>>>>>>> throughout the kernel. We've seen that at some points, but not
>>>>>>> recently. So I would assume that memory is not corrupted in all these
>>>>>>> cases:
>>>>>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>>>>>
>>>>>> This BUG clarifies the {+.+.}:
>>>>>>
>>>>>> 4 locks held by kworker/0:145/381:
>>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>>>>>>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>>>>>>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>>>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>>>>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>>>>>>
>>>>>> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
>>>>>> since there is one more lock after it.
>>>>>>
>>>>>> This BUG happen because of there are many rtnl_mutex waiters while owner
>>>>>> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
>>>>>
>>>>>
>>>>> You mean that it's not hanged, but rather needs more than 2 minutes to
>>>>> complete, right?
>>>>
>>>> Yeah, I think, this is possible. I've seen the situations like that.
>>>> Let synchronize_rcu_expedited() is executed for X seconds. Then,
>>>> it's need just 120/x calls of "this function under rtnl_mutex" to make
>>>> a soft lockup of someone else who wants the mutex too.
>>>>
>>>> Also, despite the CFS is fair scheduler, in case of the calls are
>>>> made from workqueue, every work will cause sleep. So, every work
>>>> will be executed in separate worker task. Every worker will haved its
>>>> own time slice. This increases the probability these tasks will
>>>> take cpu time before the task in the header of the hang.
>>>
>>>
>>> OK, let's stick with this theory for now. Looking at the crash frequencies here:
>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>> I can actually believe that this is just flakes due to too slow execution.
>>>
>>> I've noted that we need either reduce load and/or increase timeouts:
>>> https://github.com/google/syzkaller/issues/516#issuecomment-395685629
>>
>> Hm, I'm not sure we should hide such the situations from syzbot,
>> because the load like this may occur in real life on real workload.
>> They may help us to understand whether rtnl_mutex already needs
>> a redesign came from this statistics. Also, these hungs may happen
>> in a place, which can be rewritten without rtnl_mutex, so we focus
>> attention on it.
> 
> If somebody wants to act on these reports:
> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
> it's even better. The point is that testing must not have false
> positives, one way or another. If we do nothing then syzbot will
> slowly discover all 250 usages of rtnl_lock and produce unique bugs
> for them. Each and every of these bug reports will need to handled by
> somebody.
> 
> Does somebody want to act on these and improve rtnl performance in
> foreseeable future?

I just analyzed this question a little bit, and it looks like only
the preparations for improving the performance will take much much time.
And the performance won't change till preparations are finished.
So, this looks like "not a foreseeable future".

Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-06-08  8:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 15:03 INFO: task hung in ip6gre_exit_batch_net syzbot
2018-06-04 15:22 ` Dmitry Vyukov
2018-06-05  9:03   ` Kirill Tkhai
2018-06-05  9:36     ` Dmitry Vyukov
2018-06-05 13:55       ` Kirill Tkhai
2018-06-07 18:23         ` Dmitry Vyukov
2018-06-07 18:54           ` Kirill Tkhai
2018-06-07 19:03             ` Dmitry Vyukov
2018-06-07 19:59               ` Kirill Tkhai
2018-06-08  8:18                 ` Dmitry Vyukov
2018-06-08  8:31                   ` Kirill Tkhai
2018-06-08  8:38                     ` Dmitry Vyukov
2018-06-08  8:43                       ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).