All of lore.kernel.org
 help / color / mirror / Atom feed
* [syzbot] WARNING in call_timer_fn
@ 2022-11-16 15:25 syzbot
  2022-11-16 18:29 ` Thomas Gleixner
       [not found] ` <20221117024511.3606-1-hdanton@sina.com>
  0 siblings, 2 replies; 10+ messages in thread
From: syzbot @ 2022-11-16 15:25 UTC (permalink / raw)
  To: ben-linux, bp, daniel.sneddon, dave.hansen, hpa, linux-kernel,
	mingo, netdev, pbonzini, sathyanarayanan.kuppuswamy,
	syzkaller-bugs, tglx, x86

Hello,

syzbot found the following issue on:

HEAD commit:    7eba4505394e net: dcb: move getapptrust to separate function
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16626ce9880000
kernel config:  https://syzkaller.appspot.com/x/.config?x=893a728fb1a6b263
dashboard link: https://syzkaller.appspot.com/bug?extid=6fb78d577e89e69602f9
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16899f35880000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/68b1a18c1aad/disk-7eba4505.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/f03970850c94/vmlinux-7eba4505.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4775c120c4c6/bzImage-7eba4505.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6fb78d577e89e69602f9@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 0 PID: 5978 at kernel/workqueue.c:1438 __queue_work+0xf70/0x13b0 kernel/workqueue.c:1438
Modules linked in:
CPU: 0 PID: 5978 Comm: syz-executor.3 Not tainted 6.1.0-rc4-syzkaller-01070-g7eba4505394e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
RIP: 0010:__queue_work+0xf70/0x13b0 kernel/workqueue.c:1438
Code: e0 07 83 c0 03 38 d0 7c 09 84 d2 74 05 e8 e8 64 7b 00 8b 5b 2c 31 ff 83 e3 20 89 de e8 09 5e 2e 00 85 db 75 42 e8 50 61 2e 00 <0f> 0b e9 7e f7 ff ff e8 44 61 2e 00 0f 0b e9 10 f7 ff ff e8 38 61
RSP: 0018:ffffc90000007c98 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000100 RCX: 0000000000000100
RDX: ffff888022ba9d40 RSI: ffffffff8151be20 RDI: 0000000000000005
RBP: ffffc90000007d60 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000100 R11: 0000000000000001 R12: ffff88806ccf4b30
R13: ffff88806ccf4b78 R14: 0000000000000000 R15: ffff88807568f000
FS:  0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555558e1708 CR3: 0000000076b84000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 call_timer_fn+0x1da/0x7c0 kernel/time/timer.c:1474
 expire_timers kernel/time/timer.c:1514 [inline]
 __run_timers.part.0+0x4a3/0xaf0 kernel/time/timer.c:1790
 __run_timers kernel/time/timer.c:1768 [inline]
 run_timer_softirq+0xb7/0x1d0 kernel/time/timer.c:1803
 __do_softirq+0x1fb/0xadc kernel/softirq.c:571
 invoke_softirq kernel/softirq.c:445 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
 irq_exit_rcu+0x9/0x20 kernel/softirq.c:662
 sysvec_apic_timer_interrupt+0x97/0xc0 arch/x86/kernel/apic/apic.c:1107
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:649
RIP: 0010:lock_acquire+0x227/0x630 kernel/locking/lockdep.c:5636
Code: 76 9f 7e 83 f8 01 0f 85 3a 03 00 00 9c 58 f6 c4 02 0f 85 25 03 00 00 48 83 7c 24 08 00 74 01 fb 48 b8 00 00 00 00 00 fc ff df <48> 01 c3 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 48 8b 84 24
RSP: 0018:ffffc90006f678d0 EFLAGS: 00000206
RAX: dffffc0000000000 RBX: 1ffff92000decf1d RCX: 0000000000000001
RDX: 1ffff110045754f1 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000001 R08: 0000000000000001 R09: ffffffff912a75ef
R10: fffffbfff2254ebd R11: 1ffffffff212a1df R12: 0000000000000000
R13: 0000000000000000 R14: ffff88807568f138 R15: 0000000000000000
 __flush_workqueue+0x118/0x13a0 kernel/workqueue.c:2809
 drain_workqueue+0x1a9/0x3c0 kernel/workqueue.c:2974
 hci_dev_close_sync+0x2f3/0x1200 net/bluetooth/hci_sync.c:4809
 hci_dev_do_close+0x31/0x70 net/bluetooth/hci_core.c:554
 hci_unregister_dev+0x183/0x4e0 net/bluetooth/hci_core.c:2702
 vhci_release+0x80/0xf0 drivers/bluetooth/hci_vhci.c:568
 __fput+0x27c/0xa90 fs/file_table.c:320
 task_work_run+0x16f/0x270 kernel/task_work.c:179
 exit_task_work include/linux/task_work.h:38 [inline]
 do_exit+0xb3d/0x2a30 kernel/exit.c:820
 do_group_exit+0xd4/0x2a0 kernel/exit.c:950
 __do_sys_exit_group kernel/exit.c:961 [inline]
 __se_sys_exit_group kernel/exit.c:959 [inline]
 __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:959
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f3d8a68b639
Code: Unable to access opcode bytes at 0x7f3d8a68b60f.
RSP: 002b:00007ffd554b9878 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000029 RCX: 00007f3d8a68b639
RDX: 00007f3d8a68cc8a RSI: 0000000000000000 RDI: 0000000000000007
RBP: 0000000000000007 R08: ffffffffff000000 R09: 0000000000000029
R10: 00000000000003b8 R11: 0000000000000246 R12: 00007ffd554b9f00
R13: 0000000000000003 R14: 00007ffd554b9e9c R15: 00007f3d8a782b60
 </TASK>
----------------
Code disassembly (best guess):
   0:	76 9f                	jbe    0xffffffa1
   2:	7e 83                	jle    0xffffff87
   4:	f8                   	clc
   5:	01 0f                	add    %ecx,(%rdi)
   7:	85 3a                	test   %edi,(%rdx)
   9:	03 00                	add    (%rax),%eax
   b:	00 9c 58 f6 c4 02 0f 	add    %bl,0xf02c4f6(%rax,%rbx,2)
  12:	85 25 03 00 00 48    	test   %esp,0x48000003(%rip)        # 0x4800001b
  18:	83 7c 24 08 00       	cmpl   $0x0,0x8(%rsp)
  1d:	74 01                	je     0x20
  1f:	fb                   	sti
  20:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  27:	fc ff df
* 2a:	48 01 c3             	add    %rax,%rbx <-- trapping instruction
  2d:	48 c7 03 00 00 00 00 	movq   $0x0,(%rbx)
  34:	48 c7 43 08 00 00 00 	movq   $0x0,0x8(%rbx)
  3b:	00
  3c:	48                   	rex.W
  3d:	8b                   	.byte 0x8b
  3e:	84                   	.byte 0x84
  3f:	24                   	.byte 0x24


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
  2022-11-16 15:25 [syzbot] WARNING in call_timer_fn syzbot
@ 2022-11-16 18:29 ` Thomas Gleixner
       [not found] ` <20221117024511.3606-1-hdanton@sina.com>
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Gleixner @ 2022-11-16 18:29 UTC (permalink / raw)
  To: syzbot, ben-linux, bp, daniel.sneddon, dave.hansen, hpa,
	linux-kernel, mingo, netdev, pbonzini,
	sathyanarayanan.kuppuswamy, syzkaller-bugs, x86, Steven Rostedt,
	Marcel Holtmann, Luiz Augusto von Dentz

On Wed, Nov 16 2022 at 07:25, syzbot wrote:
> HEAD commit:    7eba4505394e net: dcb: move getapptrust to separate function
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16626ce9880000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=893a728fb1a6b263
> dashboard link: https://syzkaller.appspot.com/bug?extid=6fb78d577e89e69602f9
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16899f35880000
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/68b1a18c1aad/disk-7eba4505.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/f03970850c94/vmlinux-7eba4505.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/4775c120c4c6/bzImage-7eba4505.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+6fb78d577e89e69602f9@syzkaller.appspotmail.com
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 5978 at kernel/workqueue.c:1438 __queue_work+0xf70/0x13b0 kernel/workqueue.c:1438
> Modules linked in:
> CPU: 0 PID: 5978 Comm: syz-executor.3 Not tainted 6.1.0-rc4-syzkaller-01070-g7eba4505394e #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
> RIP: 0010:__queue_work+0xf70/0x13b0 kernel/workqueue.c:1438
> Code: e0 07 83 c0 03 38 d0 7c 09 84 d2 74 05 e8 e8 64 7b 00 8b 5b 2c 31 ff 83 e3 20 89 de e8 09 5e 2e 00 85 db 75 42 e8 50 61 2e 00 <0f> 0b e9 7e f7 ff ff e8 44 61 2e 00 0f 0b e9 10 f7 ff ff e8 38 61
> RSP: 0018:ffffc90000007c98 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: 0000000000000100 RCX: 0000000000000100
> RDX: ffff888022ba9d40 RSI: ffffffff8151be20 RDI: 0000000000000005
> RBP: ffffc90000007d60 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000100 R11: 0000000000000001 R12: ffff88806ccf4b30
> R13: ffff88806ccf4b78 R14: 0000000000000000 R15: ffff88807568f000
> FS:  0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005555558e1708 CR3: 0000000076b84000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <IRQ>
>  call_timer_fn+0x1da/0x7c0 kernel/time/timer.c:1474
>  expire_timers kernel/time/timer.c:1514 [inline]
>  __run_timers.part.0+0x4a3/0xaf0 kernel/time/timer.c:1790
>  __run_timers kernel/time/timer.c:1768 [inline]
>  run_timer_softirq+0xb7/0x1d0 kernel/time/timer.c:1803
>  __do_softirq+0x1fb/0xadc kernel/softirq.c:571
>  invoke_softirq kernel/softirq.c:445 [inline]
>  __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
>  irq_exit_rcu+0x9/0x20 kernel/softirq.c:662
>  sysvec_apic_timer_interrupt+0x97/0xc0 arch/x86/kernel/apic/apic.c:1107
>  </IRQ>
>  <TASK>
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:649
> RIP: 0010:lock_acquire+0x227/0x630 kernel/locking/lockdep.c:5636
> Code: 76 9f 7e 83 f8 01 0f 85 3a 03 00 00 9c 58 f6 c4 02 0f 85 25 03 00 00 48 83 7c 24 08 00 74 01 fb 48 b8 00 00 00 00 00 fc ff df <48> 01 c3 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 48 8b 84 24
> RSP: 0018:ffffc90006f678d0 EFLAGS: 00000206
> RAX: dffffc0000000000 RBX: 1ffff92000decf1d RCX: 0000000000000001
> RDX: 1ffff110045754f1 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 0000000000000001 R08: 0000000000000001 R09: ffffffff912a75ef
> R10: fffffbfff2254ebd R11: 1ffffffff212a1df R12: 0000000000000000
> R13: 0000000000000000 R14: ffff88807568f138 R15: 0000000000000000
>  __flush_workqueue+0x118/0x13a0 kernel/workqueue.c:2809
>  drain_workqueue+0x1a9/0x3c0 kernel/workqueue.c:2974
>  hci_dev_close_sync+0x2f3/0x1200 net/bluetooth/hci_sync.c:4809
>  hci_dev_do_close+0x31/0x70 net/bluetooth/hci_core.c:554
>  hci_unregister_dev+0x183/0x4e0 net/bluetooth/hci_core.c:2702
>  vhci_release+0x80/0xf0 drivers/bluetooth/hci_vhci.c:568
>  __fput+0x27c/0xa90 fs/file_table.c:320
>  task_work_run+0x16f/0x270 kernel/task_work.c:179
>  exit_task_work include/linux/task_work.h:38 [inline]
>  do_exit+0xb3d/0x2a30 kernel/exit.c:820
>  do_group_exit+0xd4/0x2a0 kernel/exit.c:950
>  __do_sys_exit_group kernel/exit.c:961 [inline]
>  __se_sys_exit_group kernel/exit.c:959 [inline]
>  __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:959
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f3d8a68b639
> Code: Unable to access opcode bytes at 0x7f3d8a68b60f.
> RSP: 002b:00007ffd554b9878 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 0000000000000029 RCX: 00007f3d8a68b639
> RDX: 00007f3d8a68cc8a RSI: 0000000000000000 RDI: 0000000000000007
> RBP: 0000000000000007 R08: ffffffffff000000 R09: 0000000000000029
> R10: 00000000000003b8 R11: 0000000000000246 R12: 00007ffd554b9f00
> R13: 0000000000000003 R14: 00007ffd554b9e9c R15: 00007f3d8a782b60

That's precisely the case I described here:

  https://lore.kernel.org/all/87iljhsftt.ffs@tglx

and the final fix is here:

  https://lore.kernel.org/all/20221115195802.415956561@linutronix.de

but that will take a bit to get upstream. In the meanwhile someone could
sprinkle 'if (in_shutdown)' conditionals into the code to prevent the
splat.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
       [not found] ` <20221117024511.3606-1-hdanton@sina.com>
@ 2022-11-17 11:54   ` Thomas Gleixner
       [not found]   ` <20221117125523.3783-1-hdanton@sina.com>
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Gleixner @ 2022-11-17 11:54 UTC (permalink / raw)
  To: Hillf Danton
  Cc: syzbot, linux-kernel, pbonzini, syzkaller-bugs, Steven Rostedt,
	Marcel Holtmann, Luiz Augusto von Dentz

On Thu, Nov 17 2022 at 10:45, Hillf Danton wrote:
> On 16 Nov 2022 19:29:27 +0100 Thomas Gleixner <tglx@linutronix.de>
>> 
>
> I doubt your fix helps in the domain of workqueue in general, and in this
> report if it is delayed work that triggered the warning.

Indeed. I misread the backtrace and my brain connected the wrong dots
here. It's a different bluethooth hickup.

> --- x/net/bluetooth/hci_sync.c
> +++ y/net/bluetooth/hci_sync.c
> @@ -4806,6 +4806,7 @@ int hci_dev_close_sync(struct hci_dev *h
>  	/* Avoid potential lockdep warnings from the *_flush() calls by
>  	 * ensuring the workqueue is empty up front.
>  	 */
> +	cancel_delayed_work_sync(&hdev->cmd_timer);
>  	drain_workqueue(hdev->workqueue);

The work has been canceled already before in the same function and there
are some more delayed works which can trigger this.

So no, this whole close_sync() function is prone to teardown races and
just slapping a single cancel here without deeper analysis does not cut
it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
       [not found]   ` <20221117125523.3783-1-hdanton@sina.com>
@ 2022-11-17 16:06     ` Thomas Gleixner
  2022-11-17 21:04       ` Luiz Augusto von Dentz
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2022-11-17 16:06 UTC (permalink / raw)
  To: Hillf Danton
  Cc: syzbot, linux-kernel, pbonzini, syzkaller-bugs, Steven Rostedt,
	Marcel Holtmann, Luiz Augusto von Dentz

On Thu, Nov 17 2022 at 20:55, Hillf Danton wrote:
> On 17 Nov 2022 12:54:28 +0100 Thomas Gleixner <tglx@linutronix.de>
>> 
>> The work has been canceled already before in the same function and there
>> are some more delayed works which can trigger this.
>> 
>> So no, this whole close_sync() function is prone to teardown races and
>> just slapping a single cancel here without deeper analysis does not cut
>> it.
>
> Agree.
>
> A set of sync cancelations can do the job, given what is defined in struct
> hci_dev wrt workqueue.

It's only part of the solution because you also have to prevent that
work is queued from other parts of the code....


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
  2022-11-17 16:06     ` Thomas Gleixner
@ 2022-11-17 21:04       ` Luiz Augusto von Dentz
  2022-11-17 21:16         ` Luiz Augusto von Dentz
  0 siblings, 1 reply; 10+ messages in thread
From: Luiz Augusto von Dentz @ 2022-11-17 21:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Hillf Danton, syzbot, linux-kernel, pbonzini, syzkaller-bugs,
	Steven Rostedt, Marcel Holtmann

Hi Thomas,

On Thu, Nov 17, 2022 at 8:06 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Nov 17 2022 at 20:55, Hillf Danton wrote:
> > On 17 Nov 2022 12:54:28 +0100 Thomas Gleixner <tglx@linutronix.de>
> >>
> >> The work has been canceled already before in the same function and there
> >> are some more delayed works which can trigger this.
> >>
> >> So no, this whole close_sync() function is prone to teardown races and
> >> just slapping a single cancel here without deeper analysis does not cut
> >> it.
> >
> > Agree.
> >
> > A set of sync cancelations can do the job, given what is defined in struct
> > hci_dev wrt workqueue.
>
> It's only part of the solution because you also have to prevent that
> work is queued from other parts of the code....

I thought we would have something similar to shutdown_timer (e.g.
shutdown_delayed_work) so we can safely free its object/struct, at
least that was the impression I got when discussing with Steven.

-- 
Luiz Augusto von Dentz

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
  2022-11-17 21:04       ` Luiz Augusto von Dentz
@ 2022-11-17 21:16         ` Luiz Augusto von Dentz
  2022-11-18  0:53           ` Tetsuo Handa
       [not found]           ` <20221118012805.3862-1-hdanton@sina.com>
  0 siblings, 2 replies; 10+ messages in thread
From: Luiz Augusto von Dentz @ 2022-11-17 21:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Hillf Danton, syzbot, linux-kernel, pbonzini, syzkaller-bugs,
	Steven Rostedt, Marcel Holtmann, Tetsuo Handa

Hi Tetsuo,

On Thu, Nov 17, 2022 at 1:04 PM Luiz Augusto von Dentz
<luiz.dentz@gmail.com> wrote:
>
> Hi Thomas,
>
> On Thu, Nov 17, 2022 at 8:06 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Thu, Nov 17 2022 at 20:55, Hillf Danton wrote:
> > > On 17 Nov 2022 12:54:28 +0100 Thomas Gleixner <tglx@linutronix.de>
> > >>
> > >> The work has been canceled already before in the same function and there
> > >> are some more delayed works which can trigger this.
> > >>
> > >> So no, this whole close_sync() function is prone to teardown races and
> > >> just slapping a single cancel here without deeper analysis does not cut
> > >> it.
> > >
> > > Agree.
> > >
> > > A set of sync cancelations can do the job, given what is defined in struct
> > > hci_dev wrt workqueue.
> >
> > It's only part of the solution because you also have to prevent that
> > work is queued from other parts of the code....
>
> I thought we would have something similar to shutdown_timer (e.g.
> shutdown_delayed_work) so we can safely free its object/struct, at
> least that was the impression I got when discussing with Steven.

Wasn't the following patch suppose to address such problem:

https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=deee93d13d385103205879a8a0915036ecd83261

It was merged in the last pull request to net-next:

https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=a507ea32b9c2c407012bf89109ac0cf89fae313c

-- 
Luiz Augusto von Dentz

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
  2022-11-17 21:16         ` Luiz Augusto von Dentz
@ 2022-11-18  0:53           ` Tetsuo Handa
  2022-11-18  1:17             ` Thomas Gleixner
       [not found]           ` <20221118012805.3862-1-hdanton@sina.com>
  1 sibling, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2022-11-18  0:53 UTC (permalink / raw)
  To: Luiz Augusto von Dentz, Thomas Gleixner
  Cc: Hillf Danton, syzbot, linux-kernel, pbonzini, syzkaller-bugs,
	Steven Rostedt, Marcel Holtmann

On 2022/11/18 6:16, Luiz Augusto von Dentz wrote:
> Wasn't the following patch suppose to address such problem:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=deee93d13d385103205879a8a0915036ecd83261
> 
> It was merged in the last pull request to net-next:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=a507ea32b9c2c407012bf89109ac0cf89fae313c
> 

No. Commit deee93d13d38 ("Bluetooth: use hdev->workqueue when queuing hdev->{cmd,ncmd}_timer works")
is for handling queue_work() from hci_cmd_timeout(struct work_struct *work) from process_one_work()
(that is, called from a kernel workqueue thread from process context).

But this report says that queue_work() is called from timer interrupt handler from interrupt context
while drain_workqueue(hdev->workqueue) is in progress from process context.

But... is is_chained_work() check appropriate?
Why can't we exclude "timer interrupt handler" from "somebody else" ?

The comment for drain_workqueue() says

 * Wait until the workqueue becomes empty.  While draining is in progress,
 * only chain queueing is allowed.  IOW, only currently pending or running
 * work items on @wq can queue further work items on it.  @wq is flushed
 * repeatedly until it becomes empty.  The number of flushing is determined
 * by the depth of chaining and should be relatively short.  Whine if it
 * takes too long.

but why limited to "only currently pending or running work items on @wq" (that is,
only process context) ?

Although drain_workqueue() is also called from destroy_workqueue() (which would cause
use-after-free bug if interrupt handler calls queue_work() some time after drain_workqueue()),
I think that we can wait for drain_workqueue() to call __flush_workqueue() again if a further
work item is queued from interrupt handler...

Anyway, stopping all works and delayed works before calling drain_workqueue() would address
this problem.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
  2022-11-18  0:53           ` Tetsuo Handa
@ 2022-11-18  1:17             ` Thomas Gleixner
  0 siblings, 0 replies; 10+ messages in thread
From: Thomas Gleixner @ 2022-11-18  1:17 UTC (permalink / raw)
  To: Tetsuo Handa, Luiz Augusto von Dentz
  Cc: Hillf Danton, syzbot, linux-kernel, pbonzini, syzkaller-bugs,
	Steven Rostedt, Marcel Holtmann

On Fri, Nov 18 2022 at 09:53, Tetsuo Handa wrote:
> On 2022/11/18 6:16, Luiz Augusto von Dentz wrote:
> The comment for drain_workqueue() says
>
>  * Wait until the workqueue becomes empty.  While draining is in progress,
>  * only chain queueing is allowed.  IOW, only currently pending or running
>  * work items on @wq can queue further work items on it.  @wq is flushed
>  * repeatedly until it becomes empty.  The number of flushing is determined
>  * by the depth of chaining and should be relatively short.  Whine if it
>  * takes too long.
>
> but why limited to "only currently pending or running work items on @wq" (that is,
> only process context) ?
>
> Although drain_workqueue() is also called from destroy_workqueue() (which would cause
> use-after-free bug if interrupt handler calls queue_work() some time after drain_workqueue()),
> I think that we can wait for drain_workqueue() to call __flush_workqueue() again if a further
> work item is queued from interrupt handler...

Which is correct because at that point it's expecting to only accept the
pending and chained pending work otherwise it will go on in circles and/or
just be unable to provide the functionality of draining work, right?

> Anyway, stopping all works and delayed works before calling
> drain_workqueue() would address this problem.

Only partially.

You also have to make sure that none of the works can be rearmed or
rescheduled after that point by any other context, e.g interrupts ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
       [not found]           ` <20221118012805.3862-1-hdanton@sina.com>
@ 2022-11-18 20:01             ` Luiz Augusto von Dentz
  0 siblings, 0 replies; 10+ messages in thread
From: Luiz Augusto von Dentz @ 2022-11-18 20:01 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Thomas Gleixner, syzbot, linux-kernel, pbonzini, syzkaller-bugs,
	Steven Rostedt, Marcel Holtmann, Tetsuo Handa

Hi Hillf,

On Thu, Nov 17, 2022 at 5:28 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Thu, Nov 17, 2022 at 1:04 PM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote:
> > >
> > > I thought we would have something similar to shutdown_timer (e.g.
> > > shutdown_delayed_work) so we can safely free its object/struct, at
> > > least that was the impression I got when discussing with Steven.
>
> Because of the cough in Bluetooth's throat does not mean it makes sense
> to ask workqueue to take a flu jab. Why not cure it directly locally,
> given the cases of workqueue under the drivers dir. And timer?

Like Thomas said we can only resolve this partially with the likes of
cancel_workqueue(_sync), though we can use HCI_UNREGISTER to stop
rescheduling cmd_timer, so imo having something similar to
shutdown_timer makes things simpler for subsystems/drivers, anyway I'm
fine fixing it directly since it is probably going to be simpler to
backport, perhaps something like the following is all we need:

diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 0540555b3704..977684f5fb57 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -4090,6 +4090,7 @@ static void hci_cmd_work(struct work_struct *work)

                        rcu_read_lock();
                        if (test_bit(HCI_RESET, &hdev->flags) ||
+                           test_bit(HCI_UNREGISTER, &hdev->flags) ||
                            hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
                                cancel_delayed_work(&hdev->cmd_timer);
                        else

-- 
Luiz Augusto von Dentz

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [syzbot] WARNING in call_timer_fn
       [not found] <20221117010835.3474-1-hdanton@sina.com>
@ 2022-11-17  1:35 ` syzbot
  0 siblings, 0 replies; 10+ messages in thread
From: syzbot @ 2022-11-17  1:35 UTC (permalink / raw)
  To: hdanton, linux-kernel, syzkaller-bugs

Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
INFO: rcu detected stall in corrupted

rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P4107 } 2643 jiffies s: 2809 root: 0x0/T
rcu: blocking rcu_node structures (internal RCU debug):


Tested on:

commit:         7eba4505 net: dcb: move getapptrust to separate function
git tree:       https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=12b65845880000
kernel config:  https://syzkaller.appspot.com/x/.config?x=893a728fb1a6b263
dashboard link: https://syzkaller.appspot.com/bug?extid=6fb78d577e89e69602f9
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch:          https://syzkaller.appspot.com/x/patch.diff?x=14ef644e880000


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-11-18 20:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-16 15:25 [syzbot] WARNING in call_timer_fn syzbot
2022-11-16 18:29 ` Thomas Gleixner
     [not found] ` <20221117024511.3606-1-hdanton@sina.com>
2022-11-17 11:54   ` Thomas Gleixner
     [not found]   ` <20221117125523.3783-1-hdanton@sina.com>
2022-11-17 16:06     ` Thomas Gleixner
2022-11-17 21:04       ` Luiz Augusto von Dentz
2022-11-17 21:16         ` Luiz Augusto von Dentz
2022-11-18  0:53           ` Tetsuo Handa
2022-11-18  1:17             ` Thomas Gleixner
     [not found]           ` <20221118012805.3862-1-hdanton@sina.com>
2022-11-18 20:01             ` Luiz Augusto von Dentz
     [not found] <20221117010835.3474-1-hdanton@sina.com>
2022-11-17  1:35 ` syzbot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.