All of lore.kernel.org
 help / color / mirror / Atom feed
* INFO: task hung in blk_queue_enter
@ 2018-04-28 11:24 syzbot
  2018-05-15 11:45 ` Tetsuo Handa
  0 siblings, 1 reply; 32+ messages in thread
From: syzbot @ 2018-04-28 11:24 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel, syzkaller-bugs

Hello,

syzbot hit the following crash on upstream commit
d8a332730e757129e70675679f2b2a03f1ecf65e (Fri Apr 27 17:39:38 2018 +0000)
Merge tag 'char-misc-4.17-rc3' of  
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=c4f9cebf9d651f6e54de

So far this crash happened 13 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4727023951937536
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=6091190189424640
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5018098448990208
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=7043958930931867332
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

random: sshd: uninitialized urandom read (32 bytes read)
print_req_error: I/O error, dev loop0, sector 7680
print_req_error: I/O error, dev loop0, sector 10240
print_req_error: I/O error, dev loop0, sector 7680
print_req_error: I/O error, dev loop0, sector 12800
INFO: task kworker/u4:2:31 blocked for more than 120 seconds.
       Not tainted 4.17.0-rc2+ #20
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:2    D12840    31      2 0x80000000
Workqueue: writeback wb_workfn (flush-7:0)
Call Trace:
  context_switch kernel/sched/core.c:2848 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3490
  schedule+0xef/0x430 kernel/sched/core.c:3549
  blk_queue_enter+0x8da/0xdf0 block/blk-core.c:953
  generic_make_request+0x144/0x1510 block/blk-core.c:2395
  submit_bio+0xba/0x460 block/blk-core.c:2568
  submit_bh_wbc+0x5e8/0x7c0 fs/buffer.c:3076
  __block_write_full_page+0x82e/0xef0 fs/buffer.c:1758
  block_write_full_page+0x1ff/0x250 fs/buffer.c:2944
  blkdev_writepage+0x24/0x30 fs/block_dev.c:566
  __writepage+0x69/0xe0 mm/page-writeback.c:2302
  write_cache_pages+0x920/0x1690 mm/page-writeback.c:2240
  generic_writepages+0xcc/0x120 mm/page-writeback.c:2326
  blkdev_writepages+0x1d/0x30 fs/block_dev.c:1951
  do_writepages+0x9a/0x1a0 mm/page-writeback.c:2341
  __writeback_single_inode+0x1e4/0x15c0 fs/fs-writeback.c:1323
  writeback_sb_inodes+0x6f0/0x11a0 fs/fs-writeback.c:1587
  __writeback_inodes_wb+0x1b0/0x320 fs/fs-writeback.c:1656
  wb_writeback+0x9e4/0xf50 fs/fs-writeback.c:1765
  wb_check_background_flush fs/fs-writeback.c:1833 [inline]
  wb_do_writeback fs/fs-writeback.c:1921 [inline]
  wb_workfn+0xd2f/0x1770 fs/fs-writeback.c:1949
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Showing all locks held in the system:
3 locks held by kworker/u4:2/31:
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0:         (ptrval) ((wq_completion)"writeback"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1:         (ptrval) ((work_completion)(&(&wb->dwork)->work)){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2:         (ptrval) (&type->s_umount_key#35){.+.+}, at:  
trylock_super+0x22/0x110 fs/super.c:399
2 locks held by khungtaskd/891:
  #0:         (ptrval) (rcu_read_lock){....}, at:  
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
  #0:         (ptrval) (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60  
kernel/hung_task.c:249
  #1:         (ptrval) (tasklist_lock){.+.+}, at:  
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
2 locks held by getty/4512:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4513:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4514:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4515:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4516:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4517:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4518:
  #0:         (ptrval) (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:         (ptrval) (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
1 lock held by syz-executor516/4581:
  #0:         (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130  
drivers/block/loop.c:1391

=============================================

NMI backtrace for cpu 1
CPU: 1 PID: 891 Comm: khungtaskd Not tainted 4.17.0-rc2+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
  check_hung_task kernel/hung_task.c:132 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
  watchdog+0xc10/0xf60 kernel/hung_task.c:249
  kthread+0x345/0x410 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc2+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:ksoftirqd_running kernel/softirq.c:88 [inline]
RIP: 0010:invoke_softirq kernel/softirq.c:355 [inline]
RIP: 0010:irq_exit+0x125/0x200 kernel/softirq.c:405
RSP: 0018:ffff8801dae07cd8 EFLAGS: 00000046
RAX: dffffc0000000000 RBX: ffff8801d9a9c200 RCX: 0000000000000000
RDX: 1ffff1003b353842 RSI: 0000000000000000 RDI: ffff8801d9a9c210
RBP: ffff8801dae07ce0 R08: ffffed003b5c46c3 R09: ffffed003b5c46c2
R10: ffffed003b5c46c2 R11: ffff8801dae23613 R12: ffff8801dae2c580
R13: ffffffff88c75dc0 R14: ffff8801dae07e68 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 00000001b5553000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  <IRQ>
  scheduler_ipi+0x52b/0xa30 kernel/sched/core.c:1778
  smp_reschedule_interrupt+0xed/0x660 arch/x86/kernel/smp.c:277
  reschedule_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:887
  </IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
RSP: 0018:ffffffff88c07bc0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff02
RAX: dffffc0000000000 RBX: 1ffffffff1180f7b RCX: 0000000000000000
RDX: 1ffffffff11a3160 RSI: 0000000000000001 RDI: ffffffff88d18b00
RBP: ffffffff88c07bc0 R08: ffffed003b5c46c3 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff88c07c78 R14: ffffffff897bf060 R15: 0000000000000000
  arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
  default_idle+0xc2/0x440 arch/x86/kernel/process.c:354
  arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:345
  default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
  cpuidle_idle_call kernel/sched/idle.c:153 [inline]
  do_idle+0x395/0x560 kernel/sched/idle.c:262
  cpu_startup_entry+0x104/0x120 kernel/sched/idle.c:368
  rest_init+0xe1/0xe4 init/main.c:441
  start_kernel+0x906/0x92d init/main.c:737
  x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:445
  x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:426
  secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242
Code: 00 01 5b 5d c3 65 48 8b 1d 89 a4 ba 7e 48 85 db 74 2c 48 b8 00 00 00  
00 00 fc ff df 48 8d 7b 10 48 89 fa 48 c1 ea 03 80 3c 02 00 <0f> 85 c2 00  
00 00 48 8b 43 10 48 85 c0 0f 84 31 ff ff ff 48 c7


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-04-28 11:24 INFO: task hung in blk_queue_enter syzbot
@ 2018-05-15 11:45 ` Tetsuo Handa
  2018-05-16 13:05   ` Tetsuo Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2018-05-15 11:45 UTC (permalink / raw)
  To: syzbot, axboe, linux-block, syzkaller-bugs
  Cc: linux-kernel, linux-ext4, dvyukov

I managed to obtain SysRq-t when khungtaskd fires using debug printk()
( https://groups.google.com/forum/#!topic/syzkaller-bugs/OTuOsVebAiE ).
Only 4 threads shown below seems to be relevant to this problem.

[  246.929688]   task                        PC stack   pid father
[  249.888937] jbd2/sda1-8     D17736  2271      2 0x80000000
[  249.894586] Call Trace:
[  249.897193]  __schedule+0x801/0x1e30
[  249.900954]  schedule+0xef/0x430
[  249.904356]  io_schedule+0x1c/0x70
[  249.907935]  bit_wait_io+0x18/0x90
[  249.911492]  __wait_on_bit+0xb3/0x130
[  249.915313]  out_of_line_wait_on_bit+0x204/0x3a0
[  249.920105]  __wait_on_buffer+0x76/0x90
[  249.924102]  jbd2_journal_commit_transaction+0x655b/0x8c18
[  249.929893]  kjournald2+0x26c/0xb30
[  249.933579]  kthread+0x345/0x410
[  249.936966]  ret_from_fork+0x3a/0x50
[  254.408972] syz-executor7   D18976  5010   4828 0x00000004
[  254.414639] Call Trace:
[  254.417263]  __schedule+0x801/0x1e30
[  254.421070]  schedule+0xef/0x430
[  254.424467]  blk_queue_enter+0x8da/0xdf0
[  254.428584]  generic_make_request+0x144/0x1510
[  254.433217]  blk_queue_split+0x374/0x2090
[  254.437425]  blk_mq_make_request+0x2d0/0x25c0
[  254.442004]  generic_make_request+0x795/0x1510
[  254.446663]  submit_bio+0xba/0x460
[  254.451104]  mpage_readpages+0x6d7/0x970
[  254.455224]  blkdev_readpages+0x2c/0x40
[  254.459220]  __do_page_cache_readahead+0x79a/0xdc0
[  254.464205]  ondemand_readahead+0x550/0xc40
[  254.468559]  page_cache_sync_readahead+0xd1/0x110
[  254.473430]  generic_file_read_iter+0x1a74/0x2f00
[  254.478423]  blkdev_read_iter+0x120/0x190
[  254.482594]  generic_file_splice_read+0x552/0x910
[  254.487484]  do_splice_to+0x12e/0x190
[  254.491311]  splice_direct_to_actor+0x268/0x8d0
[  254.496017]  do_splice_direct+0x2cc/0x400
[  254.500224]  do_sendfile+0x60f/0xe00
[  254.503971]  __x64_sys_sendfile64+0x155/0x240
[  254.508507]  do_syscall_64+0x1b1/0x800
[  254.512431]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  254.568779] syz-executor7   D25408  5021   4828 0x00000004
[  254.574429] Call Trace:
[  254.577125]  __schedule+0x801/0x1e30
[  254.580882]  schedule+0xef/0x430
[  254.584297]  blk_mq_freeze_queue_wait+0x1ce/0x460
[  254.589192]  blk_freeze_queue+0x4a/0x80
[  254.593209]  blk_mq_freeze_queue+0x15/0x20
[  254.597464]  loop_clr_fd+0x226/0xb80
[  254.601208]  lo_ioctl+0x642/0x2130
[  254.604774]  blkdev_ioctl+0x9b6/0x2020
[  254.608715]  block_ioctl+0xee/0x130
[  254.612362]  do_vfs_ioctl+0x1cf/0x16a0
[  254.616294]  ksys_ioctl+0xa9/0xd0
[  254.619770]  __x64_sys_ioctl+0x73/0xb0
[  254.623674]  do_syscall_64+0x1b1/0x800
[  254.627596]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  255.022839] 1 lock held by syz-executor7/5021:
[  255.028209]  #0:         (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130
[  254.719133] blkid           D23880  5071   2713 0x00000000
[  254.724791] Call Trace:
[  254.727402]  __schedule+0x801/0x1e30
[  254.731159]  schedule+0xef/0x430
[  254.734562]  schedule_preempt_disabled+0x10/0x20
[  254.739333]  __mutex_lock+0xe38/0x17f0
[  254.743280]  mutex_lock_killable_nested+0x16/0x20
[  254.748135]  lo_ioctl+0x8d/0x2130
[  254.751611]  blkdev_ioctl+0x9b6/0x2020
[  254.755550]  block_ioctl+0xee/0x130
[  254.759200]  do_vfs_ioctl+0x1cf/0x16a0
[  254.763123]  ksys_ioctl+0xa9/0xd0
[  254.766601]  __x64_sys_ioctl+0x73/0xb0
[  254.770506]  do_syscall_64+0x1b1/0x800
[  254.774425]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  255.036399] 1 lock held by blkid/5071:
[  255.041075]  #0:         (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130



syz-executor7(PID=5021) is doing ioctl(LOOP_CLR_FD).

blkid(PID=5071) is doing ioctl() but is blocked on lo_ctl_mutex which is
held by syz-executor7(PID=5021). Therefore, I assume that blkid(PID=5071)
is irrelevant for this hung up problem.



syz-executor7(PID=5021) is stuck at blk_mq_freeze_queue_wait() from
blk_freeze_queue() from blk_mq_freeze_queue() from loop_clr_fd().

/*
 * Guarantee no request is in use, so we can change any data structure of
 * the queue afterward.
 */
void blk_freeze_queue(struct request_queue *q)
{
	/*
	 * In the !blk_mq case we are only calling this to kill the
	 * q_usage_counter, otherwise this increases the freeze depth
	 * and waits for it to return to zero.  For this reason there is
	 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
	 * exported to drivers as the only user for unfreeze is blk_mq.
	 */
	blk_freeze_queue_start(q);
	if (!q->mq_ops)
		blk_drain_queue(q);
	blk_mq_freeze_queue_wait(q);
}

q->mq_freeze_depth is incremented at blk_freeze_queue_start() and
the caller waits until q->q_usage_counter becomes 0.

void blk_freeze_queue_start(struct request_queue *q)
{
	int freeze_depth;

	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
	if (freeze_depth == 1) {
		percpu_ref_kill(&q->q_usage_counter);
		if (q->mq_ops)
			blk_mq_run_hw_queues(q, false);
	}
}

void blk_mq_freeze_queue_wait(struct request_queue *q)
{
	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
}

I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
stuck at wait_event() in blk_queue_enter().

syz-executor7   D18976  5010   4828 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2859 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
  schedule+0xef/0x430 kernel/sched/core.c:3545
  blk_queue_enter+0x8da/0xdf0 block/blk-core.c:953
  generic_make_request+0x144/0x1510 block/blk-core.c:2395
  blk_queue_split+0x374/0x2090 block/blk-merge.c:215
  blk_mq_make_request+0x2d0/0x25c0 block/blk-mq.c:1861
  generic_make_request+0x795/0x1510 block/blk-core.c:2460
  submit_bio+0xba/0x460 block/blk-core.c:2568
  mpage_bio_submit fs/mpage.c:66 [inline]
  mpage_readpages+0x6d7/0x970 fs/mpage.c:393
  blkdev_readpages+0x2c/0x40 fs/block_dev.c:577
  read_pages mm/readahead.c:121 [inline]
  __do_page_cache_readahead+0x79a/0xdc0 mm/readahead.c:199
  ra_submit mm/internal.h:66 [inline]
  ondemand_readahead+0x550/0xc40 mm/readahead.c:478
  page_cache_sync_readahead+0xd1/0x110 mm/readahead.c:510
  generic_file_buffered_read mm/filemap.c:2092 [inline]
  generic_file_read_iter+0x1a74/0x2f00 mm/filemap.c:2362
  blkdev_read_iter+0x120/0x190 fs/block_dev.c:1930
  call_read_iter include/linux/fs.h:1778 [inline]
  generic_file_splice_read+0x552/0x910 fs/splice.c:307
  do_splice_to+0x12e/0x190 fs/splice.c:880
  splice_direct_to_actor+0x268/0x8d0 fs/splice.c:952
  do_splice_direct+0x2cc/0x400 fs/splice.c:1061
  do_sendfile+0x60f/0xe00 fs/read_write.c:1440
  __do_sys_sendfile64 fs/read_write.c:1495 [inline]
  __se_sys_sendfile64 fs/read_write.c:1487 [inline]
  __x64_sys_sendfile64+0x155/0x240 fs/read_write.c:1487
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since flags == 0, preempt == false. Since stuck at wait_event(), success == false.
Thus, atomic_read(&q->mq_freeze_depth) > 0 if blk_queue_dying(q) == false. And I
guess blk_queue_dying(q) == false because we are just trying to freeze/unfreeze.

/**
 * blk_queue_enter() - try to increase q->q_usage_counter
 * @q: request queue pointer
 * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
 */
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
{
	const bool preempt = flags & BLK_MQ_REQ_PREEMPT;

	while (true) {
		bool success = false;

		rcu_read_lock();
		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
			/*
			 * The code that sets the PREEMPT_ONLY flag is
			 * responsible for ensuring that that flag is globally
			 * visible before the queue is unfrozen.
			 */
			if (preempt || !blk_queue_preempt_only(q)) {
				success = true;
			} else {
				percpu_ref_put(&q->q_usage_counter);
			}
		}
		rcu_read_unlock();

		if (success)
			return 0;

		if (flags & BLK_MQ_REQ_NOWAIT)
			return -EBUSY;

		/*
		 * read pair of barrier in blk_freeze_queue_start(),
		 * we need to order reading __PERCPU_REF_DEAD flag of
		 * .q_usage_counter and reading .mq_freeze_depth or
		 * queue dying flag, otherwise the following wait may
		 * never return if the two reads are reordered.
		 */
		smp_rmb();

		wait_event(q->mq_freeze_wq,
			   (atomic_read(&q->mq_freeze_depth) == 0 &&
			    (preempt || !blk_queue_preempt_only(q))) ||
			   blk_queue_dying(q));
		if (blk_queue_dying(q))
			return -ENODEV;
	}
}



jbd2/sda1-8(PID=2271) is stuck waiting for journal commit operation.
I don't know how this thread is involved to this problem.

jbd2/sda1-8     D17736  2271      2 0x80000000
Call Trace:
  context_switch kernel/sched/core.c:2859 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
  schedule+0xef/0x430 kernel/sched/core.c:3545
  io_schedule+0x1c/0x70 kernel/sched/core.c:5165
  bit_wait_io+0x18/0x90 kernel/sched/wait_bit.c:207
  __wait_on_bit+0xb3/0x130 kernel/sched/wait_bit.c:48
  out_of_line_wait_on_bit+0x204/0x3a0 kernel/sched/wait_bit.c:63
  wait_on_bit_io include/linux/wait_bit.h:101 [inline]
  __wait_on_buffer+0x76/0x90 fs/buffer.c:118
  wait_on_buffer include/linux/buffer_head.h:356 [inline]
  journal_wait_on_commit_record fs/jbd2/commit.c:174 [inline]
  jbd2_journal_commit_transaction+0x655b/0x8c18 fs/jbd2/commit.c:865
  kjournald2+0x26c/0xb30 fs/jbd2/journal.c:229
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412



Can we find the culprit from these hints?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-15 11:45 ` Tetsuo Handa
@ 2018-05-16 13:05   ` Tetsuo Handa
  2018-05-16 14:56       ` Bart Van Assche
  2018-05-16 17:33     ` Alan Jenkins
  0 siblings, 2 replies; 32+ messages in thread
From: Tetsuo Handa @ 2018-05-16 13:05 UTC (permalink / raw)
  To: syzbot, linux-block, syzkaller-bugs, Dan Williams, Jens Axboe
  Cc: linux-kernel, linux-ext4, dvyukov, Alan Jenkins, Bart Van Assche,
	Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn,
	Keith Busch, Martin K. Petersen, Martin Steigerwald, Ming Lei,
	Oleksandr Natalenko, Ross Zwisler

Tetsuo Handa wrote:
> I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
> but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
> stuck at wait_event() in blk_queue_enter().
> 
> Since flags == 0, preempt == false. Since stuck at wait_event(), success == false.
> Thus, atomic_read(&q->mq_freeze_depth) > 0 if blk_queue_dying(q) == false. And I
> guess blk_queue_dying(q) == false because we are just trying to freeze/unfreeze.
> 

I was able to reproduce the hung up using modified reproducer, and got values
using below debug printk() patch.

  --- a/block/blk-core.c
  +++ b/block/blk-core.c
  @@ -950,10 +950,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
   		 */
   		smp_rmb();
   
  -		wait_event(q->mq_freeze_wq,
  -			   (atomic_read(&q->mq_freeze_depth) == 0 &&
  -			    (preempt || !blk_queue_preempt_only(q))) ||
  -			   blk_queue_dying(q));
  +		while (wait_event_timeout(q->mq_freeze_wq,
  +					  (atomic_read(&q->mq_freeze_depth) == 0 &&
  +					   (preempt || !blk_queue_preempt_only(q))) ||
  +					  blk_queue_dying(q), 10 * HZ) == 0)
  +			printk("%s(%u): q->mq_freeze_depth=%d preempt=%d blk_queue_preempt_only(q)=%d blk_queue_dying(q)=%d\n",
  +			       current->comm, current->pid, atomic_read(&q->mq_freeze_depth), preempt, blk_queue_preempt_only(q), blk_queue_dying(q));
   		if (blk_queue_dying(q))
   			return -ENODEV;
   	}

[   75.869126] print_req_error: I/O error, dev loop0, sector 0
[   85.983146] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[   96.222884] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[  106.463322] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[  116.702912] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0

One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
blk_queue_enter() will never be satisfied. But what does that wait_event()
want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
as if blk_queue_dying(q) == true? That is, something like below:

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..59e2496 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		smp_rmb();
 
 		wait_event(q->mq_freeze_wq,
-			   (atomic_read(&q->mq_freeze_depth) == 0 &&
-			    (preempt || !blk_queue_preempt_only(q))) ||
+			   atomic_read(&q->mq_freeze_depth) ||
+			   (preempt || !blk_queue_preempt_only(q)) ||
 			   blk_queue_dying(q));
-		if (blk_queue_dying(q))
+		if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
 			return -ENODEV;
 	}
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-16 13:05   ` Tetsuo Handa
@ 2018-05-16 14:56       ` Bart Van Assche
  2018-05-16 17:33     ` Alan Jenkins
  1 sibling, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2018-05-16 14:56 UTC (permalink / raw)
  To: syzbot+c4f9cebf9d651f6e54de, syzkaller-bugs, dan.j.williams,
	linux-block, penguin-kernel, axboe
  Cc: linux-kernel, jthumshirn, alan.christopher.jenkins, hch,
	martin.petersen, dvyukov, ming.lei, martin, oleksandr, hare,
	ross.zwisler, keith.busch, linux-ext4

T24gV2VkLCAyMDE4LTA1LTE2IGF0IDIyOjA1ICswOTAwLCBUZXRzdW8gSGFuZGEgd3JvdGU6DQo+
IE9uZSBvcmUgbW9yZSB0aHJlYWRzIGFyZSB3YWl0aW5nIGZvciBxLT5tcV9mcmVlemVfZGVwdGgg
dG8gYmVjb21lIDAuIEJ1dCB0aGUNCj4gdGhyZWFkIHdobyBpbmNyZW1lbnRlZCBxLT5tcV9mcmVl
emVfZGVwdGggYXQgYmxrX2ZyZWV6ZV9xdWV1ZV9zdGFydChxKSBmcm9tDQo+IGJsa19mcmVlemVf
cXVldWUoKSBpcyB3YWl0aW5nIGF0IGJsa19tcV9mcmVlemVfcXVldWVfd2FpdCgpLiBUaGVyZWZv
cmUsDQo+IGF0b21pY19yZWFkKCZxLT5tcV9mcmVlemVfZGVwdGgpID09IDAgY29uZGl0aW9uIGZv
ciB3YWl0X2V2ZW50KCkgaW4NCj4gYmxrX3F1ZXVlX2VudGVyKCkgd2lsbCBuZXZlciBiZSBzYXRp
c2ZpZWQuIEJ1dCB3aGF0IGRvZXMgdGhhdCB3YWl0X2V2ZW50KCkNCj4gd2FudCB0byBkbz8gSXNu
J3QgInN0YXJ0IGZyZWV6aW5nIiBhIHNvcnQgb2YgYmxrX3F1ZXVlX2R5aW5nKHEpID09IHRydWU/
DQo+IFNpbmNlIHBlcmNwdV9yZWZfdHJ5Z2V0X2xpdmUoJnEtPnFfdXNhZ2VfY291bnRlcikgZmFp
bGVkIGFuZCB0aGUgcXVldWUgaXMNCj4gYWJvdXQgdG8gYmUgZnJvemVuLCBzaG91bGRuJ3Qgd2Ug
dHJlYXQgYXRvbWljX3JlYWQoJnEtPm1xX2ZyZWV6ZV9kZXB0aCkgIT0gMA0KPiBhcyBpZiBibGtf
cXVldWVfZHlpbmcocSkgPT0gdHJ1ZT8gVGhhdCBpcywgc29tZXRoaW5nIGxpa2UgYmVsb3c6DQo+
IA0KPiBkaWZmIC0tZ2l0IGEvYmxvY2svYmxrLWNvcmUuYyBiL2Jsb2NrL2Jsay1jb3JlLmMNCj4g
aW5kZXggODU5MDliNC4uNTllMjQ5NiAxMDA2NDQNCj4gLS0tIGEvYmxvY2svYmxrLWNvcmUuYw0K
PiArKysgYi9ibG9jay9ibGstY29yZS5jDQo+IEBAIC05NTEsMTAgKzk1MSwxMCBAQCBpbnQgYmxr
X3F1ZXVlX2VudGVyKHN0cnVjdCByZXF1ZXN0X3F1ZXVlICpxLCBibGtfbXFfcmVxX2ZsYWdzX3Qg
ZmxhZ3MpDQo+ICAJCXNtcF9ybWIoKTsNCj4gIA0KPiAgCQl3YWl0X2V2ZW50KHEtPm1xX2ZyZWV6
ZV93cSwNCj4gLQkJCSAgIChhdG9taWNfcmVhZCgmcS0+bXFfZnJlZXplX2RlcHRoKSA9PSAwICYm
DQo+IC0JCQkgICAgKHByZWVtcHQgfHwgIWJsa19xdWV1ZV9wcmVlbXB0X29ubHkocSkpKSB8fA0K
PiArCQkJICAgYXRvbWljX3JlYWQoJnEtPm1xX2ZyZWV6ZV9kZXB0aCkgfHwNCj4gKwkJCSAgIChw
cmVlbXB0IHx8ICFibGtfcXVldWVfcHJlZW1wdF9vbmx5KHEpKSB8fA0KPiAgCQkJICAgYmxrX3F1
ZXVlX2R5aW5nKHEpKTsNCj4gLQkJaWYgKGJsa19xdWV1ZV9keWluZyhxKSkNCj4gKwkJaWYgKGF0
b21pY19yZWFkKCZxLT5tcV9mcmVlemVfZGVwdGgpIHx8IGJsa19xdWV1ZV9keWluZyhxKSkNCj4g
IAkJCXJldHVybiAtRU5PREVWOw0KPiAgCX0NCj4gIH0NCg0KVGhhdCBjaGFuZ2UgbG9va3Mgd3Jv
bmcgdG8gbWUuIEFkZGl0aW9uYWxseSwgSSB0aGluayB0aGF0IHlvdSBhcmUgbG9va2luZyBpbg0K
dGhlIHdyb25nIGRpcmVjdGlvbi4gU2luY2UgYmxrX21xX2ZyZWV6ZV9xdWV1ZV93YWl0KCkgYW5k
IGJsa19xdWV1ZV9lbnRlcigpDQp3b3JrIGZpbmUgZm9yIGFsbCBibG9jayBkcml2ZXJzIGV4Y2Vw
dCB0aGUgbG9vcCBkcml2ZXIgSSB0aGluayB0aGF0IHlvdSBzaG91bGQNCmhhdmUgYSBjbG9zZXIg
bG9vayBhdCBob3cgdGhlIGxvb3AgZHJpdmVyIHVzZXMgdGhpcyBibG9jayBsYXllciBmdW5jdGlv
bmFsaXR5Lg0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-05-16 14:56       ` Bart Van Assche
  0 siblings, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2018-05-16 14:56 UTC (permalink / raw)
  To: syzbot+c4f9cebf9d651f6e54de, syzkaller-bugs, dan.j.williams,
	linux-block, penguin-kernel, axboe
  Cc: linux-kernel, jthumshirn, alan.christopher.jenkins, hch,
	martin.petersen, dvyukov, ming.lei, martin, oleksandr, hare,
	ross.zwisler, keith.busch, linux-ext4

On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
> blk_queue_enter() will never be satisfied. But what does that wait_event()
> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
> as if blk_queue_dying(q) == true? That is, something like below:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b4..59e2496 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  		smp_rmb();
>  
>  		wait_event(q->mq_freeze_wq,
> -			   (atomic_read(&q->mq_freeze_depth) == 0 &&
> -			    (preempt || !blk_queue_preempt_only(q))) ||
> +			   atomic_read(&q->mq_freeze_depth) ||
> +			   (preempt || !blk_queue_preempt_only(q)) ||
>  			   blk_queue_dying(q));
> -		if (blk_queue_dying(q))
> +		if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
>  			return -ENODEV;
>  	}
>  }

That change looks wrong to me. Additionally, I think that you are looking in
the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
work fine for all block drivers except the loop driver I think that you should
have a closer look at how the loop driver uses this block layer functionality.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-16 14:56       ` Bart Van Assche
@ 2018-05-16 15:16         ` Dmitry Vyukov
  -1 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2018-05-16 15:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: syzbot+c4f9cebf9d651f6e54de, syzkaller-bugs, dan.j.williams,
	linux-block, penguin-kernel@I-love.SAKURA.ne.jp, axboe,
	linux-kernel, jthumshirn, alan.christopher.jenkins, hch,
	martin.petersen, ming.lei, martin, oleksandr, hare, ross.zwisler,
	keith.busch, linux-ext4

On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
>> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
>> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
>> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
>> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
>> blk_queue_enter() will never be satisfied. But what does that wait_event()
>> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
>> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
>> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
>> as if blk_queue_dying(q) == true? That is, something like below:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b4..59e2496 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>>               smp_rmb();
>>
>>               wait_event(q->mq_freeze_wq,
>> -                        (atomic_read(&q->mq_freeze_depth) == 0 &&
>> -                         (preempt || !blk_queue_preempt_only(q))) ||
>> +                        atomic_read(&q->mq_freeze_depth) ||
>> +                        (preempt || !blk_queue_preempt_only(q)) ||
>>                          blk_queue_dying(q));
>> -             if (blk_queue_dying(q))
>> +             if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
>>                       return -ENODEV;
>>       }
>>  }
>
> That change looks wrong to me.

Hi Bart,

Why does it look wrong to you?

> Additionally, I think that you are looking in
> the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
> work fine for all block drivers except the loop driver I think that you should
> have a closer look at how the loop driver uses this block layer functionality.
>
> Thanks,
>
> Bart.
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-05-16 15:16         ` Dmitry Vyukov
  0 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2018-05-16 15:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: syzbot+c4f9cebf9d651f6e54de, syzkaller-bugs, dan.j.williams,
	linux-block, penguin-kernel@I-love.SAKURA.ne.jp, axboe,
	linux-kernel, jthumshirn, alan.christopher.jenkins, hch,
	martin.petersen, ming.lei, martin,

On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
>> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
>> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
>> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
>> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
>> blk_queue_enter() will never be satisfied. But what does that wait_event()
>> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
>> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
>> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
>> as if blk_queue_dying(q) == true? That is, something like below:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b4..59e2496 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>>               smp_rmb();
>>
>>               wait_event(q->mq_freeze_wq,
>> -                        (atomic_read(&q->mq_freeze_depth) == 0 &&
>> -                         (preempt || !blk_queue_preempt_only(q))) ||
>> +                        atomic_read(&q->mq_freeze_depth) ||
>> +                        (preempt || !blk_queue_preempt_only(q)) ||
>>                          blk_queue_dying(q));
>> -             if (blk_queue_dying(q))
>> +             if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
>>                       return -ENODEV;
>>       }
>>  }
>
> That change looks wrong to me.

Hi Bart,

Why does it look wrong to you?

> Additionally, I think that you are looking in
> the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
> work fine for all block drivers except the loop driver I think that you should
> have a closer look at how the loop driver uses this block layer functionality.
>
> Thanks,
>
> Bart.
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-16 15:16         ` Dmitry Vyukov
  (?)
@ 2018-05-16 15:37           ` Bart Van Assche
  -1 siblings, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2018-05-16 15:37 UTC (permalink / raw)
  To: dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, penguin-kernel, martin.petersen,
	axboe, dan.j.williams, hch, oleksandr, ming.lei, martin, hare,
	syzkaller-bugs, ross.zwisler, keith.busch, linux-ext4

T24gV2VkLCAyMDE4LTA1LTE2IGF0IDE3OjE2ICswMjAwLCBEbWl0cnkgVnl1a292IHdyb3RlOg0K
PiBPbiBXZWQsIE1heSAxNiwgMjAxOCBhdCA0OjU2IFBNLCBCYXJ0IFZhbiBBc3NjaGUgPEJhcnQu
VmFuQXNzY2hlQHdkYy5jb20+IHdyb3RlOg0KPiA+IE9uIFdlZCwgMjAxOC0wNS0xNiBhdCAyMjow
NSArMDkwMCwgVGV0c3VvIEhhbmRhIHdyb3RlOg0KPiA+ID4gZGlmZiAtLWdpdCBhL2Jsb2NrL2Js
ay1jb3JlLmMgYi9ibG9jay9ibGstY29yZS5jDQo+ID4gPiBpbmRleCA4NTkwOWI0Li41OWUyNDk2
IDEwMDY0NA0KPiA+ID4gLS0tIGEvYmxvY2svYmxrLWNvcmUuYw0KPiA+ID4gKysrIGIvYmxvY2sv
YmxrLWNvcmUuYw0KPiA+ID4gQEAgLTk1MSwxMCArOTUxLDEwIEBAIGludCBibGtfcXVldWVfZW50
ZXIoc3RydWN0IHJlcXVlc3RfcXVldWUgKnEsIGJsa19tcV9yZXFfZmxhZ3NfdCBmbGFncykNCj4g
PiA+ICAgICAgICAgICAgICAgc21wX3JtYigpOw0KPiA+ID4gDQo+ID4gPiAgICAgICAgICAgICAg
IHdhaXRfZXZlbnQocS0+bXFfZnJlZXplX3dxLA0KPiA+ID4gLSAgICAgICAgICAgICAgICAgICAg
ICAgIChhdG9taWNfcmVhZCgmcS0+bXFfZnJlZXplX2RlcHRoKSA9PSAwICYmDQo+ID4gPiAtICAg
ICAgICAgICAgICAgICAgICAgICAgIChwcmVlbXB0IHx8ICFibGtfcXVldWVfcHJlZW1wdF9vbmx5
KHEpKSkgfHwNCj4gPiA+ICsgICAgICAgICAgICAgICAgICAgICAgICBhdG9taWNfcmVhZCgmcS0+
bXFfZnJlZXplX2RlcHRoKSB8fA0KPiA+ID4gKyAgICAgICAgICAgICAgICAgICAgICAgIChwcmVl
bXB0IHx8ICFibGtfcXVldWVfcHJlZW1wdF9vbmx5KHEpKSB8fA0KPiA+ID4gICAgICAgICAgICAg
ICAgICAgICAgICAgIGJsa19xdWV1ZV9keWluZyhxKSk7DQo+ID4gPiAtICAgICAgICAgICAgIGlm
IChibGtfcXVldWVfZHlpbmcocSkpDQo+ID4gPiArICAgICAgICAgICAgIGlmIChhdG9taWNfcmVh
ZCgmcS0+bXFfZnJlZXplX2RlcHRoKSB8fCBibGtfcXVldWVfZHlpbmcocSkpDQo+ID4gPiAgICAg
ICAgICAgICAgICAgICAgICAgcmV0dXJuIC1FTk9ERVY7DQo+ID4gPiAgICAgICB9DQo+ID4gPiAg
fQ0KPiA+IA0KPiA+IFRoYXQgY2hhbmdlIGxvb2tzIHdyb25nIHRvIG1lLg0KPiANCj4gSGkgQmFy
dCwNCj4gDQo+IFdoeSBkb2VzIGl0IGxvb2sgd3JvbmcgdG8geW91Pw0KDQpCZWNhdXNlIHRoYXQg
Y2hhbmdlIGNvbmZsaWN0cyB3aXRoIHRoZSBwdXJwb3NlIG9mIHF1ZXVlIGZyZWV6aW5nIGFuZCBh
bHNvIGJlY2F1c2UNCnRoYXQgY2hhbmdlIHdvdWxkIGluamVjdCBJL08gZXJyb3JzIGluIGNvZGUg
cGF0aHMgdGhhdCBzaG91bGRuJ3QgaW5qZWN0IEkvTyBlcnJvcnMuDQpQbGVhc2UgaGF2ZSBhIGxv
b2sgYXQgZS5nLiBnZW5lcmljX21ha2VfcmVxdWVzdCgpLiBGcm9tIHRoZSBzdGFydCBvZiB0aGF0
IGZ1bmN0aW9uOg0KDQoJaWYgKGJsa19xdWV1ZV9lbnRlcihxLCBmbGFncykgPCAwKSB7DQoJCWlm
ICghYmxrX3F1ZXVlX2R5aW5nKHEpICYmIChiaW8tPmJpX29wZiAmIFJFUV9OT1dBSVQpKQ0KCQkJ
YmlvX3dvdWxkYmxvY2tfZXJyb3IoYmlvKTsNCgkJZWxzZQ0KCQkJYmlvX2lvX2Vycm9yKGJpbyk7
DQoJCXJldHVybiByZXQ7DQoJfQ0KDQpUaGUgYWJvdmUgcGF0Y2ggY2hhbmdlcyB0aGUgYmVoYXZp
b3Igb2YgYmxrX3F1ZXVlX2VudGVyKCkgY29kZSBmcm9tIHdhaXRpbmcgd2hpbGUNCnEtPm1xX2Zy
ZWV6ZV9kZXB0aCAhPSAwIGludG8gcmV0dXJuaW5nIC1FTk9ERVYgd2hpbGUgdGhlIHJlcXVlc3Qg
cXVldWUgaXMgZnJvemVuLg0KVGhhdCB3aWxsIGNhdXNlIGdlbmVyaWNfbWFrZV9yZXF1ZXN0KCkg
dG8gY2FsbCBiaW9faW9fZXJyb3IoYmlvKSB3aGlsZSBhIHJlcXVlc3QNCnF1ZXVlIGlzIGZyb3pl
biBpZiBSRVFfTk9XQUlUIGhhcyBub3QgYmVlbiBzZXQsIHdoaWNoIGlzIHRoZSBkZWZhdWx0IGJl
aGF2aW9yLiBTbw0KYW55IG9wZXJhdGlvbiB0aGF0IGZyZWV6ZXMgdGhlIHF1ZXVlIHRlbXBvcmFy
aWx5LCBlLmcuIGNoYW5naW5nIHRoZSBxdWV1ZSBkZXB0aCwNCmNvbmN1cnJlbnRseSB3aXRoIEkv
TyBwcm9jZXNzaW5nIGNhbiBjYXVzZSBJL08gdG8gZmFpbCB3aXRoIC1FTk9ERVYuIEFzIHlvdQ0K
cHJvYmFibHkga25vdyBmYWlsdXJlIG9mIHdyaXRlIHJlcXVlc3RzIGhhcyB2ZXJ5IGFubm95aW5n
IGNvbnNlcXVlbmNlcy4gSXQgZS5nLg0KY2F1c2VzIGZpbGVzeXN0ZW1zIHRvIGdvIGludG8gcmVh
ZC1vbmx5IG1vZGUuIFRoYXQncyB3aHkgSSB0aGluayB0aGF0IHRoZSBhYm92ZQ0KY2hhbmdlIGlz
IGNvbXBsZXRlbHkgd3JvbmcuDQoNCkJhcnQuDQoNCg0K

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-05-16 15:37           ` Bart Van Assche
  0 siblings, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2018-05-16 15:37 UTC (permalink / raw)
  To: dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, penguin-kernel, martin.petersen,
	axboe, dan.j.williams, hch, oleksandr, ming.lei, martin, hare,
	syzkaller-bugs, ross.zwisler, keith.busch, linux-ext4

On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 85909b4..59e2496 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > >               smp_rmb();
> > > 
> > >               wait_event(q->mq_freeze_wq,
> > > -                        (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > -                         (preempt || !blk_queue_preempt_only(q))) ||
> > > +                        atomic_read(&q->mq_freeze_depth) ||
> > > +                        (preempt || !blk_queue_preempt_only(q)) ||
> > >                          blk_queue_dying(q));
> > > -             if (blk_queue_dying(q))
> > > +             if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > >                       return -ENODEV;
> > >       }
> > >  }
> > 
> > That change looks wrong to me.
> 
> Hi Bart,
> 
> Why does it look wrong to you?

Because that change conflicts with the purpose of queue freezing and also because
that change would inject I/O errors in code paths that shouldn't inject I/O errors.
Please have a look at e.g. generic_make_request(). From the start of that function:

	if (blk_queue_enter(q, flags) < 0) {
		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
			bio_wouldblock_error(bio);
		else
			bio_io_error(bio);
		return ret;
	}

The above patch changes the behavior of blk_queue_enter() code from waiting while
q->mq_freeze_depth != 0 into returning -ENODEV while the request queue is frozen.
That will cause generic_make_request() to call bio_io_error(bio) while a request
queue is frozen if REQ_NOWAIT has not been set, which is the default behavior. So
any operation that freezes the queue temporarily, e.g. changing the queue depth,
concurrently with I/O processing can cause I/O to fail with -ENODEV. As you
probably know failure of write requests has very annoying consequences. It e.g.
causes filesystems to go into read-only mode. That's why I think that the above
change is completely wrong.

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-05-16 15:37           ` Bart Van Assche
  0 siblings, 0 replies; 32+ messages in thread
From: Bart Van Assche @ 2018-05-16 15:37 UTC (permalink / raw)
  To: dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, penguin-kernel, martin.petersen,
	axboe, dan.j.williams, hch, oleksandr, ming.lei, martin,

On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 85909b4..59e2496 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > >               smp_rmb();
> > > 
> > >               wait_event(q->mq_freeze_wq,
> > > -                        (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > -                         (preempt || !blk_queue_preempt_only(q))) ||
> > > +                        atomic_read(&q->mq_freeze_depth) ||
> > > +                        (preempt || !blk_queue_preempt_only(q)) ||
> > >                          blk_queue_dying(q));
> > > -             if (blk_queue_dying(q))
> > > +             if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > >                       return -ENODEV;
> > >       }
> > >  }
> > 
> > That change looks wrong to me.
> 
> Hi Bart,
> 
> Why does it look wrong to you?

Because that change conflicts with the purpose of queue freezing and also because
that change would inject I/O errors in code paths that shouldn't inject I/O errors.
Please have a look at e.g. generic_make_request(). From the start of that function:

	if (blk_queue_enter(q, flags) < 0) {
		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
			bio_wouldblock_error(bio);
		else
			bio_io_error(bio);
		return ret;
	}

The above patch changes the behavior of blk_queue_enter() code from waiting while
q->mq_freeze_depth != 0 into returning -ENODEV while the request queue is frozen.
That will cause generic_make_request() to call bio_io_error(bio) while a request
queue is frozen if REQ_NOWAIT has not been set, which is the default behavior. So
any operation that freezes the queue temporarily, e.g. changing the queue depth,
concurrently with I/O processing can cause I/O to fail with -ENODEV. As you
probably know failure of write requests has very annoying consequences. It e.g.
causes filesystems to go into read-only mode. That's why I think that the above
change is completely wrong.

Bart.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-16 13:05   ` Tetsuo Handa
  2018-05-16 14:56       ` Bart Van Assche
@ 2018-05-16 17:33     ` Alan Jenkins
  1 sibling, 0 replies; 32+ messages in thread
From: Alan Jenkins @ 2018-05-16 17:33 UTC (permalink / raw)
  To: Tetsuo Handa, syzkaller-bugs
  Cc: syzbot, linux-block, Dan Williams, Jens Axboe, linux-kernel,
	linux-ext4, dvyukov, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke, Johannes Thumshirn, Keith Busch,
	Martin K. Petersen, Martin Steigerwald, Ming Lei,
	Oleksandr Natalenko, Ross Zwisler

 > jbd2/sda1-8(PID=2271) is stuck waiting for journal commit operation.
 > I don't know how this thread is involved to this problem.

It feels like it should be a necessary link in the chain. This is the 
filesystem underneath the loop device. If that hangs, we would expect 
the loop device to hang, but not vice versa.

AIUI, you couldn't reproduce this hang on your own machine.

Don't you think, your fuzzer has just found a nice exploit for reiserfs?

Alan


(It's interesting that you found this hang just after we fixed 
block_queue_enter() to wait in D state instead of S state.[1]  I don't 
know syszkaller, so I assume from this it only flagged this case up 
because of the hung task warning. I.e. syzkaller didn't have its own 
watchdog that would have failed the test anyway.

[1] the commit that caused you to CC me. "block: do not use 
interruptible wait anywhere"

https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-16 15:37           ` Bart Van Assche
  (?)
  (?)
@ 2018-05-21 21:52           ` Tetsuo Handa
  2018-05-22 11:20             ` Tetsuo Handa
  -1 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2018-05-21 21:52 UTC (permalink / raw)
  To: Bart.VanAssche, dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, axboe,
	dan.j.williams, hch, oleksandr, ming.lei, martin, hare,
	syzkaller-bugs, ross.zwisler, keith.busch, linux-ext4

Bart Van Assche wrote:
> On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> > On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 85909b4..59e2496 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > > >               smp_rmb();
> > > > 
> > > >               wait_event(q->mq_freeze_wq,
> > > > -                        (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > > -                         (preempt || !blk_queue_preempt_only(q))) ||
> > > > +                        atomic_read(&q->mq_freeze_depth) ||
> > > > +                        (preempt || !blk_queue_preempt_only(q)) ||
> > > >                          blk_queue_dying(q));
> > > > -             if (blk_queue_dying(q))
> > > > +             if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > > >                       return -ENODEV;
> > > >       }
> > > >  }
> > > 
> > > That change looks wrong to me.
> > 
> > Hi Bart,
> > 
> > Why does it look wrong to you?
> 
> Because that change conflicts with the purpose of queue freezing and also because
> that change would inject I/O errors in code paths that shouldn't inject I/O errors.

But waiting there until atomic_read(&q->mq_freeze_depth) becomes 0 is causing deadlock.
wait_event() never returns is a bug. I think we should not wait for q->mq_freeze_depth.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-21 21:52           ` Tetsuo Handa
@ 2018-05-22 11:20             ` Tetsuo Handa
  2018-06-01 10:10               ` Tetsuo Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2018-05-22 11:20 UTC (permalink / raw)
  To: Bart.VanAssche, dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, axboe,
	dan.j.williams, hch, oleksandr, ming.lei, martin, hare,
	syzkaller-bugs, ross.zwisler, keith.busch, linux-ext4

I checked counter values using debug printk() patch shown below, and
found that q->q_usage_counter.count == 1 when this deadlock occurs.

Since sum of percpu_count did not change after percpu_ref_kill(), this is
not a race condition while folding percpu counter values into atomic counter
value. That is, for some reason, someone who is responsible for calling
percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
unable to call percpu_ref_put().

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..6933020 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,6 +908,12 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+	return (unsigned long __percpu *)
+		(ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -950,10 +956,22 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		 */
 		smp_rmb();
 
-		wait_event(q->mq_freeze_wq,
-			   (atomic_read(&q->mq_freeze_depth) == 0 &&
-			    (preempt || !blk_queue_preempt_only(q))) ||
-			   blk_queue_dying(q));
+		while (wait_event_timeout(q->mq_freeze_wq,
+					  (atomic_read(&q->mq_freeze_depth) == 0 &&
+					   (preempt || !blk_queue_preempt_only(q))) ||
+					  blk_queue_dying(q), 3 * HZ) == 0) {
+			struct percpu_ref *ref = &q->q_usage_counter;
+			unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+			unsigned long count = 0;
+			int cpu;
+
+			for_each_possible_cpu(cpu)
+				count += *per_cpu_ptr(percpu_count, cpu);
+
+			printk("%s(%d): %px %ld %ld\n", current->comm, current->pid,
+			       ref, atomic_long_read(&ref->count), count);
+		}
+
 		if (blk_queue_dying(q))
 			return -ENODEV;
 	}
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7..72773ce 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -133,8 +133,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 	for_each_possible_cpu(cpu)
 		count += *per_cpu_ptr(percpu_count, cpu);
 
-	pr_debug("global %ld percpu %ld",
-		 atomic_long_read(&ref->count), (long)count);
+	printk("%px global %ld percpu %ld\n", ref,
+	       atomic_long_read(&ref->count), (long)count);
 
 	/*
 	 * It's crucial that we sum the percpu counters _before_ adding the sum
@@ -150,6 +150,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 	 */
 	atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count);
 
+	printk("%px global %ld\n", ref, atomic_long_read(&ref->count));
+
 	WARN_ONCE(atomic_long_read(&ref->count) <= 0,
 		  "percpu ref (%pf) <= 0 (%ld) after switching to atomic",
 		  ref->release, atomic_long_read(&ref->count));

If I change blk_queue_enter() not to wait at wait_event() if
q->mq_freeze_depth != 0, this deadlock problem does not occur.

Also, I found that if blk_freeze_queue_start() tries to wait for
counters to become 1 before calling percpu_ref_kill() (like shown
below), this deadlock problem does not occur.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac..4bff534 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -134,12 +134,36 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
 }
 
+#define PERCPU_COUNT_BIAS       (1LU << (BITS_PER_LONG - 1))
+
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+	return (unsigned long __percpu *)
+		(ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
 void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
+		int i;
+		for (i = 0; i < 10; i++) {
+			struct percpu_ref *ref = &q->q_usage_counter;
+			unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+			unsigned long count = 0;
+			int cpu;
+
+			for_each_possible_cpu(cpu)
+				count += *per_cpu_ptr(percpu_count, cpu);
+
+			if (atomic_long_read(&ref->count) + count - PERCPU_COUNT_BIAS == 1)
+				break;
+			printk("%s(%d):! %px %ld %ld\n", current->comm, current->pid,
+			       ref, atomic_long_read(&ref->count), count);
+			schedule_timeout_uninterruptible(HZ / 10);
+		}
 		percpu_ref_kill(&q->q_usage_counter);
 		if (q->mq_ops)
 			blk_mq_run_hw_queues(q, false);

But I don't know how to find someone who is failing to call percpu_ref_put()...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-05-22 11:20             ` Tetsuo Handa
@ 2018-06-01 10:10               ` Tetsuo Handa
  2018-06-01 17:52                 ` Jens Axboe
  2018-06-04 11:46                 ` Dmitry Vyukov
  0 siblings, 2 replies; 32+ messages in thread
From: Tetsuo Handa @ 2018-06-01 10:10 UTC (permalink / raw)
  To: Bart.VanAssche, dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, axboe,
	dan.j.williams, hch, oleksandr, ming.lei, martin, hare,
	syzkaller-bugs, ross.zwisler, keith.busch, linux-ext4

Tetsuo Handa wrote:
> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> not a race condition while folding percpu counter values into atomic counter
> value. That is, for some reason, someone who is responsible for calling
> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> unable to call percpu_ref_put().
> But I don't know how to find someone who is failing to call percpu_ref_put()...

I found the someone. It was already there in the backtrace...

----------------------------------------
[   62.065852] a.out           D    0  4414   4337 0x00000000
[   62.067677] Call Trace:
[   62.068545]  __schedule+0x40b/0x860
[   62.069726]  schedule+0x31/0x80
[   62.070796]  schedule_timeout+0x1c1/0x3c0
[   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
[   62.073670]  blk_queue_enter+0x218/0x520
[   62.074985]  ? remove_wait_queue+0x70/0x70
[   62.076361]  generic_make_request+0x3d/0x540
[   62.077785]  ? __bio_clone_fast+0x6b/0x80
[   62.079147]  ? bio_clone_fast+0x2c/0x70
[   62.080456]  blk_queue_split+0x29b/0x560
[   62.081772]  ? blk_queue_split+0x29b/0x560
[   62.083162]  blk_mq_make_request+0x7c/0x430
[   62.084562]  generic_make_request+0x276/0x540
[   62.086034]  submit_bio+0x6e/0x140
[   62.087185]  ? submit_bio+0x6e/0x140
[   62.088384]  ? guard_bio_eod+0x9d/0x1d0
[   62.089681]  do_mpage_readpage+0x328/0x730
[   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
[   62.092726]  mpage_readpages+0x120/0x190
[   62.094034]  ? check_disk_change+0x70/0x70
[   62.095454]  ? check_disk_change+0x70/0x70
[   62.096849]  ? alloc_pages_current+0x65/0xd0
[   62.098277]  blkdev_readpages+0x18/0x20
[   62.099568]  __do_page_cache_readahead+0x298/0x360
[   62.101157]  ondemand_readahead+0x1f6/0x490
[   62.102546]  ? ondemand_readahead+0x1f6/0x490
[   62.103995]  page_cache_sync_readahead+0x29/0x40
[   62.105539]  generic_file_read_iter+0x7d0/0x9d0
[   62.107067]  ? futex_wait+0x221/0x240
[   62.108303]  ? trace_hardirqs_on+0xd/0x10
[   62.109654]  blkdev_read_iter+0x30/0x40
[   62.110954]  generic_file_splice_read+0xc5/0x140
[   62.112538]  do_splice_to+0x74/0x90
[   62.113726]  splice_direct_to_actor+0xa4/0x1f0
[   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
[   62.116773]  do_splice_direct+0x8a/0xb0
[   62.118056]  do_sendfile+0x1aa/0x390
[   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
[   62.120666]  do_syscall_64+0x6e/0x210
[   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
----------------------------------------

The someone is blk_queue_split() from blk_mq_make_request() who depends on an
assumption that blk_queue_enter() from recursively called generic_make_request()
does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.

----------------------------------------
generic_make_request(struct bio *bio) {
  if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
    if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
      bio_wouldblock_error(bio);
    else
      bio_io_error(bio);
    return ret;
  }
(...snipped...)
  ret = q->make_request_fn(q, bio);
(...snipped...)
  if (q)
    blk_queue_exit(q);
}
----------------------------------------

where q->make_request_fn == blk_mq_make_request which does

----------------------------------------
blk_mq_make_request(struct request_queue *q, struct bio *bio) {
   blk_queue_split(q, &bio);
}

blk_queue_split(struct request_queue *q, struct bio **bio) {
  generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
}
----------------------------------------

and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
percpu_ref_kill() are called by	blk_freeze_queue_start()...

Now, it is up to you about how to fix this race problem.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-01 10:10               ` Tetsuo Handa
@ 2018-06-01 17:52                 ` Jens Axboe
  2018-06-01 23:49                   ` Ming Lei
  2018-06-05  0:27                   ` Tetsuo Handa
  2018-06-04 11:46                 ` Dmitry Vyukov
  1 sibling, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2018-06-01 17:52 UTC (permalink / raw)
  To: Tetsuo Handa, Bart.VanAssche, dvyukov
  Cc: linux-kernel, linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, ming.lei, martin, hare, syzkaller-bugs,
	ross.zwisler, keith.busch, linux-ext4

On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
> 
> I found the someone. It was already there in the backtrace...
> 
> ----------------------------------------
> [   62.065852] a.out           D    0  4414   4337 0x00000000
> [   62.067677] Call Trace:
> [   62.068545]  __schedule+0x40b/0x860
> [   62.069726]  schedule+0x31/0x80
> [   62.070796]  schedule_timeout+0x1c1/0x3c0
> [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
> [   62.073670]  blk_queue_enter+0x218/0x520
> [   62.074985]  ? remove_wait_queue+0x70/0x70
> [   62.076361]  generic_make_request+0x3d/0x540
> [   62.077785]  ? __bio_clone_fast+0x6b/0x80
> [   62.079147]  ? bio_clone_fast+0x2c/0x70
> [   62.080456]  blk_queue_split+0x29b/0x560
> [   62.081772]  ? blk_queue_split+0x29b/0x560
> [   62.083162]  blk_mq_make_request+0x7c/0x430
> [   62.084562]  generic_make_request+0x276/0x540
> [   62.086034]  submit_bio+0x6e/0x140
> [   62.087185]  ? submit_bio+0x6e/0x140
> [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
> [   62.089681]  do_mpage_readpage+0x328/0x730
> [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
> [   62.092726]  mpage_readpages+0x120/0x190
> [   62.094034]  ? check_disk_change+0x70/0x70
> [   62.095454]  ? check_disk_change+0x70/0x70
> [   62.096849]  ? alloc_pages_current+0x65/0xd0
> [   62.098277]  blkdev_readpages+0x18/0x20
> [   62.099568]  __do_page_cache_readahead+0x298/0x360
> [   62.101157]  ondemand_readahead+0x1f6/0x490
> [   62.102546]  ? ondemand_readahead+0x1f6/0x490
> [   62.103995]  page_cache_sync_readahead+0x29/0x40
> [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
> [   62.107067]  ? futex_wait+0x221/0x240
> [   62.108303]  ? trace_hardirqs_on+0xd/0x10
> [   62.109654]  blkdev_read_iter+0x30/0x40
> [   62.110954]  generic_file_splice_read+0xc5/0x140
> [   62.112538]  do_splice_to+0x74/0x90
> [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
> [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
> [   62.116773]  do_splice_direct+0x8a/0xb0
> [   62.118056]  do_sendfile+0x1aa/0x390
> [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
> [   62.120666]  do_syscall_64+0x6e/0x210
> [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
> 
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> 
> ----------------------------------------
> generic_make_request(struct bio *bio) {
>   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>       bio_wouldblock_error(bio);
>     else
>       bio_io_error(bio);
>     return ret;
>   }
> (...snipped...)
>   ret = q->make_request_fn(q, bio);
> (...snipped...)
>   if (q)
>     blk_queue_exit(q);
> }
> ----------------------------------------
> 
> where q->make_request_fn == blk_mq_make_request which does
> 
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>    blk_queue_split(q, &bio);
> }
> 
> blk_queue_split(struct request_queue *q, struct bio **bio) {
>   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
> 
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by	blk_freeze_queue_start()...
> 
> Now, it is up to you about how to fix this race problem.

Ahh, nicely spotted! One idea would be the one below. For this case,
we're recursing, so we can either do a non-block queue enter, or we
can just do a live enter.

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..7de12e0dcc3d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
 
 	if (bio->bi_opf & REQ_NOWAIT)
 		flags = BLK_MQ_REQ_NOWAIT;
-	if (blk_queue_enter(q, flags) < 0) {
+	if (bio->bi_opf & REQ_QUEUE_ENTERED)
+		blk_queue_enter_live(q);
+	else if (blk_queue_enter(q, flags) < 0) {
 		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
 			bio_wouldblock_error(bio);
 		else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..e1063e8f7eda 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 		/* there isn't chance to merge the splitted bio */
 		split->bi_opf |= REQ_NOMERGE;
 
+		/*
+		 * Since we're recursing into make_request here, ensure
+		 * that we mark this bio as already having entered the queue.
+		 * If not, and the queue is going away, we can get stuck
+		 * forever on waiting for the queue reference to drop. But
+		 * that will never happen, as we're already holding a
+		 * reference to it.
+		 */
+		(*bio)->bi_opf |= REQ_QUEUE_ENTERED;
+
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
 		generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..a6bc789b83e7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -282,6 +282,8 @@ enum req_flag_bits {
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
+	__REQ_QUEUE_ENTERED,	/* queue already entered for this request */
+
 	/* for driver use */
 	__REQ_DRV,
 
@@ -302,9 +304,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
-
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
-
+#define REQ_QUEUE_ENTERED	(1ULL << __REQ_QUEUE_ENTERED)
 #define REQ_DRV			(1ULL << __REQ_DRV)
 
 #define REQ_FAILFAST_MASK \

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-01 17:52                 ` Jens Axboe
@ 2018-06-01 23:49                   ` Ming Lei
  2018-06-02  0:49                     ` Jens Axboe
  2018-06-05  0:27                   ` Tetsuo Handa
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-06-01 23:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >> not a race condition while folding percpu counter values into atomic counter
> >> value. That is, for some reason, someone who is responsible for calling
> >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >> unable to call percpu_ref_put().
> >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> > 
> > I found the someone. It was already there in the backtrace...
> > 
> > ----------------------------------------
> > [   62.065852] a.out           D    0  4414   4337 0x00000000
> > [   62.067677] Call Trace:
> > [   62.068545]  __schedule+0x40b/0x860
> > [   62.069726]  schedule+0x31/0x80
> > [   62.070796]  schedule_timeout+0x1c1/0x3c0
> > [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
> > [   62.073670]  blk_queue_enter+0x218/0x520
> > [   62.074985]  ? remove_wait_queue+0x70/0x70
> > [   62.076361]  generic_make_request+0x3d/0x540
> > [   62.077785]  ? __bio_clone_fast+0x6b/0x80
> > [   62.079147]  ? bio_clone_fast+0x2c/0x70
> > [   62.080456]  blk_queue_split+0x29b/0x560
> > [   62.081772]  ? blk_queue_split+0x29b/0x560
> > [   62.083162]  blk_mq_make_request+0x7c/0x430
> > [   62.084562]  generic_make_request+0x276/0x540
> > [   62.086034]  submit_bio+0x6e/0x140
> > [   62.087185]  ? submit_bio+0x6e/0x140
> > [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
> > [   62.089681]  do_mpage_readpage+0x328/0x730
> > [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
> > [   62.092726]  mpage_readpages+0x120/0x190
> > [   62.094034]  ? check_disk_change+0x70/0x70
> > [   62.095454]  ? check_disk_change+0x70/0x70
> > [   62.096849]  ? alloc_pages_current+0x65/0xd0
> > [   62.098277]  blkdev_readpages+0x18/0x20
> > [   62.099568]  __do_page_cache_readahead+0x298/0x360
> > [   62.101157]  ondemand_readahead+0x1f6/0x490
> > [   62.102546]  ? ondemand_readahead+0x1f6/0x490
> > [   62.103995]  page_cache_sync_readahead+0x29/0x40
> > [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
> > [   62.107067]  ? futex_wait+0x221/0x240
> > [   62.108303]  ? trace_hardirqs_on+0xd/0x10
> > [   62.109654]  blkdev_read_iter+0x30/0x40
> > [   62.110954]  generic_file_splice_read+0xc5/0x140
> > [   62.112538]  do_splice_to+0x74/0x90
> > [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
> > [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
> > [   62.116773]  do_splice_direct+0x8a/0xb0
> > [   62.118056]  do_sendfile+0x1aa/0x390
> > [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
> > [   62.120666]  do_syscall_64+0x6e/0x210
> > [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > ----------------------------------------
> > 
> > The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> > assumption that blk_queue_enter() from recursively called generic_make_request()
> > does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> > 
> > ----------------------------------------
> > generic_make_request(struct bio *bio) {
> >   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> >     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >       bio_wouldblock_error(bio);
> >     else
> >       bio_io_error(bio);
> >     return ret;
> >   }
> > (...snipped...)
> >   ret = q->make_request_fn(q, bio);
> > (...snipped...)
> >   if (q)
> >     blk_queue_exit(q);
> > }
> > ----------------------------------------
> > 
> > where q->make_request_fn == blk_mq_make_request which does
> > 
> > ----------------------------------------
> > blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> >    blk_queue_split(q, &bio);
> > }
> > 
> > blk_queue_split(struct request_queue *q, struct bio **bio) {
> >   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> > }
> > ----------------------------------------
> > 
> > and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> > percpu_ref_kill() are called by	blk_freeze_queue_start()...
> > 
> > Now, it is up to you about how to fix this race problem.
> 
> Ahh, nicely spotted! One idea would be the one below. For this case,
> we're recursing, so we can either do a non-block queue enter, or we
> can just do a live enter.
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b431eb0..7de12e0dcc3d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>  
>  	if (bio->bi_opf & REQ_NOWAIT)
>  		flags = BLK_MQ_REQ_NOWAIT;
> -	if (blk_queue_enter(q, flags) < 0) {
> +	if (bio->bi_opf & REQ_QUEUE_ENTERED)
> +		blk_queue_enter_live(q);
> +	else if (blk_queue_enter(q, flags) < 0) {
>  		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>  			bio_wouldblock_error(bio);
>  		else
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 782940c65d8a..e1063e8f7eda 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>  		/* there isn't chance to merge the splitted bio */
>  		split->bi_opf |= REQ_NOMERGE;
>  
> +		/*
> +		 * Since we're recursing into make_request here, ensure
> +		 * that we mark this bio as already having entered the queue.
> +		 * If not, and the queue is going away, we can get stuck
> +		 * forever on waiting for the queue reference to drop. But
> +		 * that will never happen, as we're already holding a
> +		 * reference to it.
> +		 */
> +		(*bio)->bi_opf |= REQ_QUEUE_ENTERED;
> +
>  		bio_chain(split, *bio);
>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>  		generic_make_request(*bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 17b18b91ebac..a6bc789b83e7 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -282,6 +282,8 @@ enum req_flag_bits {
>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>  
> +	__REQ_QUEUE_ENTERED,	/* queue already entered for this request */
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  
> @@ -302,9 +304,8 @@ enum req_flag_bits {
>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
> -
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
> -
> +#define REQ_QUEUE_ENTERED	(1ULL << __REQ_QUEUE_ENTERED)
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  
>  #define REQ_FAILFAST_MASK \

bio clone inherits .bi_opf, so this way may not work for cloned bio.

Given most of bio_split() is run in .make_request_fn(), holding the
queue ref may not be needed in case of bio splitting, so another
candidate fix might be to introduce one extra parameter of 'need_queue_ref'
to generic_make_request().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-01 23:49                   ` Ming Lei
@ 2018-06-02  0:49                     ` Jens Axboe
  2018-06-02  0:56                       ` Jens Axboe
  2018-06-02  2:36                       ` Ming Lei
  0 siblings, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2018-06-02  0:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On 6/1/18 5:49 PM, Ming Lei wrote:
> On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
>> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>>> Tetsuo Handa wrote:
>>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>>> not a race condition while folding percpu counter values into atomic counter
>>>> value. That is, for some reason, someone who is responsible for calling
>>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>>> unable to call percpu_ref_put().
>>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>>
>>> I found the someone. It was already there in the backtrace...
>>>
>>> ----------------------------------------
>>> [   62.065852] a.out           D    0  4414   4337 0x00000000
>>> [   62.067677] Call Trace:
>>> [   62.068545]  __schedule+0x40b/0x860
>>> [   62.069726]  schedule+0x31/0x80
>>> [   62.070796]  schedule_timeout+0x1c1/0x3c0
>>> [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
>>> [   62.073670]  blk_queue_enter+0x218/0x520
>>> [   62.074985]  ? remove_wait_queue+0x70/0x70
>>> [   62.076361]  generic_make_request+0x3d/0x540
>>> [   62.077785]  ? __bio_clone_fast+0x6b/0x80
>>> [   62.079147]  ? bio_clone_fast+0x2c/0x70
>>> [   62.080456]  blk_queue_split+0x29b/0x560
>>> [   62.081772]  ? blk_queue_split+0x29b/0x560
>>> [   62.083162]  blk_mq_make_request+0x7c/0x430
>>> [   62.084562]  generic_make_request+0x276/0x540
>>> [   62.086034]  submit_bio+0x6e/0x140
>>> [   62.087185]  ? submit_bio+0x6e/0x140
>>> [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
>>> [   62.089681]  do_mpage_readpage+0x328/0x730
>>> [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
>>> [   62.092726]  mpage_readpages+0x120/0x190
>>> [   62.094034]  ? check_disk_change+0x70/0x70
>>> [   62.095454]  ? check_disk_change+0x70/0x70
>>> [   62.096849]  ? alloc_pages_current+0x65/0xd0
>>> [   62.098277]  blkdev_readpages+0x18/0x20
>>> [   62.099568]  __do_page_cache_readahead+0x298/0x360
>>> [   62.101157]  ondemand_readahead+0x1f6/0x490
>>> [   62.102546]  ? ondemand_readahead+0x1f6/0x490
>>> [   62.103995]  page_cache_sync_readahead+0x29/0x40
>>> [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
>>> [   62.107067]  ? futex_wait+0x221/0x240
>>> [   62.108303]  ? trace_hardirqs_on+0xd/0x10
>>> [   62.109654]  blkdev_read_iter+0x30/0x40
>>> [   62.110954]  generic_file_splice_read+0xc5/0x140
>>> [   62.112538]  do_splice_to+0x74/0x90
>>> [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
>>> [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
>>> [   62.116773]  do_splice_direct+0x8a/0xb0
>>> [   62.118056]  do_sendfile+0x1aa/0x390
>>> [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
>>> [   62.120666]  do_syscall_64+0x6e/0x210
>>> [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> ----------------------------------------
>>>
>>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
>>> assumption that blk_queue_enter() from recursively called generic_make_request()
>>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>>>
>>> ----------------------------------------
>>> generic_make_request(struct bio *bio) {
>>>   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>>>     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>>       bio_wouldblock_error(bio);
>>>     else
>>>       bio_io_error(bio);
>>>     return ret;
>>>   }
>>> (...snipped...)
>>>   ret = q->make_request_fn(q, bio);
>>> (...snipped...)
>>>   if (q)
>>>     blk_queue_exit(q);
>>> }
>>> ----------------------------------------
>>>
>>> where q->make_request_fn == blk_mq_make_request which does
>>>
>>> ----------------------------------------
>>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>>>    blk_queue_split(q, &bio);
>>> }
>>>
>>> blk_queue_split(struct request_queue *q, struct bio **bio) {
>>>   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
>>> }
>>> ----------------------------------------
>>>
>>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
>>> percpu_ref_kill() are called by	blk_freeze_queue_start()...
>>>
>>> Now, it is up to you about how to fix this race problem.
>>
>> Ahh, nicely spotted! One idea would be the one below. For this case,
>> we're recursing, so we can either do a non-block queue enter, or we
>> can just do a live enter.
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b431eb0..7de12e0dcc3d 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  
>>  	if (bio->bi_opf & REQ_NOWAIT)
>>  		flags = BLK_MQ_REQ_NOWAIT;
>> -	if (blk_queue_enter(q, flags) < 0) {
>> +	if (bio->bi_opf & REQ_QUEUE_ENTERED)
>> +		blk_queue_enter_live(q);
>> +	else if (blk_queue_enter(q, flags) < 0) {
>>  		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>  			bio_wouldblock_error(bio);
>>  		else
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 782940c65d8a..e1063e8f7eda 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>>  		/* there isn't chance to merge the splitted bio */
>>  		split->bi_opf |= REQ_NOMERGE;
>>  
>> +		/*
>> +		 * Since we're recursing into make_request here, ensure
>> +		 * that we mark this bio as already having entered the queue.
>> +		 * If not, and the queue is going away, we can get stuck
>> +		 * forever on waiting for the queue reference to drop. But
>> +		 * that will never happen, as we're already holding a
>> +		 * reference to it.
>> +		 */
>> +		(*bio)->bi_opf |= REQ_QUEUE_ENTERED;
>> +
>>  		bio_chain(split, *bio);
>>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>>  		generic_make_request(*bio);
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 17b18b91ebac..a6bc789b83e7 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -282,6 +282,8 @@ enum req_flag_bits {
>>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>>  
>> +	__REQ_QUEUE_ENTERED,	/* queue already entered for this request */
>> +
>>  	/* for driver use */
>>  	__REQ_DRV,
>>  
>> @@ -302,9 +304,8 @@ enum req_flag_bits {
>>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
>>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>> -
>>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>> -
>> +#define REQ_QUEUE_ENTERED	(1ULL << __REQ_QUEUE_ENTERED)
>>  #define REQ_DRV			(1ULL << __REQ_DRV)
>>  
>>  #define REQ_FAILFAST_MASK \
> 
> bio clone inherits .bi_opf, so this way may not work for cloned bio.
> 
> Given most of bio_split() is run in .make_request_fn(), holding the
> queue ref may not be needed in case of bio splitting, so another
> candidate fix might be to introduce one extra parameter of
> 'need_queue_ref' to generic_make_request().

I did consider that, but that's pretty silly for a corner case and you'd
have to change a crap ton of calls to it. I think it'd be cleaner to
clear the bit when we need to, potentially even adding a debug check to
blk_queue_enter_live() that complains if the ref was not already
elevated. Though that would be expensive, compared to the percpu inc
now. Not saying the bit is necessarily the best way forward, but I do
like it a LOT more than adding an argument to generic_make_request.

It should have been a bio flag to begin with, that better captures the
scope and also avoids clones inheriting anything they shouldn't. It's
also not applicable to a request.

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..8f1063f50863 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
 
 	if (bio->bi_opf & REQ_NOWAIT)
 		flags = BLK_MQ_REQ_NOWAIT;
-	if (blk_queue_enter(q, flags) < 0) {
+	if (bio->bi_flags & BIO_QUEUE_ENTERED)
+		blk_queue_enter_live(q);
+	else if (blk_queue_enter(q, flags) < 0) {
 		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
 			bio_wouldblock_error(bio);
 		else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..9ee9474e579c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 		/* there isn't chance to merge the splitted bio */
 		split->bi_opf |= REQ_NOMERGE;
 
+		/*
+		 * Since we're recursing into make_request here, ensure
+		 * that we mark this bio as already having entered the queue.
+		 * If not, and the queue is going away, we can get stuck
+		 * forever on waiting for the queue reference to drop. But
+		 * that will never happen, as we're already holding a
+		 * reference to it.
+		 */
+		(*bio)->bi_flags |= BIO_QUEUE_ENTERED;
+
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
 		generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..4687d7c99076 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -186,6 +186,8 @@ struct bio {
 				 * throttling rules. Don't do it again. */
 #define BIO_TRACE_COMPLETION 10	/* bio_endio() should trace the final completion
 				 * of this bio. */
+#define BIO_QUEUE_ENTERED 11	/* can use blk_queue_enter_live() */
+
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-02  0:49                     ` Jens Axboe
@ 2018-06-02  0:56                       ` Jens Axboe
  2018-06-02  2:36                       ` Ming Lei
  1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2018-06-02  0:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On 6/1/18 6:49 PM, Jens Axboe wrote:
> On 6/1/18 5:49 PM, Ming Lei wrote:
>> On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
>>> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>>>> Tetsuo Handa wrote:
>>>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>>>> not a race condition while folding percpu counter values into atomic counter
>>>>> value. That is, for some reason, someone who is responsible for calling
>>>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>>>> unable to call percpu_ref_put().
>>>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>>>
>>>> I found the someone. It was already there in the backtrace...
>>>>
>>>> ----------------------------------------
>>>> [   62.065852] a.out           D    0  4414   4337 0x00000000
>>>> [   62.067677] Call Trace:
>>>> [   62.068545]  __schedule+0x40b/0x860
>>>> [   62.069726]  schedule+0x31/0x80
>>>> [   62.070796]  schedule_timeout+0x1c1/0x3c0
>>>> [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
>>>> [   62.073670]  blk_queue_enter+0x218/0x520
>>>> [   62.074985]  ? remove_wait_queue+0x70/0x70
>>>> [   62.076361]  generic_make_request+0x3d/0x540
>>>> [   62.077785]  ? __bio_clone_fast+0x6b/0x80
>>>> [   62.079147]  ? bio_clone_fast+0x2c/0x70
>>>> [   62.080456]  blk_queue_split+0x29b/0x560
>>>> [   62.081772]  ? blk_queue_split+0x29b/0x560
>>>> [   62.083162]  blk_mq_make_request+0x7c/0x430
>>>> [   62.084562]  generic_make_request+0x276/0x540
>>>> [   62.086034]  submit_bio+0x6e/0x140
>>>> [   62.087185]  ? submit_bio+0x6e/0x140
>>>> [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
>>>> [   62.089681]  do_mpage_readpage+0x328/0x730
>>>> [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
>>>> [   62.092726]  mpage_readpages+0x120/0x190
>>>> [   62.094034]  ? check_disk_change+0x70/0x70
>>>> [   62.095454]  ? check_disk_change+0x70/0x70
>>>> [   62.096849]  ? alloc_pages_current+0x65/0xd0
>>>> [   62.098277]  blkdev_readpages+0x18/0x20
>>>> [   62.099568]  __do_page_cache_readahead+0x298/0x360
>>>> [   62.101157]  ondemand_readahead+0x1f6/0x490
>>>> [   62.102546]  ? ondemand_readahead+0x1f6/0x490
>>>> [   62.103995]  page_cache_sync_readahead+0x29/0x40
>>>> [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
>>>> [   62.107067]  ? futex_wait+0x221/0x240
>>>> [   62.108303]  ? trace_hardirqs_on+0xd/0x10
>>>> [   62.109654]  blkdev_read_iter+0x30/0x40
>>>> [   62.110954]  generic_file_splice_read+0xc5/0x140
>>>> [   62.112538]  do_splice_to+0x74/0x90
>>>> [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
>>>> [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
>>>> [   62.116773]  do_splice_direct+0x8a/0xb0
>>>> [   62.118056]  do_sendfile+0x1aa/0x390
>>>> [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
>>>> [   62.120666]  do_syscall_64+0x6e/0x210
>>>> [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> ----------------------------------------
>>>>
>>>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
>>>> assumption that blk_queue_enter() from recursively called generic_make_request()
>>>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>>>>
>>>> ----------------------------------------
>>>> generic_make_request(struct bio *bio) {
>>>>   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>>>>     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>>>       bio_wouldblock_error(bio);
>>>>     else
>>>>       bio_io_error(bio);
>>>>     return ret;
>>>>   }
>>>> (...snipped...)
>>>>   ret = q->make_request_fn(q, bio);
>>>> (...snipped...)
>>>>   if (q)
>>>>     blk_queue_exit(q);
>>>> }
>>>> ----------------------------------------
>>>>
>>>> where q->make_request_fn == blk_mq_make_request which does
>>>>
>>>> ----------------------------------------
>>>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>>>>    blk_queue_split(q, &bio);
>>>> }
>>>>
>>>> blk_queue_split(struct request_queue *q, struct bio **bio) {
>>>>   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
>>>> }
>>>> ----------------------------------------
>>>>
>>>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
>>>> percpu_ref_kill() are called by	blk_freeze_queue_start()...
>>>>
>>>> Now, it is up to you about how to fix this race problem.
>>>
>>> Ahh, nicely spotted! One idea would be the one below. For this case,
>>> we're recursing, so we can either do a non-block queue enter, or we
>>> can just do a live enter.
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 85909b431eb0..7de12e0dcc3d 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>  
>>>  	if (bio->bi_opf & REQ_NOWAIT)
>>>  		flags = BLK_MQ_REQ_NOWAIT;
>>> -	if (blk_queue_enter(q, flags) < 0) {
>>> +	if (bio->bi_opf & REQ_QUEUE_ENTERED)
>>> +		blk_queue_enter_live(q);
>>> +	else if (blk_queue_enter(q, flags) < 0) {
>>>  		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>>  			bio_wouldblock_error(bio);
>>>  		else
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index 782940c65d8a..e1063e8f7eda 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>>>  		/* there isn't chance to merge the splitted bio */
>>>  		split->bi_opf |= REQ_NOMERGE;
>>>  
>>> +		/*
>>> +		 * Since we're recursing into make_request here, ensure
>>> +		 * that we mark this bio as already having entered the queue.
>>> +		 * If not, and the queue is going away, we can get stuck
>>> +		 * forever on waiting for the queue reference to drop. But
>>> +		 * that will never happen, as we're already holding a
>>> +		 * reference to it.
>>> +		 */
>>> +		(*bio)->bi_opf |= REQ_QUEUE_ENTERED;
>>> +
>>>  		bio_chain(split, *bio);
>>>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>>>  		generic_make_request(*bio);
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index 17b18b91ebac..a6bc789b83e7 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -282,6 +282,8 @@ enum req_flag_bits {
>>>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>>>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>>>  
>>> +	__REQ_QUEUE_ENTERED,	/* queue already entered for this request */
>>> +
>>>  	/* for driver use */
>>>  	__REQ_DRV,
>>>  
>>> @@ -302,9 +304,8 @@ enum req_flag_bits {
>>>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
>>>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>>>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>>> -
>>>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>>> -
>>> +#define REQ_QUEUE_ENTERED	(1ULL << __REQ_QUEUE_ENTERED)
>>>  #define REQ_DRV			(1ULL << __REQ_DRV)
>>>  
>>>  #define REQ_FAILFAST_MASK \
>>
>> bio clone inherits .bi_opf, so this way may not work for cloned bio.
>>
>> Given most of bio_split() is run in .make_request_fn(), holding the
>> queue ref may not be needed in case of bio splitting, so another
>> candidate fix might be to introduce one extra parameter of
>> 'need_queue_ref' to generic_make_request().
> 
> I did consider that, but that's pretty silly for a corner case and you'd
> have to change a crap ton of calls to it. I think it'd be cleaner to
> clear the bit when we need to, potentially even adding a debug check to
> blk_queue_enter_live() that complains if the ref was not already
> elevated. Though that would be expensive, compared to the percpu inc
> now. Not saying the bit is necessarily the best way forward, but I do
> like it a LOT more than adding an argument to generic_make_request.
> 
> It should have been a bio flag to begin with, that better captures the
> scope and also avoids clones inheriting anything they shouldn't. It's
> also not applicable to a request.

A little too much inheritance from the previous patch, better version
below.

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..71cb228d7374 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
 
 	if (bio->bi_opf & REQ_NOWAIT)
 		flags = BLK_MQ_REQ_NOWAIT;
-	if (blk_queue_enter(q, flags) < 0) {
+	if (bio_flagged(bio, BIO_QUEUE_ENTERED))
+		blk_queue_enter_live(q);
+	else if (blk_queue_enter(q, flags) < 0) {
 		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
 			bio_wouldblock_error(bio);
 		else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..481dc02668f9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 		/* there isn't chance to merge the splitted bio */
 		split->bi_opf |= REQ_NOMERGE;
 
+		/*
+		 * Since we're recursing into make_request here, ensure
+		 * that we mark this bio as already having entered the queue.
+		 * If not, and the queue is going away, we can get stuck
+		 * forever on waiting for the queue reference to drop. But
+		 * that will never happen, as we're already holding a
+		 * reference to it.
+		 */
+		bio_set_flag(*bio, BIO_QUEUE_ENTERED);
+
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
 		generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..1602bf4ab4cd 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -186,6 +186,8 @@ struct bio {
 				 * throttling rules. Don't do it again. */
 #define BIO_TRACE_COMPLETION 10	/* bio_endio() should trace the final completion
 				 * of this bio. */
+#define BIO_QUEUE_ENTERED 11	/* can use blk_queue_enter_live() */
+
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-02  0:49                     ` Jens Axboe
  2018-06-02  0:56                       ` Jens Axboe
@ 2018-06-02  2:36                       ` Ming Lei
  2018-06-02  4:31                         ` Jens Axboe
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-06-02  2:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On Fri, Jun 01, 2018 at 06:49:10PM -0600, Jens Axboe wrote:
> On 6/1/18 5:49 PM, Ming Lei wrote:
> > On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
> >> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> >>> Tetsuo Handa wrote:
> >>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >>>> not a race condition while folding percpu counter values into atomic counter
> >>>> value. That is, for some reason, someone who is responsible for calling
> >>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >>>> unable to call percpu_ref_put().
> >>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >>>
> >>> I found the someone. It was already there in the backtrace...
> >>>
> >>> ----------------------------------------
> >>> [   62.065852] a.out           D    0  4414   4337 0x00000000
> >>> [   62.067677] Call Trace:
> >>> [   62.068545]  __schedule+0x40b/0x860
> >>> [   62.069726]  schedule+0x31/0x80
> >>> [   62.070796]  schedule_timeout+0x1c1/0x3c0
> >>> [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
> >>> [   62.073670]  blk_queue_enter+0x218/0x520
> >>> [   62.074985]  ? remove_wait_queue+0x70/0x70
> >>> [   62.076361]  generic_make_request+0x3d/0x540
> >>> [   62.077785]  ? __bio_clone_fast+0x6b/0x80
> >>> [   62.079147]  ? bio_clone_fast+0x2c/0x70
> >>> [   62.080456]  blk_queue_split+0x29b/0x560
> >>> [   62.081772]  ? blk_queue_split+0x29b/0x560
> >>> [   62.083162]  blk_mq_make_request+0x7c/0x430
> >>> [   62.084562]  generic_make_request+0x276/0x540
> >>> [   62.086034]  submit_bio+0x6e/0x140
> >>> [   62.087185]  ? submit_bio+0x6e/0x140
> >>> [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
> >>> [   62.089681]  do_mpage_readpage+0x328/0x730
> >>> [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
> >>> [   62.092726]  mpage_readpages+0x120/0x190
> >>> [   62.094034]  ? check_disk_change+0x70/0x70
> >>> [   62.095454]  ? check_disk_change+0x70/0x70
> >>> [   62.096849]  ? alloc_pages_current+0x65/0xd0
> >>> [   62.098277]  blkdev_readpages+0x18/0x20
> >>> [   62.099568]  __do_page_cache_readahead+0x298/0x360
> >>> [   62.101157]  ondemand_readahead+0x1f6/0x490
> >>> [   62.102546]  ? ondemand_readahead+0x1f6/0x490
> >>> [   62.103995]  page_cache_sync_readahead+0x29/0x40
> >>> [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
> >>> [   62.107067]  ? futex_wait+0x221/0x240
> >>> [   62.108303]  ? trace_hardirqs_on+0xd/0x10
> >>> [   62.109654]  blkdev_read_iter+0x30/0x40
> >>> [   62.110954]  generic_file_splice_read+0xc5/0x140
> >>> [   62.112538]  do_splice_to+0x74/0x90
> >>> [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
> >>> [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
> >>> [   62.116773]  do_splice_direct+0x8a/0xb0
> >>> [   62.118056]  do_sendfile+0x1aa/0x390
> >>> [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
> >>> [   62.120666]  do_syscall_64+0x6e/0x210
> >>> [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>> ----------------------------------------
> >>>
> >>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> >>> assumption that blk_queue_enter() from recursively called generic_make_request()
> >>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> >>>
> >>> ----------------------------------------
> >>> generic_make_request(struct bio *bio) {
> >>>   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> >>>     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >>>       bio_wouldblock_error(bio);
> >>>     else
> >>>       bio_io_error(bio);
> >>>     return ret;
> >>>   }
> >>> (...snipped...)
> >>>   ret = q->make_request_fn(q, bio);
> >>> (...snipped...)
> >>>   if (q)
> >>>     blk_queue_exit(q);
> >>> }
> >>> ----------------------------------------
> >>>
> >>> where q->make_request_fn == blk_mq_make_request which does
> >>>
> >>> ----------------------------------------
> >>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> >>>    blk_queue_split(q, &bio);
> >>> }
> >>>
> >>> blk_queue_split(struct request_queue *q, struct bio **bio) {
> >>>   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> >>> }
> >>> ----------------------------------------
> >>>
> >>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> >>> percpu_ref_kill() are called by	blk_freeze_queue_start()...
> >>>
> >>> Now, it is up to you about how to fix this race problem.
> >>
> >> Ahh, nicely spotted! One idea would be the one below. For this case,
> >> we're recursing, so we can either do a non-block queue enter, or we
> >> can just do a live enter.
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 85909b431eb0..7de12e0dcc3d 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
> >>  
> >>  	if (bio->bi_opf & REQ_NOWAIT)
> >>  		flags = BLK_MQ_REQ_NOWAIT;
> >> -	if (blk_queue_enter(q, flags) < 0) {
> >> +	if (bio->bi_opf & REQ_QUEUE_ENTERED)
> >> +		blk_queue_enter_live(q);
> >> +	else if (blk_queue_enter(q, flags) < 0) {
> >>  		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >>  			bio_wouldblock_error(bio);
> >>  		else
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 782940c65d8a..e1063e8f7eda 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> >>  		/* there isn't chance to merge the splitted bio */
> >>  		split->bi_opf |= REQ_NOMERGE;
> >>  
> >> +		/*
> >> +		 * Since we're recursing into make_request here, ensure
> >> +		 * that we mark this bio as already having entered the queue.
> >> +		 * If not, and the queue is going away, we can get stuck
> >> +		 * forever on waiting for the queue reference to drop. But
> >> +		 * that will never happen, as we're already holding a
> >> +		 * reference to it.
> >> +		 */
> >> +		(*bio)->bi_opf |= REQ_QUEUE_ENTERED;
> >> +
> >>  		bio_chain(split, *bio);
> >>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> >>  		generic_make_request(*bio);
> >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> >> index 17b18b91ebac..a6bc789b83e7 100644
> >> --- a/include/linux/blk_types.h
> >> +++ b/include/linux/blk_types.h
> >> @@ -282,6 +282,8 @@ enum req_flag_bits {
> >>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
> >>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> >>  
> >> +	__REQ_QUEUE_ENTERED,	/* queue already entered for this request */
> >> +
> >>  	/* for driver use */
> >>  	__REQ_DRV,
> >>  
> >> @@ -302,9 +304,8 @@ enum req_flag_bits {
> >>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
> >>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
> >>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
> >> -
> >>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
> >> -
> >> +#define REQ_QUEUE_ENTERED	(1ULL << __REQ_QUEUE_ENTERED)
> >>  #define REQ_DRV			(1ULL << __REQ_DRV)
> >>  
> >>  #define REQ_FAILFAST_MASK \
> > 
> > bio clone inherits .bi_opf, so this way may not work for cloned bio.
> > 
> > Given most of bio_split() is run in .make_request_fn(), holding the
> > queue ref may not be needed in case of bio splitting, so another
> > candidate fix might be to introduce one extra parameter of
> > 'need_queue_ref' to generic_make_request().
> 
> I did consider that, but that's pretty silly for a corner case and you'd
> have to change a crap ton of calls to it. I think it'd be cleaner to

Seems not necessary to change any call to it, since all blk_queue_split()
are called from q->make_request_fn() now.

> clear the bit when we need to, potentially even adding a debug check to
> blk_queue_enter_live() that complains if the ref was not already
> elevated. Though that would be expensive, compared to the percpu inc
> now. Not saying the bit is necessarily the best way forward, but I do
> like it a LOT more than adding an argument to generic_make_request.

We may introduce generic_make_request_with_ref() for blk_queue_split(),
what do you think of the following patch?

diff --git a/block/blk-core.c b/block/blk-core.c
index cee03cad99f2..e10ee3f9f338 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2363,7 +2363,7 @@ generic_make_request_checks(struct bio *bio)
  * a lower device by calling into generic_make_request recursively, which
  * means the bio should NOT be touched after the call to ->make_request_fn.
  */
-blk_qc_t generic_make_request(struct bio *bio)
+blk_qc_t __generic_make_request(struct bio *bio, bool got_ref)
 {
 	/*
 	 * bio_list_on_stack[0] contains bios submitted by the current
@@ -2379,12 +2379,14 @@ blk_qc_t generic_make_request(struct bio *bio)
 
 	if (bio->bi_opf & REQ_NOWAIT)
 		flags = BLK_MQ_REQ_NOWAIT;
-	if (blk_queue_enter(q, flags) < 0) {
-		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
-			bio_wouldblock_error(bio);
-		else
-			bio_io_error(bio);
-		return ret;
+	if (!got_ref) {
+		if (blk_queue_enter(q, flags) < 0) {
+			if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
+				bio_wouldblock_error(bio);
+			else
+				bio_io_error(bio);
+			return ret;
+		}
 	}
 
 	if (!generic_make_request_checks(bio))
@@ -2426,8 +2428,9 @@ blk_qc_t generic_make_request(struct bio *bio)
 		bool enter_succeeded = true;
 
 		if (unlikely(q != bio->bi_disk->queue)) {
-			if (q)
+			if (q && !got_ref)
 				blk_queue_exit(q);
+			got_ref = false;
 			q = bio->bi_disk->queue;
 			flags = 0;
 			if (bio->bi_opf & REQ_NOWAIT)
@@ -2476,7 +2479,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 		blk_queue_exit(q);
 	return ret;
 }
-EXPORT_SYMBOL(generic_make_request);
+EXPORT_SYMBOL(__generic_make_request);
 
 /**
  * direct_make_request - hand a buffer directly to its device driver for I/O
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5573d0fbec53..997616f1c85e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -180,6 +180,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return do_split ? new : NULL;
 }
 
+/*
+ * This function has to be called from q->make_request_fn().
+ */
 void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
 	struct bio *split, *res;
@@ -212,7 +215,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
-		generic_make_request(*bio);
+		generic_make_request_with_ref(*bio);
 		*bio = split;
 	}
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4efd9af62e25..051018e0126e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -956,9 +956,18 @@ static inline void rq_flush_dcache_pages(struct request *rq)
 }
 #endif
 
+blk_qc_t generic_make_request(struct bio *bio)
+{
+	return __generic_make_request(bio, false);
+}
+
+blk_qc_t generic_make_request_with_ref(struct bio *bio)
+{
+	return __generic_make_request(bio, true);
+}
+
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
-extern blk_qc_t generic_make_request(struct bio *bio);
 extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
> 
> It should have been a bio flag to begin with, that better captures the
> scope and also avoids clones inheriting anything they shouldn't. It's
> also not applicable to a request.
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b431eb0..8f1063f50863 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>  
>  	if (bio->bi_opf & REQ_NOWAIT)
>  		flags = BLK_MQ_REQ_NOWAIT;
> -	if (blk_queue_enter(q, flags) < 0) {
> +	if (bio->bi_flags & BIO_QUEUE_ENTERED)
> +		blk_queue_enter_live(q);
> +	else if (blk_queue_enter(q, flags) < 0) {
>  		if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>  			bio_wouldblock_error(bio);
>  		else
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 782940c65d8a..9ee9474e579c 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>  		/* there isn't chance to merge the splitted bio */
>  		split->bi_opf |= REQ_NOMERGE;
>  
> +		/*
> +		 * Since we're recursing into make_request here, ensure
> +		 * that we mark this bio as already having entered the queue.
> +		 * If not, and the queue is going away, we can get stuck
> +		 * forever on waiting for the queue reference to drop. But
> +		 * that will never happen, as we're already holding a
> +		 * reference to it.
> +		 */
> +		(*bio)->bi_flags |= BIO_QUEUE_ENTERED;
> +
>  		bio_chain(split, *bio);
>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>  		generic_make_request(*bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 17b18b91ebac..4687d7c99076 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -186,6 +186,8 @@ struct bio {
>  				 * throttling rules. Don't do it again. */
>  #define BIO_TRACE_COMPLETION 10	/* bio_endio() should trace the final completion
>  				 * of this bio. */
> +#define BIO_QUEUE_ENTERED 11	/* can use blk_queue_enter_live() */
> +
>  /* See BVEC_POOL_OFFSET below before adding new flags */

In DM or MD, there is the following use case:

1) bio comes from fs, then one clone is made from this bio

2) the cloned bio is submitted via generic_make_request() from
another context, such as wq.

3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
the queue usage counter isn't held.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-02  2:36                       ` Ming Lei
@ 2018-06-02  4:31                         ` Jens Axboe
  2018-06-02  4:54                           ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2018-06-02  4:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On 6/1/18 8:36 PM, Ming Lei wrote:
>> clear the bit when we need to, potentially even adding a debug check to
>> blk_queue_enter_live() that complains if the ref was not already
>> elevated. Though that would be expensive, compared to the percpu inc
>> now. Not saying the bit is necessarily the best way forward, but I do
>> like it a LOT more than adding an argument to generic_make_request.
> 
> We may introduce generic_make_request_with_ref() for blk_queue_split(),
> what do you think of the following patch?

Functionally it'll work as well.

> In DM or MD, there is the following use case:
> 
> 1) bio comes from fs, then one clone is made from this bio
> 
> 2) the cloned bio is submitted via generic_make_request() from
> another context, such as wq.
> 
> 3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
> the queue usage counter isn't held.

We don't copy flags on a clone.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-02  4:31                         ` Jens Axboe
@ 2018-06-02  4:54                           ` Ming Lei
  2018-06-02  8:07                               ` Martin Steigerwald
  2018-06-02 13:48                               ` Jens Axboe
  0 siblings, 2 replies; 32+ messages in thread
From: Ming Lei @ 2018-06-02  4:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> On 6/1/18 8:36 PM, Ming Lei wrote:
> >> clear the bit when we need to, potentially even adding a debug check to
> >> blk_queue_enter_live() that complains if the ref was not already
> >> elevated. Though that would be expensive, compared to the percpu inc
> >> now. Not saying the bit is necessarily the best way forward, but I do
> >> like it a LOT more than adding an argument to generic_make_request.
> > 
> > We may introduce generic_make_request_with_ref() for blk_queue_split(),
> > what do you think of the following patch?
> 
> Functionally it'll work as well.

If you are fine with this one, I'm glad to prepare for a normal one,
so that we can save one precious bio flag, and the total number is
just 13.

> 
> > In DM or MD, there is the following use case:
> > 
> > 1) bio comes from fs, then one clone is made from this bio
> > 
> > 2) the cloned bio is submitted via generic_make_request() from
> > another context, such as wq.
> > 
> > 3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
> > the queue usage counter isn't held.
> 
> We don't copy flags on a clone.

Yeah, you are right, seems I ignored that.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-02  4:54                           ` Ming Lei
@ 2018-06-02  8:07                               ` Martin Steigerwald
  2018-06-02 13:48                               ` Jens Axboe
  1 sibling, 0 replies; 32+ messages in thread
From: Martin Steigerwald @ 2018-06-02  8:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel,
	linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, hare, syzkaller-bugs, ross.zwisler, keith.busch,
	linux-ext4

Ming Lei - 02.06.18, 06:54:
> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> > On 6/1/18 8:36 PM, Ming Lei wrote:
> > >> clear the bit when we need to, potentially even adding a debug
> > >> check to blk_queue_enter_live() that complains if the ref was
> > >> not already elevated. Though that would be expensive, compared
> > >> to the percpu inc now. Not saying the bit is necessarily the
> > >> best way forward, but I do like it a LOT more than adding an
> > >> argument to generic_make_request.> >=20
> > > We may introduce generic_make_request_with_ref() for
> > > blk_queue_split(), what do you think of the following patch?
> >=20
> > Functionally it'll work as well.
>=20
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.
[=E2=80=A6]

I am completely lost about what you are talking about.

That is fine with me, but in case there is anything to test=E2=80=A6 let me=
=20
know. I do not promise quick responses tough, as I have been quite a bit=20
overwhelmed with stuff lately.

Currently I run 4.17-rc7 without any blk-mq related patches and it=20
appears to be free of hangs or stalls related to blk-mq.

Thanks,
=2D-=20
Martin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-06-02  8:07                               ` Martin Steigerwald
  0 siblings, 0 replies; 32+ messages in thread
From: Martin Steigerwald @ 2018-06-02  8:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel,
	linux-block, jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, hare, syzkaller-bugs, ross.zwisler, keith.busch,
	linux-ext4

Ming Lei - 02.06.18, 06:54:
> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> > On 6/1/18 8:36 PM, Ming Lei wrote:
> > >> clear the bit when we need to, potentially even adding a debug
> > >> check to blk_queue_enter_live() that complains if the ref was
> > >> not already elevated. Though that would be expensive, compared
> > >> to the percpu inc now. Not saying the bit is necessarily the
> > >> best way forward, but I do like it a LOT more than adding an
> > >> argument to generic_make_request.> > 
> > > We may introduce generic_make_request_with_ref() for
> > > blk_queue_split(), what do you think of the following patch?
> > 
> > Functionally it'll work as well.
> 
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.
[…]

I am completely lost about what you are talking about.

That is fine with me, but in case there is anything to test… let me 
know. I do not promise quick responses tough, as I have been quite a bit 
overwhelmed with stuff lately.

Currently I run 4.17-rc7 without any blk-mq related patches and it 
appears to be free of hangs or stalls related to blk-mq.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-02  4:54                           ` Ming Lei
@ 2018-06-02 13:48                               ` Jens Axboe
  2018-06-02 13:48                               ` Jens Axboe
  1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2018-06-02 13:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4


> On Jun 1, 2018, at 10:54 PM, Ming Lei <ming.lei@redhat.com> wrote:
>=20
>> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
>> On 6/1/18 8:36 PM, Ming Lei wrote:
>>>> clear the bit when we need to, potentially even adding a debug check to=

>>>> blk_queue_enter_live() that complains if the ref was not already
>>>> elevated. Though that would be expensive, compared to the percpu inc
>>>> now. Not saying the bit is necessarily the best way forward, but I do
>>>> like it a LOT more than adding an argument to generic_make_request.
>>>=20
>>> We may introduce generic_make_request_with_ref() for blk_queue_split(),
>>> what do you think of the following patch?
>>=20
>> Functionally it'll work as well.
>=20
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.

We=E2=80=99ve never had much issue with bio specific flags, I prefer that ap=
proach to a wrapper around generic_make_request().=20

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-06-02 13:48                               ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2018-06-02 13:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4


> On Jun 1, 2018, at 10:54 PM, Ming Lei <ming.lei@redhat.com> wrote:
> 
>> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
>> On 6/1/18 8:36 PM, Ming Lei wrote:
>>>> clear the bit when we need to, potentially even adding a debug check to
>>>> blk_queue_enter_live() that complains if the ref was not already
>>>> elevated. Though that would be expensive, compared to the percpu inc
>>>> now. Not saying the bit is necessarily the best way forward, but I do
>>>> like it a LOT more than adding an argument to generic_make_request.
>>> 
>>> We may introduce generic_make_request_with_ref() for blk_queue_split(),
>>> what do you think of the following patch?
>> 
>> Functionally it'll work as well.
> 
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.

We’ve never had much issue with bio specific flags, I prefer that approach to a wrapper around generic_make_request(). 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-01 10:10               ` Tetsuo Handa
  2018-06-01 17:52                 ` Jens Axboe
@ 2018-06-04 11:46                 ` Dmitry Vyukov
  2018-06-04 13:13                   ` Tetsuo Handa
  1 sibling, 1 reply; 32+ messages in thread
From: Dmitry Vyukov @ 2018-06-04 11:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Bart Van Assche, LKML, linux-block, Johannes Thumshirn,
	Alan Jenkins, syzbot, Martin K. Petersen, Jens Axboe,
	Dan Williams, Christoph Hellwig, oleksandr, ming.lei, martin,
	Hannes Reinecke, syzkaller-bugs, Ross Zwisler, keith.busch,
	linux-ext4

On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>
> I found the someone. It was already there in the backtrace...

Nice!

Do I understand it correctly that this bug is probably the root cause
of a whole lot of syzbot "task hung" reports? E.g. this one too?
https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
I guess we will need to sweep close everything related to
filesystems/block devices when this is committed?


> ----------------------------------------
> [   62.065852] a.out           D    0  4414   4337 0x00000000
> [   62.067677] Call Trace:
> [   62.068545]  __schedule+0x40b/0x860
> [   62.069726]  schedule+0x31/0x80
> [   62.070796]  schedule_timeout+0x1c1/0x3c0
> [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
> [   62.073670]  blk_queue_enter+0x218/0x520
> [   62.074985]  ? remove_wait_queue+0x70/0x70
> [   62.076361]  generic_make_request+0x3d/0x540
> [   62.077785]  ? __bio_clone_fast+0x6b/0x80
> [   62.079147]  ? bio_clone_fast+0x2c/0x70
> [   62.080456]  blk_queue_split+0x29b/0x560
> [   62.081772]  ? blk_queue_split+0x29b/0x560
> [   62.083162]  blk_mq_make_request+0x7c/0x430
> [   62.084562]  generic_make_request+0x276/0x540
> [   62.086034]  submit_bio+0x6e/0x140
> [   62.087185]  ? submit_bio+0x6e/0x140
> [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
> [   62.089681]  do_mpage_readpage+0x328/0x730
> [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
> [   62.092726]  mpage_readpages+0x120/0x190
> [   62.094034]  ? check_disk_change+0x70/0x70
> [   62.095454]  ? check_disk_change+0x70/0x70
> [   62.096849]  ? alloc_pages_current+0x65/0xd0
> [   62.098277]  blkdev_readpages+0x18/0x20
> [   62.099568]  __do_page_cache_readahead+0x298/0x360
> [   62.101157]  ondemand_readahead+0x1f6/0x490
> [   62.102546]  ? ondemand_readahead+0x1f6/0x490
> [   62.103995]  page_cache_sync_readahead+0x29/0x40
> [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
> [   62.107067]  ? futex_wait+0x221/0x240
> [   62.108303]  ? trace_hardirqs_on+0xd/0x10
> [   62.109654]  blkdev_read_iter+0x30/0x40
> [   62.110954]  generic_file_splice_read+0xc5/0x140
> [   62.112538]  do_splice_to+0x74/0x90
> [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
> [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
> [   62.116773]  do_splice_direct+0x8a/0xb0
> [   62.118056]  do_sendfile+0x1aa/0x390
> [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
> [   62.120666]  do_syscall_64+0x6e/0x210
> [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
>
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>
> ----------------------------------------
> generic_make_request(struct bio *bio) {
>   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>       bio_wouldblock_error(bio);
>     else
>       bio_io_error(bio);
>     return ret;
>   }
> (...snipped...)
>   ret = q->make_request_fn(q, bio);
> (...snipped...)
>   if (q)
>     blk_queue_exit(q);
> }
> ----------------------------------------
>
> where q->make_request_fn == blk_mq_make_request which does
>
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>    blk_queue_split(q, &bio);
> }
>
> blk_queue_split(struct request_queue *q, struct bio **bio) {
>   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
>
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by blk_freeze_queue_start()...
>
> Now, it is up to you about how to fix this race problem.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-04 11:46                 ` Dmitry Vyukov
@ 2018-06-04 13:13                   ` Tetsuo Handa
  0 siblings, 0 replies; 32+ messages in thread
From: Tetsuo Handa @ 2018-06-04 13:13 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Bart Van Assche, LKML, linux-block, Johannes Thumshirn,
	Alan Jenkins, syzbot, Martin K. Petersen, Jens Axboe,
	Dan Williams, Christoph Hellwig, oleksandr, ming.lei, martin,
	Hannes Reinecke, syzkaller-bugs, Ross Zwisler, keith.busch,
	linux-ext4

On 2018/06/04 20:46, Dmitry Vyukov wrote:
> On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> Tetsuo Handa wrote:
>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>> not a race condition while folding percpu counter values into atomic counter
>>> value. That is, for some reason, someone who is responsible for calling
>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>> unable to call percpu_ref_put().
>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>
>> I found the someone. It was already there in the backtrace...
> 
> Nice!
> 
> Do I understand it correctly that this bug is probably the root cause
> of a whole lot of syzbot "task hung" reports? E.g. this one too?
> https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
> I guess we will need to sweep close everything related to
> filesystems/block devices when this is committed?
> 

I can't tell it. We still have lockdep warning for loop module. Also, I suspect
that hung tasks waiting for page bit might be related to binder module.

Since we are about to merge "kernel/hung_task.c: show all hung tasks before panic"
patch, we might be able to get more hints after 4.18-rc1.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-01 17:52                 ` Jens Axboe
  2018-06-01 23:49                   ` Ming Lei
@ 2018-06-05  0:27                   ` Tetsuo Handa
  2018-06-05  0:41                     ` Ming Lei
  1 sibling, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2018-06-05  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart.VanAssche, dvyukov, linux-kernel, linux-block, jthumshirn,
	alan.christopher.jenkins, syzbot+c4f9cebf9d651f6e54de,
	martin.petersen, dan.j.williams, hch, oleksandr, ming.lei,
	martin, hare, syzkaller-bugs, ross.zwisler, keith.busch,
	linux-ext4

Jens Axboe wrote:
> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >> not a race condition while folding percpu counter values into atomic counter
> >> value. That is, for some reason, someone who is responsible for calling
> >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >> unable to call percpu_ref_put().
> >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> > 
> > I found the someone. It was already there in the backtrace...
> > 
> 
> Ahh, nicely spotted! One idea would be the one below. For this case,
> we're recursing, so we can either do a non-block queue enter, or we
> can just do a live enter.
> 

While "block: don't use blocking queue entered for recursive bio submits" was
already applied, syzbot is still reporting a hung task with same signature but
different trace.

https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
----------------------------------------
[  492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
[  492.519604]       Not tainted 4.17.0+ #83
[  492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.531787] syz-executor1   D23384 20263   4574 0x00000004
[  492.537443] Call Trace:
[  492.540041]  __schedule+0x801/0x1e30
[  492.580958]  schedule+0xef/0x430
[  492.610154]  blk_queue_enter+0x8da/0xdf0
[  492.716327]  generic_make_request+0x651/0x1790
[  492.765680]  submit_bio+0xba/0x460
[  492.793198]  submit_bio_wait+0x134/0x1e0
[  492.801891]  blkdev_issue_flush+0x204/0x300
[  492.806236]  blkdev_fsync+0x93/0xd0
[  492.813620]  vfs_fsync_range+0x140/0x220
[  492.817702]  vfs_fsync+0x29/0x30
[  492.821081]  __loop_update_dio+0x4de/0x6a0
[  492.825341]  lo_ioctl+0xd28/0x2190
[  492.833442]  blkdev_ioctl+0x9b6/0x2020
[  492.872146]  block_ioctl+0xee/0x130
[  492.880139]  do_vfs_ioctl+0x1cf/0x16a0
[  492.927550]  ksys_ioctl+0xa9/0xd0
[  492.931036]  __x64_sys_ioctl+0x73/0xb0
[  492.934952]  do_syscall_64+0x1b1/0x800
[  492.963624]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  493.212768] 1 lock held by syz-executor1/20263:
[  493.217448]  #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
----------------------------------------

Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
by caling atomic_inc_return() and percpu_ref_kill() ?
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-05  0:27                   ` Tetsuo Handa
@ 2018-06-05  0:41                     ` Ming Lei
  2018-06-07  3:29                         ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-06-05  0:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Bart.VanAssche, dvyukov, linux-kernel, linux-block,
	jthumshirn, alan.christopher.jenkins,
	syzbot+c4f9cebf9d651f6e54de, martin.petersen, dan.j.williams,
	hch, oleksandr, martin, hare, syzkaller-bugs, ross.zwisler,
	keith.busch, linux-ext4

On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
> Jens Axboe wrote:
> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> > >> not a race condition while folding percpu counter values into atomic counter
> > >> value. That is, for some reason, someone who is responsible for calling
> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> > >> unable to call percpu_ref_put().
> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> > > 
> > > I found the someone. It was already there in the backtrace...
> > > 
> > 
> > Ahh, nicely spotted! One idea would be the one below. For this case,
> > we're recursing, so we can either do a non-block queue enter, or we
> > can just do a live enter.
> > 
> 
> While "block: don't use blocking queue entered for recursive bio submits" was
> already applied, syzbot is still reporting a hung task with same signature but
> different trace.
> 
> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
> ----------------------------------------
> [  492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
> [  492.519604]       Not tainted 4.17.0+ #83
> [  492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  492.531787] syz-executor1   D23384 20263   4574 0x00000004
> [  492.537443] Call Trace:
> [  492.540041]  __schedule+0x801/0x1e30
> [  492.580958]  schedule+0xef/0x430
> [  492.610154]  blk_queue_enter+0x8da/0xdf0
> [  492.716327]  generic_make_request+0x651/0x1790
> [  492.765680]  submit_bio+0xba/0x460
> [  492.793198]  submit_bio_wait+0x134/0x1e0
> [  492.801891]  blkdev_issue_flush+0x204/0x300
> [  492.806236]  blkdev_fsync+0x93/0xd0
> [  492.813620]  vfs_fsync_range+0x140/0x220
> [  492.817702]  vfs_fsync+0x29/0x30
> [  492.821081]  __loop_update_dio+0x4de/0x6a0
> [  492.825341]  lo_ioctl+0xd28/0x2190
> [  492.833442]  blkdev_ioctl+0x9b6/0x2020
> [  492.872146]  block_ioctl+0xee/0x130
> [  492.880139]  do_vfs_ioctl+0x1cf/0x16a0
> [  492.927550]  ksys_ioctl+0xa9/0xd0
> [  492.931036]  __x64_sys_ioctl+0x73/0xb0
> [  492.934952]  do_syscall_64+0x1b1/0x800
> [  492.963624]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  493.212768] 1 lock held by syz-executor1/20263:
> [  493.217448]  #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
> ----------------------------------------
> 
> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
> by caling atomic_inc_return() and percpu_ref_kill() ?
>  

The vfs_fsync() isn't necessary in loop_update_dio() since both
generic_file_write_iter() and generic_file_read_iter() can handle
buffered io vs dio well.

I will send one patch to remove the vfs_sync() later.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-05  0:41                     ` Ming Lei
@ 2018-06-07  3:29                         ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-06-07  3:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Jens Axboe, Bart Van Assche, Dmitry Vyukov,
	Linux Kernel Mailing List, linux-block, Johannes Thumshirn,
	alan.christopher.jenkins, syzbot+c4f9cebf9d651f6e54de,
	Martin K. Petersen, Dan Williams, Christoph Hellwig,
	=Oleksandr Natalenko, martin, Hannes Reinecke, syzkaller-bugs,
	Ross Zwisler, Keith Busch, open list:EXT4 FILE SYSTEM

On Tue, Jun 5, 2018 at 8:41 AM, Ming Lei <ming.lei@redhat.com> wrote:
> On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
>> Jens Axboe wrote:
>> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>> > > Tetsuo Handa wrote:
>> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> > >> not a race condition while folding percpu counter values into atomic counter
>> > >> value. That is, for some reason, someone who is responsible for calling
>> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> > >> unable to call percpu_ref_put().
>> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
>> > >
>> > > I found the someone. It was already there in the backtrace...
>> > >
>> >
>> > Ahh, nicely spotted! One idea would be the one below. For this case,
>> > we're recursing, so we can either do a non-block queue enter, or we
>> > can just do a live enter.
>> >
>>
>> While "block: don't use blocking queue entered for recursive bio submits" was
>> already applied, syzbot is still reporting a hung task with same signature but
>> different trace.
>>
>> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
>> ----------------------------------------
>> [  492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
>> [  492.519604]       Not tainted 4.17.0+ #83
>> [  492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  492.531787] syz-executor1   D23384 20263   4574 0x00000004
>> [  492.537443] Call Trace:
>> [  492.540041]  __schedule+0x801/0x1e30
>> [  492.580958]  schedule+0xef/0x430
>> [  492.610154]  blk_queue_enter+0x8da/0xdf0
>> [  492.716327]  generic_make_request+0x651/0x1790
>> [  492.765680]  submit_bio+0xba/0x460
>> [  492.793198]  submit_bio_wait+0x134/0x1e0
>> [  492.801891]  blkdev_issue_flush+0x204/0x300
>> [  492.806236]  blkdev_fsync+0x93/0xd0
>> [  492.813620]  vfs_fsync_range+0x140/0x220
>> [  492.817702]  vfs_fsync+0x29/0x30
>> [  492.821081]  __loop_update_dio+0x4de/0x6a0
>> [  492.825341]  lo_ioctl+0xd28/0x2190
>> [  492.833442]  blkdev_ioctl+0x9b6/0x2020
>> [  492.872146]  block_ioctl+0xee/0x130
>> [  492.880139]  do_vfs_ioctl+0x1cf/0x16a0
>> [  492.927550]  ksys_ioctl+0xa9/0xd0
>> [  492.931036]  __x64_sys_ioctl+0x73/0xb0
>> [  492.934952]  do_syscall_64+0x1b1/0x800
>> [  492.963624]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  493.212768] 1 lock held by syz-executor1/20263:
>> [  493.217448]  #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
>> ----------------------------------------
>>
>> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
>> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
>> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
>> by caling atomic_inc_return() and percpu_ref_kill() ?
>>
>
> The vfs_fsync() isn't necessary in loop_update_dio() since both
> generic_file_write_iter() and generic_file_read_iter() can handle
> buffered io vs dio well.
>
> I will send one patch to remove the vfs_sync() later.

Hi Tetsuo,

The issue might be fixed by removing this vfs_sync(), but I'd like to
understand the idea behind since vfs_sync() shouldn't have caused
any IO to this loop queue.

I also tried to do the test via the following c syzbot, but can't reproduce
it yet after running it for several hours.

https://syzkaller.appspot.com/x/repro.c?id=4727023951937536

Could you share us how you reproduce it?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
@ 2018-06-07  3:29                         ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-06-07  3:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Jens Axboe, Bart Van Assche, Dmitry Vyukov,
	Linux Kernel Mailing List, linux-block, Johannes Thumshirn,
	alan.christopher.jenkins, syzbot+c4f9cebf9d651f6e54de,
	Martin K. Petersen, Dan Williams, Christoph Hellwig,
	=Oleksandr Natalenko, martin, Hannes Reinecke, syzkaller-bugs,
	Ross Zwisler, Keith Busch,

On Tue, Jun 5, 2018 at 8:41 AM, Ming Lei <ming.lei@redhat.com> wrote:
> On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
>> Jens Axboe wrote:
>> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>> > > Tetsuo Handa wrote:
>> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> > >> not a race condition while folding percpu counter values into atomic counter
>> > >> value. That is, for some reason, someone who is responsible for calling
>> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> > >> unable to call percpu_ref_put().
>> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
>> > >
>> > > I found the someone. It was already there in the backtrace...
>> > >
>> >
>> > Ahh, nicely spotted! One idea would be the one below. For this case,
>> > we're recursing, so we can either do a non-block queue enter, or we
>> > can just do a live enter.
>> >
>>
>> While "block: don't use blocking queue entered for recursive bio submits" was
>> already applied, syzbot is still reporting a hung task with same signature but
>> different trace.
>>
>> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
>> ----------------------------------------
>> [  492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
>> [  492.519604]       Not tainted 4.17.0+ #83
>> [  492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  492.531787] syz-executor1   D23384 20263   4574 0x00000004
>> [  492.537443] Call Trace:
>> [  492.540041]  __schedule+0x801/0x1e30
>> [  492.580958]  schedule+0xef/0x430
>> [  492.610154]  blk_queue_enter+0x8da/0xdf0
>> [  492.716327]  generic_make_request+0x651/0x1790
>> [  492.765680]  submit_bio+0xba/0x460
>> [  492.793198]  submit_bio_wait+0x134/0x1e0
>> [  492.801891]  blkdev_issue_flush+0x204/0x300
>> [  492.806236]  blkdev_fsync+0x93/0xd0
>> [  492.813620]  vfs_fsync_range+0x140/0x220
>> [  492.817702]  vfs_fsync+0x29/0x30
>> [  492.821081]  __loop_update_dio+0x4de/0x6a0
>> [  492.825341]  lo_ioctl+0xd28/0x2190
>> [  492.833442]  blkdev_ioctl+0x9b6/0x2020
>> [  492.872146]  block_ioctl+0xee/0x130
>> [  492.880139]  do_vfs_ioctl+0x1cf/0x16a0
>> [  492.927550]  ksys_ioctl+0xa9/0xd0
>> [  492.931036]  __x64_sys_ioctl+0x73/0xb0
>> [  492.934952]  do_syscall_64+0x1b1/0x800
>> [  492.963624]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  493.212768] 1 lock held by syz-executor1/20263:
>> [  493.217448]  #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
>> ----------------------------------------
>>
>> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
>> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
>> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
>> by caling atomic_inc_return() and percpu_ref_kill() ?
>>
>
> The vfs_fsync() isn't necessary in loop_update_dio() since both
> generic_file_write_iter() and generic_file_read_iter() can handle
> buffered io vs dio well.
>
> I will send one patch to remove the vfs_sync() later.

Hi Tetsuo,

The issue might be fixed by removing this vfs_sync(), but I'd like to
understand the idea behind since vfs_sync() shouldn't have caused
any IO to this loop queue.

I also tried to do the test via the following c syzbot, but can't reproduce
it yet after running it for several hours.

https://syzkaller.appspot.com/x/repro.c?id=4727023951937536

Could you share us how you reproduce it?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: INFO: task hung in blk_queue_enter
  2018-06-07  3:29                         ` Ming Lei
  (?)
@ 2018-06-07 13:19                         ` Tetsuo Handa
  -1 siblings, 0 replies; 32+ messages in thread
From: Tetsuo Handa @ 2018-06-07 13:19 UTC (permalink / raw)
  To: Ming Lei, Ming Lei
  Cc: Jens Axboe, Bart Van Assche, Dmitry Vyukov,
	Linux Kernel Mailing List, linux-block, Johannes Thumshirn,
	alan.christopher.jenkins, syzbot+c4f9cebf9d651f6e54de,
	Martin K. Petersen, Dan Williams, Christoph Hellwig,
	Oleksandr Natalenko, martin, Hannes Reinecke, syzkaller-bugs,
	Ross Zwisler, Keith Busch, open list:EXT4 FILE SYSTEM

On 2018/06/07 12:29, Ming Lei wrote:
>> The vfs_fsync() isn't necessary in loop_update_dio() since both
>> generic_file_write_iter() and generic_file_read_iter() can handle
>> buffered io vs dio well.
>>
>> I will send one patch to remove the vfs_sync() later.
> 
> Hi Tetsuo,
> 
> The issue might be fixed by removing this vfs_sync(), but I'd like to
> understand the idea behind since vfs_sync() shouldn't have caused
> any IO to this loop queue.
> 
> I also tried to do the test via the following c syzbot, but can't reproduce
> it yet after running it for several hours.
> 
> https://syzkaller.appspot.com/x/repro.c?id=4727023951937536
> 
> Could you share us how you reproduce it?

I just reported that syzbot started hitting vfs_sync() case.
I haven't succeeded reproducing vfs_sync() case in my environment.

But in general, I modify "any reproducer provided by syzbot that loops
forever" to call exit() if open() etc. failed, for continuing the loop
unlikely helps after hitting "Too many open files" error.

   res = syscall(__NR_memfd_create, 0x200002c0, 0);
   if (res != -1)
     r[1] = res;
+  else
+    exit(1);

Thus, I restart the reproducer in a loop from shell
rather than keep running for hours.

   while :; do echo -n .; ./a.out; done

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-06-07 13:19 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-28 11:24 INFO: task hung in blk_queue_enter syzbot
2018-05-15 11:45 ` Tetsuo Handa
2018-05-16 13:05   ` Tetsuo Handa
2018-05-16 14:56     ` Bart Van Assche
2018-05-16 14:56       ` Bart Van Assche
2018-05-16 15:16       ` Dmitry Vyukov
2018-05-16 15:16         ` Dmitry Vyukov
2018-05-16 15:37         ` Bart Van Assche
2018-05-16 15:37           ` Bart Van Assche
2018-05-16 15:37           ` Bart Van Assche
2018-05-21 21:52           ` Tetsuo Handa
2018-05-22 11:20             ` Tetsuo Handa
2018-06-01 10:10               ` Tetsuo Handa
2018-06-01 17:52                 ` Jens Axboe
2018-06-01 23:49                   ` Ming Lei
2018-06-02  0:49                     ` Jens Axboe
2018-06-02  0:56                       ` Jens Axboe
2018-06-02  2:36                       ` Ming Lei
2018-06-02  4:31                         ` Jens Axboe
2018-06-02  4:54                           ` Ming Lei
2018-06-02  8:07                             ` Martin Steigerwald
2018-06-02  8:07                               ` Martin Steigerwald
2018-06-02 13:48                             ` Jens Axboe
2018-06-02 13:48                               ` Jens Axboe
2018-06-05  0:27                   ` Tetsuo Handa
2018-06-05  0:41                     ` Ming Lei
2018-06-07  3:29                       ` Ming Lei
2018-06-07  3:29                         ` Ming Lei
2018-06-07 13:19                         ` Tetsuo Handa
2018-06-04 11:46                 ` Dmitry Vyukov
2018-06-04 13:13                   ` Tetsuo Handa
2018-05-16 17:33     ` Alan Jenkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.