All of lore.kernel.org
 help / color / mirror / Atom feed
* blktests with zbd/006 ZNS triggers a possible false positive RCU stall
@ 2022-04-14 22:02 Luis Chamberlain
  2022-04-15  1:09 ` Davidlohr Bueso
  2022-04-20  5:54 ` Shinichiro Kawasaki
  0 siblings, 2 replies; 15+ messages in thread
From: Luis Chamberlain @ 2022-04-14 22:02 UTC (permalink / raw)
  To: shinichiro.kawasaki, Klaus Jensen, Paul E. McKenney,
	Davidlohr Bueso, linux-block
  Cc: Luis Chamberlain, Pankaj Raghav, Pankaj Raghav, Adam Manzanares

Hey folks,

While enhancing kdevops [0] to embrace automation of testing with
blktests for ZNS I ended up spotting a possible false positive RCU stall
when running zbd/006 after zbd/005. The curious thing though is that
this possible RCU stall is only possible when using the qemu
ZNS drive, not when using nbd. In so far as kdevops is concerned
it creates ZNS drives for you when you enable the config option
CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
suffices. When configuring blktests you can just enable the zbd
guest, so only a pair of guests are reated the zbd guest and the
respective development guest, zbd-dev guest. When using
CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
just two guests:

  * linux517-blktests-zbd
  * linux517-blktests-zbd-dev

The RCU stall can be triggered easily as follows:

make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
make
make bringup # bring up guests
make linux # build and boot into v5.17-rc7
make blktests # build and install blktests

Now let's ssh to the guest while leaving a console attached
with `sudo virsh vagrant_linux517-blktests-zbd` in a window:

ssh linux517-blktests-zbd
sudo su -
cd /usr/local/blktests
export TEST_DEVS=/dev/nvme9n1
i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;

The above should never fail, but you should eventually see an RCU
stall candidate on the console. The full details can be observed on the
gist [1] but for completeness I list some of it below. It may be a false
positive at this point, not sure.

[493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
[493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
[493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
[493336.981666] nvme nvme9: Abort status: 0x0
[493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
[493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[493425.817272] rcu:    4-....: (0 ticks this GP) idle=c48/0/0x0 softirq=11316030/11316030 fqs=939  (false positive?)
[493425.819275]         (detected by 7, t=14522 jiffies, g=31237493, q=6271)
[493425.820683] Sending NMI from CPU 7 to CPUs 0:
[493425.821917] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xb/0x10
[493425.822797] Sending NMI from CPU 7 to CPUs 1:
[493425.825449] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0xb/0x10
[493425.826424] Sending NMI from CPU 7 to CPUs 2:
[493425.829057] NMI backtrace for cpu 2 skipped: idling at native_safe_halt+0xb/0x10
[493425.830019] Sending NMI from CPU 7 to CPUs 3:
[493425.832656] NMI backtrace for cpu 3
[493425.832673] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E     5.17.0-rc7 #1
[493425.832677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[493425.832685] RIP: 0010:native_queued_spin_lock_slowpath+0x28/0x200
[493425.832707] Code: 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb 66 90 ba <etc>
[493425.832709] RSP: 0018:ffffb481c0148f00 EFLAGS: 00000002
[493425.832712] RAX: 0000000000000001 RBX: ffffffff9b2c4f00 RCX: 0000000000000000
[493425.832714] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff9b2c4f00
[493425.832714] RBP: ffff8f8037cee7c0 R08: 0001c9d05bf762ee R09: 7fffffffffffffff
[493425.832716] R10: ffffffff9b2060c0 R11: 000000000000819d R12: 0000000000000246
[493425.832717] R13: ffffffff9b2c4f00 R14: ffff8f8037cee838 R15: ffffffff9b206108
[493425.832718] FS:  0000000000000000(0000) GS:ffff8f8037cc0000(0000) knlGS:0000000000000000
[493425.832719] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[493425.832721] CR2: 00007fad480b44c4 CR3: 00000001032d8002 CR4: 0000000000770ee0
[493425.832724] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[493425.832725] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[493425.832726] PKRU: 55555554
[493425.832733] Call Trace:
[493425.832742]  <IRQ>
[493425.832749]  _raw_spin_lock_irqsave+0x44/0x50
[493425.832756]  rcu_core+0xd3/0x750
[493425.832771]  ? kvm_clock_get_cycles+0x14/0x30
[493425.832780]  ? ktime_get+0x35/0x90
[493425.832789]  __do_softirq+0xec/0x2ea
[493425.832796]  __irq_exit_rcu+0xbc/0x110
[493425.832804]  sysvec_apic_timer_interrupt+0xa2/0xd0
[493425.832811]  </IRQ>
[493425.832812]  <TASK>
[493425.832812]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[493425.832816] RIP: 0010:native_safe_halt+0xb/0x10
[493425.832819] Code: 65 ff ff ff 7f 5b c3 65 48 8b 04 25 c0 cb 01 00 f0 <etc>
[493425.832821] RSP: 0018:ffffb481c009bef0 EFLAGS: 00000206
[493425.832822] RAX: ffffffff9a8dba90 RBX: 0000000000000003 RCX: 0000000000000000
[493425.832823] RDX: 0000000000000000 RSI: ffffffff9b0f7c8e RDI: ffffffff9b0d564d
[493425.832825] RBP: ffff8f7f008a8000 R08: 0001c0c4a6b05dd9 R09: 0000000000000000
[493425.832826] R10: 00000001075920c2 R11: 0000000000000001 R12: 0000000000000000
[493425.832827] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[493425.832829]  ? mwait_idle+0x80/0x80
[493425.832846]  default_idle+0xa/0x10
[493425.832848]  default_idle_call+0x33/0xe0
[493425.832850]  do_idle+0x210/0x2a0
[493425.832859]  cpu_startup_entry+0x19/0x20
[493425.832862]  secondary_startup_64_no_verify+0xc3/0xcb
[493425.832881]  </TASK>
[493425.833607] Sending NMI from CPU 7 to CPUs 4:
[493425.894551] NMI backtrace for cpu 4 skipped: idling at native_safe_halt+0xb/0x10
[493425.895524] Sending NMI from CPU 7 to CPUs 5:
[493425.897668] NMI backtrace for cpu 5
[493425.897673] CPU: 5 PID: 0 Comm: swapper/5 Kdump: loaded Tainted: G E     5.17.0-rc7 #1
[493425.897676] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[493425.897678] RIP: 0010:native_queued_spin_lock_slowpath+0x13/0x200
[493425.897687] Code: c0 75 c5 4d 89 50 08 4c 89 c0 c3 0f 0b 66 2e 0f 1f <etc>
[493425.897689] RSP: 0018:ffffb481c01a0f00 EFLAGS: 00000002
[493425.897692] RAX: 0000000000000001 RBX: ffffffff9b2c4f00 RCX: 0000000000000000
[493425.897694] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff9b2c4f00
[493425.897695] RBP: ffff8f8037d6e7c0 R08: 0001c9d05bf75cd7 R09: 7fffffffffffffff
[493425.897696] R10: ffffffff9b2060c0 R11: 0000000000007459 R12: 0000000000000246
[493425.897697] R13: ffffffff9b2c4f00 R14: ffff8f8037d6e838 R15: ffffffff9b206108
[493425.897698] FS:  0000000000000000(0000) GS:ffff8f8037d40000(0000) knlGS:0000000000000000
[493425.897700] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[493425.897702] CR2: 00007f9956af9d38 CR3: 00000001069c6001 CR4: 0000000000770ee0
[493425.897705] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[493425.897706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[493425.897707] PKRU: 55555554
[493425.897708] Call Trace:
[493425.897711]  <IRQ>
[493425.897715]  _raw_spin_lock_irqsave+0x44/0x50
[493425.897721]  rcu_core+0xd3/0x750
[493425.897727]  ? kvm_clock_get_cycles+0x14/0x30
[493425.897731]  ? ktime_get+0x35/0x90
[493425.897735]  __do_softirq+0xec/0x2ea
[493425.897740]  __irq_exit_rcu+0xbc/0x110
[493425.897745]  sysvec_apic_timer_interrupt+0xa2/0xd0
[493425.897750]  </IRQ>
[493425.897751]  <TASK>
[493425.897752]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[493425.897756] RIP: 0010:native_safe_halt+0xb/0x10
[493425.897758] Code: 65 ff ff ff 7f 5b c3 65 48 8b 04 25 c0 cb 01 00 f0 <etc>
[493425.897760] RSP: 0018:ffffb481c00abef0 EFLAGS: 00000206
[493425.897762] RAX: ffffffff9a8dba90 RBX: 0000000000000005 RCX: 0000000000000000
[493425.897763] RDX: 0000000000000000 RSI: ffffffff9b0f7c8e RDI: ffffffff9b0d564d
[493425.897764] RBP: ffff8f7f008ade00 R08: 0001c0c4a6b3512d R09: 0000000000000001
[493425.897765] R10: 000000000001c400 R11: 0000000000000000 R12: 0000000000000000
[493425.897766] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[493425.897767]  ? mwait_idle+0x80/0x80
[493425.897771]  default_idle+0xa/0x10
[493425.897773]  default_idle_call+0x33/0xe0
[493425.897776]  do_idle+0x210/0x2a0
[493425.897782]  cpu_startup_entry+0x19/0x20
[493425.897784]  secondary_startup_64_no_verify+0xc3/0xcb
[493425.897790]  </TASK>
[493425.898650] Sending NMI from CPU 7 to CPUs 6:
[493425.958744] NMI backtrace for cpu 6 skipped: idling at native_safe_halt+0xb/0x10
[493425.959718] NMI backtrace for cpu 7
[493425.961758] CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G E     5.17.0-rc7 #1
[493425.963599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[493425.965345] Call Trace:
[493425.966192]  <IRQ>
[493425.966951]  dump_stack_lvl+0x48/0x5e
[493425.967921]  nmi_cpu_backtrace.cold+0x30/0x77
[493425.968777]  ? lapic_can_unplug_cpu+0x80/0x80
[493425.969477]  nmi_trigger_cpumask_backtrace+0x12a/0x150
[493425.970354]  rcu_dump_cpu_stacks+0xd8/0x100
[493425.971049]  rcu_sched_clock_irq+0xa23/0xb70
[493425.971726]  update_process_times+0x93/0xc0
[493425.972394]  tick_sched_handle+0x22/0x60
[493425.973045]  tick_sched_timer+0x84/0xb0
[493425.973677]  ? can_stop_idle_tick+0xd0/0xd0
[493425.974354]  __hrtimer_run_queues+0x127/0x2c0
[493425.975031]  hrtimer_interrupt+0x106/0x220
[493425.975683]  __sysvec_apic_timer_interrupt+0x7c/0x160
[493425.976429]  sysvec_apic_timer_interrupt+0x9d/0xd0
[493425.977150]  </IRQ>
[493425.977602]  <TASK>
[493425.978066]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[493425.978829] RIP: 0010:native_safe_halt+0xb/0x10
[493425.979520] Code: 65 ff ff ff 7f 5b c3 65 48 8b 04 25 c0 cb 01 00 f0 <etc>
[493425.981866] RSP: 0018:ffffb481c00bbef0 EFLAGS: 00000206
[493425.982672] RAX: ffffffff9a8dba90 RBX: 0000000000000007 RCX: 0000000000000001
[493425.983644] RDX: 0000000000000002 RSI: ffffffff9b0f7c8e RDI: ffffffff9b0d564d
[493425.984619] RBP: ffff8f7f00934680 R08: 0001c9c79fd7ad0b R09: 0000000000000000
[493425.985588] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[493425.986672] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[493425.987650]  ? mwait_idle+0x80/0x80
[493425.988258]  default_idle+0xa/0x10
[493425.988842]  default_idle_call+0x33/0xe0
[493425.989480]  do_idle+0x210/0x2a0
[493425.990063]  cpu_startup_entry+0x19/0x20
[493425.990706]  secondary_startup_64_no_verify+0xc3/0xcb
[493425.991455]  </TASK>
[493426.706021] nvme nvme9: 8/0/0 default/read/poll queues
[493452.960178] run blktests zbd/005 at 2022-04-14 20:06:23
[493485.422746] run blktests zbd/006 at 2022-04-14 20:06:55
[493517.203362] nvme nvme9: I/O 516 QID 8 timeout, aborting
[493517.205394] nvme nvme9: Abort status: 0x0

[0] https://github.com/mcgrof/kdevops
[1] https://gist.github.com/mcgrof/5647efdc01bdb4715cfd9b2852fb13ce

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-14 22:02 blktests with zbd/006 ZNS triggers a possible false positive RCU stall Luis Chamberlain
@ 2022-04-15  1:09 ` Davidlohr Bueso
  2022-04-15  3:54   ` Paul E. McKenney
  2022-04-15 17:33   ` Luis Chamberlain
  2022-04-20  5:54 ` Shinichiro Kawasaki
  1 sibling, 2 replies; 15+ messages in thread
From: Davidlohr Bueso @ 2022-04-15  1:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: shinichiro.kawasaki, Klaus Jensen, Paul E. McKenney, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Thu, 14 Apr 2022, Luis Chamberlain wrote:

>Hey folks,
>
>While enhancing kdevops [0] to embrace automation of testing with
>blktests for ZNS I ended up spotting a possible false positive RCU stall
>when running zbd/006 after zbd/005. The curious thing though is that
>this possible RCU stall is only possible when using the qemu
>ZNS drive, not when using nbd. In so far as kdevops is concerned
>it creates ZNS drives for you when you enable the config option
>CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
>suffices. When configuring blktests you can just enable the zbd
>guest, so only a pair of guests are reated the zbd guest and the
>respective development guest, zbd-dev guest. When using
>CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
>just two guests:
>
>  * linux517-blktests-zbd
>  * linux517-blktests-zbd-dev
>
>The RCU stall can be triggered easily as follows:
>
>make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
>make
>make bringup # bring up guests
>make linux # build and boot into v5.17-rc7
>make blktests # build and install blktests
>
>Now let's ssh to the guest while leaving a console attached
>with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
>
>ssh linux517-blktests-zbd
>sudo su -
>cd /usr/local/blktests
>export TEST_DEVS=/dev/nvme9n1
>i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
>
>The above should never fail, but you should eventually see an RCU
>stall candidate on the console. The full details can be observed on the
>gist [1] but for completeness I list some of it below. It may be a false
>positive at this point, not sure.
>
>[493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
>[493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
>[493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
>[493336.981666] nvme nvme9: Abort status: 0x0
>[493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
>[493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>[493425.817272] rcu:    4-....: (0 ticks this GP) idle=c48/0/0x0 softirq=11316030/11316030 fqs=939  (false positive?)
>[493425.819275]         (detected by 7, t=14522 jiffies, g=31237493, q=6271)

Ok so CPU-7 detected stalls on CPU-4, which is in dyntick-idle mode,
which is an extended quiescent state (EQS) to overcome the limitations of
not having a tick (NO_HZ). So the false positive looks correct here in
that idle threads in this state are not in fact blocking the grace period
kthread.

No idea, however, why this would happen when using qemu as opposed to
nbd.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-15  1:09 ` Davidlohr Bueso
@ 2022-04-15  3:54   ` Paul E. McKenney
  2022-04-15  4:30     ` Davidlohr Bueso
  2022-04-15 17:33   ` Luis Chamberlain
  1 sibling, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2022-04-15  3:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Luis Chamberlain, shinichiro.kawasaki, Klaus Jensen, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
> On Thu, 14 Apr 2022, Luis Chamberlain wrote:
> 
> > Hey folks,
> > 
> > While enhancing kdevops [0] to embrace automation of testing with
> > blktests for ZNS I ended up spotting a possible false positive RCU stall
> > when running zbd/006 after zbd/005. The curious thing though is that
> > this possible RCU stall is only possible when using the qemu
> > ZNS drive, not when using nbd. In so far as kdevops is concerned
> > it creates ZNS drives for you when you enable the config option
> > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> > suffices. When configuring blktests you can just enable the zbd
> > guest, so only a pair of guests are reated the zbd guest and the
> > respective development guest, zbd-dev guest. When using
> > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> > just two guests:
> > 
> >  * linux517-blktests-zbd
> >  * linux517-blktests-zbd-dev
> > 
> > The RCU stall can be triggered easily as follows:
> > 
> > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> > make
> > make bringup # bring up guests
> > make linux # build and boot into v5.17-rc7
> > make blktests # build and install blktests
> > 
> > Now let's ssh to the guest while leaving a console attached
> > with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> > 
> > ssh linux517-blktests-zbd
> > sudo su -
> > cd /usr/local/blktests
> > export TEST_DEVS=/dev/nvme9n1
> > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> > 
> > The above should never fail, but you should eventually see an RCU
> > stall candidate on the console. The full details can be observed on the
> > gist [1] but for completeness I list some of it below. It may be a false
> > positive at this point, not sure.
> > 
> > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> > [493336.981666] nvme nvme9: Abort status: 0x0
> > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [493425.817272] rcu:    4-....: (0 ticks this GP) idle=c48/0/0x0 softirq=11316030/11316030 fqs=939  (false positive?)
> > [493425.819275]         (detected by 7, t=14522 jiffies, g=31237493, q=6271)
> 
> Ok so CPU-7 detected stalls on CPU-4, which is in dyntick-idle mode,
> which is an extended quiescent state (EQS) to overcome the limitations of
> not having a tick (NO_HZ). So the false positive looks correct here in
> that idle threads in this state are not in fact blocking the grace period
> kthread.
> 
> No idea, however, why this would happen when using qemu as opposed to
> nbd.

This one is a bit strange, no two ways about it.

In theory, vCPU preemption is a possibility.  In practice, if the
RCU grace-period kthread's vCPU was preempted for so long, I would
have expected the RCU CPU stall warning to complain about starvation.
But it still might be worth tracing context switches on the underlying
hypervisor.

Another possibility is that the activity of the RCU CPU stall warning
kicked CPU 4 into action so that by the time the message was printed, it
had exited its RCU read-side critical section and transitioned to idle.
(This sort of behavior is rare, but it does sometimes happen.)

Would it be possible to ftrace __rcu_read_lock() and __rcu_read_unlock()?
That would be one really big trace, though.  Another possibility is to
instrument these two functions so that __rcu_read_unlock() complains
if the RCU read-side critical section has lasted for more than (say)
10 seconds.

Other thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-15  3:54   ` Paul E. McKenney
@ 2022-04-15  4:30     ` Davidlohr Bueso
  2022-04-15 17:35       ` Luis Chamberlain
  0 siblings, 1 reply; 15+ messages in thread
From: Davidlohr Bueso @ 2022-04-15  4:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Luis Chamberlain, shinichiro.kawasaki, Klaus Jensen, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Thu, 14 Apr 2022, Paul E. McKenney wrote:

>On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
>> No idea, however, why this would happen when using qemu as opposed to
>> nbd.
>
>This one is a bit strange, no two ways about it.
>
>In theory, vCPU preemption is a possibility.  In practice, if the
>RCU grace-period kthread's vCPU was preempted for so long, I would
>have expected the RCU CPU stall warning to complain about starvation.
>But it still might be worth tracing context switches on the underlying
>hypervisor.

Right, but both cases are VMs which is what throws me off, regardless
of the zns drive. Or is this not the case, Luis?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-15  1:09 ` Davidlohr Bueso
  2022-04-15  3:54   ` Paul E. McKenney
@ 2022-04-15 17:33   ` Luis Chamberlain
  2022-04-15 17:42     ` Paul E. McKenney
  1 sibling, 1 reply; 15+ messages in thread
From: Luis Chamberlain @ 2022-04-15 17:33 UTC (permalink / raw)
  To: Davidlohr Bueso, Klaus Jensen, Paul E. McKenney
  Cc: shinichiro.kawasaki, Klaus Jensen, linux-block, Pankaj Raghav,
	Pankaj Raghav, Adam Manzanares

On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
> No idea, however, why this would happen when using qemu as opposed to
> nbd.

Sorry I keep saying nbd often when I mean null_blk, but yeah, same concept,
in that I meant an nvme controller is involved *with* the RCU splat.

The qemu nvme driver and qemu controller would be one code run time surface
to conside here, the two consecutive nvme controller IO timeouts seem
to be telling of a backlog of some sort:

[493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
[493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
[493336.981666] nvme nvme9: Abort status: 0x0
[493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
<-- rcu possible stall and NMIs-->
[493426.706021] nvme nvme9: 8/0/0 default/read/poll queues

So that seems to be pattern here. First puzzle is *why* we end up with
the IO timeout on the qemu nvme controller. And for this reason I CC'd
Klaus, in case he might have any ideas.

The first "timeout, aborting" happens when we have a first timeout of
a request for IO tag 192 sent to the controller and it just times out.
This seems to happen about 30 seconds after the test zbd/006 starts.

The "reset controller" will happen if we already issued the IO command
with tag 192 before and had already timed out before and hadn't been returned
to the driver. Resetting the controller is the stop gap measure.

The nvme_timeout() is just shared blk_mq_ops between nvme_mq_admin_ops
and nvme_mq_ops.

Inspecting the logs of running test zbd/005 and zbd/006 in a loop I can
see that not all nvme resets of the controller end up in the RCU stall splat
though.

The timeout of an IO happens about after 30 seconds the test starts, then
the reset 30 seconds after that first timeout, then the RCU stall about
30 seconds after. For instance follow these now mapped times:

Apr 13 23:02:33 linux517-blktests-zbd unknown: run blktests zbd/006 at 2022-04-13 23:02:33
30 seconds go by ...
Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, aborting
Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: Abort status: 0x0
30 seconds go by ...
Apr 13 23:03:36 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, reset controller
30 seconds go by ...
Apr 13 23:04:04 linux517-blktests-zbd kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

So the reset happens and yet we stay put. The reset work for the PCI
nvme controller is on nvme_reset_work() drivers/nvme/host/pci.c and
I don't see that issue'ing stuff back up the admin queue, however,
the first IO timeout did send an admin abort, on the same file:

drivers/nvme/host/pci.c
static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
{
	...
	dev_warn(nvmeq->dev->ctrl.device,                                       
                "I/O %d QID %d timeout, aborting\n",                            
                 req->tag, nvmeq->qid);                                         
                                                                                
        abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, nvme_req_op(&cmd),  
                                         BLK_MQ_REQ_NOWAIT);                    
        if (IS_ERR(abort_req)) {                                                
                atomic_inc(&dev->ctrl.abort_limit);                             
                return BLK_EH_RESET_TIMER;                                      
        }                                                                       
        nvme_init_request(abort_req, &cmd);                                     
                                                                                
        abort_req->end_io_data = NULL;                                          
        blk_execute_rq_nowait(abort_req, false, abort_endio);  
	...
}

So this is all seems by design. What's puzzling is why it takes
about 30 seconds to get to the possible RCU stall splat *after* we
reset the nvme controller. It make me wonder if we left something
lingering in a black hole somewhere after our first reset. Is it
right to gather from the RCU splat that whatever it was, it was stuck
in a long spin lock?

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-15  4:30     ` Davidlohr Bueso
@ 2022-04-15 17:35       ` Luis Chamberlain
  0 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2022-04-15 17:35 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Paul E. McKenney, shinichiro.kawasaki, Klaus Jensen, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Thu, Apr 14, 2022 at 09:30:21PM -0700, Davidlohr Bueso wrote:
> On Thu, 14 Apr 2022, Paul E. McKenney wrote:
> 
> > On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
> > > No idea, however, why this would happen when using qemu as opposed to
> > > nbd.
> > 
> > This one is a bit strange, no two ways about it.
> > 
> > In theory, vCPU preemption is a possibility.  In practice, if the
> > RCU grace-period kthread's vCPU was preempted for so long, I would
> > have expected the RCU CPU stall warning to complain about starvation.
> > But it still might be worth tracing context switches on the underlying
> > hypervisor.
> 
> Right, but both cases are VMs which is what throws me off, regardless
> of the zns drive. Or is this not the case, Luis?

The nvme controller timeout and rcu splat happens on an 8 core KVM
guest. The hypervisor has no issues reported.

Does that answer the question?

 Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-15 17:33   ` Luis Chamberlain
@ 2022-04-15 17:42     ` Paul E. McKenney
  0 siblings, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2022-04-15 17:42 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Davidlohr Bueso, Klaus Jensen, shinichiro.kawasaki, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Fri, Apr 15, 2022 at 10:33:00AM -0700, Luis Chamberlain wrote:
> On Thu, Apr 14, 2022 at 06:09:45PM -0700, Davidlohr Bueso wrote:
> > No idea, however, why this would happen when using qemu as opposed to
> > nbd.
> 
> Sorry I keep saying nbd often when I mean null_blk, but yeah, same concept,
> in that I meant an nvme controller is involved *with* the RCU splat.
> 
> The qemu nvme driver and qemu controller would be one code run time surface
> to conside here, the two consecutive nvme controller IO timeouts seem
> to be telling of a backlog of some sort:
> 
> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> [493336.981666] nvme nvme9: Abort status: 0x0
> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> <-- rcu possible stall and NMIs-->
> [493426.706021] nvme nvme9: 8/0/0 default/read/poll queues
> 
> So that seems to be pattern here. First puzzle is *why* we end up with
> the IO timeout on the qemu nvme controller. And for this reason I CC'd
> Klaus, in case he might have any ideas.
> 
> The first "timeout, aborting" happens when we have a first timeout of
> a request for IO tag 192 sent to the controller and it just times out.
> This seems to happen about 30 seconds after the test zbd/006 starts.
> 
> The "reset controller" will happen if we already issued the IO command
> with tag 192 before and had already timed out before and hadn't been returned
> to the driver. Resetting the controller is the stop gap measure.
> 
> The nvme_timeout() is just shared blk_mq_ops between nvme_mq_admin_ops
> and nvme_mq_ops.
> 
> Inspecting the logs of running test zbd/005 and zbd/006 in a loop I can
> see that not all nvme resets of the controller end up in the RCU stall splat
> though.
> 
> The timeout of an IO happens about after 30 seconds the test starts, then
> the reset 30 seconds after that first timeout, then the RCU stall about
> 30 seconds after. For instance follow these now mapped times:
> 
> Apr 13 23:02:33 linux517-blktests-zbd unknown: run blktests zbd/006 at 2022-04-13 23:02:33
> 30 seconds go by ...
> Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, aborting
> Apr 13 23:03:05 linux517-blktests-zbd kernel: nvme nvme9: Abort status: 0x0
> 30 seconds go by ...
> Apr 13 23:03:36 linux517-blktests-zbd kernel: nvme nvme9: I/O 651 QID 3 timeout, reset controller
> 30 seconds go by ...
> Apr 13 23:04:04 linux517-blktests-zbd kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 
> So the reset happens and yet we stay put. The reset work for the PCI
> nvme controller is on nvme_reset_work() drivers/nvme/host/pci.c and
> I don't see that issue'ing stuff back up the admin queue, however,
> the first IO timeout did send an admin abort, on the same file:
> 
> drivers/nvme/host/pci.c
> static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> {
> 	...
> 	dev_warn(nvmeq->dev->ctrl.device,                                       
>                 "I/O %d QID %d timeout, aborting\n",                            
>                  req->tag, nvmeq->qid);                                         
>                                                                                 
>         abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, nvme_req_op(&cmd),  
>                                          BLK_MQ_REQ_NOWAIT);                    
>         if (IS_ERR(abort_req)) {                                                
>                 atomic_inc(&dev->ctrl.abort_limit);                             
>                 return BLK_EH_RESET_TIMER;                                      
>         }                                                                       
>         nvme_init_request(abort_req, &cmd);                                     
>                                                                                 
>         abort_req->end_io_data = NULL;                                          
>         blk_execute_rq_nowait(abort_req, false, abort_endio);  
> 	...
> }
> 
> So this is all seems by design. What's puzzling is why it takes
> about 30 seconds to get to the possible RCU stall splat *after* we
> reset the nvme controller. It make me wonder if we left something
> lingering in a black hole somewhere after our first reset. Is it
> right to gather from the RCU splat that whatever it was, it was stuck
> in a long spin lock?

One way that this could happen would be for an earlier RCU grace period
to have taken up the first 9 seconds of that 30-second interval.

Tracing might help here.  The rcu_gp_init() function is invoked at the
beginning of a grace period and the rcu_gp_cleanup() function is invoked
at its end.  Either way, it might be interesting to see how the grace
periods line up with those other events.

Oh, and another way that this sort of thing can happen is if something
prevents the jiffies counter from advancing.  Rare, but it does happen,
for example: https://paulmck.livejournal.com/62071.html

							Thanx, Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-14 22:02 blktests with zbd/006 ZNS triggers a possible false positive RCU stall Luis Chamberlain
  2022-04-15  1:09 ` Davidlohr Bueso
@ 2022-04-20  5:54 ` Shinichiro Kawasaki
  2022-04-21 18:00   ` Luis Chamberlain
  1 sibling, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-04-20  5:54 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Klaus Jensen, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> Hey folks,
> 
> While enhancing kdevops [0] to embrace automation of testing with
> blktests for ZNS I ended up spotting a possible false positive RCU stall
> when running zbd/006 after zbd/005. The curious thing though is that
> this possible RCU stall is only possible when using the qemu
> ZNS drive, not when using nbd. In so far as kdevops is concerned
> it creates ZNS drives for you when you enable the config option
> CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> suffices. When configuring blktests you can just enable the zbd
> guest, so only a pair of guests are reated the zbd guest and the
> respective development guest, zbd-dev guest. When using
> CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> just two guests:
> 
>   * linux517-blktests-zbd
>   * linux517-blktests-zbd-dev
> 
> The RCU stall can be triggered easily as follows:
> 
> make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> make
> make bringup # bring up guests
> make linux # build and boot into v5.17-rc7
> make blktests # build and install blktests
> 
> Now let's ssh to the guest while leaving a console attached
> with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> 
> ssh linux517-blktests-zbd
> sudo su -
> cd /usr/local/blktests
> export TEST_DEVS=/dev/nvme9n1
> i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> 
> The above should never fail, but you should eventually see an RCU
> stall candidate on the console. The full details can be observed on the
> gist [1] but for completeness I list some of it below. It may be a false
> positive at this point, not sure.
> 
> [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> [493336.981666] nvme nvme9: Abort status: 0x0
> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

Hello Luis,

I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
kernel releases. But, I have not ever observed the symptom above. Now I'm
repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
not observe the symptom so far, after 400 times repeat.

I would like to run the test using same ZNS set up as yours. Can you share how
your ZNS device is set up? I would like to know device size and QEMU -device
options, such as zoned.zone_size or zoned.max_active.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-20  5:54 ` Shinichiro Kawasaki
@ 2022-04-21 18:00   ` Luis Chamberlain
  2022-04-27  5:08     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 15+ messages in thread
From: Luis Chamberlain @ 2022-04-21 18:00 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Klaus Jensen, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
> On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> > Hey folks,
> > 
> > While enhancing kdevops [0] to embrace automation of testing with
> > blktests for ZNS I ended up spotting a possible false positive RCU stall
> > when running zbd/006 after zbd/005. The curious thing though is that
> > this possible RCU stall is only possible when using the qemu
> > ZNS drive, not when using nbd. In so far as kdevops is concerned
> > it creates ZNS drives for you when you enable the config option
> > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> > suffices. When configuring blktests you can just enable the zbd
> > guest, so only a pair of guests are reated the zbd guest and the
> > respective development guest, zbd-dev guest. When using
> > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> > just two guests:
> > 
> >   * linux517-blktests-zbd
> >   * linux517-blktests-zbd-dev
> > 
> > The RCU stall can be triggered easily as follows:
> > 
> > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> > make
> > make bringup # bring up guests
> > make linux # build and boot into v5.17-rc7
> > make blktests # build and install blktests
> > 
> > Now let's ssh to the guest while leaving a console attached
> > with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> > 
> > ssh linux517-blktests-zbd
> > sudo su -
> > cd /usr/local/blktests
> > export TEST_DEVS=/dev/nvme9n1
> > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> > 
> > The above should never fail, but you should eventually see an RCU
> > stall candidate on the console. The full details can be observed on the
> > gist [1] but for completeness I list some of it below. It may be a false
> > positive at this point, not sure.
> > 
> > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> > [493336.981666] nvme nvme9: Abort status: 0x0
> > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 
> Hello Luis,
> 
> I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
> kernel releases. But, I have not ever observed the symptom above. Now I'm
> repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
> not observe the symptom so far, after 400 times repeat.

Did you try v5.17-rc7 ?

> I would like to run the test using same ZNS set up as yours. Can you share how
> your ZNS device is set up? I would like to know device size and QEMU -device
> options, such as zoned.zone_size or zoned.max_active.

It is as easy as the above make commands, and follow up login commands.
I'll be bumping the kernel to test for fstesta and blktests on kdevops
soon but the baseline is sadly not yet done for all filesystems and
blktests yet. Once the baseline is completed though it should be easy to
bump kernel and confirm if old failures are not failing anymore / find
new issues.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-21 18:00   ` Luis Chamberlain
@ 2022-04-27  5:08     ` Shinichiro Kawasaki
  2022-04-27  5:42       ` Luis Chamberlain
  2022-04-27  7:41       ` Klaus Jensen
  0 siblings, 2 replies; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-04-27  5:08 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Klaus Jensen, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
> On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
> > On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> > > Hey folks,
> > > 
> > > While enhancing kdevops [0] to embrace automation of testing with
> > > blktests for ZNS I ended up spotting a possible false positive RCU stall
> > > when running zbd/006 after zbd/005. The curious thing though is that
> > > this possible RCU stall is only possible when using the qemu
> > > ZNS drive, not when using nbd. In so far as kdevops is concerned
> > > it creates ZNS drives for you when you enable the config option
> > > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> > > suffices. When configuring blktests you can just enable the zbd
> > > guest, so only a pair of guests are reated the zbd guest and the
> > > respective development guest, zbd-dev guest. When using
> > > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> > > just two guests:
> > > 
> > >   * linux517-blktests-zbd
> > >   * linux517-blktests-zbd-dev
> > > 
> > > The RCU stall can be triggered easily as follows:
> > > 
> > > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> > > make
> > > make bringup # bring up guests
> > > make linux # build and boot into v5.17-rc7
> > > make blktests # build and install blktests
> > > 
> > > Now let's ssh to the guest while leaving a console attached
> > > with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> > > 
> > > ssh linux517-blktests-zbd
> > > sudo su -
> > > cd /usr/local/blktests
> > > export TEST_DEVS=/dev/nvme9n1
> > > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> > > 
> > > The above should never fail, but you should eventually see an RCU
> > > stall candidate on the console. The full details can be observed on the
> > > gist [1] but for completeness I list some of it below. It may be a false
> > > positive at this point, not sure.
> > > 
> > > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> > > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> > > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> > > [493336.981666] nvme nvme9: Abort status: 0x0
> > > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> > > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 
> > Hello Luis,
> > 
> > I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
> > kernel releases. But, I have not ever observed the symptom above. Now I'm
> > repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
> > not observe the symptom so far, after 400 times repeat.
> 
> Did you try v5.17-rc7 ?

I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.

> 
> > I would like to run the test using same ZNS set up as yours. Can you share how
> > your ZNS device is set up? I would like to know device size and QEMU -device
> > options, such as zoned.zone_size or zoned.max_active.
> 
> It is as easy as the above make commands, and follow up login commands.

I managed to run kdevops on my machine, and saw the I/O timeout and abort
messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
in my test environment also (the reset controller message and RCU relegated
error were not observed).

[  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
[  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
[  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
[  276.788391][    C0] nvme nvme6: Abort status: 0x0

The conditions to recreate the I/O timeout error are as follows:

- Larger size of QEMU ZNS drive (10GB)
    - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
      size, the I/O timeout is not observed.

- Issue zone reset command for all zones (with 'blkzone reset' command) just
  after zbd/005 completion to the drive.
    - The test case zbd/006 calls the zone reset command. It's enough to repeat
      zbd/005 and zone reset command to recreate the I/O timeout.
    - When 10 seconds sleep is added between zbd/005 run and zone reset command,
      the I/O timeout was not observed.
    - The data write pattern of zbd/005 looks important. Simple dd command to
      fill the device before 'blkzone reset' did not recreate the I/O timeout.

I dug into QEMU code and found that it takes long time to complete zone reset
command with all zones flag. It takes more than 30 seconds and looks triggering
the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
backend file for each zone, so that data of each zone is zero cleared. Each
fallocate call is quick but between the calls, 0.7 second delay was observed
often. I guess some fsync or fdatasync operation would be running and causing
the delay.

In other words, QEMU ZNS zone reset for all zones is so slow depending on the
ZNS drive's size and status. Performance improvement of zone reset is desired in
QEMU. I will seek for the chance to work on it.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-27  5:08     ` Shinichiro Kawasaki
@ 2022-04-27  5:42       ` Luis Chamberlain
  2022-04-27  7:41       ` Klaus Jensen
  1 sibling, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2022-04-27  5:42 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Klaus Jensen, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Wed, Apr 27, 2022 at 05:08:25AM +0000, Shinichiro Kawasaki wrote:
> The conditions to recreate the I/O timeout error are as follows:
> 
> - Larger size of QEMU ZNS drive (10GB)
>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
>       size, the I/O timeout is not observed.
> 
> - Issue zone reset command for all zones (with 'blkzone reset' command) just
>   after zbd/005 completion to the drive.
>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
>       zbd/005 and zone reset command to recreate the I/O timeout.
>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
>       the I/O timeout was not observed.
>     - The data write pattern of zbd/005 looks important. Simple dd command to
>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
> 
> I dug into QEMU code and found that it takes long time to complete zone reset
> command with all zones flag. It takes more than 30 seconds and looks triggering
> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
> backend file for each zone, so that data of each zone is zero cleared. Each
> fallocate call is quick but between the calls, 0.7 second delay was observed
> often. I guess some fsync or fdatasync operation would be running and causing
> the delay.
> 
> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
> ZNS drive's size and status. Performance improvement of zone reset is desired in
> QEMU. I will seek for the chance to work on it.

Awesome find Shinichiro!

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-27  5:08     ` Shinichiro Kawasaki
  2022-04-27  5:42       ` Luis Chamberlain
@ 2022-04-27  7:41       ` Klaus Jensen
  2022-04-27  8:39         ` Damien Le Moal
  2022-04-27  8:53         ` Shinichiro Kawasaki
  1 sibling, 2 replies; 15+ messages in thread
From: Klaus Jensen @ 2022-04-27  7:41 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Luis Chamberlain, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

[-- Attachment #1: Type: text/plain, Size: 6317 bytes --]

On Apr 27 05:08, Shinichiro Kawasaki wrote:
> On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
> > On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
> > > On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> > > > Hey folks,
> > > > 
> > > > While enhancing kdevops [0] to embrace automation of testing with
> > > > blktests for ZNS I ended up spotting a possible false positive RCU stall
> > > > when running zbd/006 after zbd/005. The curious thing though is that
> > > > this possible RCU stall is only possible when using the qemu
> > > > ZNS drive, not when using nbd. In so far as kdevops is concerned
> > > > it creates ZNS drives for you when you enable the config option
> > > > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> > > > suffices. When configuring blktests you can just enable the zbd
> > > > guest, so only a pair of guests are reated the zbd guest and the
> > > > respective development guest, zbd-dev guest. When using
> > > > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> > > > just two guests:
> > > > 
> > > >   * linux517-blktests-zbd
> > > >   * linux517-blktests-zbd-dev
> > > > 
> > > > The RCU stall can be triggered easily as follows:
> > > > 
> > > > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> > > > make
> > > > make bringup # bring up guests
> > > > make linux # build and boot into v5.17-rc7
> > > > make blktests # build and install blktests
> > > > 
> > > > Now let's ssh to the guest while leaving a console attached
> > > > with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> > > > 
> > > > ssh linux517-blktests-zbd
> > > > sudo su -
> > > > cd /usr/local/blktests
> > > > export TEST_DEVS=/dev/nvme9n1
> > > > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> > > > 
> > > > The above should never fail, but you should eventually see an RCU
> > > > stall candidate on the console. The full details can be observed on the
> > > > gist [1] but for completeness I list some of it below. It may be a false
> > > > positive at this point, not sure.
> > > > 
> > > > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> > > > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> > > > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> > > > [493336.981666] nvme nvme9: Abort status: 0x0
> > > > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> > > > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > 
> > > Hello Luis,
> > > 
> > > I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
> > > kernel releases. But, I have not ever observed the symptom above. Now I'm
> > > repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
> > > not observe the symptom so far, after 400 times repeat.
> > 
> > Did you try v5.17-rc7 ?
> 
> I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
> 
> > 
> > > I would like to run the test using same ZNS set up as yours. Can you share how
> > > your ZNS device is set up? I would like to know device size and QEMU -device
> > > options, such as zoned.zone_size or zoned.max_active.
> > 
> > It is as easy as the above make commands, and follow up login commands.
> 
> I managed to run kdevops on my machine, and saw the I/O timeout and abort
> messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
> in my test environment also (the reset controller message and RCU relegated
> error were not observed).
> 

Can you extract the relevant part of the QEMU parameters? I tried to
reproduce this, but could not with a 10G, neither with discard=on or
off, qcow2 or raw.

> [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
> [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
> [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
> [  276.788391][    C0] nvme nvme6: Abort status: 0x0
> 
> The conditions to recreate the I/O timeout error are as follows:
> 
> - Larger size of QEMU ZNS drive (10GB)
>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
>       size, the I/O timeout is not observed.
> 
> - Issue zone reset command for all zones (with 'blkzone reset' command) just
>   after zbd/005 completion to the drive.
>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
>       zbd/005 and zone reset command to recreate the I/O timeout.
>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
>       the I/O timeout was not observed.
>     - The data write pattern of zbd/005 looks important. Simple dd command to
>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
> 
> I dug into QEMU code and found that it takes long time to complete zone reset
> command with all zones flag. It takes more than 30 seconds and looks triggering
> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
> backend file for each zone, so that data of each zone is zero cleared. Each
> fallocate call is quick but between the calls, 0.7 second delay was observed
> often. I guess some fsync or fdatasync operation would be running and causing
> the delay.
> 

QEMU uses a write zeroes for zone reset. This is because of the
requirement that block in empty zones must be considered deallocated.

When the drive is configured with `discard=on`, these write zeroes
*should* turn into discards. However, I also tested with discard=off and
I could not reproduce it.

It might make sense to force QEMU to use a discard for zone reset in all
cases, and then change the reported DLFEAT appropriately, since we
cannot guarantee zeroes then.

> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
> ZNS drive's size and status. Performance improvement of zone reset is desired in
> QEMU. I will seek for the chance to work on it.
> 

Currently, each zone is a separate discard/write zero call. It would be
fair to special case all zones and do it in much larger chunks.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-27  7:41       ` Klaus Jensen
@ 2022-04-27  8:39         ` Damien Le Moal
  2022-04-27  8:55           ` Klaus Jensen
  2022-04-27  8:53         ` Shinichiro Kawasaki
  1 sibling, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2022-04-27  8:39 UTC (permalink / raw)
  To: Klaus Jensen, Shinichiro Kawasaki
  Cc: Luis Chamberlain, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On 4/27/22 16:41, Klaus Jensen wrote:
> On Apr 27 05:08, Shinichiro Kawasaki wrote:
>> On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
>>> On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
>>>> On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
>>>>> Hey folks,
>>>>>
>>>>> While enhancing kdevops [0] to embrace automation of testing with
>>>>> blktests for ZNS I ended up spotting a possible false positive RCU stall
>>>>> when running zbd/006 after zbd/005. The curious thing though is that
>>>>> this possible RCU stall is only possible when using the qemu
>>>>> ZNS drive, not when using nbd. In so far as kdevops is concerned
>>>>> it creates ZNS drives for you when you enable the config option
>>>>> CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
>>>>> suffices. When configuring blktests you can just enable the zbd
>>>>> guest, so only a pair of guests are reated the zbd guest and the
>>>>> respective development guest, zbd-dev guest. When using
>>>>> CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
>>>>> just two guests:
>>>>>
>>>>>   * linux517-blktests-zbd
>>>>>   * linux517-blktests-zbd-dev
>>>>>
>>>>> The RCU stall can be triggered easily as follows:
>>>>>
>>>>> make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
>>>>> make
>>>>> make bringup # bring up guests
>>>>> make linux # build and boot into v5.17-rc7
>>>>> make blktests # build and install blktests
>>>>>
>>>>> Now let's ssh to the guest while leaving a console attached
>>>>> with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
>>>>>
>>>>> ssh linux517-blktests-zbd
>>>>> sudo su -
>>>>> cd /usr/local/blktests
>>>>> export TEST_DEVS=/dev/nvme9n1
>>>>> i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
>>>>>
>>>>> The above should never fail, but you should eventually see an RCU
>>>>> stall candidate on the console. The full details can be observed on the
>>>>> gist [1] but for completeness I list some of it below. It may be a false
>>>>> positive at this point, not sure.
>>>>>
>>>>> [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
>>>>> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
>>>>> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
>>>>> [493336.981666] nvme nvme9: Abort status: 0x0
>>>>> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
>>>>> [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>>
>>>> Hello Luis,
>>>>
>>>> I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
>>>> kernel releases. But, I have not ever observed the symptom above. Now I'm
>>>> repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
>>>> not observe the symptom so far, after 400 times repeat.
>>>
>>> Did you try v5.17-rc7 ?
>>
>> I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
>>
>>>
>>>> I would like to run the test using same ZNS set up as yours. Can you share how
>>>> your ZNS device is set up? I would like to know device size and QEMU -device
>>>> options, such as zoned.zone_size or zoned.max_active.
>>>
>>> It is as easy as the above make commands, and follow up login commands.
>>
>> I managed to run kdevops on my machine, and saw the I/O timeout and abort
>> messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
>> in my test environment also (the reset controller message and RCU relegated
>> error were not observed).
>>
> 
> Can you extract the relevant part of the QEMU parameters? I tried to
> reproduce this, but could not with a 10G, neither with discard=on or
> off, qcow2 or raw.
> 
>> [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
>> [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
>> [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
>> [  276.788391][    C0] nvme nvme6: Abort status: 0x0
>>
>> The conditions to recreate the I/O timeout error are as follows:
>>
>> - Larger size of QEMU ZNS drive (10GB)
>>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
>>       size, the I/O timeout is not observed.
>>
>> - Issue zone reset command for all zones (with 'blkzone reset' command) just
>>   after zbd/005 completion to the drive.
>>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
>>       zbd/005 and zone reset command to recreate the I/O timeout.
>>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
>>       the I/O timeout was not observed.
>>     - The data write pattern of zbd/005 looks important. Simple dd command to
>>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
>>
>> I dug into QEMU code and found that it takes long time to complete zone reset
>> command with all zones flag. It takes more than 30 seconds and looks triggering
>> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
>> backend file for each zone, so that data of each zone is zero cleared. Each
>> fallocate call is quick but between the calls, 0.7 second delay was observed
>> often. I guess some fsync or fdatasync operation would be running and causing
>> the delay.
>>
> 
> QEMU uses a write zeroes for zone reset. This is because of the
> requirement that block in empty zones must be considered deallocated.
> 
> When the drive is configured with `discard=on`, these write zeroes
> *should* turn into discards. However, I also tested with discard=off and
> I could not reproduce it.
> 
> It might make sense to force QEMU to use a discard for zone reset in all
> cases, and then change the reported DLFEAT appropriately, since we
> cannot guarantee zeroes then.

Why not punch a hole in the backing store file with fallocate() with mode
set to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE ? That would be way
faster than a write zeroes which potentially actually do the writes,
leading to large command processing times. Reading in a hole in a file is
guaranteed to return zeroes, at least on Linux.

If the backingstore is a block device, then sure, write zeroes is the only
solution. Discard should be used with caution since that is a hint only
and some drives may actually do nothing.

> 
>> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
>> ZNS drive's size and status. Performance improvement of zone reset is desired in
>> QEMU. I will seek for the chance to work on it.
>>
> 
> Currently, each zone is a separate discard/write zero call. It would be
> fair to special case all zones and do it in much larger chunks.

Yep, for a backing file, a full file fallocate(FALLOC_FL_PUNCH_HOLE) would
do nicely. Or truncate(0) + truncate(storage size) would do too.

Since resets are always all zones or one zone, special optimized handling
of the reset all case will definitely have huge benefits for that command.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-27  7:41       ` Klaus Jensen
  2022-04-27  8:39         ` Damien Le Moal
@ 2022-04-27  8:53         ` Shinichiro Kawasaki
  1 sibling, 0 replies; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-04-27  8:53 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Luis Chamberlain, Paul E. McKenney, Davidlohr Bueso, linux-block,
	Pankaj Raghav, Pankaj Raghav, Adam Manzanares

On Apr 27, 2022 / 09:41, Klaus Jensen wrote:
> On Apr 27 05:08, Shinichiro Kawasaki wrote:
> > On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
> > > On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
> > > > On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> > > > > Hey folks,
> > > > > 
> > > > > While enhancing kdevops [0] to embrace automation of testing with
> > > > > blktests for ZNS I ended up spotting a possible false positive RCU stall
> > > > > when running zbd/006 after zbd/005. The curious thing though is that
> > > > > this possible RCU stall is only possible when using the qemu
> > > > > ZNS drive, not when using nbd. In so far as kdevops is concerned
> > > > > it creates ZNS drives for you when you enable the config option
> > > > > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> > > > > suffices. When configuring blktests you can just enable the zbd
> > > > > guest, so only a pair of guests are reated the zbd guest and the
> > > > > respective development guest, zbd-dev guest. When using
> > > > > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> > > > > just two guests:
> > > > > 
> > > > >   * linux517-blktests-zbd
> > > > >   * linux517-blktests-zbd-dev
> > > > > 
> > > > > The RCU stall can be triggered easily as follows:
> > > > > 
> > > > > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> > > > > make
> > > > > make bringup # bring up guests
> > > > > make linux # build and boot into v5.17-rc7
> > > > > make blktests # build and install blktests
> > > > > 
> > > > > Now let's ssh to the guest while leaving a console attached
> > > > > with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> > > > > 
> > > > > ssh linux517-blktests-zbd
> > > > > sudo su -
> > > > > cd /usr/local/blktests
> > > > > export TEST_DEVS=/dev/nvme9n1
> > > > > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> > > > > 
> > > > > The above should never fail, but you should eventually see an RCU
> > > > > stall candidate on the console. The full details can be observed on the
> > > > > gist [1] but for completeness I list some of it below. It may be a false
> > > > > positive at this point, not sure.
> > > > > 
> > > > > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> > > > > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> > > > > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> > > > > [493336.981666] nvme nvme9: Abort status: 0x0
> > > > > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> > > > > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > 
> > > > Hello Luis,
> > > > 
> > > > I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
> > > > kernel releases. But, I have not ever observed the symptom above. Now I'm
> > > > repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
> > > > not observe the symptom so far, after 400 times repeat.
> > > 
> > > Did you try v5.17-rc7 ?
> > 
> > I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
> > 
> > > 
> > > > I would like to run the test using same ZNS set up as yours. Can you share how
> > > > your ZNS device is set up? I would like to know device size and QEMU -device
> > > > options, such as zoned.zone_size or zoned.max_active.
> > > 
> > > It is as easy as the above make commands, and follow up login commands.
> > 
> > I managed to run kdevops on my machine, and saw the I/O timeout and abort
> > messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
> > in my test environment also (the reset controller message and RCU relegated
> > error were not observed).
> > 
> 
> Can you extract the relevant part of the QEMU parameters? I tried to
> reproduce this, but could not with a 10G, neither with discard=on or
> off, qcow2 or raw.

Sure, the QEMU options I used for the ZNS drive were as follows:

-device nvme,id=nvme6,serial=XXX,logical_block_size=4096,physical_block_size=4096 \
-drive file=YYY.img,id=ZZZ,format=raw,if=none \
-device nvme-ns,drive=ZZZ,bus=nvme6,nsid=1,zoned=true,zoned.max_open=0,zoned.max_active=0,zoned.zone_size=134217728,zoned.zone_capacity=0

I prepared the QEMU ZNS image file in raw format on ext4 filesystem.

# qemu-img create -f raw YYY.img 10240M

> 
> > [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
> > [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
> > [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
> > [  276.788391][    C0] nvme nvme6: Abort status: 0x0
> > 
> > The conditions to recreate the I/O timeout error are as follows:
> > 
> > - Larger size of QEMU ZNS drive (10GB)
> >     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
> >       size, the I/O timeout is not observed.
> > 
> > - Issue zone reset command for all zones (with 'blkzone reset' command) just
> >   after zbd/005 completion to the drive.
> >     - The test case zbd/006 calls the zone reset command. It's enough to repeat
> >       zbd/005 and zone reset command to recreate the I/O timeout.
> >     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
> >       the I/O timeout was not observed.
> >     - The data write pattern of zbd/005 looks important. Simple dd command to
> >       fill the device before 'blkzone reset' did not recreate the I/O timeout.
> > 
> > I dug into QEMU code and found that it takes long time to complete zone reset
> > command with all zones flag. It takes more than 30 seconds and looks triggering
> > the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
> > backend file for each zone, so that data of each zone is zero cleared. Each
> > fallocate call is quick but between the calls, 0.7 second delay was observed
> > often. I guess some fsync or fdatasync operation would be running and causing
> > the delay.
> > 
> 
> QEMU uses a write zeroes for zone reset. This is because of the
> requirement that block in empty zones must be considered deallocated.
> 
> When the drive is configured with `discard=on`, these write zeroes
> *should* turn into discards. However, I also tested with discard=off and
> I could not reproduce it.

I took a look in the QEMU code, and found that that discard=on/off option
affects BDRV_O_UNMAP flag which is referred in bdrv_co_pwrite_zeroes().
However, it looks that the reset zone path does not hit this function. Instead,
bdrv_co_pwritev_part() is in the call path. So, the discard option does not
affect the reset zone I/O timeout, probably.

> 
> It might make sense to force QEMU to use a discard for zone reset in all
> cases, and then change the reported DLFEAT appropriately, since we
> cannot guarantee zeroes then.
> 
> > In other words, QEMU ZNS zone reset for all zones is so slow depending on the
> > ZNS drive's size and status. Performance improvement of zone reset is desired in
> > QEMU. I will seek for the chance to work on it.
> > 
> 
> Currently, each zone is a separate discard/write zero call. It would be
> fair to special case all zones and do it in much larger chunks.

Yes, this looks the correct fix approach.

BTW, it would be the better to move this discussion from linux-block list to
qemu-devel list :)

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
  2022-04-27  8:39         ` Damien Le Moal
@ 2022-04-27  8:55           ` Klaus Jensen
  0 siblings, 0 replies; 15+ messages in thread
From: Klaus Jensen @ 2022-04-27  8:55 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Shinichiro Kawasaki, Luis Chamberlain, Paul E. McKenney,
	Davidlohr Bueso, linux-block, Pankaj Raghav, Pankaj Raghav,
	Adam Manzanares

[-- Attachment #1: Type: text/plain, Size: 7922 bytes --]

On Apr 27 17:39, Damien Le Moal wrote:
> On 4/27/22 16:41, Klaus Jensen wrote:
> > On Apr 27 05:08, Shinichiro Kawasaki wrote:
> >> On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
> >>> On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
> >>>> On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> >>>>> Hey folks,
> >>>>>
> >>>>> While enhancing kdevops [0] to embrace automation of testing with
> >>>>> blktests for ZNS I ended up spotting a possible false positive RCU stall
> >>>>> when running zbd/006 after zbd/005. The curious thing though is that
> >>>>> this possible RCU stall is only possible when using the qemu
> >>>>> ZNS drive, not when using nbd. In so far as kdevops is concerned
> >>>>> it creates ZNS drives for you when you enable the config option
> >>>>> CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> >>>>> suffices. When configuring blktests you can just enable the zbd
> >>>>> guest, so only a pair of guests are reated the zbd guest and the
> >>>>> respective development guest, zbd-dev guest. When using
> >>>>> CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> >>>>> just two guests:
> >>>>>
> >>>>>   * linux517-blktests-zbd
> >>>>>   * linux517-blktests-zbd-dev
> >>>>>
> >>>>> The RCU stall can be triggered easily as follows:
> >>>>>
> >>>>> make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> >>>>> make
> >>>>> make bringup # bring up guests
> >>>>> make linux # build and boot into v5.17-rc7
> >>>>> make blktests # build and install blktests
> >>>>>
> >>>>> Now let's ssh to the guest while leaving a console attached
> >>>>> with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> >>>>>
> >>>>> ssh linux517-blktests-zbd
> >>>>> sudo su -
> >>>>> cd /usr/local/blktests
> >>>>> export TEST_DEVS=/dev/nvme9n1
> >>>>> i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> >>>>>
> >>>>> The above should never fail, but you should eventually see an RCU
> >>>>> stall candidate on the console. The full details can be observed on the
> >>>>> gist [1] but for completeness I list some of it below. It may be a false
> >>>>> positive at this point, not sure.
> >>>>>
> >>>>> [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> >>>>> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> >>>>> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> >>>>> [493336.981666] nvme nvme9: Abort status: 0x0
> >>>>> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> >>>>> [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> >>>>
> >>>> Hello Luis,
> >>>>
> >>>> I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
> >>>> kernel releases. But, I have not ever observed the symptom above. Now I'm
> >>>> repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
> >>>> not observe the symptom so far, after 400 times repeat.
> >>>
> >>> Did you try v5.17-rc7 ?
> >>
> >> I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
> >>
> >>>
> >>>> I would like to run the test using same ZNS set up as yours. Can you share how
> >>>> your ZNS device is set up? I would like to know device size and QEMU -device
> >>>> options, such as zoned.zone_size or zoned.max_active.
> >>>
> >>> It is as easy as the above make commands, and follow up login commands.
> >>
> >> I managed to run kdevops on my machine, and saw the I/O timeout and abort
> >> messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
> >> in my test environment also (the reset controller message and RCU relegated
> >> error were not observed).
> >>
> > 
> > Can you extract the relevant part of the QEMU parameters? I tried to
> > reproduce this, but could not with a 10G, neither with discard=on or
> > off, qcow2 or raw.
> > 
> >> [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
> >> [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
> >> [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
> >> [  276.788391][    C0] nvme nvme6: Abort status: 0x0
> >>
> >> The conditions to recreate the I/O timeout error are as follows:
> >>
> >> - Larger size of QEMU ZNS drive (10GB)
> >>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
> >>       size, the I/O timeout is not observed.
> >>
> >> - Issue zone reset command for all zones (with 'blkzone reset' command) just
> >>   after zbd/005 completion to the drive.
> >>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
> >>       zbd/005 and zone reset command to recreate the I/O timeout.
> >>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
> >>       the I/O timeout was not observed.
> >>     - The data write pattern of zbd/005 looks important. Simple dd command to
> >>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
> >>
> >> I dug into QEMU code and found that it takes long time to complete zone reset
> >> command with all zones flag. It takes more than 30 seconds and looks triggering
> >> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
> >> backend file for each zone, so that data of each zone is zero cleared. Each
> >> fallocate call is quick but between the calls, 0.7 second delay was observed
> >> often. I guess some fsync or fdatasync operation would be running and causing
> >> the delay.
> >>
> > 
> > QEMU uses a write zeroes for zone reset. This is because of the
> > requirement that block in empty zones must be considered deallocated.
> > 
> > When the drive is configured with `discard=on`, these write zeroes
> > *should* turn into discards. However, I also tested with discard=off and
> > I could not reproduce it.
> > 
> > It might make sense to force QEMU to use a discard for zone reset in all
> > cases, and then change the reported DLFEAT appropriately, since we
> > cannot guarantee zeroes then.
> 
> Why not punch a hole in the backing store file with fallocate() with mode
> set to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE ? That would be way
> faster than a write zeroes which potentially actually do the writes,
> leading to large command processing times. Reading in a hole in a file is
> guaranteed to return zeroes, at least on Linux.
> 

In almost all cases, that is what QEMU will do since we set the
BDRV_REQ_MAY_UNMAP flag.

> If the backingstore is a block device, then sure, write zeroes is the only
> solution. Discard should be used with caution since that is a hint only
> and some drives may actually do nothing.
> 

Yes, that is why we do it with an explicit write zeroes which guarantees
zeroing behavior regardless of if or not discards "do something" (this
depends on the underlying block implementation).

> > 
> >> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
> >> ZNS drive's size and status. Performance improvement of zone reset is desired in
> >> QEMU. I will seek for the chance to work on it.
> >>
> > 
> > Currently, each zone is a separate discard/write zero call. It would be
> > fair to special case all zones and do it in much larger chunks.
> 
> Yep, for a backing file, a full file fallocate(FALLOC_FL_PUNCH_HOLE) would
> do nicely. Or truncate(0) + truncate(storage size) would do too.
> 
> Since resets are always all zones or one zone, special optimized handling
> of the reset all case will definitely have huge benefits for that command.
> 

Yes. I think that is the best approach. While we might reset a bunch of
zones that are already empty, it should speed things up substantially.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-27  8:55 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-14 22:02 blktests with zbd/006 ZNS triggers a possible false positive RCU stall Luis Chamberlain
2022-04-15  1:09 ` Davidlohr Bueso
2022-04-15  3:54   ` Paul E. McKenney
2022-04-15  4:30     ` Davidlohr Bueso
2022-04-15 17:35       ` Luis Chamberlain
2022-04-15 17:33   ` Luis Chamberlain
2022-04-15 17:42     ` Paul E. McKenney
2022-04-20  5:54 ` Shinichiro Kawasaki
2022-04-21 18:00   ` Luis Chamberlain
2022-04-27  5:08     ` Shinichiro Kawasaki
2022-04-27  5:42       ` Luis Chamberlain
2022-04-27  7:41       ` Klaus Jensen
2022-04-27  8:39         ` Damien Le Moal
2022-04-27  8:55           ` Klaus Jensen
2022-04-27  8:53         ` Shinichiro Kawasaki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.