On Apr 27 05:08, Shinichiro Kawasaki wrote: > On Apr 21, 2022 / 11:00, Luis Chamberlain wrote: > > On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote: > > > On Apr 14, 2022 / 15:02, Luis Chamberlain wrote: > > > > Hey folks, > > > > > > > > While enhancing kdevops [0] to embrace automation of testing with > > > > blktests for ZNS I ended up spotting a possible false positive RCU stall > > > > when running zbd/006 after zbd/005. The curious thing though is that > > > > this possible RCU stall is only possible when using the qemu > > > > ZNS drive, not when using nbd. In so far as kdevops is concerned > > > > it creates ZNS drives for you when you enable the config option > > > > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives > > > > suffices. When configuring blktests you can just enable the zbd > > > > guest, so only a pair of guests are reated the zbd guest and the > > > > respective development guest, zbd-dev guest. When using > > > > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with > > > > just two guests: > > > > > > > > * linux517-blktests-zbd > > > > * linux517-blktests-zbd-dev > > > > > > > > The RCU stall can be triggered easily as follows: > > > > > > > > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests > > > > make > > > > make bringup # bring up guests > > > > make linux # build and boot into v5.17-rc7 > > > > make blktests # build and install blktests > > > > > > > > Now let's ssh to the guest while leaving a console attached > > > > with `sudo virsh vagrant_linux517-blktests-zbd` in a window: > > > > > > > > ssh linux517-blktests-zbd > > > > sudo su - > > > > cd /usr/local/blktests > > > > export TEST_DEVS=/dev/nvme9n1 > > > > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done; > > > > > > > > The above should never fail, but you should eventually see an RCU > > > > stall candidate on the console. The full details can be observed on the > > > > gist [1] but for completeness I list some of it below. It may be a false > > > > positive at this point, not sure. > > > > > > > > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22 > > > > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55 > > > > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting > > > > [493336.981666] nvme nvme9: Abort status: 0x0 > > > > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller > > > > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > > > Hello Luis, > > > > > > I run blktests zbd group on several QEMU ZNS emulation devices for every rcX > > > kernel releases. But, I have not ever observed the symptom above. Now I'm > > > repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do > > > not observe the symptom so far, after 400 times repeat. > > > > Did you try v5.17-rc7 ? > > I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom. > > > > > > I would like to run the test using same ZNS set up as yours. Can you share how > > > your ZNS device is set up? I would like to know device size and QEMU -device > > > options, such as zoned.zone_size or zoned.max_active. > > > > It is as easy as the above make commands, and follow up login commands. > > I managed to run kdevops on my machine, and saw the I/O timeout and abort > messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated > in my test environment also (the reset controller message and RCU relegated > error were not observed). > Can you extract the relevant part of the QEMU parameters? I tried to reproduce this, but could not with a 10G, neither with discard=on or off, qcow2 or raw. > [ 214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54 > [ 246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26 > [ 276.784284][ T386] nvme nvme6: I/O 494 QID 4 timeout, aborting > [ 276.788391][ C0] nvme nvme6: Abort status: 0x0 > > The conditions to recreate the I/O timeout error are as follows: > > - Larger size of QEMU ZNS drive (10GB) > - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller > size, the I/O timeout is not observed. > > - Issue zone reset command for all zones (with 'blkzone reset' command) just > after zbd/005 completion to the drive. > - The test case zbd/006 calls the zone reset command. It's enough to repeat > zbd/005 and zone reset command to recreate the I/O timeout. > - When 10 seconds sleep is added between zbd/005 run and zone reset command, > the I/O timeout was not observed. > - The data write pattern of zbd/005 looks important. Simple dd command to > fill the device before 'blkzone reset' did not recreate the I/O timeout. > > I dug into QEMU code and found that it takes long time to complete zone reset > command with all zones flag. It takes more than 30 seconds and looks triggering > the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the > backend file for each zone, so that data of each zone is zero cleared. Each > fallocate call is quick but between the calls, 0.7 second delay was observed > often. I guess some fsync or fdatasync operation would be running and causing > the delay. > QEMU uses a write zeroes for zone reset. This is because of the requirement that block in empty zones must be considered deallocated. When the drive is configured with `discard=on`, these write zeroes *should* turn into discards. However, I also tested with discard=off and I could not reproduce it. It might make sense to force QEMU to use a discard for zone reset in all cases, and then change the reported DLFEAT appropriately, since we cannot guarantee zeroes then. > In other words, QEMU ZNS zone reset for all zones is so slow depending on the > ZNS drive's size and status. Performance improvement of zone reset is desired in > QEMU. I will seek for the chance to work on it. > Currently, each zone is a separate discard/write zero call. It would be fair to special case all zones and do it in much larger chunks.