All of lore.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: Klaus Jensen <its@irrelevant.dk>,
	Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Pankaj Raghav <pankydev8@gmail.com>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Adam Manzanares <a.manzanares@samsung.com>
Subject: Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
Date: Wed, 27 Apr 2022 17:39:46 +0900	[thread overview]
Message-ID: <a57a8b0a-f5fd-6925-89e4-68b90ea5d387@opensource.wdc.com> (raw)
In-Reply-To: <YmjzrLo0/zW3Ou03@apples>

On 4/27/22 16:41, Klaus Jensen wrote:
> On Apr 27 05:08, Shinichiro Kawasaki wrote:
>> On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
>>> On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
>>>> On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
>>>>> Hey folks,
>>>>>
>>>>> While enhancing kdevops [0] to embrace automation of testing with
>>>>> blktests for ZNS I ended up spotting a possible false positive RCU stall
>>>>> when running zbd/006 after zbd/005. The curious thing though is that
>>>>> this possible RCU stall is only possible when using the qemu
>>>>> ZNS drive, not when using nbd. In so far as kdevops is concerned
>>>>> it creates ZNS drives for you when you enable the config option
>>>>> CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
>>>>> suffices. When configuring blktests you can just enable the zbd
>>>>> guest, so only a pair of guests are reated the zbd guest and the
>>>>> respective development guest, zbd-dev guest. When using
>>>>> CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
>>>>> just two guests:
>>>>>
>>>>>   * linux517-blktests-zbd
>>>>>   * linux517-blktests-zbd-dev
>>>>>
>>>>> The RCU stall can be triggered easily as follows:
>>>>>
>>>>> make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
>>>>> make
>>>>> make bringup # bring up guests
>>>>> make linux # build and boot into v5.17-rc7
>>>>> make blktests # build and install blktests
>>>>>
>>>>> Now let's ssh to the guest while leaving a console attached
>>>>> with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
>>>>>
>>>>> ssh linux517-blktests-zbd
>>>>> sudo su -
>>>>> cd /usr/local/blktests
>>>>> export TEST_DEVS=/dev/nvme9n1
>>>>> i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
>>>>>
>>>>> The above should never fail, but you should eventually see an RCU
>>>>> stall candidate on the console. The full details can be observed on the
>>>>> gist [1] but for completeness I list some of it below. It may be a false
>>>>> positive at this point, not sure.
>>>>>
>>>>> [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
>>>>> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
>>>>> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
>>>>> [493336.981666] nvme nvme9: Abort status: 0x0
>>>>> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
>>>>> [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>>
>>>> Hello Luis,
>>>>
>>>> I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
>>>> kernel releases. But, I have not ever observed the symptom above. Now I'm
>>>> repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
>>>> not observe the symptom so far, after 400 times repeat.
>>>
>>> Did you try v5.17-rc7 ?
>>
>> I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
>>
>>>
>>>> I would like to run the test using same ZNS set up as yours. Can you share how
>>>> your ZNS device is set up? I would like to know device size and QEMU -device
>>>> options, such as zoned.zone_size or zoned.max_active.
>>>
>>> It is as easy as the above make commands, and follow up login commands.
>>
>> I managed to run kdevops on my machine, and saw the I/O timeout and abort
>> messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
>> in my test environment also (the reset controller message and RCU relegated
>> error were not observed).
>>
> 
> Can you extract the relevant part of the QEMU parameters? I tried to
> reproduce this, but could not with a 10G, neither with discard=on or
> off, qcow2 or raw.
> 
>> [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
>> [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
>> [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
>> [  276.788391][    C0] nvme nvme6: Abort status: 0x0
>>
>> The conditions to recreate the I/O timeout error are as follows:
>>
>> - Larger size of QEMU ZNS drive (10GB)
>>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
>>       size, the I/O timeout is not observed.
>>
>> - Issue zone reset command for all zones (with 'blkzone reset' command) just
>>   after zbd/005 completion to the drive.
>>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
>>       zbd/005 and zone reset command to recreate the I/O timeout.
>>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
>>       the I/O timeout was not observed.
>>     - The data write pattern of zbd/005 looks important. Simple dd command to
>>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
>>
>> I dug into QEMU code and found that it takes long time to complete zone reset
>> command with all zones flag. It takes more than 30 seconds and looks triggering
>> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
>> backend file for each zone, so that data of each zone is zero cleared. Each
>> fallocate call is quick but between the calls, 0.7 second delay was observed
>> often. I guess some fsync or fdatasync operation would be running and causing
>> the delay.
>>
> 
> QEMU uses a write zeroes for zone reset. This is because of the
> requirement that block in empty zones must be considered deallocated.
> 
> When the drive is configured with `discard=on`, these write zeroes
> *should* turn into discards. However, I also tested with discard=off and
> I could not reproduce it.
> 
> It might make sense to force QEMU to use a discard for zone reset in all
> cases, and then change the reported DLFEAT appropriately, since we
> cannot guarantee zeroes then.

Why not punch a hole in the backing store file with fallocate() with mode
set to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE ? That would be way
faster than a write zeroes which potentially actually do the writes,
leading to large command processing times. Reading in a hole in a file is
guaranteed to return zeroes, at least on Linux.

If the backingstore is a block device, then sure, write zeroes is the only
solution. Discard should be used with caution since that is a hint only
and some drives may actually do nothing.

> 
>> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
>> ZNS drive's size and status. Performance improvement of zone reset is desired in
>> QEMU. I will seek for the chance to work on it.
>>
> 
> Currently, each zone is a separate discard/write zero call. It would be
> fair to special case all zones and do it in much larger chunks.

Yep, for a backing file, a full file fallocate(FALLOC_FL_PUNCH_HOLE) would
do nicely. Or truncate(0) + truncate(storage size) would do too.

Since resets are always all zones or one zone, special optimized handling
of the reset all case will definitely have huge benefits for that command.


-- 
Damien Le Moal
Western Digital Research

  reply	other threads:[~2022-04-27  8:39 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-14 22:02 blktests with zbd/006 ZNS triggers a possible false positive RCU stall Luis Chamberlain
2022-04-15  1:09 ` Davidlohr Bueso
2022-04-15  3:54   ` Paul E. McKenney
2022-04-15  4:30     ` Davidlohr Bueso
2022-04-15 17:35       ` Luis Chamberlain
2022-04-15 17:33   ` Luis Chamberlain
2022-04-15 17:42     ` Paul E. McKenney
2022-04-20  5:54 ` Shinichiro Kawasaki
2022-04-21 18:00   ` Luis Chamberlain
2022-04-27  5:08     ` Shinichiro Kawasaki
2022-04-27  5:42       ` Luis Chamberlain
2022-04-27  7:41       ` Klaus Jensen
2022-04-27  8:39         ` Damien Le Moal [this message]
2022-04-27  8:55           ` Klaus Jensen
2022-04-27  8:53         ` Shinichiro Kawasaki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a57a8b0a-f5fd-6925-89e4-68b90ea5d387@opensource.wdc.com \
    --to=damien.lemoal@opensource.wdc.com \
    --cc=a.manzanares@samsung.com \
    --cc=dave@stgolabs.net \
    --cc=its@irrelevant.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=pankydev8@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.