From: Coly Li <colyli@suse.de>
To: Damien Le Moal <Damien.LeMoal@wdc.com>,
"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"hare@suse.com" <hare@suse.com>
Subject: Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
Date: Tue, 2 Jun 2020 20:50:45 +0800 [thread overview]
Message-ID: <c6492f79-423e-b37c-261c-80e4b26081ab@suse.de> (raw)
In-Reply-To: <9ad1e32ddd1d93ad4120723d23446160a0877752.camel@wdc.com>
On 2020/6/2 16:48, Damien Le Moal wrote:
> On Mon, 2020-06-01 at 20:34 +0800, Coly Li wrote:
>> [...]
>>>> +
>>>> + /* convert back to LBA of the bcache device*/
>>>> + zone->start -= data_offset;
>>>> + zone->wp -= data_offset;
>>>
>>> This has to be done depending on the zone type and zone condition: zone->wp is
>>> "invalid" for conventional zones, and sequential zones that are full, read-only
>>> or offline. So you need something like this:
>>>
>>> /* Remap LBA to the bcache device */
>>> zone->start -= data_offset;
>>> switch(zone->cond) {
>>> case BLK_ZONE_COND_NOT_WP:
>>> case BLK_ZONE_COND_READONLY:
>>> case BLK_ZONE_COND_FULL:
>>> case BLK_ZONE_COND_OFFLINE:
>>> break;
>>> case BLK_ZONE_COND_EMPTY:
>>> zone->wp = zone->start;
>>> break;
>>> default:
>>> zone->wp -= data_offset;
>>> break;
>>> }
>>>
>>> return args->orig_cb(zone, idx, args->orig_data);
>>>
>>
>> Hmm, do we have a unified spec to handle the wp on different zone
>> condition ?
>
> This comes from SCSI ZBC and ATA ZAC specifications, version r05 of the
> documents for both (ratified specifications). These 2 specifications
> define identical semantic and states for SMR zones. Upcoming NVMe ZNS
> has similar and compatible zone management too.
>
>> In zonefs I see zone->wp sets to zone->start for,
>> - BLK_ZONE_COND_OFFLINE
>> - BLK_ZONE_COND_READONLY
>
> This is because zonefs uses a zone write pointer position for the file
> size: file size = zone->wp - zone->start;
> For the read-only and offline zone conditions, since the wp given by
> the drive is defined as "invalid", so unknown, zonefs changes the wp to
> zone start to make the file size 0 and prevent any read IO issuing.
>
>>
>> In sd_zbc.c, I see wp sets to end of the zone for
>> - BLK_ZONE_COND_FULL
>
> This is only to be nice to host software users, so that the wp position
> relative to the zone start is exactly the zone size, indicating that
> the entire zone was written. This allows to have equivalence between
> the tests
>
> zone->cond == BLK_ZONE_COND_FULL
>
> and
>
> zone->wp == zone->start + zone->len
>
> to make coding easier. But the fact remain that the value given for the
> wp of a full zone by the disk is undefined. Generally, drives return
> 0xfffff...ffff.
>
Copied. Thanks for the above information.
>> And in dm.c I see zone->wp is set to zone->start for,
>> - BLK_ZONE_COND_EMPTY
>
> That is because like bcache, dm may remap zones to indicate sector
> values that make sense for the target device. The values given by the
> physical underlying disk cannot be used as is. The code snippet above I
> sent does the same for the same reason.
>
>> It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
>> writer pointer should be taken cared by device driver already.
>
> The physical drive device driver gives you what the disk indicated,
> modulo the change for full zone. Bcache *is* a device driver too,
> driver of the target device. The report zones method for that device
> needs to take care of the zone remapping if needed. That is not for the
> underlying physical device driver lower in the stack to do this as that
> layer does not know how the drive is being used.
>
Copied, thanks for the information :-)
>> So I write such switch-case in the following way by the inspair of your
>> comments,
>>
>> /* Zone 0 should not be reported */
>> if(WARN_ON_ONCE(zone->start < data_offset))
>> return -EIO;
>
> That depends on the start sector specified when blkdev_report_zones()
> is called. Beware to have the start sector being >= zone size. That is,
> that report start sector needs remapping too.
>
Because zones before data_offset should not visible by upper layer code,
such zones should not be sent to cached_dev_report_zones_cb(). This is a
double check that caller of cached_dev_report_zones_cb() does things in
correct way.
>> /* convert back to LBA of the bcache device*/
>> zone->start -= data_offset;
>> if (zone->cond != BLK_ZONE_COND_NOT_WP)
>> zone->wp -= data_offset;
>
> Since you have the BLK_ZONE_COND_NOT_WP condition in the switch-case,
> this is not needed.
>
>>
>> switch (zone->cond) {
>> case BLK_ZONE_COND_NOT_WP:
>> /* No write pointer available */
>> break;
>> case BLK_ZONE_COND_EMPTY:
>
> zone->wp = zone->start;
>
Correct me if I am wrong. I assume when the zone is in empty condition,
LBA of zone->wp should be exactly equal to zone->start before bcache
does the conversion. Therefore 'zone->wp - data_offset' should still be
equal to 'zone->start - data_offset'. Therefore explicitly handle it in
'case BLK_ZONE_COND_EMPTY:' should be equal to handle it in 'default:' part.
Just want to double check I understand the wp correctly, thanks.
> needs to ba added here so that your wp is remapped.
>
>> case BLK_ZONE_COND_READONLY:
>> case BLK_ZONE_COND_OFFLINE:
>> /*
>> * If underlying device driver does not properly
>> * set writer pointer, warn and fix it.
>> */
>> if (WARN_ON_ONCE(zone->wp != zone->start))
>> zone->wp = zone->start;
>
> This is not needed. The wp value is undefined for these conditions.
> touching it, looking at it or even thinking of it does not make sense
> :) So leave the wp as is for these 2 cases.
Copied.
>
>> break;
>> case BLK_ZONE_COND_FULL:
>> /*
>> * If underlying device driver does not properly
>> * set writer pointer, warn and fix it.
>> */
>> if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
>> zone->wp = zone->start + zone->len;
>
> Simply unconditionally set the wp to zone->start + zone->len. No WARN
> ON needed at all. I actually forgot this case in the code snippet I
> sent.
Copied.
>
>> break;
>> default:
>> break;
>> }
>>
>> return args->orig_cb(zone, idx, args->orig_data);
>>
>> The above code converts zone->wp by minus data_offset for
>> non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
>> I just check whether the underlying device driver updates write pointer
>> properly (as I observed in other drivers), if not then drop a warning
>> and fix the write pointer to the expected value.
>
> The wp essentially comes from the drive. Except for a full zone, the
> value is passed on as is by the driver. The physical device driver does
> not "set" the wp.
>
>> Now the write pointer is only fixed when it was wrong value. If the
>> underlying device driver does not maintain the value properly, we figure
>> out and fix it.
>
> The wp will be a wrong value only if the drive you are using has a
> firmware bug. The "fixing" needed is because bcache device needs
> remapping (off-by-one-zone) of the physical device zone
> information. That is for bcache to do. And that remapping needs to be
> done carefully since some wp values are undefined. Your code above as
> is will for instance change the wp value from "undefined"
> (0xffffffffffffffff for drives out there, generally, but it could be
> anything) to "undefined - data_offset" for any readonly or offline
> zone. The result is still "undefined"... Doing that kind of operation
> on the wp is not necessary at all and does not serve any useful
> purpose.
I see. Now it is more clear to me. Thanks.
Coly Li
next prev parent reply other threads:[~2020-06-02 12:50 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
2020-05-25 1:10 ` Damien Le Moal
2020-06-01 12:34 ` Coly Li
2020-06-02 8:48 ` Damien Le Moal
2020-06-02 12:50 ` Coly Li [this message]
2020-06-03 0:58 ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
2020-05-25 1:24 ` Damien Le Moal
2020-06-01 16:06 ` Coly Li
2020-06-02 8:54 ` Damien Le Moal
2020-06-02 10:18 ` Coly Li
2020-06-03 0:51 ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
2020-05-25 1:26 ` Damien Le Moal
2020-06-01 16:09 ` Coly Li
2020-05-25 5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
2020-05-25 8:14 ` Coly Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c6492f79-423e-b37c-261c-80e4b26081ab@suse.de \
--to=colyli@suse.de \
--cc=Damien.LeMoal@wdc.com \
--cc=Johannes.Thumshirn@wdc.com \
--cc=hare@suse.com \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).