Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: "colyli@suse.de" <colyli@suse.de>,
	"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"hare@suse.com" <hare@suse.com>
Subject: Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
Date: Tue, 2 Jun 2020 08:48:29 +0000
Message-ID: <9ad1e32ddd1d93ad4120723d23446160a0877752.camel@wdc.com> (raw)
In-Reply-To: <1c124ebe-3f1c-6715-0c1b-245ae18ec012@suse.de>

On Mon, 2020-06-01 at 20:34 +0800, Coly Li wrote:
> [...]
> > > +
> > > +	/* convert back to LBA of the bcache device*/
> > > +	zone->start -= data_offset;
> > > +	zone->wp -= data_offset;
> > 
> > This has to be done depending on the zone type and zone condition: zone->wp is
> > "invalid" for conventional zones, and sequential zones that are full, read-only
> > or offline. So you need something like this:
> > 
> > 	/* Remap LBA to the bcache device */
> > 	zone->start -= data_offset;
> > 	switch(zone->cond) {
> > 	case BLK_ZONE_COND_NOT_WP:
> > 	case BLK_ZONE_COND_READONLY:
> > 	case BLK_ZONE_COND_FULL:
> > 	case BLK_ZONE_COND_OFFLINE:
> > 		break;
> > 	case BLK_ZONE_COND_EMPTY:
> > 		zone->wp = zone->start;
> > 		break;
> > 	default:
> > 		zone->wp -= data_offset;
> > 		break;
> > 	}
> > 
> > 	return args->orig_cb(zone, idx, args->orig_data);
> > 
> 
> Hmm, do we have a unified spec to handle the wp on different zone
> condition ?

This comes from SCSI ZBC and ATA ZAC specifications, version r05 of the
documents for both (ratified specifications). These 2 specifications
define identical semantic and states for SMR zones. Upcoming NVMe ZNS
has similar and compatible zone management too.

> In zonefs I see zone->wp sets to zone->start for,
> - BLK_ZONE_COND_OFFLINE
> - BLK_ZONE_COND_READONLY

This is because zonefs uses a zone write pointer position for the file
size: file size = zone->wp - zone->start;
For the read-only and offline zone conditions, since the wp given by
the drive is defined as "invalid", so unknown, zonefs changes the wp to
zone start to make the file size 0 and prevent any read IO issuing.

> 
> In sd_zbc.c, I see wp sets to end of the zone for
> - BLK_ZONE_COND_FULL

This is only to be nice to host software users, so that the wp position
relative to the zone start is exactly the zone size, indicating that
the entire zone was written. This allows to have equivalence between
the tests

zone->cond == BLK_ZONE_COND_FULL

and

zone->wp == zone->start + zone->len

to make coding easier. But the fact remain that the value given for the
wp of a full zone by the disk is undefined. Generally, drives return
0xfffff...ffff.

> And in dm.c I see zone->wp is set to zone->start for,
> - BLK_ZONE_COND_EMPTY

That is because like bcache, dm may remap zones to indicate sector
values that make sense for the target device. The values given by the
physical underlying disk cannot be used as is. The code snippet above I
sent does the same for the same reason.

> It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
> writer pointer should be taken cared by device driver already.

The physical drive device driver gives you what the disk indicated,
modulo the change for full zone. Bcache *is* a device driver too,
driver of the target device. The report zones method for that device
needs to take care of the zone remapping if needed. That is not for the
underlying physical device driver lower in the stack to do this as that
layer does not know how the drive is being used.

> So I write such switch-case in the following way by the inspair of your
> comments,
> 
>         /* Zone 0 should not be reported */
>         if(WARN_ON_ONCE(zone->start < data_offset))
>                 return -EIO;

That depends on the start sector specified when blkdev_report_zones()
is called. Beware to have the start sector being >= zone size. That is,
that report start sector needs remapping too.

>         /* convert back to LBA of the bcache device*/
>         zone->start -= data_offset;
>         if (zone->cond != BLK_ZONE_COND_NOT_WP)
>                 zone->wp -= data_offset;

Since you have the BLK_ZONE_COND_NOT_WP condition in the switch-case,
this is not needed. 

> 
>         switch (zone->cond) {
>         case BLK_ZONE_COND_NOT_WP:
>                 /* No write pointer available */
>                 break;
>         case BLK_ZONE_COND_EMPTY:

zone->wp = zone->start;

needs to ba added here so that your wp is remapped.

>         case BLK_ZONE_COND_READONLY:
>         case BLK_ZONE_COND_OFFLINE:
>                 /*
>                  * If underlying device driver does not properly
>                  * set writer pointer, warn and fix it.
>                  */
>                 if (WARN_ON_ONCE(zone->wp != zone->start))
>                         zone->wp = zone->start;

This is not needed. The wp value is undefined for these conditions.
touching it, looking at it or even thinking of it does not make sense
:) So leave the wp as is for these 2 cases.

>                 break;
>         case BLK_ZONE_COND_FULL:
>                 /*
>                  * If underlying device driver does not properly
>                  * set writer pointer, warn and fix it.
>                  */
>                 if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
>                         zone->wp = zone->start + zone->len;

Simply unconditionally set the wp to zone->start + zone->len. No WARN
ON needed at all. I actually forgot this case in the code snippet I
sent.

>                 break;
>         default:
>                 break;
>         }
> 
>         return args->orig_cb(zone, idx, args->orig_data);
> 
> The above code converts zone->wp by minus data_offset for
> non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
> I just check whether the underlying device driver updates write pointer
> properly (as I observed in other drivers), if not then drop a warning
> and fix the write pointer to the expected value.

The wp essentially comes from the drive. Except for a full zone, the
value is passed on as is by the driver. The physical device driver does
not "set" the wp.

> Now the write pointer is only fixed when it was wrong value. If the
> underlying device driver does not maintain the value properly, we figure
> out and fix it.

The wp will be a wrong value only if the drive you are using has a
firmware bug. The "fixing" needed is because bcache device needs
remapping (off-by-one-zone) of the physical device zone
information. That is for bcache to do. And that remapping needs to be
done carefully since some wp values are undefined. Your code above as
is will for instance change the wp value from "undefined"
(0xffffffffffffffff for drives out there, generally, but it could be
anything) to "undefined - data_offset" for any readonly or offline
zone. The result is still "undefined"... Doing that kind of operation
on the wp is not necessary at all and does not serve any useful
purpose.



-- 
Damien Le Moal
Western Digital Research

  reply index

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Coly Li
2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
2020-05-25  1:10   ` Damien Le Moal
2020-06-01 12:34     ` Coly Li
2020-06-02  8:48       ` Damien Le Moal [this message]
2020-06-02 12:50         ` Coly Li
2020-06-03  0:58           ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
2020-05-25  1:24   ` Damien Le Moal
2020-06-01 16:06     ` Coly Li
2020-06-02  8:54       ` Damien Le Moal
2020-06-02 10:18         ` Coly Li
2020-06-03  0:51           ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
2020-05-25  1:26   ` Damien Le Moal
2020-06-01 16:09     ` Coly Li
2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
2020-05-25  8:14   ` Coly Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ad1e32ddd1d93ad4120723d23446160a0877752.camel@wdc.com \
    --to=damien.lemoal@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=colyli@suse.de \
    --cc=hare@suse.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git