All of lore.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Christoph Hellwig <hch@lst.de>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>,
	Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Keith Busch <Keith.Busch@wdc.com>
Subject: Re: [PATCH v2 1/2] block: introduce zone_write_granularity limit
Date: Wed, 20 Jan 2021 11:00:29 +0000	[thread overview]
Message-ID: <BL0PR04MB65144D0B02939681617B61A8E7A20@BL0PR04MB6514.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20210120101018.GB25746@lst.de

On 2021/01/20 19:10, Christoph Hellwig wrote:
> On Tue, Jan 19, 2021 at 10:17:22PM +0900, Damien Le Moal wrote:
>> Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
>> all writes into sequential write required zones be aligned to the device
>> physical block size. However, NVMe ZNS does not have this constraint and
>> allows write operations into sequential zones to be logical block size
>> aligned. This inconsistency does not help with portability of software
>> across device types.
>> To solve this, introduce the zone_write_granularity queue limit to
>> indicate the alignment constraint, in bytes, of write operations into
>> zones of a zoned block device. This new limit is exported as a
>> read-only sysfs queue attribute and the helper
>> blk_queue_zone_write_granularity() introduced for drivers to set this
>> limit. The scsi disk driver is modified to use this helper to set
>> host-managed SMR disk zone write granularity to the disk physical block
>> size. The nvme driver zns support use this helper to set the new limit
>> to the logical block size of the zoned namespace.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> ---
>>  Documentation/block/queue-sysfs.rst |  7 +++++++
>>  block/blk-settings.c                | 28 ++++++++++++++++++++++++++++
>>  block/blk-sysfs.c                   |  7 +++++++
>>  drivers/nvme/host/zns.c             |  1 +
>>  drivers/scsi/sd_zbc.c               | 10 ++++++++++
>>  include/linux/blkdev.h              |  3 +++
>>  6 files changed, 56 insertions(+)
>>
>> diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
>> index 2638d3446b79..c8bf8bc3c03a 100644
>> --- a/Documentation/block/queue-sysfs.rst
>> +++ b/Documentation/block/queue-sysfs.rst
>> @@ -273,4 +273,11 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC
>>  do not support zone commands, they will be treated as regular block devices
>>  and zoned will report "none".
>>  
>> +zone_write_granularity (RO)
>> +---------------------------
>> +This indicates the alignment constraint, in bytes, for write operations in
>> +sequential zones of zoned block devices (devices with a zoned attributed
>> +that reports "host-managed" or "host-aware"). This value is always 0 for
>> +regular block devices.
>> +
>>  Jens Axboe <jens.axboe@oracle.com>, February 2009
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 43990b1d148b..6be6ed9485e3 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -60,6 +60,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>>  	lim->io_opt = 0;
>>  	lim->misaligned = 0;
>>  	lim->zoned = BLK_ZONED_NONE;
>> +	lim->zone_write_granularity = 0;
> 
> I think this should default to 512 just like the logic and physical
> block size.

Hmm. I wanted to keep this limit to 0 for regular devices. If we default to 512,
regular devices will see that value. They can ignore it of course, but having it
as 0 makes it clear that it should be ignored.

> 
>>  }
>>  EXPORT_SYMBOL(blk_set_default_limits);
>>  
>> @@ -366,6 +367,31 @@ void blk_queue_physical_block_size(struct request_queue *q, unsigned int size)
>>  }
>>  EXPORT_SYMBOL(blk_queue_physical_block_size);
>>  
>> +/**
>> + * blk_queue_zone_write_granularity - set zone write granularity for the queue
>> + * @q:  the request queue for the zoned device
>> + * @size:  the zone write granularity size, in bytes
>> + *
>> + * Description:
>> + *   This should be set to the lowest possible size allowing to write in
>> + *   sequential zones of a zoned block device.
>> + */
>> +void blk_queue_zone_write_granularity(struct request_queue *q,
>> +				      unsigned int size)
>> +{
>> +	if (WARN_ON(!blk_queue_is_zoned(q)))
>> +		return;
>> +
>> +	q->limits.zone_write_granularity = size;
>> +
>> +	if (q->limits.zone_write_granularity < q->limits.logical_block_size)
>> +		q->limits.zone_write_granularity = q->limits.logical_block_size;
> 
> I think this should be a WARN_ON_ONCE.

OK.

> 
>> +	if (q->limits.zone_write_granularity < q->limits.io_min)
>> +		q->limits.zone_write_granularity = q->limits.io_min;
> 
> I don't think this makes sense at all.

Arg ! Yes, that is a hint only ! I misread the comments for blk_limits_io_min().
Will remove this.

> 
>> +static ssize_t queue_zone_write_granularity_show(struct request_queue *q, char *page)
> 
> Overly long line.
> 
>> +	/*
>> +	 * Per ZBC and ZAC specifications, writes in sequential write required
>> +	 * zones of host-managed devices must be aligned to the device physical
>> +	 * block size.
>> +	 */
>> +	if (blk_queue_zoned_model(q) == BLK_ZONED_HM)
>> +		blk_queue_zone_write_granularity(q, sdkp->physical_block_size);
>> +	else
>> +		blk_queue_zone_write_granularity(q, sdkp->device->sector_size);
> 
> Do we really want to special case HA drives here?  I though we generally
> either treat them as drive managed (if they have partitions) or else like
> host managed ones.

If the HA drive is treated as drive-managed (i.e. it has a partition), then the
model here will be seen as BLK_ZONED_NONE and we should ignore it, or better,
set the zone_write_granularity to 0. If the drive is actually used as a zoned
drive, then we will see BLK_ZONED_HA here and we can use the logical block size
since HA drives do not have that restriction on physical block alignment.
So my code above is wrong. The else should be

else if (blk_queue_zoned_model(q) == BLK_ZONED_HA)

And I think we also need to add "q->limit.zone_write_granularity = 0" in
blk_queue_set_zoned() if model == BLK_ZONED_NONE at the end of that function.

Will send a v3 with these fixes.


-- 
Damien Le Moal
Western Digital Research

WARNING: multiple messages have this Message-ID (diff)
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>, Keith Busch <Keith.Busch@wdc.com>,
	Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"Martin K . Petersen" <martin.petersen@oracle.com>
Subject: Re: [PATCH v2 1/2] block: introduce zone_write_granularity limit
Date: Wed, 20 Jan 2021 11:00:29 +0000	[thread overview]
Message-ID: <BL0PR04MB65144D0B02939681617B61A8E7A20@BL0PR04MB6514.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20210120101018.GB25746@lst.de

On 2021/01/20 19:10, Christoph Hellwig wrote:
> On Tue, Jan 19, 2021 at 10:17:22PM +0900, Damien Le Moal wrote:
>> Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
>> all writes into sequential write required zones be aligned to the device
>> physical block size. However, NVMe ZNS does not have this constraint and
>> allows write operations into sequential zones to be logical block size
>> aligned. This inconsistency does not help with portability of software
>> across device types.
>> To solve this, introduce the zone_write_granularity queue limit to
>> indicate the alignment constraint, in bytes, of write operations into
>> zones of a zoned block device. This new limit is exported as a
>> read-only sysfs queue attribute and the helper
>> blk_queue_zone_write_granularity() introduced for drivers to set this
>> limit. The scsi disk driver is modified to use this helper to set
>> host-managed SMR disk zone write granularity to the disk physical block
>> size. The nvme driver zns support use this helper to set the new limit
>> to the logical block size of the zoned namespace.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> ---
>>  Documentation/block/queue-sysfs.rst |  7 +++++++
>>  block/blk-settings.c                | 28 ++++++++++++++++++++++++++++
>>  block/blk-sysfs.c                   |  7 +++++++
>>  drivers/nvme/host/zns.c             |  1 +
>>  drivers/scsi/sd_zbc.c               | 10 ++++++++++
>>  include/linux/blkdev.h              |  3 +++
>>  6 files changed, 56 insertions(+)
>>
>> diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
>> index 2638d3446b79..c8bf8bc3c03a 100644
>> --- a/Documentation/block/queue-sysfs.rst
>> +++ b/Documentation/block/queue-sysfs.rst
>> @@ -273,4 +273,11 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC
>>  do not support zone commands, they will be treated as regular block devices
>>  and zoned will report "none".
>>  
>> +zone_write_granularity (RO)
>> +---------------------------
>> +This indicates the alignment constraint, in bytes, for write operations in
>> +sequential zones of zoned block devices (devices with a zoned attributed
>> +that reports "host-managed" or "host-aware"). This value is always 0 for
>> +regular block devices.
>> +
>>  Jens Axboe <jens.axboe@oracle.com>, February 2009
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 43990b1d148b..6be6ed9485e3 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -60,6 +60,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>>  	lim->io_opt = 0;
>>  	lim->misaligned = 0;
>>  	lim->zoned = BLK_ZONED_NONE;
>> +	lim->zone_write_granularity = 0;
> 
> I think this should default to 512 just like the logic and physical
> block size.

Hmm. I wanted to keep this limit to 0 for regular devices. If we default to 512,
regular devices will see that value. They can ignore it of course, but having it
as 0 makes it clear that it should be ignored.

> 
>>  }
>>  EXPORT_SYMBOL(blk_set_default_limits);
>>  
>> @@ -366,6 +367,31 @@ void blk_queue_physical_block_size(struct request_queue *q, unsigned int size)
>>  }
>>  EXPORT_SYMBOL(blk_queue_physical_block_size);
>>  
>> +/**
>> + * blk_queue_zone_write_granularity - set zone write granularity for the queue
>> + * @q:  the request queue for the zoned device
>> + * @size:  the zone write granularity size, in bytes
>> + *
>> + * Description:
>> + *   This should be set to the lowest possible size allowing to write in
>> + *   sequential zones of a zoned block device.
>> + */
>> +void blk_queue_zone_write_granularity(struct request_queue *q,
>> +				      unsigned int size)
>> +{
>> +	if (WARN_ON(!blk_queue_is_zoned(q)))
>> +		return;
>> +
>> +	q->limits.zone_write_granularity = size;
>> +
>> +	if (q->limits.zone_write_granularity < q->limits.logical_block_size)
>> +		q->limits.zone_write_granularity = q->limits.logical_block_size;
> 
> I think this should be a WARN_ON_ONCE.

OK.

> 
>> +	if (q->limits.zone_write_granularity < q->limits.io_min)
>> +		q->limits.zone_write_granularity = q->limits.io_min;
> 
> I don't think this makes sense at all.

Arg ! Yes, that is a hint only ! I misread the comments for blk_limits_io_min().
Will remove this.

> 
>> +static ssize_t queue_zone_write_granularity_show(struct request_queue *q, char *page)
> 
> Overly long line.
> 
>> +	/*
>> +	 * Per ZBC and ZAC specifications, writes in sequential write required
>> +	 * zones of host-managed devices must be aligned to the device physical
>> +	 * block size.
>> +	 */
>> +	if (blk_queue_zoned_model(q) == BLK_ZONED_HM)
>> +		blk_queue_zone_write_granularity(q, sdkp->physical_block_size);
>> +	else
>> +		blk_queue_zone_write_granularity(q, sdkp->device->sector_size);
> 
> Do we really want to special case HA drives here?  I though we generally
> either treat them as drive managed (if they have partitions) or else like
> host managed ones.

If the HA drive is treated as drive-managed (i.e. it has a partition), then the
model here will be seen as BLK_ZONED_NONE and we should ignore it, or better,
set the zone_write_granularity to 0. If the drive is actually used as a zoned
drive, then we will see BLK_ZONED_HA here and we can use the logical block size
since HA drives do not have that restriction on physical block alignment.
So my code above is wrong. The else should be

else if (blk_queue_zoned_model(q) == BLK_ZONED_HA)

And I think we also need to add "q->limit.zone_write_granularity = 0" in
blk_queue_set_zoned() if model == BLK_ZONED_NONE at the end of that function.

Will send a v3 with these fixes.


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2021-01-20 11:25 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-19 13:17 [PATCH v2 0/2] block: add zone write granularity limit Damien Le Moal
2021-01-19 13:17 ` Damien Le Moal
2021-01-19 13:17 ` [PATCH v2 1/2] block: introduce zone_write_granularity limit Damien Le Moal
2021-01-19 13:17   ` Damien Le Moal
2021-01-20 10:10   ` Christoph Hellwig
2021-01-20 10:10     ` Christoph Hellwig
2021-01-20 11:00     ` Damien Le Moal [this message]
2021-01-20 11:00       ` Damien Le Moal
2021-01-20 12:36       ` Christoph Hellwig
2021-01-20 12:36         ` Christoph Hellwig
2021-01-19 13:17 ` [PATCH v2 2/2] block: document zone_append_max_bytes attribute Damien Le Moal
2021-01-19 13:17   ` Damien Le Moal
2021-01-20 10:10   ` Christoph Hellwig
2021-01-20 10:10     ` Christoph Hellwig
2021-01-20 10:07 ` [PATCH v2 0/2] block: add zone write granularity limit Christoph Hellwig
2021-01-20 10:07   ` Christoph Hellwig
2021-01-20 10:42   ` Damien Le Moal
2021-01-20 10:42     ` Damien Le Moal
2021-01-20 10:46   ` Damien Le Moal
2021-01-20 10:46     ` Damien Le Moal
  -- strict thread matches above, loose matches on Subject: below --
2021-01-19  9:38 Damien Le Moal
2021-01-19  9:38 ` [PATCH v2 1/2] block: introduce zone_write_granularity limit Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BL0PR04MB65144D0B02939681617B61A8E7A20@BL0PR04MB6514.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=Chaitanya.Kulkarni@wdc.com \
    --cc=Keith.Busch@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.