Re: [PATCH V12 2/3] nvmet: add ZBD over ZNS backend support

From: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
To: Damien Le Moal <Damien.LeMoal@wdc.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Cc: "hch@lst.de" <hch@lst.de>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"sagi@grimberg.me" <sagi@grimberg.me>
Subject: Re: [PATCH V12 2/3] nvmet: add ZBD over ZNS backend support
Date: Sat, 13 Mar 2021 02:40:59 +0000	[thread overview]
Message-ID: <BYAPR04MB4965BE436BDFBA0BAEC43C61866E9@BYAPR04MB4965.namprd04.prod.outlook.com> (raw)
In-Reply-To: BL0PR04MB6514F4EC40A62AB4763560E8E76F9@BL0PR04MB6514.namprd04.prod.outlook.com

On 3/11/21 23:26, Damien Le Moal wrote:
> On 2021/03/12 15:29, Chaitanya Kulkarni wrote:
> [...]
> +void nvmet_bdev_execute_zone_mgmt_recv(struct nvmet_req *req)
>>>> +{
>>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>>>> +	u32 bufsize = (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2;
>>>> +	struct nvmet_report_zone_data data = { .ns = req->ns };
>>>> +	unsigned int nr_zones;
>>>> +	int reported_zones;
>>>> +	u16 status;
>>>> +
>>>> +	status = nvmet_bdev_zns_checks(req);
>>>> +	if (status)
>>>> +		goto out;
>>>> +
>>>> +	data.rz = __vmalloc(bufsize, GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO);
>>>> +	if (!data.rz) {
>>>> +		status = NVME_SC_INTERNAL;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	nr_zones = (bufsize - sizeof(struct nvme_zone_report)) /
>>>> +			sizeof(struct nvme_zone_descriptor);
>>>> +	if (!nr_zones) {
>>>> +		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
>>>> +		goto out_free_report_zones;
>>>> +	}
>>>> +
>>>> +	reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
>>>> +					     nvmet_bdev_report_zone_cb, &data);
>>>> +	if (reported_zones < 0) {
>>>> +		status = NVME_SC_INTERNAL;
>>>> +		goto out_free_report_zones;
>>>> +	}
>>> There is a problem here: the code as is ignores the request reporting option
>>> field which can lead to an invalid zone report being returned. I think you need
>>> to modify nvmet_bdev_report_zone_cb() to look at the reporting option field
>>> passed by the initiator and filter the zone report since blkdev_report_zones()
>>> does not handle that argument.
>> The reporting options are set by the host statistically in
>> nvme_ns_report_zones()
>> arefrom:-  nvme_ns_report_zones()
>>          c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>          c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>          c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>>
>> All the above values are validated in the nvmet_bdev_zns_checks() helper
>> called from nvmet_bdev_execute_zone_mgmt_recv() before we allocate the
>> report zone buffer.
>>
>> 1. c.zmr.zra indicates the action which Reports zone descriptor entries
>>    through the Report Zones data structure.
>>
>>    We validate this value is been set to NVME_ZRA_ZONE_REPORT in the
>>    nvmet_bdev_zns_chceks(). We are calling report zone after checking
>>    zone receive action it NVME_ZRA_ZONE_REPORT so not filtering is needed
>>    in the nvmet_bdev_report_zone_cb().
>>
>> 2. c.zmr.zrasf indicates the action specific field which is set to
>>    NVME_ZRASF_ZONE_REPORT_ALL.
>>
>>    We validate this value is been set to NVME_ZRASF_ZONE_REPORT_ALL in the
>>    nvmet_bdev_zns_chceks(). Since host wants all the zones we don't need to
>>    filter any zone states in the nvmet_bdev_report_zone_cb().
>>
>> 3. c.zmr.pr is set to NVME_REPORT_ZONE_PARTIAL which value = 1 i.e value in
>>    the Report Zone data structure Number of Zones field indicates the
>> number of
>>    fully transferred zone descriptors in the data buffer, which we set from
>>    return value of the blkdev_report_zones() :-
>>   
>>
>>    reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
>> 					     nvmet_bdev_report_zone_cb, &data);
>> <snip>   data.rz->nr_zones = cpu_to_le64(reported_zones);
>>
>>    So no filtering is needed in nvmet_bdev_report_zone_cb() for c.zmr.pr.
>>
>> Can you please explain what filtering is missing in the current code ?
>>
>> Maybe I'm looking into an old spec.
> report zones command has the reporting options (ro) field (bits 15:08 of dword
> 13) where the user can specify the following values:
>
> Value Description
> 0h List all zones.
> 1h List the zones in the ZSE:Empty state.
> 2h List the zones in the ZSIO:Implicitly Opened state.
> 3h List the zones in the ZSEO:Explicitly Opened state.
> 4h List the zones in the ZSC:Closed state.
> 5h List the zones in the ZSF:Full state.
> 6h List the zones in the ZSRO:Read Only state.
> 7h List the zones in the ZSO:Offline state.
>
> to filter the zone report based on zone condition. blkdev_report_zones() will
> always to a "list all zones", that is, ro == 0h.
>
> But on the initiator side, if the client issue a report zones command through an
> ioctl (passthrough/direct access not suing the block layer BLKREPORTZONES
> ioctl), it may specify a different value for the ro field. Processing that
> command using blkdev_report_zones() like you are doing here without any
> additional filtering will give an incorrect report. Filtering based on the user
> specified ro field needs to be added in nvmet_bdev_report_zone_cb().
>
> The current code here is fine of the initiator/client side uses the block layer
> and execute all report zones through blkdev_report_zones(). But things will
> break if the client starts doing passthrough commands using nvme ioctl. No ?

Okay, so I'm using the right spec. With your explanation it needs a
filtering,
will add it to the next version.

>>>> +
>>>> +	data.rz->nr_zones = cpu_to_le64(reported_zones);
>>>> +
>>>> +	status = nvmet_copy_to_sgl(req, 0, data.rz, bufsize);
>>>> +
>>>> +out_free_report_zones:
>>>> +	kvfree(data.rz);
>>>> +out:
>>>> +	nvmet_req_complete(req, status);
>>>> +}
>>>> +
>>>> +void nvmet_bdev_execute_zone_mgmt_send(struct nvmet_req *req)
>>>> +{
>>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zms.slba);
>>>> +	sector_t nr_sect = bdev_zone_sectors(req->ns->bdev);
>>>> +	u16 status = NVME_SC_SUCCESS;
>>>> +	u8 zsa = req->cmd->zms.zsa;
>>>> +	enum req_opf op;
>>>> +	int ret;
>>>> +	const unsigned int zsa_to_op[] = {
>>>> +		[NVME_ZONE_OPEN]	= REQ_OP_ZONE_OPEN,
>>>> +		[NVME_ZONE_CLOSE]	= REQ_OP_ZONE_CLOSE,
>>>> +		[NVME_ZONE_FINISH]	= REQ_OP_ZONE_FINISH,
>>>> +		[NVME_ZONE_RESET]	= REQ_OP_ZONE_RESET,
>>>> +	};
>>>> +
>>>> +	if (zsa > ARRAY_SIZE(zsa_to_op) || !zsa_to_op[zsa]) {
>>> What is the point of the "!zsa_to_op[zsa]I see, I'll add the async I/O interface." here ? All the REQ_OP_ZONE_XXX are
>>> non 0, always...
>> Well this is just making sure that we receive the right action since sparse
>> array will return 0 for any other values than listed above having
>> !zsa_to_op[zsa] check we can return an error.
> But zsa is unsigned and you check it against the array size. So it can only be
> within 0 and array size - 1. That is enough... I really do not see the point of
> clutering the condition with something that is always true...

Okay.

>
> [...]
>>>> +
>>>> +	ret = blkdev_zone_mgmt(req->ns->bdev, op, sect, nr_sect, GFP_KERNEL);
>>>> +	if (ret)
>>>> +		status = NVME_SC_INTERNAL;
>>>> +out:
>>>> +	nvmet_req_complete(req, status);
>>>> +}
>>>> +
>>>> +void nvmet_bdev_execute_zone_append(struct nvmet_req *req)
>>>> +{
>>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->rw.slba);
>>>> +	u16 status = NVME_SC_SUCCESS;
>>>> +	unsigned int total_len = 0;
>>>> +	struct scatterlist *sg;
>>>> +	int ret = 0, sg_cnt;
>>>> +	struct bio *bio;
>>>> +
>>>> +	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
>>>> +		return;
>>> No nvmet_req_complete() call ? Is that done in nvmet_check_transfer_len() ?
>> Yes it does, you had the same comment on earlier version, it can be
>> confusing
>> that is why I proposed a helper for check transfer len and !req->sg_cnt
>> check,
>> but we don't want that helper.
> Just add a comment saying that nvmet_check_transfer_len() calls
> nvmet_req_complete(). That will help people like me with a short memory span :)

Okay.

>>>> +
>>>> +	if (!req->sg_cnt) {
>>>> +		nvmet_req_complete(req, 0);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	if (req->transfer_len <= NVMET_MAX_INLINE_DATA_LEN) {
>>>> +		bio = &req->b.inline_bio;
>>>> +		bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
>>>> +	} else {
>>>> +		bio = bio_alloc(GFP_KERNEL, req->sg_cnt);
>>>> +	}
>>>> +
>>>> +	bio_set_dev(bio, req->ns->bdev);
>>>> +	bio->bi_iter.bi_sector = sect;
>>>> +	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
>>>> +	if (req->cmd->rw.control & cpu_to_le16(NVME_RW_FUA))
>>>> +		bio->bi_opf |= REQ_FUA;
>>>> +
>>>> +	for_each_sg(req->sg, sg, req->sg_cnt, sg_cnt) {
>>>> +		struct page *p = sg_page(sg);
>>>> +		unsigned int l = sg->length;
>>>> +		unsigned int o = sg->offset;
>>>> +
>>>> +		ret = bio_add_zone_append_page(bio, p, l, o);
>>>> +		if (ret != sg->length) {
>>>> +			status = NVME_SC_INTERNAL;
>>>> +			goto out_bio_put;
>>>> +		}
>>>> +
>>>> +		total_len += sg->length;
>>>> +	}
>>>> +
>>>> +	if (total_len != nvmet_rw_data_len(req)) {
>>>> +		status = NVME_SC_INTERNAL | NVME_SC_DNR;
>>>> +		goto out_bio_put;
>>>> +	}
>>>> +
>>>> +	ret = submit_bio_wait(bio);
>>> submit_bio_wait() ? Why blocking here ? That would be bad for performance. Is it
>>> mandatory to block here ? The result handling could be done in the bio_end
>>> callback no ?
>> I did initially, but zonefs uses sync I/O, I'm not sure about the btrfs,
>> if it does
>> please let me know I'll make it async.
>>
>> If there is no async caller in the kernel for REQ_OP_ZONE_APPEND
>> shouldwe make this async ?
> This should not depend on what the user does, at all.
> Yes, for now zonefs only uses zone append for sync write() calls. But I intend
> to have zone append used for async writes too. And in btrfs, append writes are
> used for all data IOs, sync or async. That is on the initiator side anyway. The
> target side should not assume any special behavior of the initiator. So if there
> is no technical reasons to prevent async append writes execution, I would rather
> have all of them processed with async BIOs execution, similar to regular write BIOs.

Thanks for the explanation, I was not aware about the async interface.
I'll make it async.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme