All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
To: Damien Le Moal <Damien.LeMoal@wdc.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Cc: "hch@lst.de" <hch@lst.de>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"sagi@grimberg.me" <sagi@grimberg.me>
Subject: Re: [PATCH V12 2/3] nvmet: add ZBD over ZNS backend support
Date: Sat, 13 Mar 2021 02:40:59 +0000	[thread overview]
Message-ID: <BYAPR04MB4965BE436BDFBA0BAEC43C61866E9@BYAPR04MB4965.namprd04.prod.outlook.com> (raw)
In-Reply-To: BL0PR04MB6514F4EC40A62AB4763560E8E76F9@BL0PR04MB6514.namprd04.prod.outlook.com

On 3/11/21 23:26, Damien Le Moal wrote:
> On 2021/03/12 15:29, Chaitanya Kulkarni wrote:
> [...]
> +void nvmet_bdev_execute_zone_mgmt_recv(struct nvmet_req *req)
>>>> +{
>>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>>>> +	u32 bufsize = (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2;
>>>> +	struct nvmet_report_zone_data data = { .ns = req->ns };
>>>> +	unsigned int nr_zones;
>>>> +	int reported_zones;
>>>> +	u16 status;
>>>> +
>>>> +	status = nvmet_bdev_zns_checks(req);
>>>> +	if (status)
>>>> +		goto out;
>>>> +
>>>> +	data.rz = __vmalloc(bufsize, GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO);
>>>> +	if (!data.rz) {
>>>> +		status = NVME_SC_INTERNAL;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	nr_zones = (bufsize - sizeof(struct nvme_zone_report)) /
>>>> +			sizeof(struct nvme_zone_descriptor);
>>>> +	if (!nr_zones) {
>>>> +		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
>>>> +		goto out_free_report_zones;
>>>> +	}
>>>> +
>>>> +	reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
>>>> +					     nvmet_bdev_report_zone_cb, &data);
>>>> +	if (reported_zones < 0) {
>>>> +		status = NVME_SC_INTERNAL;
>>>> +		goto out_free_report_zones;
>>>> +	}
>>> There is a problem here: the code as is ignores the request reporting option
>>> field which can lead to an invalid zone report being returned. I think you need
>>> to modify nvmet_bdev_report_zone_cb() to look at the reporting option field
>>> passed by the initiator and filter the zone report since blkdev_report_zones()
>>> does not handle that argument.
>> The reporting options are set by the host statistically in
>> nvme_ns_report_zones()
>> arefrom:-  nvme_ns_report_zones()
>>          c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>          c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>          c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>>
>> All the above values are validated in the nvmet_bdev_zns_checks() helper
>> called from nvmet_bdev_execute_zone_mgmt_recv() before we allocate the
>> report zone buffer.
>>
>> 1. c.zmr.zra indicates the action which Reports zone descriptor entries
>>    through the Report Zones data structure.
>>
>>    We validate this value is been set to NVME_ZRA_ZONE_REPORT in the
>>    nvmet_bdev_zns_chceks(). We are calling report zone after checking
>>    zone receive action it NVME_ZRA_ZONE_REPORT so not filtering is needed
>>    in the nvmet_bdev_report_zone_cb().
>>
>> 2. c.zmr.zrasf indicates the action specific field which is set to
>>    NVME_ZRASF_ZONE_REPORT_ALL.
>>
>>    We validate this value is been set to NVME_ZRASF_ZONE_REPORT_ALL in the
>>    nvmet_bdev_zns_chceks(). Since host wants all the zones we don't need to
>>    filter any zone states in the nvmet_bdev_report_zone_cb().
>>
>> 3. c.zmr.pr is set to NVME_REPORT_ZONE_PARTIAL which value = 1 i.e value in
>>    the Report Zone data structure Number of Zones field indicates the
>> number of
>>    fully transferred zone descriptors in the data buffer, which we set from
>>    return value of the blkdev_report_zones() :-
>>   
>>
>>    reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
>> 					     nvmet_bdev_report_zone_cb, &data);
>> <snip>   data.rz->nr_zones = cpu_to_le64(reported_zones);
>>
>>    So no filtering is needed in nvmet_bdev_report_zone_cb() for c.zmr.pr.
>>
>> Can you please explain what filtering is missing in the current code ?
>>
>> Maybe I'm looking into an old spec.
> report zones command has the reporting options (ro) field (bits 15:08 of dword
> 13) where the user can specify the following values:
>
> Value Description
> 0h List all zones.
> 1h List the zones in the ZSE:Empty state.
> 2h List the zones in the ZSIO:Implicitly Opened state.
> 3h List the zones in the ZSEO:Explicitly Opened state.
> 4h List the zones in the ZSC:Closed state.
> 5h List the zones in the ZSF:Full state.
> 6h List the zones in the ZSRO:Read Only state.
> 7h List the zones in the ZSO:Offline state.
>
> to filter the zone report based on zone condition. blkdev_report_zones() will
> always to a "list all zones", that is, ro == 0h.
>
> But on the initiator side, if the client issue a report zones command through an
> ioctl (passthrough/direct access not suing the block layer BLKREPORTZONES
> ioctl), it may specify a different value for the ro field. Processing that
> command using blkdev_report_zones() like you are doing here without any
> additional filtering will give an incorrect report. Filtering based on the user
> specified ro field needs to be added in nvmet_bdev_report_zone_cb().
>
> The current code here is fine of the initiator/client side uses the block layer
> and execute all report zones through blkdev_report_zones(). But things will
> break if the client starts doing passthrough commands using nvme ioctl. No ?

Okay, so I'm using the right spec. With your explanation it needs a
filtering,
will add it to the next version.

>>>> +
>>>> +	data.rz->nr_zones = cpu_to_le64(reported_zones);
>>>> +
>>>> +	status = nvmet_copy_to_sgl(req, 0, data.rz, bufsize);
>>>> +
>>>> +out_free_report_zones:
>>>> +	kvfree(data.rz);
>>>> +out:
>>>> +	nvmet_req_complete(req, status);
>>>> +}
>>>> +
>>>> +void nvmet_bdev_execute_zone_mgmt_send(struct nvmet_req *req)
>>>> +{
>>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zms.slba);
>>>> +	sector_t nr_sect = bdev_zone_sectors(req->ns->bdev);
>>>> +	u16 status = NVME_SC_SUCCESS;
>>>> +	u8 zsa = req->cmd->zms.zsa;
>>>> +	enum req_opf op;
>>>> +	int ret;
>>>> +	const unsigned int zsa_to_op[] = {
>>>> +		[NVME_ZONE_OPEN]	= REQ_OP_ZONE_OPEN,
>>>> +		[NVME_ZONE_CLOSE]	= REQ_OP_ZONE_CLOSE,
>>>> +		[NVME_ZONE_FINISH]	= REQ_OP_ZONE_FINISH,
>>>> +		[NVME_ZONE_RESET]	= REQ_OP_ZONE_RESET,
>>>> +	};
>>>> +
>>>> +	if (zsa > ARRAY_SIZE(zsa_to_op) || !zsa_to_op[zsa]) {
>>> What is the point of the "!zsa_to_op[zsa]I see, I'll add the async I/O interface." here ? All the REQ_OP_ZONE_XXX are
>>> non 0, always...
>> Well this is just making sure that we receive the right action since sparse
>> array will return 0 for any other values than listed above having
>> !zsa_to_op[zsa] check we can return an error.
> But zsa is unsigned and you check it against the array size. So it can only be
> within 0 and array size - 1. That is enough... I really do not see the point of
> clutering the condition with something that is always true...

Okay.

>
> [...]
>>>> +
>>>> +	ret = blkdev_zone_mgmt(req->ns->bdev, op, sect, nr_sect, GFP_KERNEL);
>>>> +	if (ret)
>>>> +		status = NVME_SC_INTERNAL;
>>>> +out:
>>>> +	nvmet_req_complete(req, status);
>>>> +}
>>>> +
>>>> +void nvmet_bdev_execute_zone_append(struct nvmet_req *req)
>>>> +{
>>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->rw.slba);
>>>> +	u16 status = NVME_SC_SUCCESS;
>>>> +	unsigned int total_len = 0;
>>>> +	struct scatterlist *sg;
>>>> +	int ret = 0, sg_cnt;
>>>> +	struct bio *bio;
>>>> +
>>>> +	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
>>>> +		return;
>>> No nvmet_req_complete() call ? Is that done in nvmet_check_transfer_len() ?
>> Yes it does, you had the same comment on earlier version, it can be
>> confusing
>> that is why I proposed a helper for check transfer len and !req->sg_cnt
>> check,
>> but we don't want that helper.
> Just add a comment saying that nvmet_check_transfer_len() calls
> nvmet_req_complete(). That will help people like me with a short memory span :)

Okay.

>>>> +
>>>> +	if (!req->sg_cnt) {
>>>> +		nvmet_req_complete(req, 0);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	if (req->transfer_len <= NVMET_MAX_INLINE_DATA_LEN) {
>>>> +		bio = &req->b.inline_bio;
>>>> +		bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
>>>> +	} else {
>>>> +		bio = bio_alloc(GFP_KERNEL, req->sg_cnt);
>>>> +	}
>>>> +
>>>> +	bio_set_dev(bio, req->ns->bdev);
>>>> +	bio->bi_iter.bi_sector = sect;
>>>> +	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
>>>> +	if (req->cmd->rw.control & cpu_to_le16(NVME_RW_FUA))
>>>> +		bio->bi_opf |= REQ_FUA;
>>>> +
>>>> +	for_each_sg(req->sg, sg, req->sg_cnt, sg_cnt) {
>>>> +		struct page *p = sg_page(sg);
>>>> +		unsigned int l = sg->length;
>>>> +		unsigned int o = sg->offset;
>>>> +
>>>> +		ret = bio_add_zone_append_page(bio, p, l, o);
>>>> +		if (ret != sg->length) {
>>>> +			status = NVME_SC_INTERNAL;
>>>> +			goto out_bio_put;
>>>> +		}
>>>> +
>>>> +		total_len += sg->length;
>>>> +	}
>>>> +
>>>> +	if (total_len != nvmet_rw_data_len(req)) {
>>>> +		status = NVME_SC_INTERNAL | NVME_SC_DNR;
>>>> +		goto out_bio_put;
>>>> +	}
>>>> +
>>>> +	ret = submit_bio_wait(bio);
>>> submit_bio_wait() ? Why blocking here ? That would be bad for performance. Is it
>>> mandatory to block here ? The result handling could be done in the bio_end
>>> callback no ?
>> I did initially, but zonefs uses sync I/O, I'm not sure about the btrfs,
>> if it does
>> please let me know I'll make it async.
>>
>> If there is no async caller in the kernel for REQ_OP_ZONE_APPEND
>> shouldwe make this async ?
> This should not depend on what the user does, at all.
> Yes, for now zonefs only uses zone append for sync write() calls. But I intend
> to have zone append used for async writes too. And in btrfs, append writes are
> used for all data IOs, sync or async. That is on the initiator side anyway. The
> target side should not assume any special behavior of the initiator. So if there
> is no technical reasons to prevent async append writes execution, I would rather
> have all of them processed with async BIOs execution, similar to regular write BIOs.

Thanks for the explanation, I was not aware about the async interface.
I'll make it async.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2021-03-13  2:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-11  7:15 [PATCH V12 0/3] nvmet: add ZBD backend support Chaitanya Kulkarni
2021-03-11  7:15 ` [PATCH V12 1/3] nvmet: add NVM Command Set Identifier support Chaitanya Kulkarni
2021-03-11  7:15 ` [PATCH V12 2/3] nvmet: add ZBD over ZNS backend support Chaitanya Kulkarni
2021-03-12  1:15   ` Damien Le Moal
2021-03-12  6:29     ` Chaitanya Kulkarni
2021-03-12  7:25       ` Damien Le Moal
2021-03-13  2:40         ` Chaitanya Kulkarni [this message]
2021-03-15  3:54         ` Chaitanya Kulkarni
2021-03-15  4:09           ` Damien Le Moal
2021-03-15  4:53             ` Chaitanya Kulkarni
2021-03-11  7:15 ` [PATCH V12 3/3] nvmet: add nvmet_req_bio put helper for backends Chaitanya Kulkarni
2021-03-12  0:37   ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR04MB4965BE436BDFBA0BAEC43C61866E9@BYAPR04MB4965.namprd04.prod.outlook.com \
    --to=chaitanya.kulkarni@wdc.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.