All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
@ 2021-01-05 22:49 klayph
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: klayph @ 2021-01-05 22:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

From: Clay Mayers <clay.mayers@kioxia.com>

Enables fused compare/write to be issued from user mode.

The first patch allows a pair of pass through requests
to be linked and submitted as a fused pair to the nvme
device queue.  The second patch exposes it through an
nvme ioctl so a cmp/write can be issued from user mode.

There was a question regarding Linux support for this
functionality in nvme-cli.  With these patches, that
would be possible.

https://github.com/linux-nvme/nvme-cli/issues/318

Clay Mayers (2):
  nvme: support fused nvme requests
  nvme: support fused NVME_IOCTL_SUBMIT_IO

 drivers/nvme/host/core.c | 261 ++++++++++++++++++++++++++++++---------
 drivers/nvme/host/nvme.h |   2 +
 drivers/nvme/host/pci.c  |  32 ++++-
 3 files changed, 234 insertions(+), 61 deletions(-)

-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] nvme: support fused nvme requests
  2021-01-05 22:49 [PATCH 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO klayph
@ 2021-01-05 22:49 ` klayph
  2021-01-05 23:52   ` Keith Busch
                     ` (5 more replies)
  2021-01-05 22:49 ` [PATCH " klayph
  2021-01-05 23:04 ` [PATCH 0/2] nvme: Support for " James Smart
  2 siblings, 6 replies; 22+ messages in thread
From: klayph @ 2021-01-05 22:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

From: Clay Mayers <mayerc@kioxia.com>

Adds support for fused nvme commands to be tunneled through a blk_mq
queue and submitted atomically to an nvme device queue.

If the nvme cmnd has the first fused flag set, nvme_request.nrq2 points
to the nvme_request for the second fused command. Instead of submitting
the first request, its cmnd is saved in the second command's
nvme_request to be submitted as a pair once the second request is
processed by nvme_queue_rq().

Signed-off-by: Clay Mayers <clay.mayers@kioxia.com>
---
 drivers/nvme/host/core.c |  1 +
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/pci.c  | 32 +++++++++++++++++++++++++++++++-
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9a270e49df17..a498cf6a9eaf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -514,6 +514,7 @@ static inline void nvme_clear_nvme_request(struct request *req)
 	if (!(req->rq_flags & RQF_DONTPREP)) {
 		nvme_req(req)->retries = 0;
 		nvme_req(req)->flags = 0;
+		nvme_req(req)->nrq2 = NULL;
 		req->rq_flags |= RQF_DONTPREP;
 	}
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 567f7ad18a91..187dde1f11fe 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -157,6 +157,8 @@ struct nvme_request {
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;
+	struct nvme_request	*nrq2;	/* Points to second fused request */
+	struct nvme_command	cmnd;	/* Saved fused first cmnd */
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3be352403839..c24729e100bc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -491,6 +491,30 @@ static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
 	nvmeq->last_sq_tail = nvmeq->sq_tail;
 }
 
+/**
+ * nvme_submit_cmd2() - Copy fused commands into a queue and ring the doorbell
+ * @nvmeq: The queue to use
+ * @cmd: The first command to send
+ * @cmd2: the second command to send
+ * @write_sq: whether to write to the SQ doorbell
+ */
+static void nvme_submit_cmd2(struct nvme_queue *nvmeq, struct nvme_command *cmd,
+			     struct nvme_command *cmd2, bool write_sq)
+{
+	spin_lock(&nvmeq->sq_lock);
+	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+		cmd, sizeof(*cmd));
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+		cmd2, sizeof(*cmd2));
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+	nvme_write_sq_db(nvmeq, write_sq);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+
 /**
  * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
  * @nvmeq: The queue to use
@@ -918,7 +942,13 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	}
 
 	blk_mq_start_request(req);
-	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
+
+	if (cmnd.common.flags & NVME_CMD_FUSE_FIRST)
+		memcpy(&nvme_req(req)->nrq2->cmnd, &cmnd, sizeof(cmnd));
+	else if (cmnd.common.flags & NVME_CMD_FUSE_SECOND)
+		nvme_submit_cmd2(nvmeq, &nvme_req(req)->cmnd, &cmnd, bd->last);
+	else
+		nvme_submit_cmd(nvmeq, &cmnd, bd->last);
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] nvme: support fused NVME_IOCTL_SUBMIT_IO
  2021-01-05 22:49 [PATCH 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO klayph
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
@ 2021-01-05 22:49 ` klayph
  2021-01-05 23:04 ` [PATCH 0/2] nvme: Support for " James Smart
  2 siblings, 0 replies; 22+ messages in thread
From: klayph @ 2021-01-05 22:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

From: Clay Mayers <mayerc@kioxia.com>

Extends the functionality of the NVME_IOCTL_SUBMIT_IO ioctl to support
a pair of fused nvme_user_io requests.

When submitting a fused pair, an array of two nvme_user_io structs are
supplied when invoking NVME_IOCTL_SUBMIT_IO ioctl.  Rather than
introduce a new ioctl code, the presence of a fused pair is indicated
by the nvme_user_io.flags having the value of NVME_CMD_FUSED_FIRST.
This then indicates a second nvme_user_io struct follows the first with
an nvme_user_io.flags set to NVME_CMD_FUSED_SECOND.

A fused pair may fail to submit with -EWOULDBLOCK.  This indicates the
device queue selected for the first command didn't have a tag available
when the request for the second command was created.

Signed-off-by: Clay Mayers <clay.mayers@kioxia.com>
---
 drivers/nvme/host/core.c | 260 ++++++++++++++++++++++++++++++---------
 1 file changed, 200 insertions(+), 60 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a498cf6a9eaf..ce5d2a9a08a8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1468,16 +1468,40 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
 	return (void __user *)ptrval;
 }
 
-static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
+struct nvme_user_io_req {
+	struct nvme_command cmd;
+	struct request *rq;
+	struct bio *bio;	/* bio in rq at the time of allocation */
+	void *meta;
+	void __user *udata;
+	void __user *umeta;
+	unsigned int len;
+	unsigned int mlen;
+	u32 mseed;
+};
+
+static void nvme_free_io(struct nvme_user_io_req *nrq)
+{
+	if (!nrq)
+		return;
+	kfree(nrq->meta);
+	if (nrq->bio)
+		blk_rq_unmap_user(nrq->bio);
+	if (nrq->rq)
+		blk_mq_free_request(nrq->rq);
+	nrq->meta = NULL;
+	nrq->bio = NULL;
+	nrq->rq = NULL;
+}
+
+static int nvme_prep_io(struct nvme_ns *ns, struct nvme_user_io_req *nrq,
+			struct nvme_user_io __user *uio, int size)
 {
 	struct nvme_user_io io;
-	struct nvme_command c;
-	unsigned length, meta_len;
-	void __user *metadata;
 
-	if (copy_from_user(&io, uio, sizeof(io)))
+	if (unlikely(copy_from_user(&io, uio, size)))
 		return -EFAULT;
-	if (io.flags)
+	if (unlikely(io.flags & ~(NVME_CMD_FUSE_FIRST|NVME_CMD_FUSE_SECOND)))
 		return -EINVAL;
 
 	switch (io.opcode) {
@@ -1489,33 +1513,160 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 		return -EINVAL;
 	}
 
-	length = (io.nblocks + 1) << ns->lba_shift;
-	meta_len = (io.nblocks + 1) * ns->ms;
-	metadata = nvme_to_user_ptr(io.metadata);
+	nrq->udata = nvme_to_user_ptr(io.addr);
+	nrq->len = (io.nblocks + 1) << ns->lba_shift;
+	nrq->umeta = nvme_to_user_ptr(io.metadata);
+	nrq->mlen = (io.nblocks + 1) * ns->ms;
+	nrq->mseed = lower_32_bits(io.slba);
+	nrq->bio = nrq->meta = NULL;
 
 	if (ns->features & NVME_NS_EXT_LBAS) {
-		length += meta_len;
-		meta_len = 0;
-	} else if (meta_len) {
+		nrq->len += nrq->mlen;
+		nrq->mlen = 0;
+	} else if (nrq->mlen) {
 		if ((io.metadata & 3) || !io.metadata)
 			return -EINVAL;
 	}
 
-	memset(&c, 0, sizeof(c));
-	c.rw.opcode = io.opcode;
-	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
-	c.rw.slba = cpu_to_le64(io.slba);
-	c.rw.length = cpu_to_le16(io.nblocks);
-	c.rw.control = cpu_to_le16(io.control);
-	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
-	c.rw.reftag = cpu_to_le32(io.reftag);
-	c.rw.apptag = cpu_to_le16(io.apptag);
-	c.rw.appmask = cpu_to_le16(io.appmask);
-
-	return nvme_submit_user_cmd(ns->queue, &c,
-			nvme_to_user_ptr(io.addr), length,
-			metadata, meta_len, lower_32_bits(io.slba), NULL, 0);
+	memset(&nrq->cmd, 0, sizeof(nrq->cmd));
+	nrq->cmd.rw.opcode = io.opcode;
+	nrq->cmd.rw.flags = io.flags;
+	nrq->cmd.rw.nsid = cpu_to_le32(ns->head->ns_id);
+	nrq->cmd.rw.slba = cpu_to_le64(io.slba);
+	nrq->cmd.rw.length = cpu_to_le16(io.nblocks);
+	nrq->cmd.rw.control = cpu_to_le16(io.control);
+	nrq->cmd.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
+	nrq->cmd.rw.reftag = cpu_to_le32(io.reftag);
+	nrq->cmd.rw.apptag = cpu_to_le16(io.apptag);
+	nrq->cmd.rw.appmask = cpu_to_le16(io.appmask);
+
+	return 0;
+}
+
+static struct request *nvme_mk_req_io(struct nvme_ns *ns,
+				      struct nvme_user_io_req *nrq,
+				      blk_mq_req_flags_t flags, int qid,
+				      int timeout)
+{
+	bool write = nvme_is_write(&nrq->cmd);
+	struct request_queue *q = ns->queue;
+	struct gendisk *disk = ns->disk;
+	struct request *rq;
+	struct bio *bio = NULL;
+	void *meta = NULL;
+	int ret;
+
+	rq = nvme_alloc_request(q, &nrq->cmd, flags, qid);
+	if (unlikely(IS_ERR(rq)))
+		return rq;
+
+	rq->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+	nvme_req(rq)->flags |= NVME_REQ_USERCMD;
+
+	if (nrq->udata && nrq->len) {
+		ret = blk_rq_map_user(q, rq, NULL, nrq->udata, nrq->len,
+				GFP_KERNEL);
+		if (ret)
+			goto out;
+		bio = rq->bio;
+		bio->bi_disk = disk;
+		if (disk && nrq->umeta && nrq->mlen) {
+			meta = nvme_add_user_metadata(bio, nrq->umeta, nrq->mlen,
+					nrq->mseed, write);
+			if (IS_ERR(meta)) {
+				ret = PTR_ERR(meta);
+				goto out_unmap;
+			}
+			nrq->meta = meta;
+		}
+	}
+	nrq->bio = bio;
+	return rq;
+out_unmap:
+	if (bio)
+		blk_rq_unmap_user(bio);
+ out:
+	blk_mq_free_request(rq);
+	return ERR_PTR(ret);
+}
+
+static int nvme_unprep_io(struct nvme_user_io_req *nrq,
+			  u64 *result)
+{
+	struct request *rq = nrq->rq;
+	int write = nvme_is_write(&nrq->cmd);
+	int ret;
+
+	if (unlikely(nvme_req(rq)->flags & NVME_REQ_CANCELLED))
+		ret = -EINTR;
+	else
+		ret = nvme_req(rq)->status;
+	if (result)
+		*result = le64_to_cpu(nvme_req(rq)->result.u64);
+	if (nrq->meta && !ret && !write) {
+		if (copy_to_user(nrq->umeta, nrq->meta, nrq->mlen))
+			ret = -EFAULT;
+	}
+	nvme_free_io(nrq);
+	return ret;
+}
+
+/* support both NVME_IOCTL_SUBMIT_IO and NVME_IOCTL_SUBMIT_IO32 */
+static int nvme_submit_io(struct nvme_ns *ns, void __user *uio,
+			  int size)
+{
+	struct nvme_user_io_req nrq, nrq2;
+	struct request *rq, *rq2;
+	int ret, fused;
+
+	ret = nvme_prep_io(ns, &nrq, uio, size);
+	if (unlikely(ret))
+		return ret;
+	fused = (nrq.cmd.common.flags == NVME_CMD_FUSE_FIRST);
+	if (fused) {
+		ret = nvme_prep_io(ns, &nrq2, uio+size, size);
+		if (unlikely(ret))
+			return ret;
+		if (unlikely(nrq2.cmd.common.flags != NVME_CMD_FUSE_SECOND))
+			return -EINVAL;
+	} else if (unlikely(nrq.cmd.common.flags)) {
+		return -EINVAL;
+	}
+	rq = nvme_mk_req_io(ns, &nrq, 0, NVME_QID_ANY, 0);
+	if (unlikely(IS_ERR(rq)))
+		return PTR_ERR(rq);
+	nrq.rq = rq;
+	if (fused) {
+		DECLARE_COMPLETION_ONSTACK(wait);
+
+		rq2 = nvme_mk_req_io(ns, &nrq2, BLK_MQ_REQ_NOWAIT,
+				     nvme_req_qid(rq), 0);
+		if (unlikely(IS_ERR(rq2))) {
+			nvme_free_io(&nrq);
+			return PTR_ERR(rq2);
+		}
+		nvme_req(rq)->nrq2 = nvme_req(rq2);
+		nrq2.rq = rq2;
+
+		rq->cmd_flags |= REQ_NOMERGE;
+		rq2->cmd_flags |= REQ_NOMERGE;
+		rq->end_io_data = &wait;
+		blk_execute_rq_nowait(rq->q, ns->disk, rq, false, nvme_end_sync_rq);
+		nvme_execute_passthru_rq(rq2);
+
+		/*
+		 * both will be complete at this point, but nvme spec doesn't
+		 * specify cqe ordering for fused operations so wait for the
+		 * first to complete as well
+		 */
+		wait_for_completion_io(&wait);
+		nvme_unprep_io(&nrq, NULL);
+		ret = nvme_unprep_io(&nrq2, NULL);
+	} else {
+		nvme_execute_passthru_rq(rq);
+		ret = nvme_unprep_io(&nrq, NULL);
+	}
+	return ret;
 }
 
 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
@@ -1672,6 +1823,23 @@ static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 	return ret;
 }
 
+struct nvme_user_io32 {
+	__u8	opcode;
+	__u8	flags;
+	__u16	control;
+	__u16	nblocks;
+	__u16	rsvd;
+	__u64	metadata;
+	__u64	addr;
+	__u64	slba;
+	__u32	dsmgmt;
+	__u32	reftag;
+	__u16	apptag;
+	__u16	appmask;
+} __attribute__((__packed__));
+
+#define NVME_IOCTL_SUBMIT_IO32	_IOW('N', 0x42, struct nvme_user_io32)
+
 static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg)
 {
@@ -1700,8 +1868,10 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	case NVME_IOCTL_IO_CMD:
 		ret = nvme_user_cmd(ns->ctrl, ns, argp);
 		break;
+	case NVME_IOCTL_SUBMIT_IO32:
+		fallthrough;	/* structures are identical except size */
 	case NVME_IOCTL_SUBMIT_IO:
-		ret = nvme_submit_io(ns, argp);
+		ret = nvme_submit_io(ns, argp, _IOC_SIZE(cmd));
 		break;
 	case NVME_IOCTL_IO64_CMD:
 		ret = nvme_user_cmd64(ns->ctrl, ns, argp);
@@ -1717,41 +1887,11 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	return ret;
 }
 
-#ifdef CONFIG_COMPAT
-struct nvme_user_io32 {
-	__u8	opcode;
-	__u8	flags;
-	__u16	control;
-	__u16	nblocks;
-	__u16	rsvd;
-	__u64	metadata;
-	__u64	addr;
-	__u64	slba;
-	__u32	dsmgmt;
-	__u32	reftag;
-	__u16	apptag;
-	__u16	appmask;
-} __attribute__((__packed__));
-
-#define NVME_IOCTL_SUBMIT_IO32	_IOW('N', 0x42, struct nvme_user_io32)
 
+#ifdef CONFIG_COMPAT
 static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg)
 {
-	/*
-	 * Corresponds to the difference of NVME_IOCTL_SUBMIT_IO
-	 * between 32 bit programs and 64 bit kernel.
-	 * The cause is that the results of sizeof(struct nvme_user_io),
-	 * which is used to define NVME_IOCTL_SUBMIT_IO,
-	 * are not same between 32 bit compiler and 64 bit compiler.
-	 * NVME_IOCTL_SUBMIT_IO32 is for 64 bit kernel handling
-	 * NVME_IOCTL_SUBMIT_IO issued from 32 bit programs.
-	 * Other IOCTL numbers are same between 32 bit and 64 bit.
-	 * So there is nothing to do regarding to other IOCTL numbers.
-	 */
-	if (cmd == NVME_IOCTL_SUBMIT_IO32)
-		return nvme_ioctl(bdev, mode, NVME_IOCTL_SUBMIT_IO, arg);
-
 	return nvme_ioctl(bdev, mode, cmd, arg);
 }
 #else
@@ -3118,7 +3258,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ret = nvme_configure_apst(ctrl);
 	if (ret < 0)
 		return ret;
-	
+
 	ret = nvme_configure_timestamp(ctrl);
 	if (ret < 0)
 		return ret;
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-05 22:49 [PATCH 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO klayph
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
  2021-01-05 22:49 ` [PATCH " klayph
@ 2021-01-05 23:04 ` James Smart
  2 siblings, 0 replies; 22+ messages in thread
From: James Smart @ 2021-01-05 23:04 UTC (permalink / raw)
  To: klayph, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg


[-- Attachment #1.1: Type: text/plain, Size: 2415 bytes --]


On 1/5/2021 2:49 PM, klayph@gmail.com wrote:
> From: Clay Mayers <clay.mayers@kioxia.com>
>
> Enables fused compare/write to be issued from user mode.
>
> The first patch allows a pair of pass through requests
> to be linked and submitted as a fused pair to the nvme
> device queue.  The second patch exposes it through an
> nvme ioctl so a cmp/write can be issued from user mode.
>
> There was a question regarding Linux support for this
> functionality in nvme-cli.  With these patches, that
> would be possible.
>
> https://github.com/linux-nvme/nvme-cli/issues/318
>
> Clay Mayers (2):
>    nvme: support fused nvme requests
>    nvme: support fused NVME_IOCTL_SUBMIT_IO
>
>   drivers/nvme/host/core.c | 261 ++++++++++++++++++++++++++++++---------
>   drivers/nvme/host/nvme.h |   2 +
>   drivers/nvme/host/pci.c  |  32 ++++-
>   3 files changed, 234 insertions(+), 61 deletions(-)
>

Hmmm... I don't expect fused support to be simple. There's the whole 
notion of: submitting the two commands such that back-to-back queue_rq() 
calls are made to the transport for the same queue and the back-to-back 
calls (different request contexts) just happen to be the fused commands; 
as well as under error, if queues are frozen, then replayed, the same 
back-to-back calling sequence occurs (and I hope a freeze occuring while 
one sent another not, would recover).

And... the transports better be able to support fused commands as well.  
nvme-fc does not as fused commands were explicitly deemed unsupported.

But - I'll look to see what you have in the patches.

-- james

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.

[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4163 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] nvme: support fused nvme requests
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
@ 2021-01-05 23:52   ` Keith Busch
  2021-01-06 14:55     ` Clay Mayers
  2021-01-06  0:35   ` James Smart
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-01-05 23:52 UTC (permalink / raw)
  To: klayph; +Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

On Tue, Jan 05, 2021 at 02:49:38PM -0800, klayph@gmail.com wrote:
> @@ -157,6 +157,8 @@ struct nvme_request {
>  	u8			flags;
>  	u16			status;
>  	struct nvme_ctrl	*ctrl;
> +	struct nvme_request	*nrq2;	/* Points to second fused request */
> +	struct nvme_command	cmnd;	/* Saved fused first cmnd */

This struct is getting a bit large, and we allocate 1000's of them. An
extra 72B (3x its current size) for something that likely never gets
used (and can't be used for fabrics) seems a bit much.

> @@ -918,7 +942,13 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	}
>  
>  	blk_mq_start_request(req);
> -	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> +
> +	if (cmnd.common.flags & NVME_CMD_FUSE_FIRST)
> +		memcpy(&nvme_req(req)->nrq2->cmnd, &cmnd, sizeof(cmnd));
> +	else if (cmnd.common.flags & NVME_CMD_FUSE_SECOND)
> +		nvme_submit_cmd2(nvmeq, &nvme_req(req)->cmnd, &cmnd, bd->last);
> +	else
> +		nvme_submit_cmd(nvmeq, &cmnd, bd->last);

There's an unlikely scenario where this breaks down, but it'd look
something like this: the FUSE_FIRST is submitted to the driver, then
before the FUSE_SECOND reaches the driver, the controller is reset.
Because you're only using this through the passthrough interface where
the FAILFAST flag is set, the FUSE_FIRST command is ended in failure,
which releases the command ID for reuse. Once the controller is
restarted, the FUSE_SECOND is dispatched to the driver using an invalid
command ID stored in its 'cmnd' field.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] nvme: support fused nvme requests
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
  2021-01-05 23:52   ` Keith Busch
@ 2021-01-06  0:35   ` James Smart
  2021-01-06 15:01     ` Clay Mayers
  2021-01-06  7:59   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: James Smart @ 2021-01-06  0:35 UTC (permalink / raw)
  To: klayph, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg


[-- Attachment #1.1: Type: text/plain, Size: 3248 bytes --]



On 1/5/2021 2:49 PM, klayph@gmail.com wrote:
> From

>   
>   /*
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3be352403839..c24729e100bc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -491,6 +491,30 @@ static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
>   	nvmeq->last_sq_tail = nvmeq->sq_tail;
>   }
>   
> +/**
> + * nvme_submit_cmd2() - Copy fused commands into a queue and ring the doorbell
> + * @nvmeq: The queue to use
> + * @cmd: The first command to send
> + * @cmd2: the second command to send
> + * @write_sq: whether to write to the SQ doorbell
> + */
> +static void nvme_submit_cmd2(struct nvme_queue *nvmeq, struct nvme_command *cmd,
> +			     struct nvme_command *cmd2, bool write_sq)
> +{
> +	spin_lock(&nvmeq->sq_lock);
> +	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
> +		cmd, sizeof(*cmd));
> +	if (++nvmeq->sq_tail == nvmeq->q_depth)
> +		nvmeq->sq_tail = 0;
> +	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
> +		cmd2, sizeof(*cmd2));
> +	if (++nvmeq->sq_tail == nvmeq->q_depth)
> +		nvmeq->sq_tail = 0;
> +	nvme_write_sq_db(nvmeq, write_sq);
> +	spin_unlock(&nvmeq->sq_lock);
> +}
> +
> +
>   /**
>    * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
>    * @nvmeq: The queue to use
> @@ -918,7 +942,13 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>   	}
>   
>   	blk_mq_start_request(req);
> -	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> +
> +	if (cmnd.common.flags & NVME_CMD_FUSE_FIRST)
> +		memcpy(&nvme_req(req)->nrq2->cmnd, &cmnd, sizeof(cmnd));
> +	else if (cmnd.common.flags & NVME_CMD_FUSE_SECOND)
> +		nvme_submit_cmd2(nvmeq, &nvme_req(req)->cmnd, &cmnd, bd->last);
> +	else
> +		nvme_submit_cmd(nvmeq, &cmnd, bd->last);
>   	return BLK_STS_OK;
>   out_unmap_data:
>   	nvme_unmap_data(dev, req);

I think Keith caught the most concerning issue. I've been thinking of 
others but they mostly seem to work.  I assume recopies of the first 
fused cmd to the data structure don't hurt.  I would expect there to be 
a couple of other odd issues if the two requests start working 
independently.

This is the pci transport only. You should either put something in to 
not allow the ioctl on a non-pci transport, or add the support (or 
rejection) to the fabric transports as well.

-- james




-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.

[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4163 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] nvme: support fused nvme requests
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
  2021-01-05 23:52   ` Keith Busch
  2021-01-06  0:35   ` James Smart
@ 2021-01-06  7:59   ` Christoph Hellwig
  2021-01-25 19:58   ` [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO clay.mayers
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-01-06  7:59 UTC (permalink / raw)
  To: klayph
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

I really don't like this.  There is no good use case and it really
affects the I/O submission fast path.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] nvme: support fused nvme requests
  2021-01-05 23:52   ` Keith Busch
@ 2021-01-06 14:55     ` Clay Mayers
  0 siblings, 0 replies; 22+ messages in thread
From: Clay Mayers @ 2021-01-06 14:55 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

Thanks for the very fast feedback.  I will certainly reduce the memory
hit for non-fused cmds and look into dealing with a controller reset.

On Tue, Jan 5, 2021 at 3:52 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, Jan 05, 2021 at 02:49:38PM -0800, klayph@gmail.com wrote:
> > @@ -157,6 +157,8 @@ struct nvme_request {
> >       u8                      flags;
> >       u16                     status;
> >       struct nvme_ctrl        *ctrl;
> > +     struct nvme_request     *nrq2;  /* Points to second fused request */
> > +     struct nvme_command     cmnd;   /* Saved fused first cmnd */
>
> This struct is getting a bit large, and we allocate 1000's of them. An
> extra 72B (3x its current size) for something that likely never gets
> used (and can't be used for fabrics) seems a bit much.
>
> > @@ -918,7 +942,13 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >       }
> >
> >       blk_mq_start_request(req);
> > -     nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> > +
> > +     if (cmnd.common.flags & NVME_CMD_FUSE_FIRST)
> > +             memcpy(&nvme_req(req)->nrq2->cmnd, &cmnd, sizeof(cmnd));
> > +     else if (cmnd.common.flags & NVME_CMD_FUSE_SECOND)
> > +             nvme_submit_cmd2(nvmeq, &nvme_req(req)->cmnd, &cmnd, bd->last);
> > +     else
> > +             nvme_submit_cmd(nvmeq, &cmnd, bd->last);
>
> There's an unlikely scenario where this breaks down, but it'd look
> something like this: the FUSE_FIRST is submitted to the driver, then
> before the FUSE_SECOND reaches the driver, the controller is reset.
> Because you're only using this through the passthrough interface where
> the FAILFAST flag is set, the FUSE_FIRST command is ended in failure,
> which releases the command ID for reuse. Once the controller is
> restarted, the FUSE_SECOND is dispatched to the driver using an invalid
> command ID stored in its 'cmnd' field.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] nvme: support fused nvme requests
  2021-01-06  0:35   ` James Smart
@ 2021-01-06 15:01     ` Clay Mayers
  0 siblings, 0 replies; 22+ messages in thread
From: Clay Mayers @ 2021-01-06 15:01 UTC (permalink / raw)
  To: James Smart
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

Thanks for the very fast feedback.  I will ensure this is pci only for now.

On Tue, Jan 5, 2021 at 4:35 PM James Smart <james.smart@broadcom.com> wrote:
>
>
>
> On 1/5/2021 2:49 PM, klayph@gmail.com wrote:
> > From
>
> >
> >   /*
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 3be352403839..c24729e100bc 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -491,6 +491,30 @@ static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
> >       nvmeq->last_sq_tail = nvmeq->sq_tail;
> >   }
> >
> > +/**
> > + * nvme_submit_cmd2() - Copy fused commands into a queue and ring the doorbell
> > + * @nvmeq: The queue to use
> > + * @cmd: The first command to send
> > + * @cmd2: the second command to send
> > + * @write_sq: whether to write to the SQ doorbell
> > + */
> > +static void nvme_submit_cmd2(struct nvme_queue *nvmeq, struct nvme_command *cmd,
> > +                          struct nvme_command *cmd2, bool write_sq)
> > +{
> > +     spin_lock(&nvmeq->sq_lock);
> > +     memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
> > +             cmd, sizeof(*cmd));
> > +     if (++nvmeq->sq_tail == nvmeq->q_depth)
> > +             nvmeq->sq_tail = 0;
> > +     memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
> > +             cmd2, sizeof(*cmd2));
> > +     if (++nvmeq->sq_tail == nvmeq->q_depth)
> > +             nvmeq->sq_tail = 0;
> > +     nvme_write_sq_db(nvmeq, write_sq);
> > +     spin_unlock(&nvmeq->sq_lock);
> > +}
> > +
> > +
> >   /**
> >    * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
> >    * @nvmeq: The queue to use
> > @@ -918,7 +942,13 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >       }
> >
> >       blk_mq_start_request(req);
> > -     nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> > +
> > +     if (cmnd.common.flags & NVME_CMD_FUSE_FIRST)
> > +             memcpy(&nvme_req(req)->nrq2->cmnd, &cmnd, sizeof(cmnd));
> > +     else if (cmnd.common.flags & NVME_CMD_FUSE_SECOND)
> > +             nvme_submit_cmd2(nvmeq, &nvme_req(req)->cmnd, &cmnd, bd->last);
> > +     else
> > +             nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> >       return BLK_STS_OK;
> >   out_unmap_data:
> >       nvme_unmap_data(dev, req);
>
> I think Keith caught the most concerning issue. I've been thinking of
> others but they mostly seem to work.  I assume recopies of the first
> fused cmd to the data structure don't hurt.  I would expect there to be
> a couple of other odd issues if the two requests start working
> independently.
>
> This is the pci transport only. You should either put something in to
> not allow the ioctl on a non-pci transport, or add the support (or
> rejection) to the fabric transports as well.
>
> -- james
>
>
>
>
> --
> This electronic communication and the information and any files transmitted
> with it, or attached to it, are confidential and are intended solely for
> the use of the individual or entity to whom it is addressed and may contain
> information that is confidential, legally privileged, protected by privacy
> laws, or otherwise restricted from disclosure to anyone else. If you are
> not the intended recipient or the person responsible for delivering the
> e-mail to the intended recipient, you are hereby notified that any use,
> copying, distributing, dissemination, forwarding, printing, or copying of
> this e-mail is strictly prohibited. If you received this e-mail in error,
> please return the e-mail to the sender, delete it from your computer, and
> destroy any printed copy of it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
                     ` (2 preceding siblings ...)
  2021-01-06  7:59   ` Christoph Hellwig
@ 2021-01-25 19:58   ` clay.mayers
  2021-01-26  1:43     ` Chaitanya Kulkarni
  2021-01-25 19:58   ` [PATCH V2 1/2] nvme: support fused pci nvme requests clay.mayers
  2021-01-25 19:58   ` [PATCH V2 2/2] nvme: support fused NVME_IOCTL_SUBMIT_IO clay.mayers
  5 siblings, 1 reply; 22+ messages in thread
From: clay.mayers @ 2021-01-25 19:58 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

From: Clay Mayers <clay.mayers@kioxia.com>

Enables fused compare/write to be issued from user mode.

The first patch allows a pair of pass through requests
to be linked and submitted as a fused pair to the nvme
pci device queue.  The second patch exposes it through
an nvme ioctl so a fused pair can be issued from user
mode.

There was a question regarding Linux support for this
functionality in nvme-cli.  With these patches, that
would be possible.

https://github.com/linux-nvme/nvme-cli/issues/318

Local pci device fused support is also necessary for
NVMeOF targets to support fused operation.

Clay Mayers (2):
  nvme: support fused pci nvme requests
  nvme: support fused NVME_IOCTL_SUBMIT_IO

 drivers/nvme/host/core.c | 267 ++++++++++++++++++++++++++++++---------
 drivers/nvme/host/nvme.h |  12 ++
 drivers/nvme/host/pci.c  |  85 ++++++++++++-
 3 files changed, 302 insertions(+), 62 deletions(-)

-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 1/2] nvme: support fused pci nvme requests
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
                     ` (3 preceding siblings ...)
  2021-01-25 19:58   ` [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO clay.mayers
@ 2021-01-25 19:58   ` clay.mayers
  2021-01-25 19:58   ` [PATCH V2 2/2] nvme: support fused NVME_IOCTL_SUBMIT_IO clay.mayers
  5 siblings, 0 replies; 22+ messages in thread
From: clay.mayers @ 2021-01-25 19:58 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

From: Clay Mayers <clay.mayers@kioxia.com>

Adds support for fused nvme commands to be tunneled through a blk_mq
queue and submitted atomically to a pci nvme device queue.

In nvme_queue_rq(), when an nvme cmnd has the first fused flag set,
the nvme cmnd is saved in nvme_request.fctx and the command is not
queued to the device.  Once the nvme cmnd with the second fused flag
set is queued, nvme_request.nrq is used to get back to the first
fused request so both cmnds can be queued to the device atomically.

v2:
Reduced size of nvme_request by pointing to saved cmnd instead so
only fused commands require the extra 64 bytes.  The saved cmnd is
now with the first request to ease clean up.  Flipped second req
to point to the first so a union with the saved cmnd pointer can
be used to limit the fused impact to 8 bytes.

Fixed issue with aborted first fused cmnd being submitted by
unaborted second fused cmnd.

Signed-off-by: Clay Mayers <clay.mayers@kioxia.com>
---
 drivers/nvme/host/nvme.h | 12 ++++++
 drivers/nvme/host/pci.c  | 85 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 567f7ad18a91..b41ce7cd4f49 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -157,6 +157,18 @@ struct nvme_request {
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;
+	union {
+		struct nvme_request *nrq;	/* other fused request */
+		/*
+		 * Between the time the first fused is queued with nvme_queue_rq
+		 * and the second command is queued, the first fused command
+		 * uses fctx instead of nrq.
+		 */
+		struct nvme_fused_ctx {
+			struct nvme_request *nrq2;	/* copy of nrq */
+			struct nvme_command cmnd;	/* copy of 1st fused */
+		} *fctx;
+	};
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3be352403839..ba4798685811 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -491,6 +491,30 @@ static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
 	nvmeq->last_sq_tail = nvmeq->sq_tail;
 }
 
+/**
+ * nvme_submit_cmd2() - Copy fused commands into a queue and ring the doorbell
+ * @nvmeq: The queue to use
+ * @cmd: The first command to send
+ * @cmd2: the second command to send
+ * @write_sq: whether to write to the SQ doorbell
+ */
+static void nvme_submit_cmd2(struct nvme_queue *nvmeq, struct nvme_command *cmd,
+			     struct nvme_command *cmd2, bool write_sq)
+{
+	spin_lock(&nvmeq->sq_lock);
+	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+		cmd, sizeof(*cmd));
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+		cmd2, sizeof(*cmd2));
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+	nvme_write_sq_db(nvmeq, write_sq);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+
 /**
  * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
  * @nvmeq: The queue to use
@@ -876,6 +900,37 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
 	return BLK_STS_OK;
 }
 
+static blk_status_t nvme_queue_frq(struct nvme_queue *nvmeq,
+			struct request *req, struct nvme_command *cmnd,
+			bool write_sq)
+{
+	struct nvme_fused_ctx *fctx;
+
+	if (cmnd->common.flags & NVME_CMD_FUSE_FIRST) {
+		/* Save cmnd to submit with 2nd fused */
+		fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+		if (!fctx)
+			return BLK_STS_RESOURCE;
+		fctx->nrq2 = nvme_req(req)->nrq;
+		memcpy(&fctx->cmnd, cmnd, sizeof(*cmnd));
+		nvme_req(req)->fctx = fctx;
+		blk_mq_start_request(req);
+		return BLK_STS_OK;
+	}
+	/* handle NVME_CMD_FUSED_SECOND */
+	if (!nvme_req(req)->nrq) {
+		nvme_req(req)->status = NVME_SC_FUSED_FAIL;
+		return BLK_STS_IOERR;	/* First i/o has been canceled */
+	}
+
+	fctx = nvme_req(req)->nrq->fctx;
+	nvme_req(req)->nrq->fctx = NULL;
+	blk_mq_start_request(req);
+	nvme_submit_cmd2(nvmeq, &fctx->cmnd, cmnd, write_sq);
+	kfree(fctx);
+	return BLK_STS_OK;
+}
+
 /*
  * NOTE: ns is NULL when called on the admin queue.
  */
@@ -889,6 +944,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_command cmnd;
 	blk_status_t ret;
+	int fused;
 
 	iod->aborted = 0;
 	iod->npages = -1;
@@ -917,8 +973,15 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 			goto out_unmap_data;
 	}
 
-	blk_mq_start_request(req);
-	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
+	fused = cmnd.common.flags & (NVME_CMD_FUSE_FIRST|NVME_CMD_FUSE_SECOND);
+	if (likely(!fused)) {
+		blk_mq_start_request(req);
+		nvme_submit_cmd(nvmeq, &cmnd, bd->last);
+	} else {
+		ret = nvme_queue_frq(nvmeq, req, &cmnd, bd->last);
+		if (ret)
+			goto out_unmap_data;
+	}
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
@@ -2423,6 +2486,22 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+static bool nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (unlikely(nvme_req(req)->cmd->common.flags & NVME_CMD_FUSE_FIRST)) {
+		struct nvme_fused_ctx *fctx = nvme_req(req)->fctx;
+
+		/* this will only be set if 2nd fused isn't queued yet */
+		if (unlikely(fctx)) {
+			fctx->nrq2->nrq = NULL;	/* break link of 2nd fused */
+			nvme_req(req)->fctx = NULL;
+			kfree(fctx);
+		}
+	}
+	return nvme_cancel_request(req, data, reserved);
+}
+
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	bool dead = true, freeze = false;
@@ -2459,7 +2538,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_pci_disable(dev);
 	nvme_reap_pending_cqes(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_wait_completed_request(&dev->tagset);
 	blk_mq_tagset_wait_completed_request(&dev->admin_tagset);
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V2 2/2] nvme: support fused NVME_IOCTL_SUBMIT_IO
  2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
                     ` (4 preceding siblings ...)
  2021-01-25 19:58   ` [PATCH V2 1/2] nvme: support fused pci nvme requests clay.mayers
@ 2021-01-25 19:58   ` clay.mayers
  5 siblings, 0 replies; 22+ messages in thread
From: clay.mayers @ 2021-01-25 19:58 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

From: Clay Mayers <clay.mayers@kioxia.com>

Extends the functionality of the NVME_IOCTL_SUBMIT_IO ioctl to support
a pair of fused nvme_user_io requests.

When submitting a fused pair, an array of two nvme_user_io structs are
supplied when invoking NVME_IOCTL_SUBMIT_IO ioctl.  Rather than
introduce a new ioctl code, the presence of a fused pair is indicated
by the nvme_user_io.flags having the value of NVME_CMD_FUSED_FIRST.
This then indicates a second nvme_user_io struct follows the first with
an nvme_user_io.flags set to NVME_CMD_FUSED_SECOND.

A fused pair may fail to submit with -EWOULDBLOCK.  This indicates the
device queue selected for the first command didn't have a tag available
when the request for the second command was created.  The caller should
retry the request.

v2:
Restricted fused support to just the pci transport
Restored the #ifdef CONFIG_COMPAT around nvme_user_io32 defines.

Signed-off-by: Clay Mayers <clay.mayers@kioxia.com>
---
 drivers/nvme/host/core.c | 267 ++++++++++++++++++++++++++++++---------
 1 file changed, 208 insertions(+), 59 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9a270e49df17..29e9129f6bf3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1467,16 +1467,40 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
 	return (void __user *)ptrval;
 }
 
-static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
+struct nvme_user_io_req {
+	struct nvme_command cmd;
+	struct request *rq;
+	struct bio *bio;	/* bio in rq at the time of allocation */
+	void *meta;
+	void __user *udata;
+	void __user *umeta;
+	unsigned int len;
+	unsigned int mlen;
+	u32 mseed;
+};
+
+static void nvme_free_io(struct nvme_user_io_req *nrq)
+{
+	if (!nrq)
+		return;
+	kfree(nrq->meta);
+	if (nrq->bio)
+		blk_rq_unmap_user(nrq->bio);
+	if (nrq->rq)
+		blk_mq_free_request(nrq->rq);
+	nrq->meta = NULL;
+	nrq->bio = NULL;
+	nrq->rq = NULL;
+}
+
+static int nvme_prep_io(struct nvme_ns *ns, struct nvme_user_io_req *nrq,
+			struct nvme_user_io __user *uio, int size)
 {
 	struct nvme_user_io io;
-	struct nvme_command c;
-	unsigned length, meta_len;
-	void __user *metadata;
 
-	if (copy_from_user(&io, uio, sizeof(io)))
+	if (unlikely(copy_from_user(&io, uio, size)))
 		return -EFAULT;
-	if (io.flags)
+	if (unlikely(io.flags & ~(NVME_CMD_FUSE_FIRST|NVME_CMD_FUSE_SECOND)))
 		return -EINVAL;
 
 	switch (io.opcode) {
@@ -1488,33 +1512,165 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 		return -EINVAL;
 	}
 
-	length = (io.nblocks + 1) << ns->lba_shift;
-	meta_len = (io.nblocks + 1) * ns->ms;
-	metadata = nvme_to_user_ptr(io.metadata);
+	nrq->udata = nvme_to_user_ptr(io.addr);
+	nrq->len = (io.nblocks + 1) << ns->lba_shift;
+	nrq->umeta = nvme_to_user_ptr(io.metadata);
+	nrq->mlen = (io.nblocks + 1) * ns->ms;
+	nrq->mseed = lower_32_bits(io.slba);
+	nrq->bio = nrq->meta = NULL;
 
 	if (ns->features & NVME_NS_EXT_LBAS) {
-		length += meta_len;
-		meta_len = 0;
-	} else if (meta_len) {
+		nrq->len += nrq->mlen;
+		nrq->mlen = 0;
+	} else if (nrq->mlen) {
 		if ((io.metadata & 3) || !io.metadata)
 			return -EINVAL;
 	}
 
-	memset(&c, 0, sizeof(c));
-	c.rw.opcode = io.opcode;
-	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
-	c.rw.slba = cpu_to_le64(io.slba);
-	c.rw.length = cpu_to_le16(io.nblocks);
-	c.rw.control = cpu_to_le16(io.control);
-	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
-	c.rw.reftag = cpu_to_le32(io.reftag);
-	c.rw.apptag = cpu_to_le16(io.apptag);
-	c.rw.appmask = cpu_to_le16(io.appmask);
-
-	return nvme_submit_user_cmd(ns->queue, &c,
-			nvme_to_user_ptr(io.addr), length,
-			metadata, meta_len, lower_32_bits(io.slba), NULL, 0);
+	memset(&nrq->cmd, 0, sizeof(nrq->cmd));
+	nrq->cmd.rw.opcode = io.opcode;
+	nrq->cmd.rw.flags = io.flags;
+	nrq->cmd.rw.nsid = cpu_to_le32(ns->head->ns_id);
+	nrq->cmd.rw.slba = cpu_to_le64(io.slba);
+	nrq->cmd.rw.length = cpu_to_le16(io.nblocks);
+	nrq->cmd.rw.control = cpu_to_le16(io.control);
+	nrq->cmd.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
+	nrq->cmd.rw.reftag = cpu_to_le32(io.reftag);
+	nrq->cmd.rw.apptag = cpu_to_le16(io.apptag);
+	nrq->cmd.rw.appmask = cpu_to_le16(io.appmask);
+
+	return 0;
+}
+
+static struct request *nvme_mk_req_io(struct nvme_ns *ns,
+				      struct nvme_user_io_req *nrq,
+				      blk_mq_req_flags_t flags, int qid,
+				      int timeout)
+{
+	bool write = nvme_is_write(&nrq->cmd);
+	struct request_queue *q = ns->queue;
+	struct gendisk *disk = ns->disk;
+	struct request *rq;
+	struct bio *bio = NULL;
+	void *meta = NULL;
+	int ret;
+
+	rq = nvme_alloc_request(q, &nrq->cmd, flags, qid);
+	if (IS_ERR(rq))
+		return rq;
+
+	rq->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+	nvme_req(rq)->flags |= NVME_REQ_USERCMD;
+
+	if (nrq->udata && nrq->len) {
+		ret = blk_rq_map_user(q, rq, NULL, nrq->udata, nrq->len,
+				GFP_KERNEL);
+		if (ret)
+			goto out;
+		bio = rq->bio;
+		bio->bi_disk = disk;
+		if (disk && nrq->umeta && nrq->mlen) {
+			meta = nvme_add_user_metadata(bio, nrq->umeta, nrq->mlen,
+					nrq->mseed, write);
+			if (IS_ERR(meta)) {
+				ret = PTR_ERR(meta);
+				goto out_unmap;
+			}
+			nrq->meta = meta;
+		}
+	}
+	nrq->bio = bio;
+	return rq;
+out_unmap:
+	if (bio)
+		blk_rq_unmap_user(bio);
+ out:
+	blk_mq_free_request(rq);
+	return ERR_PTR(ret);
+}
+
+static int nvme_unprep_io(struct nvme_user_io_req *nrq,
+			  u64 *result)
+{
+	struct request *rq = nrq->rq;
+	int write = nvme_is_write(&nrq->cmd);
+	int ret;
+
+	if (unlikely(nvme_req(rq)->flags & NVME_REQ_CANCELLED))
+		ret = -EINTR;
+	else
+		ret = nvme_req(rq)->status;
+	if (result)
+		*result = le64_to_cpu(nvme_req(rq)->result.u64);
+	if (nrq->meta && !ret && !write) {
+		if (copy_to_user(nrq->umeta, nrq->meta, nrq->mlen))
+			ret = -EFAULT;
+	}
+	nvme_free_io(nrq);
+	return ret;
+}
+
+/* support both NVME_IOCTL_SUBMIT_IO and NVME_IOCTL_SUBMIT_IO32 */
+static int nvme_submit_io(struct nvme_ns *ns, void __user *uio,
+			  int size)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct nvme_user_io_req nrq, nrq2;
+	struct request *rq, *rq2;
+	int ret, fused;
+
+	ret = nvme_prep_io(ns, &nrq, uio, size);
+	if (unlikely(ret))
+		return ret;
+	fused = (nrq.cmd.common.flags == NVME_CMD_FUSE_FIRST);
+	if (fused) {
+		if (!(ctrl->ops->flags & NVME_F_PCI_P2PDMA))
+			return -EINVAL;
+		ret = nvme_prep_io(ns, &nrq2, uio+size, size);
+		if (unlikely(ret))
+			return ret;
+		if (unlikely(nrq2.cmd.common.flags != NVME_CMD_FUSE_SECOND))
+			return -EINVAL;
+	} else if (unlikely(nrq.cmd.common.flags)) {
+		return -EINVAL;
+	}
+	rq = nvme_mk_req_io(ns, &nrq, 0, NVME_QID_ANY, 0);
+	if (IS_ERR(rq))
+		return PTR_ERR(rq);
+	nrq.rq = rq;
+	if (fused) {
+		DECLARE_COMPLETION_ONSTACK(wait);
+
+		rq2 = nvme_mk_req_io(ns, &nrq2, BLK_MQ_REQ_NOWAIT,
+				     nvme_req_qid(rq), 0);
+		if (IS_ERR(rq2)) {
+			nvme_free_io(&nrq);
+			return PTR_ERR(rq2);
+		}
+		/* point requests at each other */
+		nvme_req(rq)->nrq = nvme_req(rq2);
+		nvme_req(rq2)->nrq = nvme_req(rq);
+		nrq2.rq = rq2;
+
+		rq->cmd_flags |= REQ_NOMERGE;
+		rq2->cmd_flags |= REQ_NOMERGE;
+		rq->end_io_data = &wait;
+		blk_execute_rq_nowait(rq->q, ns->disk, rq, false, nvme_end_sync_rq);
+		nvme_execute_passthru_rq(rq2);
+
+		/*
+		 * both will be complete at this point, but nvme spec doesn't
+		 * specify cqe ordering for fused operations so wait for the
+		 * first to complete as well
+		 */
+		wait_for_completion_io(&wait);
+		nvme_unprep_io(&nrq, NULL);
+		ret = nvme_unprep_io(&nrq2, NULL);
+	} else {
+		nvme_execute_passthru_rq(rq);
+		ret = nvme_unprep_io(&nrq, NULL);
+	}
+	return ret;
 }
 
 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
@@ -1671,6 +1827,25 @@ static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 	return ret;
 }
 
+#ifdef CONFIG_COMPAT
+struct nvme_user_io32 {
+	__u8	opcode;
+	__u8	flags;
+	__u16	control;
+	__u16	nblocks;
+	__u16	rsvd;
+	__u64	metadata;
+	__u64	addr;
+	__u64	slba;
+	__u32	dsmgmt;
+	__u32	reftag;
+	__u16	apptag;
+	__u16	appmask;
+} __attribute__((__packed__));
+
+#define NVME_IOCTL_SUBMIT_IO32	_IOW('N', 0x42, struct nvme_user_io32)
+#endif
+
 static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg)
 {
@@ -1699,8 +1874,12 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	case NVME_IOCTL_IO_CMD:
 		ret = nvme_user_cmd(ns->ctrl, ns, argp);
 		break;
+#ifdef CONFIG_COMPAT
+	case NVME_IOCTL_SUBMIT_IO32:
+		fallthrough;	/* structures are identical except size */
+#endif
 	case NVME_IOCTL_SUBMIT_IO:
-		ret = nvme_submit_io(ns, argp);
+		ret = nvme_submit_io(ns, argp, _IOC_SIZE(cmd));
 		break;
 	case NVME_IOCTL_IO64_CMD:
 		ret = nvme_user_cmd64(ns->ctrl, ns, argp);
@@ -1716,41 +1895,11 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	return ret;
 }
 
-#ifdef CONFIG_COMPAT
-struct nvme_user_io32 {
-	__u8	opcode;
-	__u8	flags;
-	__u16	control;
-	__u16	nblocks;
-	__u16	rsvd;
-	__u64	metadata;
-	__u64	addr;
-	__u64	slba;
-	__u32	dsmgmt;
-	__u32	reftag;
-	__u16	apptag;
-	__u16	appmask;
-} __attribute__((__packed__));
-
-#define NVME_IOCTL_SUBMIT_IO32	_IOW('N', 0x42, struct nvme_user_io32)
 
+#ifdef CONFIG_COMPAT
 static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg)
 {
-	/*
-	 * Corresponds to the difference of NVME_IOCTL_SUBMIT_IO
-	 * between 32 bit programs and 64 bit kernel.
-	 * The cause is that the results of sizeof(struct nvme_user_io),
-	 * which is used to define NVME_IOCTL_SUBMIT_IO,
-	 * are not same between 32 bit compiler and 64 bit compiler.
-	 * NVME_IOCTL_SUBMIT_IO32 is for 64 bit kernel handling
-	 * NVME_IOCTL_SUBMIT_IO issued from 32 bit programs.
-	 * Other IOCTL numbers are same between 32 bit and 64 bit.
-	 * So there is nothing to do regarding to other IOCTL numbers.
-	 */
-	if (cmd == NVME_IOCTL_SUBMIT_IO32)
-		return nvme_ioctl(bdev, mode, NVME_IOCTL_SUBMIT_IO, arg);
-
 	return nvme_ioctl(bdev, mode, cmd, arg);
 }
 #else
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-25 19:58   ` [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO clay.mayers
@ 2021-01-26  1:43     ` Chaitanya Kulkarni
  2021-01-26 18:17       ` Clay Mayers
  0 siblings, 1 reply; 22+ messages in thread
From: Chaitanya Kulkarni @ 2021-01-26  1:43 UTC (permalink / raw)
  To: clay.mayers, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

On 1/25/21 12:03, clay.mayers@kioxia.com wrote:
> Local pci device fused support is also necessary for
> NVMeOF targets to support fused operation.
Please explain the use case and the application of the NVMeOF
fuse command feature.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-26  1:43     ` Chaitanya Kulkarni
@ 2021-01-26 18:17       ` Clay Mayers
  2021-01-26 19:00         ` Chaitanya Kulkarni
  0 siblings, 1 reply; 22+ messages in thread
From: Clay Mayers @ 2021-01-26 18:17 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

> From: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
> Sent: Monday, January 25, 2021 5:43 PM
> To: Clay Mayers <Clay.Mayers@kioxia.com>; linux-nvme@lists.infradead.org
> Cc: Keith Busch <kbusch@kernel.org>; Jens Axboe <axboe@fb.com>;
> Christoph Hellwig <hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>
> Subject: Re: [PATCH V2 0/2] nvme: Support for fused
> NVME_IOCTL_SUBMIT_IO
> 
> On 1/25/21 12:03, clay.mayers@kioxia.com wrote:
> > Local pci device fused support is also necessary for NVMeOF targets to
> > support fused operation.
> Please explain the use case and the application of the NVMeOF fuse
> command feature.

NVMeOF devices are used to create disaggregated storage systems
where compute and storage are connected over a fabric.  Fused
compare/write can be used to arbitrate shared access to NVMeOF devices
w/o a central authority.

A specific example of how fused compare/write is used is the clustered
file system VMFS.  It uses the SCSI version of compare/write to manage
meta data on shared SAN systems.  File system meta data is updated
using locks stored on the storage media.  Those locks are grabbed using
fused compare/write operations as an atomic test & set.  VMFS originally
used device reserve, which is a courser gained locking mechanism but it
doesn't scale as well as an atomic test & set.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-26 18:17       ` Clay Mayers
@ 2021-01-26 19:00         ` Chaitanya Kulkarni
  2021-01-26 21:14           ` Clay Mayers
  2021-02-09  0:53           ` Clay Mayers
  0 siblings, 2 replies; 22+ messages in thread
From: Chaitanya Kulkarni @ 2021-01-26 19:00 UTC (permalink / raw)
  To: Clay Mayers, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

On 1/26/21 10:17 AM, Clay Mayers wrote:
>> From: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
>> Sent: Monday, January 25, 2021 5:43 PM
>> To: Clay Mayers <Clay.Mayers@kioxia.com>; linux-nvme@lists.infradead.org
>> Cc: Keith Busch <kbusch@kernel.org>; Jens Axboe <axboe@fb.com>;
>> Christoph Hellwig <hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>
>> Subject: Re: [PATCH V2 0/2] nvme: Support for fused
>> NVME_IOCTL_SUBMIT_IO
>>
>> On 1/25/21 12:03, clay.mayers@kioxia.com wrote:
>>> Local pci device fused support is also necessary for NVMeOF targets to
>>> support fused operation.
>> Please explain the use case and the application of the NVMeOF fuse
>> command feature.
> NVMeOF devices are used to create disaggregated storage systems
> where compute and storage are connected over a fabric.  Fused
> compare/write can be used to arbitrate shared access to NVMeOF devices
> w/o a central authority.
>
> A specific example of how fused compare/write is used is the clustered
> file system VMFS.  It uses the SCSI version of compare/write to manage
> meta data on shared SAN systems.  File system meta data is updated
> using locks stored on the storage media.  Those locks are grabbed using
> fused compare/write operations as an atomic test & set.  VMFS originally
> used device reserve, which is a courser gained locking mechanism but it
> doesn't scale as well as an atomic test & set.
If I understand correctly VMFS is out of tree filesystem is it ?
Can you please explain the setup in detail ? what kind of interface
file-system is using to issue the command ?


Based on your description it looks like target is connected to the
vmware based system and host is vmware based host and not the linux
host which is present in this series.
Also what are other applications or is this the only one application?
>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-26 19:00         ` Chaitanya Kulkarni
@ 2021-01-26 21:14           ` Clay Mayers
  2021-02-09  0:53           ` Clay Mayers
  1 sibling, 0 replies; 22+ messages in thread
From: Clay Mayers @ 2021-01-26 21:14 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

> From: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
> Sent: Tuesday, January 26, 2021 11:01 AM
> 
> On 1/26/21 10:17 AM, Clay Mayers wrote:
> >>
> >> On 1/25/21 12:03, clay.mayers@kioxia.com wrote:
> >>> Local pci device fused support is also necessary for NVMeOF targets
> >>> to support fused operation.
> >> Please explain the use case and the application of the NVMeOF fuse
> >> command feature.
> > NVMeOF devices are used to create disaggregated storage systems where
> > compute and storage are connected over a fabric.  Fused compare/write
> > can be used to arbitrate shared access to NVMeOF devices w/o a central
> > authority.
> >
> > A specific example of how fused compare/write is used is the clustered
> > file system VMFS.  It uses the SCSI version of compare/write to manage
> > meta data on shared SAN systems.  File system meta data is updated
> > using locks stored on the storage media.  Those locks are grabbed
> > using fused compare/write operations as an atomic test & set.  VMFS
> > originally used device reserve, which is a courser gained locking
> > mechanism but it doesn't scale as well as an atomic test & set.
> If I understand correctly VMFS is out of tree filesystem is it ?
I seem to have misunderstood your request for a use-case.  As a patch series,
this is not about NVMeOF.  This is about pci support for the fused command.
NVMeOF is the use case for pci fused support.

But how strong of a use case is NVMeOF?  I offered clustered file systems
and the public example of VMWare's VMFS to illustrate the usefulness.
Here VMWare is the target and Linux is the host serving up storage over
NVMeOF.  That requires fused support the target/host and pci. With a
past company I worked for, we use the SPDK to get this functionality for
disaggregated storage.  That's right for some solutions but not all.

Our actual goal is to have something like direct device access without
something like the SPDK.  We think io uring is the correct solution.
Jens, just before his winter PTO, tweeted about adding ioctl support to
io uring.  We hope to extend that to support fused operations as well.
Exposing it through IOCTL makes the pci patch useful now.  The one
Example I have is for nvme-cli as requested on github.

https://github.com/linux-nvme/nvme-cli/issues/318

I thought this was better than folding an nvme change in with an io uring
patch series.  I'm trying to find the balance between a small isolated unit of
change and something compelling.
> Can you please explain the setup in detail ? what kind of interface file-system
> is using to issue the command ?
> Based on your description it looks like target is connected to the vmware
> based system and host is vmware based host and not the linux host which is
> present in this series.
No - the idea is to be standards based and use NVMeOF for target and host
data exchange.  In one example, the target would be running vSphere. The
host, as a Linux machine, would expose its attached devices with NVMeOF.
VSphere would expect fused command support from the Linux machine.
> Also what are other applications or is this the only one application?
The application is disaggregated storage on NVMeOF, both consuming it
and publishing it.  I don't have any specific set of applications to offer.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-01-26 19:00         ` Chaitanya Kulkarni
  2021-01-26 21:14           ` Clay Mayers
@ 2021-02-09  0:53           ` Clay Mayers
  2021-02-09  3:12             ` Keith Busch
  2021-02-09  7:54             ` Christoph Hellwig
  1 sibling, 2 replies; 22+ messages in thread
From: Clay Mayers @ 2021-02-09  0:53 UTC (permalink / raw)
  To: linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Chaitanya Kulkarni,
	Sagi Grimberg

> From: Clay Mayers
> Sent: Tuesday, January 26, 2021 1:14 PM
> To: linux-nvme@lists.infradead.org
> Cc: Keith Busch <kbusch@kernel.org>; Jens Axboe <axboe@fb.com>;
> 'Chaitanya Kulkarni' <Chaitanya.Kulkarni@wdc.com>;
> Christoph Hellwig <hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>
> Subject: RE: [PATCH V2 0/2] nvme: Support for fused
> NVME_IOCTL_SUBMIT_IO
>
Is there any other feedback on V2?

My main concern I have about my implementation is how fused requests
are tunneled through the mq request layer.  The 1st request is marked as
started but it won't be in the device until the 2nd command is queued.  As
Keith pointed out, a device reset can split the two so care must be taken to
correctly handle this case.  Despite this, I thought this was a better approach
than modifying mq requests to be fused.  Especially given Christoph's 
concern of cost vs value.  This is the lightest touch I could come up with.

Further consideration of this patch may need a more compelling use case.
I've worked on a proprietary storage systems that relied on fused NVMeOF
support so it seems compelling to me.  There's a comment in target/core.c
that there is "no support for fused commands yet" implying it's been
considered.  Is pci only support for fused too soon or too little?  What would
make it more compelling?






_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-02-09  0:53           ` Clay Mayers
@ 2021-02-09  3:12             ` Keith Busch
  2021-02-09 15:24               ` Bart Van Assche
  2021-02-09 15:38               ` Clay Mayers
  2021-02-09  7:54             ` Christoph Hellwig
  1 sibling, 2 replies; 22+ messages in thread
From: Keith Busch @ 2021-02-09  3:12 UTC (permalink / raw)
  To: Clay Mayers
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, linux-nvme,
	Christoph Hellwig

On Tue, Feb 09, 2021 at 12:53:17AM +0000, Clay Mayers wrote:
> Is there any other feedback on V2?
> 
> My main concern I have about my implementation is how fused requests
> are tunneled through the mq request layer.  The 1st request is marked as
> started but it won't be in the device until the 2nd command is queued.  As
> Keith pointed out, a device reset can split the two so care must be taken to
> correctly handle this case.  Despite this, I thought this was a better approach
> than modifying mq requests to be fused.  Especially given Christoph's 
> concern of cost vs value.  This is the lightest touch I could come up with.
> 
> Further consideration of this patch may need a more compelling use case.
> I've worked on a proprietary storage systems that relied on fused NVMeOF
> support so it seems compelling to me.  There's a comment in target/core.c
> that there is "no support for fused commands yet" implying it's been
> considered.  Is pci only support for fused too soon or too little?  What would
> make it more compelling?

The complications it introduces to the IO path and error handling for an
archaic feature has me on the "Nak" side. NVMeOF was introduced well after the
spec define Reservations, and the kernel has supported that capability for many
years. I'm not aware of any other use case for fused commands, so it appears to
be dead weight in the spec.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-02-09  0:53           ` Clay Mayers
  2021-02-09  3:12             ` Keith Busch
@ 2021-02-09  7:54             ` Christoph Hellwig
  2021-02-09 15:53               ` Clay Mayers
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2021-02-09  7:54 UTC (permalink / raw)
  To: Clay Mayers
  Cc: Sagi Grimberg, Chaitanya Kulkarni, linux-nvme, Jens Axboe,
	Keith Busch, Christoph Hellwig

On Tue, Feb 09, 2021 at 12:53:17AM +0000, Clay Mayers wrote:
> > From: Clay Mayers
> > Sent: Tuesday, January 26, 2021 1:14 PM
> > To: linux-nvme@lists.infradead.org
> > Cc: Keith Busch <kbusch@kernel.org>; Jens Axboe <axboe@fb.com>;
> > 'Chaitanya Kulkarni' <Chaitanya.Kulkarni@wdc.com>;
> > Christoph Hellwig <hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>
> > Subject: RE: [PATCH V2 0/2] nvme: Support for fused
> > NVME_IOCTL_SUBMIT_IO
> >
> Is there any other feedback on V2?

My feedback is that we should not add code to the kernel driver and
our fast path for fused command unless we have a mainstream in-kernel
users, that is not at all for the foreseeable future.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-02-09  3:12             ` Keith Busch
@ 2021-02-09 15:24               ` Bart Van Assche
  2021-02-09 15:38               ` Clay Mayers
  1 sibling, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2021-02-09 15:24 UTC (permalink / raw)
  To: Keith Busch, Clay Mayers
  Cc: Jens Axboe, linux-nvme, Sagi Grimberg, Chaitanya Kulkarni,
	Christoph Hellwig

On 2/8/21 7:12 PM, Keith Busch wrote:
> On Tue, Feb 09, 2021 at 12:53:17AM +0000, Clay Mayers wrote:
>> Is there any other feedback on V2?
>>
>> My main concern I have about my implementation is how fused requests
>> are tunneled through the mq request layer.  The 1st request is marked as
>> started but it won't be in the device until the 2nd command is queued.  As
>> Keith pointed out, a device reset can split the two so care must be taken to
>> correctly handle this case.  Despite this, I thought this was a better approach
>> than modifying mq requests to be fused.  Especially given Christoph's
>> concern of cost vs value.  This is the lightest touch I could come up with.
>>
>> Further consideration of this patch may need a more compelling use case.
>> I've worked on a proprietary storage systems that relied on fused NVMeOF
>> support so it seems compelling to me.  There's a comment in target/core.c
>> that there is "no support for fused commands yet" implying it's been
>> considered.  Is pci only support for fused too soon or too little?  What would
>> make it more compelling?
> 
> The complications it introduces to the IO path and error handling for an
> archaic feature has me on the "Nak" side. NVMeOF was introduced well after the
> spec define Reservations, and the kernel has supported that capability for many
> years. I'm not aware of any other use case for fused commands, so it appears to
> be dead weight in the spec.

Hi Keith,

Do you agree that NVMe persistent reservation commands apply to an NVMe 
namespace in its entirety while fused compare-and-write commands allow 
to implement locking for subsets of the LBA range of a namespace? In 
other words, I think there is a valid use case for fused commands.

Thanks,

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-02-09  3:12             ` Keith Busch
  2021-02-09 15:24               ` Bart Van Assche
@ 2021-02-09 15:38               ` Clay Mayers
  1 sibling, 0 replies; 22+ messages in thread
From: Clay Mayers @ 2021-02-09 15:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, linux-nvme,
	Christoph Hellwig

> From: Keith Busch <kbusch@kernel.org>
> Sent: Monday, February 8, 2021 7:13 PM
> To: Clay Mayers <Clay.Mayers@kioxia.com>
> Cc: linux-nvme@lists.infradead.org; Jens Axboe <axboe@fb.com>; Chaitanya
> Kulkarni <Chaitanya.Kulkarni@wdc.com>; Christoph Hellwig <hch@lst.de>;
> Sagi Grimberg <sagi@grimberg.me>
> Subject: Re: [PATCH V2 0/2] nvme: Support for fused
> NVME_IOCTL_SUBMIT_IO
> 
> On Tue, Feb 09, 2021 at 12:53:17AM +0000, Clay Mayers wrote:
> > Is there any other feedback on V2?
> >
> > My main concern I have about my implementation is how fused requests
> > are tunneled through the mq request layer.  The 1st request is marked
> > as started but it won't be in the device until the 2nd command is
> > queued.  As Keith pointed out, a device reset can split the two so
> > care must be taken to correctly handle this case.  Despite this, I
> > thought this was a better approach than modifying mq requests to be
> > fused.  Especially given Christoph's concern of cost vs value.  This is the
> lightest touch I could come up with.
> >
> > Further consideration of this patch may need a more compelling use case.
> > I've worked on a proprietary storage systems that relied on fused
> > NVMeOF support so it seems compelling to me.  There's a comment in
> > target/core.c that there is "no support for fused commands yet"
> > implying it's been considered.  Is pci only support for fused too soon
> > or too little?  What would make it more compelling?
> 
> The complications it introduces to the IO path and error handling for an
> archaic feature has me on the "Nak" side. NVMeOF was introduced well after
> the spec define Reservations, and the kernel has supported that capability
> for many years. I'm not aware of any other use case for fused commands, so
> it appears to be dead weight in the spec.

Thanks for sharing your perspective.  I can see the point that reservations are
good enough for many.  For VmWare, moving from reservations to cmp/write
on SCSI led to an increase in performance (couldn't find an actual # but heard
30% anecdotally). That's a specialized use case (storage for VMs) on a much
slower medium that may not translate to more general storage systems or
much faster storage.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO
  2021-02-09  7:54             ` Christoph Hellwig
@ 2021-02-09 15:53               ` Clay Mayers
  0 siblings, 0 replies; 22+ messages in thread
From: Clay Mayers @ 2021-02-09 15:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Chaitanya Kulkarni, linux-nvme, Sagi Grimberg

> From: Christoph Hellwig <hch@lst.de>
> Sent: Monday, February 8, 2021 11:55 PM
> To: Clay Mayers <Clay.Mayers@kioxia.com>
> Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>; Jens
> Axboe <axboe@fb.com>; Chaitanya Kulkarni
> <Chaitanya.Kulkarni@wdc.com>; Christoph Hellwig <hch@lst.de>; Sagi
> Grimberg <sagi@grimberg.me>
> Subject: Re: [PATCH V2 0/2] nvme: Support for fused
> NVME_IOCTL_SUBMIT_IO
> 
> On Tue, Feb 09, 2021 at 12:53:17AM +0000, Clay Mayers wrote:
> > > From: Clay Mayers
> > > Sent: Tuesday, January 26, 2021 1:14 PM
> > > To: linux-nvme@lists.infradead.org
> > > Cc: Keith Busch <kbusch@kernel.org>; Jens Axboe <axboe@fb.com>;
> > > 'Chaitanya Kulkarni' <Chaitanya.Kulkarni@wdc.com>; Christoph Hellwig
> > > <hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>
> > > Subject: RE: [PATCH V2 0/2] nvme: Support for fused
> > > NVME_IOCTL_SUBMIT_IO
> > >
> > Is there any other feedback on V2?
> 
> My feedback is that we should not add code to the kernel driver and our fast
> path for fused command unless we have a mainstream in-kernel users, that
> is not at all for the foreseeable future.

Thanks - that makes sense.  I know the patch isn't enough to be very useful on
its own.  I shared it at this stage because of my concern over the implementation.
Thanks to the feedback from the list, it is better than it was.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-02-09 15:53 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 22:49 [PATCH 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO klayph
2021-01-05 22:49 ` [PATCH 1/2] nvme: support fused nvme requests klayph
2021-01-05 23:52   ` Keith Busch
2021-01-06 14:55     ` Clay Mayers
2021-01-06  0:35   ` James Smart
2021-01-06 15:01     ` Clay Mayers
2021-01-06  7:59   ` Christoph Hellwig
2021-01-25 19:58   ` [PATCH V2 0/2] nvme: Support for fused NVME_IOCTL_SUBMIT_IO clay.mayers
2021-01-26  1:43     ` Chaitanya Kulkarni
2021-01-26 18:17       ` Clay Mayers
2021-01-26 19:00         ` Chaitanya Kulkarni
2021-01-26 21:14           ` Clay Mayers
2021-02-09  0:53           ` Clay Mayers
2021-02-09  3:12             ` Keith Busch
2021-02-09 15:24               ` Bart Van Assche
2021-02-09 15:38               ` Clay Mayers
2021-02-09  7:54             ` Christoph Hellwig
2021-02-09 15:53               ` Clay Mayers
2021-01-25 19:58   ` [PATCH V2 1/2] nvme: support fused pci nvme requests clay.mayers
2021-01-25 19:58   ` [PATCH V2 2/2] nvme: support fused NVME_IOCTL_SUBMIT_IO clay.mayers
2021-01-05 22:49 ` [PATCH " klayph
2021-01-05 23:04 ` [PATCH 0/2] nvme: Support for " James Smart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.