[PATCH 0/3] nvme: protect against possible request reference after completion

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] nvme: protect against possible request reference after completion
@ 2021-05-17 17:59 Sagi Grimberg
  2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner

Nothing in nvme protects against referencing a request after it was completed.
For example, in case a buggy controller sends a completion twice for the same
request, the host can access and modify a request that was already completed.

At best, this will cause a panic, but on the worst case, this can cause a silent
data corruption if the request was already reused and executed by the time
we reference it.

The nvme command_id is an opaque that we simply placed the request tag thus far.
To protect against a access after completion, we introduce a generation counter
to the upper 4-bits of the command_id that will increment every invocation and
be validated upon the reception of a completion. This will limit the maximum
queue depth to be effectively 4095, but we hardly ever use such long queues
(in fabrics the maximum is already 1024).

Feedback is welcome.

Sagi Grimberg (3):
  nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data
  nvme-pci: limit maximum queue depth to 4095
  nvme: code command_id with a genctr for use-after-free validation

 drivers/nvme/host/core.c   |  3 ++-
 drivers/nvme/host/nvme.h   | 47 +++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/pci.c    |  7 +++---
 drivers/nvme/host/rdma.c   |  4 ++--
 drivers/nvme/host/tcp.c    | 32 +++++++++++---------------
 drivers/nvme/target/loop.c |  4 ++--
 6 files changed, 69 insertions(+), 28 deletions(-)

-- 
2.27.0

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data
  2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg
@ 2021-05-17 17:59 ` Sagi Grimberg
  2021-05-18  6:59   ` Christoph Hellwig
  2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner

We already validate it when receiving the c2hdata pdu header
and this is not changing so this is a redundant check.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 6bd5b281c818..1d184c06c575 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -699,12 +699,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-	if (!rq) {
-		dev_err(queue->ctrl->ctrl.device,
-			"queue %d tag %#x not found\n",
-			nvme_tcp_queue_id(queue), pdu->command_id);
-		return -ENOENT;
-	}
 	req = blk_mq_rq_to_pdu(rq);
 
 	while (true) {
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data
  2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg
@ 2021-05-18  6:59   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2021-05-18  6:59 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Keith Busch, Daniel Wagner

> +++ b/drivers/nvme/host/tcp.c
> @@ -699,12 +699,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>  	struct request *rq;
>  
>  	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
> -	if (!rq) {
> -		dev_err(queue->ctrl->ctrl.device,
> -			"queue %d tag %#x not found\n",
> -			nvme_tcp_queue_id(queue), pdu->command_id);
> -		return -ENOENT;
> -	}
>  	req = blk_mq_rq_to_pdu(rq);

We could simplify the initialization a bit now:

	struct request *rq =
		blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095
  2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg
  2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg
@ 2021-05-17 17:59 ` Sagi Grimberg
  2021-05-18  7:01   ` Christoph Hellwig
  2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg
  2021-05-17 18:47 ` [PATCH 0/3] nvme: protect against possible request reference after completion Keith Busch
  3 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner

We are going to use the upper 4-bits of the command_id for a generation
counter, so enforce the new queue depth upper limit.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a29b170701fc..bc64282b2301 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -60,6 +60,7 @@ MODULE_PARM_DESC(sgl_threshold,
 		"Use SGLs when average request segment size is larger or equal to "
 		"this size. Use 0 to disable SGLs.");
 
+#define NVME_PCI_MAX_QUEUE_SIZE 4095
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp);
 static const struct kernel_param_ops io_queue_depth_ops = {
 	.set = io_queue_depth_set,
@@ -68,7 +69,7 @@ static const struct kernel_param_ops io_queue_depth_ops = {
 
 static unsigned int io_queue_depth = 1024;
 module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644);
-MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2");
+MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2 and < 4096");
 
 static int io_queue_count_set(const char *val, const struct kernel_param *kp)
 {
@@ -161,7 +162,7 @@ static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
 	u32 n;
 
 	ret = kstrtou32(val, 10, &n);
-	if (ret != 0 || n < 2)
+	if (ret != 0 || n < 2 || n > NVME_PCI_MAX_QUEUE_SIZE)
 		return -EINVAL;
 
 	return param_set_uint(val, kp);
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095
  2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg
@ 2021-05-18  7:01   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2021-05-18  7:01 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Keith Busch, Daniel Wagner

Can we lift param_set_uint_minmax from sunrpc to common code and use
that?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg
  2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg
  2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg
@ 2021-05-17 17:59 ` Sagi Grimberg
  2021-05-17 19:04   ` Keith Busch
  2021-05-17 19:09   ` Bart Van Assche
  2021-05-17 18:47 ` [PATCH 0/3] nvme: protect against possible request reference after completion Keith Busch
  3 siblings, 2 replies; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner

We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c   |  3 ++-
 drivers/nvme/host/nvme.h   | 47 +++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/pci.c    |  2 +-
 drivers/nvme/host/rdma.c   |  4 ++--
 drivers/nvme/host/tcp.c    | 26 ++++++++++-----------
 drivers/nvme/target/loop.c |  4 ++--
 6 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e6612971f4eb..8308dd1e9e3f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1006,7 +1006,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		return BLK_STS_IOERR;
 	}
 
-	cmd->common.command_id = req->tag;
+	nvme_req(req)->genctr++;
+	cmd->common.command_id = nvme_cid(req);
 	trace_nvme_setup_cmd(req, cmd);
 	return ret;
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 05f31a2c64bb..e6f72c82582d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -158,6 +158,7 @@ enum nvme_quirks {
 struct nvme_request {
 	struct nvme_command	*cmd;
 	union nvme_result	result;
+	u8			genctr;
 	u8			retries;
 	u8			flags;
 	u16			status;
@@ -497,6 +498,49 @@ struct nvme_ctrl_ops {
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
 };
 
+/*
+ * nvme command_id is constructed as such:
+ * | xxxx | xxxxxxxxxxxx |
+ *   gen    request tag
+ */
+#define nvme_genctr_mask(gen)			(gen & 0xf)
+#define nvme_cid_install_genctr(gen)		(nvme_genctr_mask(gen) << 12)
+#define nvme_genctr_from_cid(cid)		((cid & 0xf000) >> 12)
+#define nvme_tag_from_cid(cid)			(cid & 0xfff)
+
+static inline u16 nvme_cid(struct request *rq)
+{
+	return nvme_cid_install_genctr(nvme_req(rq)->genctr) | rq->tag;
+}
+
+static inline struct request *nvme_find_rq(struct blk_mq_tags *tags,
+		u16 command_id)
+{
+	u8 genctr = nvme_genctr_from_cid(command_id);
+	u16 tag = nvme_tag_from_cid(command_id);
+	struct request *rq;
+
+	rq = blk_mq_tag_to_rq(tags, tag);
+	if (unlikely(!rq)) {
+		pr_err("could not locate request for tag %#x\n",
+			tag);
+		return NULL;
+	}
+	if (unlikely(nvme_genctr_mask(nvme_req(rq)->genctr) != genctr)) {
+		dev_err(nvme_req(rq)->ctrl->device,
+			"request %#x genctr mismatch (got %#x expected %#x)\n",
+			tag, genctr, nvme_req(rq)->genctr);
+		return NULL;
+	}
+	return rq;
+}
+
+static inline struct request *nvme_cid_to_rq(struct blk_mq_tags *tags,
+                u16 command_id)
+{
+	return blk_mq_tag_to_rq(tags, nvme_tag_from_cid(command_id));
+}
+
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
 void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj,
 			    const char *dev_name);
@@ -594,7 +638,8 @@ static inline void nvme_put_ctrl(struct nvme_ctrl *ctrl)
 
 static inline bool nvme_is_aen_req(u16 qid, __u16 command_id)
 {
-	return !qid && command_id >= NVME_AQ_BLK_MQ_DEPTH;
+	return !qid &&
+		nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH;
 }
 
 void nvme_complete_rq(struct request *req);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index bc64282b2301..0c1ea6e521d1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1018,7 +1018,7 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx)
 		return;
 	}
 
-	req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), command_id);
+	req = nvme_find_rq(nvme_queue_tagset(nvmeq), command_id);
 	if (unlikely(!req)) {
 		dev_warn(nvmeq->dev->ctrl.device,
 			"invalid id %d completed on queue %d\n",
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 203b47a8ec92..ab5b7d175488 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1727,10 +1727,10 @@ static void nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	struct request *rq;
 	struct nvme_rdma_request *req;
 
-	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
+	rq = nvme_find_rq(nvme_rdma_tagset(queue), cqe->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
-			"tag 0x%x on QP %#x not found\n",
+			"got bad command_id %#x on QP %#x\n",
 			cqe->command_id, queue->qp->qp_num);
 		nvme_rdma_error_recovery(queue->ctrl);
 		return;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1d184c06c575..e6bd74cabc6d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -487,11 +487,11 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 {
 	struct request *rq;
 
-	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+	rq = nvme_find_rq(nvme_tcp_tagset(queue), cqe->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
-			"queue %d tag 0x%x not found\n",
-			nvme_tcp_queue_id(queue), cqe->command_id);
+			"got bad cqe.command_id %#x on queue %d\n",
+			cqe->command_id, nvme_tcp_queue_id(queue));
 		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
 		return -EINVAL;
 	}
@@ -508,11 +508,11 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
 {
 	struct request *rq;
 
-	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	rq = nvme_find_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
-			"queue %d tag %#x not found\n",
-			nvme_tcp_queue_id(queue), pdu->command_id);
+			"got bad c2hdata.command_id %#x on queue %d\n",
+			pdu->command_id, nvme_tcp_queue_id(queue));
 		return -ENOENT;
 	}
 
@@ -599,7 +599,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
 	data->hdr.plen =
 		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
 	data->ttag = pdu->ttag;
-	data->command_id = rq->tag;
+	data->command_id = nvme_cid(rq);
 	data->data_offset = cpu_to_le32(req->data_sent);
 	data->data_length = cpu_to_le32(req->pdu_len);
 	return 0;
@@ -612,11 +612,11 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	struct request *rq;
 	int ret;
 
-	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	rq = nvme_find_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
-			"queue %d tag %#x not found\n",
-			nvme_tcp_queue_id(queue), pdu->command_id);
+			"got bad r2t.command_id %#x on queue %d\n",
+			pdu->command_id, nvme_tcp_queue_id(queue));
 		return -ENOENT;
 	}
 	req = blk_mq_rq_to_pdu(rq);
@@ -698,7 +698,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	struct nvme_tcp_request *req;
 	struct request *rq;
 
-	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	rq = nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	req = blk_mq_rq_to_pdu(rq);
 
 	while (true) {
@@ -791,8 +791,8 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
 	}
 
 	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
-		struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
-						pdu->command_id);
+		struct request *rq = nvme_cid_to_rq(nvme_tcp_tagset(queue),
+					pdu->command_id);
 
 		nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
 		queue->nr_cqe++;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 1b89a6bb819a..9f1f5d572960 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -107,10 +107,10 @@ static void nvme_loop_queue_response(struct nvmet_req *req)
 	} else {
 		struct request *rq;
 
-		rq = blk_mq_tag_to_rq(nvme_loop_tagset(queue), cqe->command_id);
+		rq = nvme_find_rq(nvme_loop_tagset(queue), cqe->command_id);
 		if (!rq) {
 			dev_err(queue->ctrl->ctrl.device,
-				"tag 0x%x on queue %d not found\n",
+				"got bad command_id %#x on queue %d\n",
 				cqe->command_id, nvme_loop_queue_idx(queue));
 			return;
 		}
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg
@ 2021-05-17 19:04   ` Keith Busch
  2021-05-17 20:23     ` Sagi Grimberg
  2021-05-17 19:09   ` Bart Van Assche
  1 sibling, 1 reply; 16+ messages in thread
From: Keith Busch @ 2021-05-17 19:04 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner

On Mon, May 17, 2021 at 10:59:55AM -0700, Sagi Grimberg wrote:
> +static inline struct request *nvme_find_rq(struct blk_mq_tags *tags,
> +		u16 command_id)
> +{
> +	u8 genctr = nvme_genctr_from_cid(command_id);
> +	u16 tag = nvme_tag_from_cid(command_id);
> +	struct request *rq;
> +
> +	rq = blk_mq_tag_to_rq(tags, tag);
> +	if (unlikely(!rq)) {
> +		pr_err("could not locate request for tag %#x\n",
> +			tag);
> +		return NULL;
> +	}
> +	if (unlikely(nvme_genctr_mask(nvme_req(rq)->genctr) != genctr)) {
> +		dev_err(nvme_req(rq)->ctrl->device,
> +			"request %#x genctr mismatch (got %#x expected %#x)\n",
> +			tag, genctr, nvme_req(rq)->genctr);
> +		return NULL;
> +	}

Should we also check for blk_mq_request_started(rq) here? It doesn't look
like the genctr is sufficient to detect a double completion of the same
command_id if the driver hasn't had a chance to advance the genctr for a
new request.

> +	return rq;
> +}

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 19:04   ` Keith Busch
@ 2021-05-17 20:23     ` Sagi Grimberg
  0 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 20:23 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner


>> +static inline struct request *nvme_find_rq(struct blk_mq_tags *tags,
>> +		u16 command_id)
>> +{
>> +	u8 genctr = nvme_genctr_from_cid(command_id);
>> +	u16 tag = nvme_tag_from_cid(command_id);
>> +	struct request *rq;
>> +
>> +	rq = blk_mq_tag_to_rq(tags, tag);
>> +	if (unlikely(!rq)) {
>> +		pr_err("could not locate request for tag %#x\n",
>> +			tag);
>> +		return NULL;
>> +	}
>> +	if (unlikely(nvme_genctr_mask(nvme_req(rq)->genctr) != genctr)) {
>> +		dev_err(nvme_req(rq)->ctrl->device,
>> +			"request %#x genctr mismatch (got %#x expected %#x)\n",
>> +			tag, genctr, nvme_req(rq)->genctr);
>> +		return NULL;
>> +	}
> 
> Should we also check for blk_mq_request_started(rq) here? It doesn't look
> like the genctr is sufficient to detect a double completion of the same
> command_id if the driver hasn't had a chance to advance the genctr for a
> new request.

We could, maybe incrementing the counter again on the completion would
be better?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg
  2021-05-17 19:04   ` Keith Busch
@ 2021-05-17 19:09   ` Bart Van Assche
  2021-05-17 19:46     ` Keith Busch
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2021-05-17 19:09 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner

On 5/17/21 10:59 AM, Sagi Grimberg wrote:
> We cannot detect a (perhaps buggy) controller that is sending us
> a completion for a request that was already completed (for example
> sending a completion twice), this phenomenon was seen in the wild
> a few times.
> 
> So to protect against this, we use the upper 4 msbits of the nvme sqe
> command_id to use as a 4-bit generation counter and verify it matches
> the existing request generation that is incrementing on every execution.
> 
> The 16-bit command_id structure now is constructed by:
> | xxxx | xxxxxxxxxxxx |
>   gen    request tag
> 
> This means that we are giving up some possible queue depth as 12 bits
> allow for a maximum queue depth of 4095 instead of 65536, however we
> never create such long queues anyways so no real harm done.

Is a four bit generation counter sufficient? Shouldn't such a counter be
at least 32 bits to provide a reasonable protection against e.g.
duplicate packets generated by retry mechanisms in networking stacks?

Additionally, I do not agree with the statement "we never create such
long queues anyways". I have already done this myself.

Thanks,

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 19:09   ` Bart Van Assche
@ 2021-05-17 19:46     ` Keith Busch
  2021-05-17 20:27       ` Sagi Grimberg
  2021-05-17 20:28       ` Bart Van Assche
  0 siblings, 2 replies; 16+ messages in thread
From: Keith Busch @ 2021-05-17 19:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, Daniel Wagner

On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote:
> On 5/17/21 10:59 AM, Sagi Grimberg wrote:
> > We cannot detect a (perhaps buggy) controller that is sending us
> > a completion for a request that was already completed (for example
> > sending a completion twice), this phenomenon was seen in the wild
> > a few times.
> > 
> > So to protect against this, we use the upper 4 msbits of the nvme sqe
> > command_id to use as a 4-bit generation counter and verify it matches
> > the existing request generation that is incrementing on every execution.
> > 
> > The 16-bit command_id structure now is constructed by:
> > | xxxx | xxxxxxxxxxxx |
> >   gen    request tag
> > 
> > This means that we are giving up some possible queue depth as 12 bits
> > allow for a maximum queue depth of 4095 instead of 65536, however we
> > never create such long queues anyways so no real harm done.
> 
> Is a four bit generation counter sufficient? Shouldn't such a counter be
> at least 32 bits to provide a reasonable protection against e.g.
> duplicate packets generated by retry mechanisms in networking stacks?

It has to fit in the protocol's command identifier, and that's only 16
bits. Most of the bits for the tag, so the implementation uses the most
available. More could be better, but that would require a spec change.
 
> Additionally, I do not agree with the statement "we never create such
> long queues anyways". I have already done this myself.

Why? That won't improve bandwidth, and will increase latency. We already
have timeout problems with the current default 1k qdepth on some
devices.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 19:46     ` Keith Busch
@ 2021-05-17 20:27       ` Sagi Grimberg
  2021-05-17 20:28       ` Bart Van Assche
  1 sibling, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 20:27 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner


>>> We cannot detect a (perhaps buggy) controller that is sending us
>>> a completion for a request that was already completed (for example
>>> sending a completion twice), this phenomenon was seen in the wild
>>> a few times.
>>>
>>> So to protect against this, we use the upper 4 msbits of the nvme sqe
>>> command_id to use as a 4-bit generation counter and verify it matches
>>> the existing request generation that is incrementing on every execution.
>>>
>>> The 16-bit command_id structure now is constructed by:
>>> | xxxx | xxxxxxxxxxxx |
>>>    gen    request tag
>>>
>>> This means that we are giving up some possible queue depth as 12 bits
>>> allow for a maximum queue depth of 4095 instead of 65536, however we
>>> never create such long queues anyways so no real harm done.
>>
>> Is a four bit generation counter sufficient? Shouldn't such a counter be
>> at least 32 bits to provide a reasonable protection against e.g.
>> duplicate packets generated by retry mechanisms in networking stacks?
> 
> It has to fit in the protocol's command identifier, and that's only 16
> bits. Most of the bits for the tag, so the implementation uses the most
> available. More could be better, but that would require a spec change.

Yes, even if we can expand the genctr I don't think we should, we may
want to leave some bits for potential future usages. No matter how
many bits we allocate, we can't protect a 100% against everything here.

>> Additionally, I do not agree with the statement "we never create such
>> long queues anyways". I have already done this myself.
> 
> Why? That won't improve bandwidth, and will increase latency. We already
> have timeout problems with the current default 1k qdepth on some
> devices.

Well, if there is a use-case that requires queues that are deeper than
4095 entries, we have no space in the command to protect against this...

I also find it surprising that there is a real benefit for such deep
nvme queues...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 19:46     ` Keith Busch
  2021-05-17 20:27       ` Sagi Grimberg
@ 2021-05-17 20:28       ` Bart Van Assche
  2021-05-17 21:50         ` Sagi Grimberg
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2021-05-17 20:28 UTC (permalink / raw)
  To: Keith Busch; +Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, Daniel Wagner

On 5/17/21 12:46 PM, Keith Busch wrote:
> On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote:
>> Additionally, I do not agree with the statement "we never create such
>> long queues anyways". I have already done this myself.
> 
> Why? That won't improve bandwidth, and will increase latency. We already
> have timeout problems with the current default 1k qdepth on some
> devices.

For testing FPGA or ASIC solutions that support offloading NVMe it is
more convenient to use a single queue pair with a high queue depth than
creating multiple queue pairs that each have a lower queue depth.

Bart.



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 20:28       ` Bart Van Assche
@ 2021-05-17 21:50         ` Sagi Grimberg
  2021-05-17 22:06           ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 21:50 UTC (permalink / raw)
  To: Bart Van Assche, Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner


>> On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote:
>>> Additionally, I do not agree with the statement "we never create such
>>> long queues anyways". I have already done this myself.
>>
>> Why? That won't improve bandwidth, and will increase latency. We already
>> have timeout problems with the current default 1k qdepth on some
>> devices.
> 
> For testing FPGA or ASIC solutions that support offloading NVMe it is
> more convenient to use a single queue pair with a high queue depth than
> creating multiple queue pairs that each have a lower queue depth.

And you actually see a benefit for using queues that are >=40956 in
depth? That is surprising to me...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 21:50         ` Sagi Grimberg
@ 2021-05-17 22:06           ` Bart Van Assche
  2021-05-17 22:15             ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2021-05-17 22:06 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner

On 5/17/21 2:50 PM, Sagi Grimberg wrote:
> 
>>> On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote:
>>>> Additionally, I do not agree with the statement "we never create such
>>>> long queues anyways". I have already done this myself.
>>>
>>> Why? That won't improve bandwidth, and will increase latency. We already
>>> have timeout problems with the current default 1k qdepth on some
>>> devices.
>>
>> For testing FPGA or ASIC solutions that support offloading NVMe it is
>> more convenient to use a single queue pair with a high queue depth than
>> creating multiple queue pairs that each have a lower queue depth.
> 
> And you actually see a benefit for using queues that are >=40956 in
> depth? That is surprising to me...

Hi Sagi,

It seems like there is a misunderstanding. I'm not aware of any use case
where very high queue depths provide a performance benefit. Such high
queue depths are necessary to verify an implementation of an NVMe
controller that maintains state per NVMe command and to verify whether
the NVMe controller pauses fetching new NVMe commands if the internal
NVMe command buffer is full.

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation
  2021-05-17 22:06           ` Bart Van Assche
@ 2021-05-17 22:15             ` Sagi Grimberg
  0 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2021-05-17 22:15 UTC (permalink / raw)
  To: Bart Van Assche, Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner


>>>>> Additionally, I do not agree with the statement "we never create such
>>>>> long queues anyways". I have already done this myself.
>>>>
>>>> Why? That won't improve bandwidth, and will increase latency. We already
>>>> have timeout problems with the current default 1k qdepth on some
>>>> devices.
>>>
>>> For testing FPGA or ASIC solutions that support offloading NVMe it is
>>> more convenient to use a single queue pair with a high queue depth than
>>> creating multiple queue pairs that each have a lower queue depth.
>>
>> And you actually see a benefit for using queues that are >=40956 in
>> depth? That is surprising to me...
> 
> Hi Sagi,
> 
> It seems like there is a misunderstanding. I'm not aware of any use case
> where very high queue depths provide a performance benefit. Such high
> queue depths are necessary to verify an implementation of an NVMe
> controller that maintains state per NVMe command and to verify whether
> the NVMe controller pauses fetching new NVMe commands if the internal
> NVMe command buffer is full.

I see, thanks for the clarification. However I think that a host may
choose not to support a full blown 65535 depth just like devices can
choose not to support it. As for testing your device, this will move
to the same category of all the other things Linux doesn't support,
as needing a different host SW to test this.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] nvme: protect against possible request reference after completion
  2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg
                   ` (2 preceding siblings ...)
  2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg
@ 2021-05-17 18:47 ` Keith Busch
  3 siblings, 0 replies; 16+ messages in thread
From: Keith Busch @ 2021-05-17 18:47 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner

On Mon, May 17, 2021 at 10:59:52AM -0700, Sagi Grimberg wrote:
> Nothing in nvme protects against referencing a request after it was completed.
> For example, in case a buggy controller sends a completion twice for the same
> request, the host can access and modify a request that was already completed.
> 
> At best, this will cause a panic, but on the worst case, this can cause a silent
> data corruption if the request was already reused and executed by the time
> we reference it.
> 
> The nvme command_id is an opaque that we simply placed the request tag thus far.
> To protect against a access after completion, we introduce a generation counter
> to the upper 4-bits of the command_id that will increment every invocation and
> be validated upon the reception of a completion. This will limit the maximum
> queue depth to be effectively 4095, but we hardly ever use such long queues
> (in fabrics the maximum is already 1024).

This is a neat safe guard even though we haven't seen much indication of
this type of controller bug occurring on PCIe.

It looks pretty light-weight, but I would like to see if this has a
performance impact. I'm still 3 weeks away from physical access to my
site to set up a performance test with my low-latency devices, though.

On patch 2, I think it's safe to just cap the queue depth to the new max
rather than return an error if the user requests more. 4k actually seems
like quite a lot there, too. 1k should be plenty just like the fabrics
transports, and a 1k limit provides 2 more bits for the gen sequence.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-05-18  7:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg
2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg
2021-05-18  6:59   ` Christoph Hellwig
2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg
2021-05-18  7:01   ` Christoph Hellwig
2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg
2021-05-17 19:04   ` Keith Busch
2021-05-17 20:23     ` Sagi Grimberg
2021-05-17 19:09   ` Bart Van Assche
2021-05-17 19:46     ` Keith Busch
2021-05-17 20:27       ` Sagi Grimberg
2021-05-17 20:28       ` Bart Van Assche
2021-05-17 21:50         ` Sagi Grimberg
2021-05-17 22:06           ` Bart Van Assche
2021-05-17 22:15             ` Sagi Grimberg
2021-05-17 18:47 ` [PATCH 0/3] nvme: protect against possible request reference after completion Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.