* [PATCH 0/3] nvme: protect against possible request reference after completion @ 2021-05-17 17:59 Sagi Grimberg 2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner Nothing in nvme protects against referencing a request after it was completed. For example, in case a buggy controller sends a completion twice for the same request, the host can access and modify a request that was already completed. At best, this will cause a panic, but on the worst case, this can cause a silent data corruption if the request was already reused and executed by the time we reference it. The nvme command_id is an opaque that we simply placed the request tag thus far. To protect against a access after completion, we introduce a generation counter to the upper 4-bits of the command_id that will increment every invocation and be validated upon the reception of a completion. This will limit the maximum queue depth to be effectively 4095, but we hardly ever use such long queues (in fabrics the maximum is already 1024). Feedback is welcome. Sagi Grimberg (3): nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data nvme-pci: limit maximum queue depth to 4095 nvme: code command_id with a genctr for use-after-free validation drivers/nvme/host/core.c | 3 ++- drivers/nvme/host/nvme.h | 47 +++++++++++++++++++++++++++++++++++++- drivers/nvme/host/pci.c | 7 +++--- drivers/nvme/host/rdma.c | 4 ++-- drivers/nvme/host/tcp.c | 32 +++++++++++--------------- drivers/nvme/target/loop.c | 4 ++-- 6 files changed, 69 insertions(+), 28 deletions(-) -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data 2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg @ 2021-05-17 17:59 ` Sagi Grimberg 2021-05-18 6:59 ` Christoph Hellwig 2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg ` (2 subsequent siblings) 3 siblings, 1 reply; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner We already validate it when receiving the c2hdata pdu header and this is not changing so this is a redundant check. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/tcp.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 6bd5b281c818..1d184c06c575 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -699,12 +699,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, struct request *rq; rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); - if (!rq) { - dev_err(queue->ctrl->ctrl.device, - "queue %d tag %#x not found\n", - nvme_tcp_queue_id(queue), pdu->command_id); - return -ENOENT; - } req = blk_mq_rq_to_pdu(rq); while (true) { -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data 2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg @ 2021-05-18 6:59 ` Christoph Hellwig 0 siblings, 0 replies; 16+ messages in thread From: Christoph Hellwig @ 2021-05-18 6:59 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Keith Busch, Daniel Wagner > +++ b/drivers/nvme/host/tcp.c > @@ -699,12 +699,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, > struct request *rq; > > rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); > - if (!rq) { > - dev_err(queue->ctrl->ctrl.device, > - "queue %d tag %#x not found\n", > - nvme_tcp_queue_id(queue), pdu->command_id); > - return -ENOENT; > - } > req = blk_mq_rq_to_pdu(rq); We could simplify the initialization a bit now: struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg 2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg @ 2021-05-17 17:59 ` Sagi Grimberg 2021-05-18 7:01 ` Christoph Hellwig 2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg 2021-05-17 18:47 ` [PATCH 0/3] nvme: protect against possible request reference after completion Keith Busch 3 siblings, 1 reply; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner We are going to use the upper 4-bits of the command_id for a generation counter, so enforce the new queue depth upper limit. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/pci.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a29b170701fc..bc64282b2301 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -60,6 +60,7 @@ MODULE_PARM_DESC(sgl_threshold, "Use SGLs when average request segment size is larger or equal to " "this size. Use 0 to disable SGLs."); +#define NVME_PCI_MAX_QUEUE_SIZE 4095 static int io_queue_depth_set(const char *val, const struct kernel_param *kp); static const struct kernel_param_ops io_queue_depth_ops = { .set = io_queue_depth_set, @@ -68,7 +69,7 @@ static const struct kernel_param_ops io_queue_depth_ops = { static unsigned int io_queue_depth = 1024; module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644); -MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2"); +MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2 and < 4096"); static int io_queue_count_set(const char *val, const struct kernel_param *kp) { @@ -161,7 +162,7 @@ static int io_queue_depth_set(const char *val, const struct kernel_param *kp) u32 n; ret = kstrtou32(val, 10, &n); - if (ret != 0 || n < 2) + if (ret != 0 || n < 2 || n > NVME_PCI_MAX_QUEUE_SIZE) return -EINVAL; return param_set_uint(val, kp); -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg @ 2021-05-18 7:01 ` Christoph Hellwig 0 siblings, 0 replies; 16+ messages in thread From: Christoph Hellwig @ 2021-05-18 7:01 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Keith Busch, Daniel Wagner Can we lift param_set_uint_minmax from sunrpc to common code and use that? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg 2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg 2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg @ 2021-05-17 17:59 ` Sagi Grimberg 2021-05-17 19:04 ` Keith Busch 2021-05-17 19:09 ` Bart Van Assche 2021-05-17 18:47 ` [PATCH 0/3] nvme: protect against possible request reference after completion Keith Busch 3 siblings, 2 replies; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 17:59 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner We cannot detect a (perhaps buggy) controller that is sending us a completion for a request that was already completed (for example sending a completion twice), this phenomenon was seen in the wild a few times. So to protect against this, we use the upper 4 msbits of the nvme sqe command_id to use as a 4-bit generation counter and verify it matches the existing request generation that is incrementing on every execution. The 16-bit command_id structure now is constructed by: | xxxx | xxxxxxxxxxxx | gen request tag This means that we are giving up some possible queue depth as 12 bits allow for a maximum queue depth of 4095 instead of 65536, however we never create such long queues anyways so no real harm done. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/core.c | 3 ++- drivers/nvme/host/nvme.h | 47 +++++++++++++++++++++++++++++++++++++- drivers/nvme/host/pci.c | 2 +- drivers/nvme/host/rdma.c | 4 ++-- drivers/nvme/host/tcp.c | 26 ++++++++++----------- drivers/nvme/target/loop.c | 4 ++-- 6 files changed, 66 insertions(+), 20 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index e6612971f4eb..8308dd1e9e3f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1006,7 +1006,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req) return BLK_STS_IOERR; } - cmd->common.command_id = req->tag; + nvme_req(req)->genctr++; + cmd->common.command_id = nvme_cid(req); trace_nvme_setup_cmd(req, cmd); return ret; } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 05f31a2c64bb..e6f72c82582d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -158,6 +158,7 @@ enum nvme_quirks { struct nvme_request { struct nvme_command *cmd; union nvme_result result; + u8 genctr; u8 retries; u8 flags; u16 status; @@ -497,6 +498,49 @@ struct nvme_ctrl_ops { int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size); }; +/* + * nvme command_id is constructed as such: + * | xxxx | xxxxxxxxxxxx | + * gen request tag + */ +#define nvme_genctr_mask(gen) (gen & 0xf) +#define nvme_cid_install_genctr(gen) (nvme_genctr_mask(gen) << 12) +#define nvme_genctr_from_cid(cid) ((cid & 0xf000) >> 12) +#define nvme_tag_from_cid(cid) (cid & 0xfff) + +static inline u16 nvme_cid(struct request *rq) +{ + return nvme_cid_install_genctr(nvme_req(rq)->genctr) | rq->tag; +} + +static inline struct request *nvme_find_rq(struct blk_mq_tags *tags, + u16 command_id) +{ + u8 genctr = nvme_genctr_from_cid(command_id); + u16 tag = nvme_tag_from_cid(command_id); + struct request *rq; + + rq = blk_mq_tag_to_rq(tags, tag); + if (unlikely(!rq)) { + pr_err("could not locate request for tag %#x\n", + tag); + return NULL; + } + if (unlikely(nvme_genctr_mask(nvme_req(rq)->genctr) != genctr)) { + dev_err(nvme_req(rq)->ctrl->device, + "request %#x genctr mismatch (got %#x expected %#x)\n", + tag, genctr, nvme_req(rq)->genctr); + return NULL; + } + return rq; +} + +static inline struct request *nvme_cid_to_rq(struct blk_mq_tags *tags, + u16 command_id) +{ + return blk_mq_tag_to_rq(tags, nvme_tag_from_cid(command_id)); +} + #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj, const char *dev_name); @@ -594,7 +638,8 @@ static inline void nvme_put_ctrl(struct nvme_ctrl *ctrl) static inline bool nvme_is_aen_req(u16 qid, __u16 command_id) { - return !qid && command_id >= NVME_AQ_BLK_MQ_DEPTH; + return !qid && + nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH; } void nvme_complete_rq(struct request *req); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index bc64282b2301..0c1ea6e521d1 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1018,7 +1018,7 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx) return; } - req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), command_id); + req = nvme_find_rq(nvme_queue_tagset(nvmeq), command_id); if (unlikely(!req)) { dev_warn(nvmeq->dev->ctrl.device, "invalid id %d completed on queue %d\n", diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 203b47a8ec92..ab5b7d175488 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1727,10 +1727,10 @@ static void nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue, struct request *rq; struct nvme_rdma_request *req; - rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id); + rq = nvme_find_rq(nvme_rdma_tagset(queue), cqe->command_id); if (!rq) { dev_err(queue->ctrl->ctrl.device, - "tag 0x%x on QP %#x not found\n", + "got bad command_id %#x on QP %#x\n", cqe->command_id, queue->qp->qp_num); nvme_rdma_error_recovery(queue->ctrl); return; diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 1d184c06c575..e6bd74cabc6d 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -487,11 +487,11 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, { struct request *rq; - rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id); + rq = nvme_find_rq(nvme_tcp_tagset(queue), cqe->command_id); if (!rq) { dev_err(queue->ctrl->ctrl.device, - "queue %d tag 0x%x not found\n", - nvme_tcp_queue_id(queue), cqe->command_id); + "got bad cqe.command_id %#x on queue %d\n", + cqe->command_id, nvme_tcp_queue_id(queue)); nvme_tcp_error_recovery(&queue->ctrl->ctrl); return -EINVAL; } @@ -508,11 +508,11 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue, { struct request *rq; - rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); + rq = nvme_find_rq(nvme_tcp_tagset(queue), pdu->command_id); if (!rq) { dev_err(queue->ctrl->ctrl.device, - "queue %d tag %#x not found\n", - nvme_tcp_queue_id(queue), pdu->command_id); + "got bad c2hdata.command_id %#x on queue %d\n", + pdu->command_id, nvme_tcp_queue_id(queue)); return -ENOENT; } @@ -599,7 +599,7 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req, data->hdr.plen = cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst); data->ttag = pdu->ttag; - data->command_id = rq->tag; + data->command_id = nvme_cid(rq); data->data_offset = cpu_to_le32(req->data_sent); data->data_length = cpu_to_le32(req->pdu_len); return 0; @@ -612,11 +612,11 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue, struct request *rq; int ret; - rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); + rq = nvme_find_rq(nvme_tcp_tagset(queue), pdu->command_id); if (!rq) { dev_err(queue->ctrl->ctrl.device, - "queue %d tag %#x not found\n", - nvme_tcp_queue_id(queue), pdu->command_id); + "got bad r2t.command_id %#x on queue %d\n", + pdu->command_id, nvme_tcp_queue_id(queue)); return -ENOENT; } req = blk_mq_rq_to_pdu(rq); @@ -698,7 +698,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, struct nvme_tcp_request *req; struct request *rq; - rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); + rq = nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id); req = blk_mq_rq_to_pdu(rq); while (true) { @@ -791,8 +791,8 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue, } if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { - struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), - pdu->command_id); + struct request *rq = nvme_cid_to_rq(nvme_tcp_tagset(queue), + pdu->command_id); nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 1b89a6bb819a..9f1f5d572960 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -107,10 +107,10 @@ static void nvme_loop_queue_response(struct nvmet_req *req) } else { struct request *rq; - rq = blk_mq_tag_to_rq(nvme_loop_tagset(queue), cqe->command_id); + rq = nvme_find_rq(nvme_loop_tagset(queue), cqe->command_id); if (!rq) { dev_err(queue->ctrl->ctrl.device, - "tag 0x%x on queue %d not found\n", + "got bad command_id %#x on queue %d\n", cqe->command_id, nvme_loop_queue_idx(queue)); return; } -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg @ 2021-05-17 19:04 ` Keith Busch 2021-05-17 20:23 ` Sagi Grimberg 2021-05-17 19:09 ` Bart Van Assche 1 sibling, 1 reply; 16+ messages in thread From: Keith Busch @ 2021-05-17 19:04 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner On Mon, May 17, 2021 at 10:59:55AM -0700, Sagi Grimberg wrote: > +static inline struct request *nvme_find_rq(struct blk_mq_tags *tags, > + u16 command_id) > +{ > + u8 genctr = nvme_genctr_from_cid(command_id); > + u16 tag = nvme_tag_from_cid(command_id); > + struct request *rq; > + > + rq = blk_mq_tag_to_rq(tags, tag); > + if (unlikely(!rq)) { > + pr_err("could not locate request for tag %#x\n", > + tag); > + return NULL; > + } > + if (unlikely(nvme_genctr_mask(nvme_req(rq)->genctr) != genctr)) { > + dev_err(nvme_req(rq)->ctrl->device, > + "request %#x genctr mismatch (got %#x expected %#x)\n", > + tag, genctr, nvme_req(rq)->genctr); > + return NULL; > + } Should we also check for blk_mq_request_started(rq) here? It doesn't look like the genctr is sufficient to detect a double completion of the same command_id if the driver hasn't had a chance to advance the genctr for a new request. > + return rq; > +} _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 19:04 ` Keith Busch @ 2021-05-17 20:23 ` Sagi Grimberg 0 siblings, 0 replies; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 20:23 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner >> +static inline struct request *nvme_find_rq(struct blk_mq_tags *tags, >> + u16 command_id) >> +{ >> + u8 genctr = nvme_genctr_from_cid(command_id); >> + u16 tag = nvme_tag_from_cid(command_id); >> + struct request *rq; >> + >> + rq = blk_mq_tag_to_rq(tags, tag); >> + if (unlikely(!rq)) { >> + pr_err("could not locate request for tag %#x\n", >> + tag); >> + return NULL; >> + } >> + if (unlikely(nvme_genctr_mask(nvme_req(rq)->genctr) != genctr)) { >> + dev_err(nvme_req(rq)->ctrl->device, >> + "request %#x genctr mismatch (got %#x expected %#x)\n", >> + tag, genctr, nvme_req(rq)->genctr); >> + return NULL; >> + } > > Should we also check for blk_mq_request_started(rq) here? It doesn't look > like the genctr is sufficient to detect a double completion of the same > command_id if the driver hasn't had a chance to advance the genctr for a > new request. We could, maybe incrementing the counter again on the completion would be better? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg 2021-05-17 19:04 ` Keith Busch @ 2021-05-17 19:09 ` Bart Van Assche 2021-05-17 19:46 ` Keith Busch 1 sibling, 1 reply; 16+ messages in thread From: Bart Van Assche @ 2021-05-17 19:09 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Daniel Wagner On 5/17/21 10:59 AM, Sagi Grimberg wrote: > We cannot detect a (perhaps buggy) controller that is sending us > a completion for a request that was already completed (for example > sending a completion twice), this phenomenon was seen in the wild > a few times. > > So to protect against this, we use the upper 4 msbits of the nvme sqe > command_id to use as a 4-bit generation counter and verify it matches > the existing request generation that is incrementing on every execution. > > The 16-bit command_id structure now is constructed by: > | xxxx | xxxxxxxxxxxx | > gen request tag > > This means that we are giving up some possible queue depth as 12 bits > allow for a maximum queue depth of 4095 instead of 65536, however we > never create such long queues anyways so no real harm done. Is a four bit generation counter sufficient? Shouldn't such a counter be at least 32 bits to provide a reasonable protection against e.g. duplicate packets generated by retry mechanisms in networking stacks? Additionally, I do not agree with the statement "we never create such long queues anyways". I have already done this myself. Thanks, Bart. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 19:09 ` Bart Van Assche @ 2021-05-17 19:46 ` Keith Busch 2021-05-17 20:27 ` Sagi Grimberg 2021-05-17 20:28 ` Bart Van Assche 0 siblings, 2 replies; 16+ messages in thread From: Keith Busch @ 2021-05-17 19:46 UTC (permalink / raw) To: Bart Van Assche Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, Daniel Wagner On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote: > On 5/17/21 10:59 AM, Sagi Grimberg wrote: > > We cannot detect a (perhaps buggy) controller that is sending us > > a completion for a request that was already completed (for example > > sending a completion twice), this phenomenon was seen in the wild > > a few times. > > > > So to protect against this, we use the upper 4 msbits of the nvme sqe > > command_id to use as a 4-bit generation counter and verify it matches > > the existing request generation that is incrementing on every execution. > > > > The 16-bit command_id structure now is constructed by: > > | xxxx | xxxxxxxxxxxx | > > gen request tag > > > > This means that we are giving up some possible queue depth as 12 bits > > allow for a maximum queue depth of 4095 instead of 65536, however we > > never create such long queues anyways so no real harm done. > > Is a four bit generation counter sufficient? Shouldn't such a counter be > at least 32 bits to provide a reasonable protection against e.g. > duplicate packets generated by retry mechanisms in networking stacks? It has to fit in the protocol's command identifier, and that's only 16 bits. Most of the bits for the tag, so the implementation uses the most available. More could be better, but that would require a spec change. > Additionally, I do not agree with the statement "we never create such > long queues anyways". I have already done this myself. Why? That won't improve bandwidth, and will increase latency. We already have timeout problems with the current default 1k qdepth on some devices. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 19:46 ` Keith Busch @ 2021-05-17 20:27 ` Sagi Grimberg 2021-05-17 20:28 ` Bart Van Assche 1 sibling, 0 replies; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 20:27 UTC (permalink / raw) To: Keith Busch, Bart Van Assche; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner >>> We cannot detect a (perhaps buggy) controller that is sending us >>> a completion for a request that was already completed (for example >>> sending a completion twice), this phenomenon was seen in the wild >>> a few times. >>> >>> So to protect against this, we use the upper 4 msbits of the nvme sqe >>> command_id to use as a 4-bit generation counter and verify it matches >>> the existing request generation that is incrementing on every execution. >>> >>> The 16-bit command_id structure now is constructed by: >>> | xxxx | xxxxxxxxxxxx | >>> gen request tag >>> >>> This means that we are giving up some possible queue depth as 12 bits >>> allow for a maximum queue depth of 4095 instead of 65536, however we >>> never create such long queues anyways so no real harm done. >> >> Is a four bit generation counter sufficient? Shouldn't such a counter be >> at least 32 bits to provide a reasonable protection against e.g. >> duplicate packets generated by retry mechanisms in networking stacks? > > It has to fit in the protocol's command identifier, and that's only 16 > bits. Most of the bits for the tag, so the implementation uses the most > available. More could be better, but that would require a spec change. Yes, even if we can expand the genctr I don't think we should, we may want to leave some bits for potential future usages. No matter how many bits we allocate, we can't protect a 100% against everything here. >> Additionally, I do not agree with the statement "we never create such >> long queues anyways". I have already done this myself. > > Why? That won't improve bandwidth, and will increase latency. We already > have timeout problems with the current default 1k qdepth on some > devices. Well, if there is a use-case that requires queues that are deeper than 4095 entries, we have no space in the command to protect against this... I also find it surprising that there is a real benefit for such deep nvme queues... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 19:46 ` Keith Busch 2021-05-17 20:27 ` Sagi Grimberg @ 2021-05-17 20:28 ` Bart Van Assche 2021-05-17 21:50 ` Sagi Grimberg 1 sibling, 1 reply; 16+ messages in thread From: Bart Van Assche @ 2021-05-17 20:28 UTC (permalink / raw) To: Keith Busch; +Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, Daniel Wagner On 5/17/21 12:46 PM, Keith Busch wrote: > On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote: >> Additionally, I do not agree with the statement "we never create such >> long queues anyways". I have already done this myself. > > Why? That won't improve bandwidth, and will increase latency. We already > have timeout problems with the current default 1k qdepth on some > devices. For testing FPGA or ASIC solutions that support offloading NVMe it is more convenient to use a single queue pair with a high queue depth than creating multiple queue pairs that each have a lower queue depth. Bart. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 20:28 ` Bart Van Assche @ 2021-05-17 21:50 ` Sagi Grimberg 2021-05-17 22:06 ` Bart Van Assche 0 siblings, 1 reply; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 21:50 UTC (permalink / raw) To: Bart Van Assche, Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner >> On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote: >>> Additionally, I do not agree with the statement "we never create such >>> long queues anyways". I have already done this myself. >> >> Why? That won't improve bandwidth, and will increase latency. We already >> have timeout problems with the current default 1k qdepth on some >> devices. > > For testing FPGA or ASIC solutions that support offloading NVMe it is > more convenient to use a single queue pair with a high queue depth than > creating multiple queue pairs that each have a lower queue depth. And you actually see a benefit for using queues that are >=40956 in depth? That is surprising to me... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 21:50 ` Sagi Grimberg @ 2021-05-17 22:06 ` Bart Van Assche 2021-05-17 22:15 ` Sagi Grimberg 0 siblings, 1 reply; 16+ messages in thread From: Bart Van Assche @ 2021-05-17 22:06 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner On 5/17/21 2:50 PM, Sagi Grimberg wrote: > >>> On Mon, May 17, 2021 at 12:09:46PM -0700, Bart Van Assche wrote: >>>> Additionally, I do not agree with the statement "we never create such >>>> long queues anyways". I have already done this myself. >>> >>> Why? That won't improve bandwidth, and will increase latency. We already >>> have timeout problems with the current default 1k qdepth on some >>> devices. >> >> For testing FPGA or ASIC solutions that support offloading NVMe it is >> more convenient to use a single queue pair with a high queue depth than >> creating multiple queue pairs that each have a lower queue depth. > > And you actually see a benefit for using queues that are >=40956 in > depth? That is surprising to me... Hi Sagi, It seems like there is a misunderstanding. I'm not aware of any use case where very high queue depths provide a performance benefit. Such high queue depths are necessary to verify an implementation of an NVMe controller that maintains state per NVMe command and to verify whether the NVMe controller pauses fetching new NVMe commands if the internal NVMe command buffer is full. Bart. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation 2021-05-17 22:06 ` Bart Van Assche @ 2021-05-17 22:15 ` Sagi Grimberg 0 siblings, 0 replies; 16+ messages in thread From: Sagi Grimberg @ 2021-05-17 22:15 UTC (permalink / raw) To: Bart Van Assche, Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner >>>>> Additionally, I do not agree with the statement "we never create such >>>>> long queues anyways". I have already done this myself. >>>> >>>> Why? That won't improve bandwidth, and will increase latency. We already >>>> have timeout problems with the current default 1k qdepth on some >>>> devices. >>> >>> For testing FPGA or ASIC solutions that support offloading NVMe it is >>> more convenient to use a single queue pair with a high queue depth than >>> creating multiple queue pairs that each have a lower queue depth. >> >> And you actually see a benefit for using queues that are >=40956 in >> depth? That is surprising to me... > > Hi Sagi, > > It seems like there is a misunderstanding. I'm not aware of any use case > where very high queue depths provide a performance benefit. Such high > queue depths are necessary to verify an implementation of an NVMe > controller that maintains state per NVMe command and to verify whether > the NVMe controller pauses fetching new NVMe commands if the internal > NVMe command buffer is full. I see, thanks for the clarification. However I think that a host may choose not to support a full blown 65535 depth just like devices can choose not to support it. As for testing your device, this will move to the same category of all the other things Linux doesn't support, as needing a different host SW to test this. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] nvme: protect against possible request reference after completion 2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg ` (2 preceding siblings ...) 2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg @ 2021-05-17 18:47 ` Keith Busch 3 siblings, 0 replies; 16+ messages in thread From: Keith Busch @ 2021-05-17 18:47 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Daniel Wagner On Mon, May 17, 2021 at 10:59:52AM -0700, Sagi Grimberg wrote: > Nothing in nvme protects against referencing a request after it was completed. > For example, in case a buggy controller sends a completion twice for the same > request, the host can access and modify a request that was already completed. > > At best, this will cause a panic, but on the worst case, this can cause a silent > data corruption if the request was already reused and executed by the time > we reference it. > > The nvme command_id is an opaque that we simply placed the request tag thus far. > To protect against a access after completion, we introduce a generation counter > to the upper 4-bits of the command_id that will increment every invocation and > be validated upon the reception of a completion. This will limit the maximum > queue depth to be effectively 4095, but we hardly ever use such long queues > (in fabrics the maximum is already 1024). This is a neat safe guard even though we haven't seen much indication of this type of controller bug occurring on PCIe. It looks pretty light-weight, but I would like to see if this has a performance impact. I'm still 3 weeks away from physical access to my site to set up a performance test with my low-latency devices, though. On patch 2, I think it's safe to just cap the queue depth to the new max rather than return an error if the user requests more. 4k actually seems like quite a lot there, too. 1k should be plenty just like the fabrics transports, and a 1k limit provides 2 more bits for the gen sequence. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-05-18 7:02 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-17 17:59 [PATCH 0/3] nvme: protect against possible request reference after completion Sagi Grimberg 2021-05-17 17:59 ` [PATCH 1/3] nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data Sagi Grimberg 2021-05-18 6:59 ` Christoph Hellwig 2021-05-17 17:59 ` [PATCH 2/3] nvme-pci: limit maximum queue depth to 4095 Sagi Grimberg 2021-05-18 7:01 ` Christoph Hellwig 2021-05-17 17:59 ` [PATCH 3/3] nvme: code command_id with a genctr for use-after-free validation Sagi Grimberg 2021-05-17 19:04 ` Keith Busch 2021-05-17 20:23 ` Sagi Grimberg 2021-05-17 19:09 ` Bart Van Assche 2021-05-17 19:46 ` Keith Busch 2021-05-17 20:27 ` Sagi Grimberg 2021-05-17 20:28 ` Bart Van Assche 2021-05-17 21:50 ` Sagi Grimberg 2021-05-17 22:06 ` Bart Van Assche 2021-05-17 22:15 ` Sagi Grimberg 2021-05-17 18:47 ` [PATCH 0/3] nvme: protect against possible request reference after completion Keith Busch
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.