All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] keepalive bugfixes
@ 2023-05-24 19:38 Uday Shankar
  2023-05-24 19:38 ` [PATCH v4 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Uday Shankar @ 2023-05-24 19:38 UTC (permalink / raw)
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Sagi Grimberg, Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

While reviewing the Linux KATO implementation in an attempt to better
understand the current NVMe Keep Alive specification, we found a
few issues in the host implementation that could contribute to spurious
Keep Alive timeouts being detected by controllers.

Changes v3-v4 (https://lore.kernel.org/linux-nvme/20230518183311.3224326-1-ushankar@purestorage.com/):
- Patch 1: fix style, improve readability, include helper from patch 3
- Patch 2: refactor to remove need for WRITE_ONCE/barriers
- Patch 3: fix comment style, move helper to patch 1

Changes v2-v3 (https://lore.kernel.org/linux-nvme/20230424232225.1975793-1-ushankar@purestorage.com/):
- Patch 3: add a warning log line for long keepalive RTT

Changes v1-v2 (https://lore.kernel.org/linux-nvme/20230417225558.2890062-1-ushankar@purestorage.com/):
- Patch 2: fix indentation, set start_time even when stats disabled

Uday Shankar (3):
  nvme: double KA polling frequency to avoid KATO with TBKAS on
  nvme: check IO start time when deciding to defer KA
  nvme: improve handling of long keep alives

 drivers/nvme/host/core.c | 46 +++++++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  4 ++--
 2 files changed, 45 insertions(+), 5 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v4 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on
  2023-05-24 19:38 [PATCH v4 0/3] keepalive bugfixes Uday Shankar
@ 2023-05-24 19:38 ` Uday Shankar
  2023-05-24 19:38 ` [PATCH v4 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
  2023-05-24 19:38 ` [PATCH v4 3/3] nvme: improve handling of long keep alives Uday Shankar
  2 siblings, 0 replies; 5+ messages in thread
From: Uday Shankar @ 2023-05-24 19:38 UTC (permalink / raw)
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Sagi Grimberg, Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

With TBKAS on, the completion of one command can defer sending a
keep alive for up to twice the delay between successive runs of
nvme_keep_alive_work. The current delay of KATO / 2 thus makes it
possible for one command to defer sending a keep alive for up to
KATO, which can result in the controller detecting a KATO. The following
trace demonstrates the issue, taking KATO = 8 for simplicity:

1. t = 0: run nvme_keep_alive_work, no keep-alive sent
2. t = ε: I/O completion seen, set comp_seen = true
3. t = 4: run nvme_keep_alive_work, see comp_seen == true,
          skip sending keep-alive, set comp_seen = false
4. t = 8: run nvme_keep_alive_work, see comp_seen == false,
          send a keep-alive command.

Here, there is a delay of 8 - ε between receiving a command completion
and sending the next command. With ε small, the controller is likely to
detect a keep alive timeout.

Fix this by running nvme_keep_alive_work with a delay of KATO / 4
whenever TBKAS is on. Going through the above trace now gives us a
worst-case delay of 4 - ε, which is in line with the recommendation of
sending a command every KATO / 2 in the NVMe specification.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f586a4808e6e..5ad2b9b0d3b9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1161,9 +1161,25 @@ EXPORT_SYMBOL_NS_GPL(nvme_passthru_end, NVME_TARGET_PASSTHRU);
  *   The host should send Keep Alive commands at half of the Keep Alive Timeout
  *   accounting for transport roundtrip times [..].
  */
+static unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
+{
+	unsigned long delay = ctrl->kato * HZ / 2;
+
+	/*
+	 * When using Traffic Based Keep Alive, we need to run
+	 * nvme_keep_alive_work at twice the normal frequency, as one
+	 * command completion can postpone sending a keep alive command
+	 * by up to twice the delay between runs.
+	 */
+	if (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS)
+		delay /= 2;
+	return delay;
+}
+
 static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 {
-	queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ / 2);
+	queue_delayed_work(nvme_wq, &ctrl->ka_work,
+			   nvme_keep_alive_work_period(ctrl));
 }
 
 static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v4 2/3] nvme: check IO start time when deciding to defer KA
  2023-05-24 19:38 [PATCH v4 0/3] keepalive bugfixes Uday Shankar
  2023-05-24 19:38 ` [PATCH v4 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
@ 2023-05-24 19:38 ` Uday Shankar
  2023-05-24 20:39   ` Keith Busch
  2023-05-24 19:38 ` [PATCH v4 3/3] nvme: improve handling of long keep alives Uday Shankar
  2 siblings, 1 reply; 5+ messages in thread
From: Uday Shankar @ 2023-05-24 19:38 UTC (permalink / raw)
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Sagi Grimberg, Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

When a command completes, we set a flag which will skip sending a
keep alive at the next run of nvme_keep_alive_work when TBKAS is on.
However, if the command was submitted long ago, it's possible that
the controller may have also restarted its keep alive timer (as a
result of receiving the command) long ago. The following trace
demonstrates the issue, assuming TBKAS is on and KATO = 8 for
simplicity:

1. t = 0: submit I/O commands A, B, C, D, E
2. t = 0.5: commands A, B, C, D, E reach controller, restart its keep
            alive timer
3. t = 1: A completes
4. t = 2: run nvme_keep_alive_work, see recent completion, do nothing
5. t = 3: B completes
6. t = 4: run nvme_keep_alive_work, see recent completion, do nothing
7. t = 5: C completes
8. t = 6: run nvme_keep_alive_work, see recent completion, do nothing
9. t = 7: D completes
10. t = 8: run nvme_keep_alive_work, see recent completion, do nothing
11. t = 9: E completes

At this point, 8.5 seconds have passed without restarting the
controller's keep alive timer, so the controller will detect a keep
alive timeout.

Fix this by checking the IO start time when deciding to defer sending a
keep alive command. Only set comp_seen if the command started after the
most recent run of nvme_keep_alive_work. With this change, the
completions of B, C, and D will not set comp_seen and the run of
nvme_keep_alive_work at t = 4 will send a keep alive.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 12 +++++++++++-
 drivers/nvme/host/nvme.h |  4 ++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5ad2b9b0d3b9..54f5440a0b83 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -397,7 +397,13 @@ void nvme_complete_rq(struct request *req)
 	trace_nvme_complete_rq(req);
 	nvme_cleanup_cmd(req);
 
-	if (ctrl->kas)
+	/*
+	 * Completions of long-running commands should not be able to
+	 * defer sending of periodic keep alives, since the controller
+	 * may have completed processing such commands a long time ago
+	 * (arbitrarily close to command submission time).
+	 */
+	if (ctrl->kas && nvme_req(req)->start_time >= ctrl->ka_last_check_time)
 		ctrl->comp_seen = true;
 
 	switch (nvme_decide_disposition(req)) {
@@ -1173,6 +1179,7 @@ static unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
 	 */
 	if (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS)
 		delay /= 2;
+
 	return delay;
 }
 
@@ -1198,6 +1205,7 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		return RQ_END_IO_NONE;
 	}
 
+	ctrl->ka_last_check_time = jiffies;
 	ctrl->comp_seen = false;
 	spin_lock_irqsave(&ctrl->lock, flags);
 	if (ctrl->state == NVME_CTRL_LIVE ||
@@ -1216,6 +1224,8 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
 
+	ctrl->ka_last_check_time = jiffies;
+
 	if ((ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) && comp_seen) {
 		dev_dbg(ctrl->device,
 			"reschedule traffic based keep-alive timer\n");
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f0a84e390a55..07ed39ea43e3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -162,9 +162,7 @@ struct nvme_request {
 	u8			retries;
 	u8			flags;
 	u16			status;
-#ifdef CONFIG_NVME_MULTIPATH
 	unsigned long		start_time;
-#endif
 	struct nvme_ctrl	*ctrl;
 };
 
@@ -323,6 +321,7 @@ struct nvme_ctrl {
 	struct delayed_work ka_work;
 	struct delayed_work failfast_work;
 	struct nvme_command ka_cmd;
+	unsigned long ka_last_check_time;
 	struct work_struct fw_act_work;
 	unsigned long events;
 
@@ -1032,6 +1031,7 @@ static inline void nvme_start_request(struct request *rq)
 {
 	if (rq->cmd_flags & REQ_NVME_MPATH)
 		nvme_mpath_start_request(rq);
+	nvme_req(rq)->start_time = jiffies;
 	blk_mq_start_request(rq);
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v4 3/3] nvme: improve handling of long keep alives
  2023-05-24 19:38 [PATCH v4 0/3] keepalive bugfixes Uday Shankar
  2023-05-24 19:38 ` [PATCH v4 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
  2023-05-24 19:38 ` [PATCH v4 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
@ 2023-05-24 19:38 ` Uday Shankar
  2 siblings, 0 replies; 5+ messages in thread
From: Uday Shankar @ 2023-05-24 19:38 UTC (permalink / raw)
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Sagi Grimberg, Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

Upon keep alive completion, nvme_keep_alive_work is scheduled with the
same delay every time. If keep alive commands are completing slowly,
this may cause a keep alive timeout. The following trace illustrates the
issue, taking KATO = 8 and TBKAS off for simplicity:

1. t = 0: run nvme_keep_alive_work, send keep alive
2. t = ε: keep alive reaches controller, controller restarts its keep
          alive timer
3. t = 4: host receives keep alive completion, schedules
          nvme_keep_alive_work with delay 4
4. t = 8: run nvme_keep_alive_work, send keep alive

Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε
between the controller receiving successive keep alives. With ε small,
the controller is likely to detect a keep alive timeout.

Fix this by calculating the RTT of the keep alive command, and adjusting
the scheduling delay of the next keep alive work accordingly.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 54f5440a0b83..a4287ae967b2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1195,6 +1195,20 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 	struct nvme_ctrl *ctrl = rq->end_io_data;
 	unsigned long flags;
 	bool startka = false;
+	unsigned long rtt = jiffies - nvme_req(rq)->start_time;
+	unsigned long delay = nvme_keep_alive_work_period(ctrl);
+
+	/*
+	 * Subtract off the keepalive RTT so nvme_keep_alive_work runs
+	 * at the desired frequency.
+	 */
+	if (rtt <= delay) {
+		delay -= rtt;
+	} else {
+		dev_warn(ctrl->device, "long keepalive RTT (%u ms)\n",
+			 jiffies_to_msecs(rtt));
+		delay = 0;
+	}
 
 	blk_mq_free_request(rq);
 
@@ -1213,7 +1227,7 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		startka = true;
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	if (startka)
-		nvme_queue_keep_alive_work(ctrl);
+		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 	return RQ_END_IO_NONE;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 2/3] nvme: check IO start time when deciding to defer KA
  2023-05-24 19:38 ` [PATCH v4 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
@ 2023-05-24 20:39   ` Keith Busch
  0 siblings, 0 replies; 5+ messages in thread
From: Keith Busch @ 2023-05-24 20:39 UTC (permalink / raw)
  To: Uday Shankar
  Cc: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Sagi Grimberg, Christoph Hellwig, Jens Axboe, linux-nvme

On Wed, May 24, 2023 at 01:38:08PM -0600, Uday Shankar wrote:
> +	/*
> +	 * Completions of long-running commands should not be able to
> +	 * defer sending of periodic keep alives, since the controller
> +	 * may have completed processing such commands a long time ago
> +	 * (arbitrarily close to command submission time).
> +	 */
> +	if (ctrl->kas && nvme_req(req)->start_time >= ctrl->ka_last_check_time)
>  		ctrl->comp_seen = true;

Instead of saving start_time, 'req->deadline - req->timeout' should get
the same result. And you won't have to set nvme's start_time twice for
the mpath case.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-24 20:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24 19:38 [PATCH v4 0/3] keepalive bugfixes Uday Shankar
2023-05-24 19:38 ` [PATCH v4 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
2023-05-24 19:38 ` [PATCH v4 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
2023-05-24 20:39   ` Keith Busch
2023-05-24 19:38 ` [PATCH v4 3/3] nvme: improve handling of long keep alives Uday Shankar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.