* [PATCH 1/2] nvme: fixup kato deadlock
2021-02-23 12:07 [PATCH 0/2] nvme: sanitize KATO handling Hannes Reinecke
@ 2021-02-23 12:07 ` Hannes Reinecke
2021-02-24 16:22 ` Christoph Hellwig
2021-02-23 12:07 ` [PATCH 2/2] nvme: sanitize KATO setting Hannes Reinecke
2021-02-24 6:42 ` [PATCH 0/2] nvme: sanitize KATO handling Chao Leng
2 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-02-23 12:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, Daniel Wagner, Sagi Grimberg, Keith Busch, Hannes Reinecke
A customer of ours has run into this deadlock with RDMA:
- The ka_work workqueue item is executed
- A new ka_work workqueue item is scheduled just after that.
- Now both, the kato request timeout _and_ the workqueue delay
will execute at roughly the same time
- If the timing is correct the workqueue executes _before_
the kato request timeout triggers
- Kato request timeout triggers, and starts error recovery
- error recovery deadlocks, as it needs to flush the kato
workqueue item; this is stuck in nvme_alloc_request() as all
reserved tags are in use.
The reserved tags would have been freed up later when cancelling all
outstanding requests in the queue:
nvme_stop_keep_alive(&ctrl->ctrl);
nvme_rdma_teardown_io_queues(ctrl, false);
nvme_start_queues(&ctrl->ctrl);
nvme_rdma_teardown_admin_queue(ctrl, false);
blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
but as we're stuck in nvme_stop_keep_alive() we'll never get this far.
To fix this a new controller flag 'NVME_CTRL_KATO_RUNNING' is added
which will short-circuit the nvme_keep_alive() function if one
keep-alive command is already running.
Cc: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
drivers/nvme/host/core.c | 8 +++++++-
drivers/nvme/host/nvme.h | 1 +
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ea40a3c511da..9b8596eb4047 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1211,6 +1211,7 @@ static void nvme_keep_alive_end_io(struct request *rq, blk_status_t status)
bool startka = false;
blk_mq_free_request(rq);
+ clear_bit(NVME_CTRL_KATO_RUNNING, &ctrl->flags);
if (status) {
dev_err(ctrl->device,
@@ -1233,10 +1234,15 @@ static int nvme_keep_alive(struct nvme_ctrl *ctrl)
{
struct request *rq;
+ if (test_and_set_bit(NVME_CTRL_KATO_RUNNING, &ctrl->flags))
+ return 0;
+
rq = nvme_alloc_request(ctrl->admin_q, &ctrl->ka_cmd,
BLK_MQ_REQ_RESERVED);
- if (IS_ERR(rq))
+ if (IS_ERR(rq)) {
+ clear_bit(NVME_CTRL_KATO_RUNNING, &ctrl->flags);
return PTR_ERR(rq);
+ }
rq->timeout = ctrl->kato * HZ;
rq->end_io_data = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e6efa085f08a..e00e3400c8b6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -344,6 +344,7 @@ struct nvme_ctrl {
int nr_reconnects;
unsigned long flags;
#define NVME_CTRL_FAILFAST_EXPIRED 0
+#define NVME_CTRL_KATO_RUNNING 1
struct nvmf_ctrl_options *opts;
struct page *discard_page;
--
2.29.2
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] nvme: sanitize KATO setting
2021-02-23 12:07 [PATCH 0/2] nvme: sanitize KATO handling Hannes Reinecke
2021-02-23 12:07 ` [PATCH 1/2] nvme: fixup kato deadlock Hannes Reinecke
@ 2021-02-23 12:07 ` Hannes Reinecke
2021-02-24 16:23 ` Christoph Hellwig
2021-02-24 6:42 ` [PATCH 0/2] nvme: sanitize KATO handling Chao Leng
2 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-02-23 12:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, Daniel Wagner, Sagi Grimberg, Keith Busch, Hannes Reinecke
According to the NVMe base spec the KATO commands should be sent
at half of the KATO interval, to properly account for round-trip
times.
As we now will only ever send one KATO command per connection we
can easily use the recommended values.
This also fixes a potential issue where the request timeout for
the KATO command does not match the value in the connect command,
which might be causing spurious connection drops from the target.
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
drivers/nvme/host/core.c | 14 ++++++++++----
drivers/nvme/host/fabrics.c | 2 +-
drivers/nvme/host/nvme.h | 1 -
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9b8596eb4047..eedbee80b7b9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1226,8 +1226,11 @@ static void nvme_keep_alive_end_io(struct request *rq, blk_status_t status)
ctrl->state == NVME_CTRL_CONNECTING)
startka = true;
spin_unlock_irqrestore(&ctrl->lock, flags);
- if (startka)
- queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ);
+ if (startka) {
+ unsigned long kato_delay = ctrl->kato >> 1;
+
+ queue_delayed_work(nvme_wq, &ctrl->ka_work, kato_delay * HZ);
+ }
}
static int nvme_keep_alive(struct nvme_ctrl *ctrl)
@@ -1259,10 +1262,12 @@ static void nvme_keep_alive_work(struct work_struct *work)
bool comp_seen = ctrl->comp_seen;
if ((ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) && comp_seen) {
+ unsigned long kato_delay = ctrl->kato >> 1;
+
dev_dbg(ctrl->device,
"reschedule traffic based keep-alive timer\n");
ctrl->comp_seen = false;
- queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ);
+ queue_delayed_work(nvme_wq, &ctrl->ka_work, kato_delay * HZ);
return;
}
@@ -1276,10 +1281,11 @@ static void nvme_keep_alive_work(struct work_struct *work)
static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
{
+ unsigned long kato_delay = ctrl->kato >> 1;
if (unlikely(ctrl->kato == 0))
return;
- queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ);
+ queue_delayed_work(nvme_wq, &ctrl->ka_work, kato_delay * HZ);
}
void nvme_stop_keep_alive(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 5dfd806fc2d2..dba32e39afbf 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -382,7 +382,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
* and add a grace period for controller kato enforcement
*/
cmd.connect.kato = ctrl->kato ?
- cpu_to_le32((ctrl->kato + NVME_KATO_GRACE) * 1000) : 0;
+ cpu_to_le32(ctrl->kato * 1000) : 0;
if (ctrl->opts->disable_sqflow)
cmd.connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e00e3400c8b6..de0b270f95fc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -27,7 +27,6 @@ extern unsigned int admin_timeout;
#define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
#define NVME_DEFAULT_KATO 5
-#define NVME_KATO_GRACE 10
#ifdef CONFIG_ARCH_NO_SG_CHAIN
#define NVME_INLINE_SG_CNT 0
--
2.29.2
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] nvme: sanitize KATO setting
2021-02-23 12:07 ` [PATCH 2/2] nvme: sanitize KATO setting Hannes Reinecke
@ 2021-02-24 16:23 ` Christoph Hellwig
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2021-02-24 16:23 UTC (permalink / raw)
To: Hannes Reinecke
Cc: linux-nvme, Daniel Wagner, Christoph Hellwig, Keith Busch, Sagi Grimberg
> + unsigned long kato_delay = ctrl->kato >> 1;
> +
> + queue_delayed_work(nvme_wq, &ctrl->ka_work, kato_delay * HZ);
> + unsigned long kato_delay = ctrl->kato >> 1;
> +
> + queue_delayed_work(nvme_wq, &ctrl->ka_work, kato_delay * HZ);
> + unsigned long kato_delay = ctrl->kato >> 1;
> + queue_delayed_work(nvme_wq, &ctrl->ka_work, kato_delay * HZ);
I think we need a properly documented helper for
(ctrl->kato / 2) * HZ
instead of all this open coded magic.
> cmd.connect.kato = ctrl->kato ?
> - cpu_to_le32((ctrl->kato + NVME_KATO_GRACE) * 1000) : 0;
> + cpu_to_le32(ctrl->kato * 1000) : 0;
cpu_to_le32(0 * 1000) is still 0, so we can remove the branch here now.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] nvme: sanitize KATO handling
2021-02-23 12:07 [PATCH 0/2] nvme: sanitize KATO handling Hannes Reinecke
2021-02-23 12:07 ` [PATCH 1/2] nvme: fixup kato deadlock Hannes Reinecke
2021-02-23 12:07 ` [PATCH 2/2] nvme: sanitize KATO setting Hannes Reinecke
@ 2021-02-24 6:42 ` Chao Leng
2021-02-24 7:06 ` Hannes Reinecke
2 siblings, 1 reply; 10+ messages in thread
From: Chao Leng @ 2021-02-24 6:42 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: Keith Busch, Daniel Wagner, linux-nvme, Sagi Grimberg
On 2021/2/23 20:07, Hannes Reinecke wrote:
> Hi all,
>
> one of our customer had been running into a deadlock trying to terminate
> outstanding KATO commands during reset.
> Looking closer at it, I found that we never actually _track_ if a KATO
> command is submitted, so we might happily be sending several KATO commands
> to the same controller simultaneously.
Can you explain how can send KATO commands simultaneously?
> Also, I found it slightly odd that we signal a different KATO value to the
> controller than what we're using internally; I would have thought that both
> sides should agree on the same KATO value. And even that wouldn't be so
> bad, but we really should be using the KATO value we annouonced to the
> controller when setting the request timeout.
>
> With these patches I attempt to resolve the situation; the first patch
> ensures that only one KATO command to a given controller is outstanding.
> With that the delay between sending KATO commands and the KATO timeout
> are decoupled, and we can follow the recommendation from the base spec
> to send the KATO commands at half the KATO timeout intervals.
>
> As usual, comments and reviews are welcome.
>
> Hannes Reinecke (2):
> nvme: fixup kato deadlock
> nvme: sanitize KATO setting
>
> drivers/nvme/host/core.c | 22 +++++++++++++++++-----
> drivers/nvme/host/fabrics.c | 2 +-
> drivers/nvme/host/nvme.h | 2 +-
> 3 files changed, 19 insertions(+), 7 deletions(-)
>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] nvme: sanitize KATO handling
2021-02-24 6:42 ` [PATCH 0/2] nvme: sanitize KATO handling Chao Leng
@ 2021-02-24 7:06 ` Hannes Reinecke
2021-02-24 7:20 ` Chao Leng
0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-02-24 7:06 UTC (permalink / raw)
To: Chao Leng, Christoph Hellwig
Cc: Keith Busch, Daniel Wagner, linux-nvme, Sagi Grimberg
On 2/24/21 7:42 AM, Chao Leng wrote:
>
>
> On 2021/2/23 20:07, Hannes Reinecke wrote:
>> Hi all,
>>
>> one of our customer had been running into a deadlock trying to terminate
>> outstanding KATO commands during reset.
>> Looking closer at it, I found that we never actually _track_ if a KATO
>> command is submitted, so we might happily be sending several KATO
>> commands
>> to the same controller simultaneously.
> Can you explain how can send KATO commands simultaneously?
Sure.
Call nvme_start_keep_alive() on a dead connection.
Just _after_ the KATO request has been sent,
call nvme_start_keep_alive() again.
You now have an expired KATO command, and the new KATO command, both are
active and sent to the controller.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] nvme: sanitize KATO handling
2021-02-24 7:06 ` Hannes Reinecke
@ 2021-02-24 7:20 ` Chao Leng
2021-02-24 7:27 ` Chao Leng
2021-02-24 7:59 ` Hannes Reinecke
0 siblings, 2 replies; 10+ messages in thread
From: Chao Leng @ 2021-02-24 7:20 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: Keith Busch, Daniel Wagner, linux-nvme, Sagi Grimberg
On 2021/2/24 15:06, Hannes Reinecke wrote:
> On 2/24/21 7:42 AM, Chao Leng wrote:
>>
>>
>> On 2021/2/23 20:07, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> one of our customer had been running into a deadlock trying to terminate
>>> outstanding KATO commands during reset.
>>> Looking closer at it, I found that we never actually _track_ if a KATO
>>> command is submitted, so we might happily be sending several KATO commands
>>> to the same controller simultaneously.
>> Can you explain how can send KATO commands simultaneously?
>
> Sure.
> Call nvme_start_keep_alive() on a dead connection.
> Just _after_ the KATO request has been sent,
> call nvme_start_keep_alive() again.
Call nvme_start_keep_alive() again? why?
Now just nvme_start_ctrl call nvme_start_keep_alive().
The ka_work will be canceled sync before start reconnection.
Did I miss something?
>
> You now have an expired KATO command, and the new KATO command, both are active and sent to the controller.
>
> Cheers,
>
> Hannes
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] nvme: sanitize KATO handling
2021-02-24 7:20 ` Chao Leng
@ 2021-02-24 7:27 ` Chao Leng
2021-02-24 7:59 ` Hannes Reinecke
1 sibling, 0 replies; 10+ messages in thread
From: Chao Leng @ 2021-02-24 7:27 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: linux-nvme, Daniel Wagner, Keith Busch, Sagi Grimberg
On 2021/2/24 15:20, Chao Leng wrote:
>
>
> On 2021/2/24 15:06, Hannes Reinecke wrote:
>> On 2/24/21 7:42 AM, Chao Leng wrote:
>>>
>>>
>>> On 2021/2/23 20:07, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> one of our customer had been running into a deadlock trying to terminate
>>>> outstanding KATO commands during reset.
>>>> Looking closer at it, I found that we never actually _track_ if a KATO
>>>> command is submitted, so we might happily be sending several KATO commands
>>>> to the same controller simultaneously.
>>> Can you explain how can send KATO commands simultaneously?
>>
>> Sure.
>> Call nvme_start_keep_alive() on a dead connection.
>> Just _after_ the KATO request has been sent,
>> call nvme_start_keep_alive() again.
> Call nvme_start_keep_alive() again? why?
> Now just nvme_start_ctrl call nvme_start_keep_alive().
> The ka_work will be canceled sync before start reconnection.
> Did I miss something?And all flight request will be canceled before start reconnection.
Is the request hang on block?
>>
>> You now have an expired KATO command, and the new KATO command, both are active and sent to the controller.
>>
>> Cheers,
>>
>> Hannes
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> .
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] nvme: sanitize KATO handling
2021-02-24 7:20 ` Chao Leng
2021-02-24 7:27 ` Chao Leng
@ 2021-02-24 7:59 ` Hannes Reinecke
1 sibling, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2021-02-24 7:59 UTC (permalink / raw)
To: Chao Leng, Christoph Hellwig
Cc: Keith Busch, Daniel Wagner, linux-nvme, Sagi Grimberg
On 2/24/21 8:20 AM, Chao Leng wrote:
>
>
> On 2021/2/24 15:06, Hannes Reinecke wrote:
>> On 2/24/21 7:42 AM, Chao Leng wrote:
>>>
>>>
>>> On 2021/2/23 20:07, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> one of our customer had been running into a deadlock trying to
>>>> terminate
>>>> outstanding KATO commands during reset.
>>>> Looking closer at it, I found that we never actually _track_ if a KATO
>>>> command is submitted, so we might happily be sending several KATO
>>>> commands
>>>> to the same controller simultaneously.
>>> Can you explain how can send KATO commands simultaneously?
>>
>> Sure.
>> Call nvme_start_keep_alive() on a dead connection.
>> Just _after_ the KATO request has been sent,
>> call nvme_start_keep_alive() again.
> Call nvme_start_keep_alive() again? why?
> Now just nvme_start_ctrl call nvme_start_keep_alive().
> The ka_work will be canceled sync before start reconnection.
> Did I miss something?
My point was that there _can_ be a ka_work() entry even when a KATO
command is running.
And yes, the ka_work entry will be cancelled, but _before_ the
outstanding commands are cancelled.
And cancelling the ka_work entry might cause the function to be
executed, which leads to a deadlock if blk_mq_get_request() is blocked
(eg if the queue is already stopped due to recovery)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 10+ messages in thread