linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nvmet-rdma: Release connections synchronously
@ 2023-05-11 15:03 Bart Van Assche
  2023-05-14 10:38 ` Shinichiro Kawasaki
  2023-05-17  7:42 ` Sagi Grimberg
  0 siblings, 2 replies; 8+ messages in thread
From: Bart Van Assche @ 2023-05-11 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Bart Van Assche, Sagi Grimberg,
	Hannes Reinecke, Shin'ichiro Kawasaki

Lockdep complains about flushing queue->release_work in the NVMe RDMA target
driver because all instances of queue->release_work have the same lock class
key. Assigning a different lock class key to each release_work instance is
difficult because dynamically registered lock class keys must be unregistered
and the last use of the work_struct lock class key happens after the work
callback has returned. Hence rework the code for releasing connections as
follows:
* Add an argument to __nvmet_rdma_queue_disconnect() and also to
  nvmet_rdma_queue_disconnect() that indicates whether or not these functions
  are called from inside a CM handler callback.
* Do not call rdma_destroy_id() if called from inside a CM handler callback.
* Let these functions return a negative value if the caller should free the
  CM ID.
* Let the CM handler return a negative value if its caller should free the
  CM ID.
* Remove queue->release_work since releasing connections now happens
  synchronously.

This patch suppresses the following lockdep complaint:

======================================================
WARNING: possible circular locking dependency detected
6.3.0 #62 Not tainted
------------------------------------------------------
kworker/1:3/455 is trying to acquire lock:
ffff88813bfd9420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]

but task is already holding lock:
ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
       process_one_work+0x7dc/0x1350
       worker_thread+0xfc/0x1260
       kthread+0x29e/0x340
       ret_from_fork+0x2c/0x50

-> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
       __flush_workqueue+0x130/0x12f0
       nvmet_rdma_cm_handler+0x961/0x39c0 [nvmet_rdma]
       cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
       cma_ib_req_handler+0x1096/0x4a50 [rdma_cm]
       cm_process_work+0x46/0x3b0 [ib_cm]
       cm_req_handler+0x20e2/0x61d0 [ib_cm]
       cm_work_handler+0xc15/0x6ce0 [ib_cm]
       process_one_work+0x843/0x1350
       worker_thread+0xfc/0x1260
       kthread+0x29e/0x340
       ret_from_fork+0x2c/0x50

-> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
       __mutex_lock+0x186/0x18f0
       cma_ib_req_handler+0xc3c/0x4a50 [rdma_cm]
       cm_process_work+0x46/0x3b0 [ib_cm]
       cm_req_handler+0x20e2/0x61d0 [ib_cm]
       cm_work_handler+0xc15/0x6ce0 [ib_cm]
       process_one_work+0x843/0x1350
       worker_thread+0xfc/0x1260
       kthread+0x29e/0x340
       ret_from_fork+0x2c/0x50

-> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
       __lock_acquire+0x2fc0/0x60f0
       lock_acquire+0x1a7/0x4e0
       __mutex_lock+0x186/0x18f0
       rdma_destroy_id+0x17/0x20 [rdma_cm]
       nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
       nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
       process_one_work+0x843/0x1350
       worker_thread+0xfc/0x1260
       kthread+0x29e/0x340
       ret_from_fork+0x2c/0x50

other info that might help us debug this:

Chain exists of:
  &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work)

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&queue->release_work));
                               lock((wq_completion)nvmet-wq);
                               lock((work_completion)(&queue->release_work));
  lock(&id_priv->handler_mutex);

 *** DEADLOCK ***

2 locks held by kworker/1:3/455:
 #0: ffff888127b28538 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x766/0x1350
 #1: ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350

stack backtrace:
CPU: 1 PID: 455 Comm: kworker/1:3 Not tainted 6.3.0 #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma]
Call Trace:
 <TASK>
 dump_stack_lvl+0x57/0x90
 check_noncircular+0x27b/0x310
 __lock_acquire+0x2fc0/0x60f0
 lock_acquire+0x1a7/0x4e0
 __mutex_lock+0x186/0x18f0
 rdma_destroy_id+0x17/0x20 [rdma_cm]
 nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
 nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
 process_one_work+0x843/0x1350
 worker_thread+0xfc/0x1260
 kthread+0x29e/0x340
 ret_from_fork+0x2c/0x50
 </TASK>

See also https://lore.kernel.org/all/rsmmxrchy6voi5qhl4irss5sprna3f5owkqtvybxglcv2pnylm@xmrnpfu3tfpe/

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/nvme/target/rdma.c | 70 ++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 4597bca43a6d..cfd9d4368248 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -101,7 +101,6 @@ struct nvmet_rdma_queue {
 	spinlock_t		rsps_lock;
 	struct nvmet_rdma_cmd	*cmds;
 
-	struct work_struct	release_work;
 	struct list_head	rsp_wait_list;
 	struct list_head	rsp_wr_wait_list;
 	spinlock_t		rsp_wr_wait_lock;
@@ -167,7 +166,8 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
 static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc);
 static void nvmet_rdma_write_data_done(struct ib_cq *cq, struct ib_wc *wc);
 static void nvmet_rdma_qp_event(struct ib_event *event, void *priv);
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue);
+static int nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
+				       bool from_cm_handler);
 static void nvmet_rdma_free_rsp(struct nvmet_rdma_device *ndev,
 				struct nvmet_rdma_rsp *r);
 static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
@@ -693,7 +693,7 @@ static void nvmet_rdma_error_comp(struct nvmet_rdma_queue *queue)
 		 * of admin connect error, just disconnect and
 		 * cleanup the queue
 		 */
-		nvmet_rdma_queue_disconnect(queue);
+		nvmet_rdma_queue_disconnect(queue, false);
 	}
 }
 
@@ -1360,17 +1360,6 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 	kfree(queue);
 }
 
-static void nvmet_rdma_release_queue_work(struct work_struct *w)
-{
-	struct nvmet_rdma_queue *queue =
-		container_of(w, struct nvmet_rdma_queue, release_work);
-	struct nvmet_rdma_device *dev = queue->dev;
-
-	nvmet_rdma_free_queue(queue);
-
-	kref_put(&dev->ref, nvmet_rdma_free_dev);
-}
-
 static int
 nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
 				struct nvmet_rdma_queue *queue)
@@ -1441,11 +1430,6 @@ nvmet_rdma_alloc_queue(struct nvmet_rdma_device *ndev,
 	if (ret)
 		goto out_destroy_sq;
 
-	/*
-	 * Schedules the actual release because calling rdma_destroy_id from
-	 * inside a CM callback would trigger a deadlock. (great API design..)
-	 */
-	INIT_WORK(&queue->release_work, nvmet_rdma_release_queue_work);
 	queue->dev = ndev;
 	queue->cm_id = cm_id;
 	queue->port = port->nport;
@@ -1582,11 +1566,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		goto put_device;
 	}
 
-	if (queue->host_qid == 0) {
-		/* Let inflight controller teardown complete */
-		flush_workqueue(nvmet_wq);
-	}
-
 	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
 	if (ret) {
 		/*
@@ -1638,10 +1617,17 @@ static void nvmet_rdma_queue_established(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 }
 
-static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+/*
+ * If called from inside a CM handler callback, the CM ID must not be
+ * destroyed. Instead, a negative value is returned if the caller should
+ * destroy the CM ID.
+ */
+static int __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
+					 bool from_cm_handler)
 {
 	bool disconnect = false;
 	unsigned long flags;
+	int ret = 0;
 
 	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
 
@@ -1668,12 +1654,22 @@ static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 
 	if (disconnect) {
+		struct nvmet_rdma_device *dev = queue->dev;
+
 		rdma_disconnect(queue->cm_id);
-		queue_work(nvmet_wq, &queue->release_work);
+		if (from_cm_handler)
+			ret = -ENODEV;
+		else
+			rdma_destroy_id(queue->cm_id);
+		nvmet_rdma_free_queue(queue);
+		kref_put(&dev->ref, nvmet_rdma_free_dev);
 	}
+
+	return ret;
 }
 
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+static int nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
+				       bool from_cm_handler)
 {
 	bool disconnect = false;
 
@@ -1685,12 +1681,16 @@ static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
 	mutex_unlock(&nvmet_rdma_queue_mutex);
 
 	if (disconnect)
-		__nvmet_rdma_queue_disconnect(queue);
+		return __nvmet_rdma_queue_disconnect(queue, from_cm_handler);
+
+	return 0;
 }
 
 static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 		struct nvmet_rdma_queue *queue)
 {
+	struct nvmet_rdma_device *dev = queue->dev;
+
 	WARN_ON_ONCE(queue->state != NVMET_RDMA_Q_CONNECTING);
 
 	mutex_lock(&nvmet_rdma_queue_mutex);
@@ -1699,7 +1699,9 @@ static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 	mutex_unlock(&nvmet_rdma_queue_mutex);
 
 	pr_err("failed to connect queue %d\n", queue->idx);
-	queue_work(nvmet_wq, &queue->release_work);
+
+	nvmet_rdma_free_queue(queue);
+	kref_put(&dev->ref, nvmet_rdma_free_dev);
 }
 
 /**
@@ -1749,6 +1751,7 @@ static int nvmet_rdma_device_removal(struct rdma_cm_id *cm_id,
 	return 1;
 }
 
+/* The caller destroys @cm_id if this function does not return zero. */
 static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event)
 {
@@ -1779,7 +1782,7 @@ static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
 		fallthrough;
 	case RDMA_CM_EVENT_DISCONNECTED:
 	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
-		nvmet_rdma_queue_disconnect(queue);
+		ret = nvmet_rdma_queue_disconnect(queue, true);
 		break;
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:
 		ret = nvmet_rdma_device_removal(cm_id, queue);
@@ -1791,6 +1794,7 @@ static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	case RDMA_CM_EVENT_UNREACHABLE:
 	case RDMA_CM_EVENT_CONNECT_ERROR:
 		nvmet_rdma_queue_connect_fail(cm_id, queue);
+		ret = -ENODEV;
 		break;
 	default:
 		pr_err("received unrecognized RDMA CM event %d\n",
@@ -1812,7 +1816,7 @@ static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl *ctrl)
 			list_del_init(&queue->queue_list);
 			mutex_unlock(&nvmet_rdma_queue_mutex);
 
-			__nvmet_rdma_queue_disconnect(queue);
+			__nvmet_rdma_queue_disconnect(queue, false);
 			goto restart;
 		}
 	}
@@ -1831,7 +1835,7 @@ static void nvmet_rdma_destroy_port_queues(struct nvmet_rdma_port *port)
 			continue;
 
 		list_del_init(&queue->queue_list);
-		__nvmet_rdma_queue_disconnect(queue);
+		__nvmet_rdma_queue_disconnect(queue, false);
 	}
 	mutex_unlock(&nvmet_rdma_queue_mutex);
 }
@@ -2049,7 +2053,7 @@ static void nvmet_rdma_remove_one(struct ib_device *ib_device, void *client_data
 
 		pr_info("Removing queue %d\n", queue->idx);
 		list_del_init(&queue->queue_list);
-		__nvmet_rdma_queue_disconnect(queue);
+		__nvmet_rdma_queue_disconnect(queue, false);
 	}
 	mutex_unlock(&nvmet_rdma_queue_mutex);
 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-11 15:03 [PATCH] nvmet-rdma: Release connections synchronously Bart Van Assche
@ 2023-05-14 10:38 ` Shinichiro Kawasaki
  2023-05-15 17:29   ` Bart Van Assche
  2023-05-17  7:42 ` Sagi Grimberg
  1 sibling, 1 reply; 8+ messages in thread
From: Shinichiro Kawasaki @ 2023-05-14 10:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Hannes Reinecke

On May 11, 2023 / 08:03, Bart Van Assche wrote:
> Lockdep complains about flushing queue->release_work in the NVMe RDMA target
> driver because all instances of queue->release_work have the same lock class
> key. Assigning a different lock class key to each release_work instance is
> difficult because dynamically registered lock class keys must be unregistered
> and the last use of the work_struct lock class key happens after the work
> callback has returned. Hence rework the code for releasing connections as
> follows:
> * Add an argument to __nvmet_rdma_queue_disconnect() and also to
>   nvmet_rdma_queue_disconnect() that indicates whether or not these functions
>   are called from inside a CM handler callback.
> * Do not call rdma_destroy_id() if called from inside a CM handler callback.
> * Let these functions return a negative value if the caller should free the
>   CM ID.
> * Let the CM handler return a negative value if its caller should free the
>   CM ID.
> * Remove queue->release_work since releasing connections now happens
>   synchronously.
> 
> This patch suppresses the following lockdep complaint:

Bart, thanks for the v2 patch. I applied to kernel v6.3 and ran blktests nvme
test group with siw transport. Unfortunately, I observed test process hang at
nvme/003 [1]. The kernel reported lockdep WARN of id_priv->handler_mutex. The
patch adds the "from_cm_handler" argument to __nvmet_rdma_queue_disconnect() to
skip rdma_destroy_id() if the call is from CM handler. However,
rdma_destroy_id() is still called in the call path below:

__nvmet_rdma_queue_disconnect()
  nvmet_rdma_free_queue()
    nvmet_rdma_destroy_queue_ib()
      rdma_destroy_id()

So I think, the from_cm_handler should be added to nvmet_rdma_free_queue() and
nvmet_rdma_destroy_queue_ib() also, so that from_cm_handler is propagated and
rdma_destroy_id() is not called in nvmet_rdma_destroy_queue_ib(). I created a
quick patch with this approach [2], and observed it avoids the process hang.

However, when I did further test using the v2 patch and the fix [1], I observed
another process hang at nvme/009 [3]. This time, deadlock of subsys->lock
happened in nvmet_ctrl_free(). I created a dirty patch to avoid it [4], but it
caused lockdep WARN for possible deadlock between subsys->lock and
id_priv->handler_mutex. I also observed hang at nvme/044 related
nvmet_fatal_error_handler.

It looks that the removal of queue->release_work looks triggering some more
lock related issues. Do you think this approach is the only way to address the
original issue? If so, it will need multiple follow-up fixes, I think.


[1]

[   44.824261] run blktests nvme/003 at 2023-05-14 16:32:23
[   44.925161] SoftiWARP attached
[   45.013065] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   45.022749] iwpm_register_pid: Unable to send a nlmsg (client = 2)
[   45.025434] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[   45.072814] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[   45.078662] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.2.15:4420
[   55.141297] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

[   55.172082] ============================================
[   55.172563] WARNING: possible recursive locking detected
[   55.173018] 6.3.0+ #5 Tainted: G            E     
[   55.173426] --------------------------------------------
[   55.173876] kworker/u8:4/88 is trying to acquire lock:
[   55.174312] ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.175099] 
               but task is already holding lock:
[   55.175667] ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: cma_iw_handler+0xac/0x500 [rdma_cm]
[   55.176445] 
               other info that might help us debug this:
[   55.177027]  Possible unsafe locking scenario:

[   55.177567]        CPU0
[   55.177873]        ----
[   55.178113]   lock(&id_priv->handler_mutex);
[   55.178488]   lock(&id_priv->handler_mutex);
[   55.178864] 
                *** DEADLOCK ***

[   55.179482]  May be due to missing lock nesting notation

[   55.180102] 3 locks held by kworker/u8:4/88:
[   55.180516]  #0: ffff888129888138 ((wq_completion)iw_cm_wq){+.+.}-{0:0}, at: process_one_work+0x706/0x1390
[   55.181318]  #1: ffff88811cadfdb0 ((work_completion)(&work->work)#2){+.+.}-{0:0}, at: process_one_work+0x733/0x1390
[   55.182152]  #2: ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: cma_iw_handler+0xac/0x500 [rdma_cm]
[   55.183007] 
               stack backtrace:
[   55.183458] CPU: 1 PID: 88 Comm: kworker/u8:4 Tainted: G            E      6.3.0+ #5
[   55.184099] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[   55.185570] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
[   55.186823] Call Trace:
[   55.187832]  <TASK>
[   55.188813]  dump_stack_lvl+0x57/0x90
[   55.189866]  __lock_acquire+0x12f5/0x5eb0
[   55.190934]  ? __pfx___lock_acquire+0x10/0x10
[   55.192038]  ? static_obj+0x89/0xc0
[   55.193062]  ? lockdep_init_map_type+0x169/0x7c0
[   55.194175]  lock_acquire+0x1a7/0x4e0
[   55.195262]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.196394]  ? __pfx_lock_acquire+0x10/0x10
[   55.197470]  ? lock_is_held_type+0xce/0x120
[   55.198541]  ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core]
[   55.199755]  ? __pfx___might_resched+0x10/0x10
[   55.200846]  __mutex_lock+0x185/0x18b0
[   55.201857]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.202922]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.203996]  ? __pfx___mutex_lock+0x10/0x10
[   55.204980]  ? rcu_is_watching+0x11/0xb0
[   55.205936]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.206988]  rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.207994]  nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[   55.209093]  __nvmet_rdma_queue_disconnect+0x3c0/0x7b0 [nvmet_rdma]
[   55.210215]  nvmet_rdma_cm_handler+0x765/0x3a10 [nvmet_rdma]
[   55.211304]  ? mark_lock+0x101/0x16d0
[   55.212201]  ? __pfx___might_resched+0x10/0x10
[   55.213109]  ? rcu_is_watching+0x11/0xb0
[   55.214040]  ? __pfx_nvmet_rdma_cm_handler+0x10/0x10 [nvmet_rdma]
[   55.215117]  ? cma_iw_handler+0xac/0x500 [rdma_cm]
[   55.216098]  ? __pfx___mutex_lock+0x10/0x10
[   55.216999]  ? mark_lock+0x101/0x16d0
[   55.217923]  ? __pfx_mark_lock+0x10/0x10
[   55.218837]  cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
[   55.219850]  cma_iw_handler+0x344/0x500 [rdma_cm]
[   55.220841]  ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm]
[   55.221874]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[   55.222889]  cm_work_handler+0xb57/0x1c50 [iw_cm]
[   55.223881]  ? __pfx_cm_work_handler+0x10/0x10 [iw_cm]
[   55.224916]  ? lock_release+0x378/0x650
[   55.225869]  ? __schedule+0x10c8/0x5c80
[   55.226816]  process_one_work+0x7e4/0x1390
[   55.227747]  ? __pfx_lock_acquire+0x10/0x10
[   55.228714]  ? __pfx_process_one_work+0x10/0x10
[   55.229689]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   55.230614]  worker_thread+0xf7/0x12b0
[   55.231476]  ? __kthread_parkme+0xc1/0x1f0
[   55.232378]  ? __pfx_worker_thread+0x10/0x10
[   55.233261]  kthread+0x29e/0x340
[   55.234071]  ? __pfx_kthread+0x10/0x10
[   55.234938]  ret_from_fork+0x2c/0x50
[   55.235772]  </TASK>


[2]

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index cfd9d4368248..7d909a88d856 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1333,23 +1333,25 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	goto out;
 }
 
-static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
+static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue,
+	bool from_cm_handler)
 {
 	ib_drain_qp(queue->qp);
-	if (queue->cm_id)
+	if (!from_cm_handler && queue->cm_id)
 		rdma_destroy_id(queue->cm_id);
 	ib_destroy_qp(queue->qp);
 	ib_cq_pool_put(queue->cq, queue->recv_queue_size + 2 *
 		       queue->send_queue_size + 1);
 }
 
-static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
+static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue,
+				  bool from_cm_handler)
 {
 	pr_debug("freeing queue %d\n", queue->idx);
 
 	nvmet_sq_destroy(&queue->nvme_sq);
 
-	nvmet_rdma_destroy_queue_ib(queue);
+	nvmet_rdma_destroy_queue_ib(queue, from_cm_handler);
 	if (!queue->nsrq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
@@ -1583,7 +1585,7 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	return 0;
 
 free_queue:
-	nvmet_rdma_free_queue(queue);
+	nvmet_rdma_free_queue(queue, true);
 put_device:
 	kref_put(&ndev->ref, nvmet_rdma_free_dev);
 
@@ -1659,9 +1661,7 @@ static int __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
 		rdma_disconnect(queue->cm_id);
 		if (from_cm_handler)
 			ret = -ENODEV;
-		else
-			rdma_destroy_id(queue->cm_id);
-		nvmet_rdma_free_queue(queue);
+		nvmet_rdma_free_queue(queue, from_cm_handler);
 		kref_put(&dev->ref, nvmet_rdma_free_dev);
 	}
 
@@ -1700,7 +1700,7 @@ static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 
 	pr_err("failed to connect queue %d\n", queue->idx);
 
-	nvmet_rdma_free_queue(queue);
+	nvmet_rdma_free_queue(queue, true);
 	kref_put(&dev->ref, nvmet_rdma_free_dev);
 }
 


[3]

[   83.653727] run blktests nvme/009 at 2023-05-14 16:23:38
[   83.745101] SoftiWARP attached
[   83.818301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   83.839954] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[   83.878445] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[   83.885342] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
[   83.888614] nvme nvme1: creating 4 I/O queues.
[   83.922458] nvme nvme1: mapped 4/0/0 default/read/poll queues.
[   83.927734] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420
[   85.333528] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"

[   85.429371] ============================================
[   85.430868] WARNING: possible recursive locking detected
[   85.432353] 6.3.0+ #5 Tainted: G            E     
[   85.433785] --------------------------------------------
[   85.435283] rm/1789 is trying to acquire lock:
[   85.436688] ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.438606] 
               but task is already holding lock:
[   85.441070] ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_port_del_ctrls+0x2a/0x130 [nvmet]
[   85.443037] 
               other info that might help us debug this:
[   85.445618]  Possible unsafe locking scenario:

[   85.448068]        CPU0
[   85.449320]        ----
[   85.450545]   lock(&subsys->lock#2);
[   85.451891]   lock(&subsys->lock#2);
[   85.453211] 
                *** DEADLOCK ***

[   85.456225]  May be due to missing lock nesting notation

[   85.458498] 5 locks held by rm/1789:
[   85.459675]  #0: ffff8881218b0460 (sb_writers#15){.+.+}-{0:0}, at: do_unlinkat+0x182/0x590
[   85.461326]  #1: ffff8881089ac2b8 (&default_group_class[depth - 1]#4/1){+.+.}-{3:3}, at: do_unlinkat+0x202/0x590
[   85.463099]  #2: ffff88810913d1d0 (&sb->s_type->i_mutex_key#22){++++}-{3:3}, at: vfs_unlink+0xb7/0x810
[   85.464829]  #3: ffffffffc166fb90 (nvmet_config_sem){++++}-{3:3}, at: nvmet_port_subsys_drop_link+0x4c/0x2e0 [nvmet]
[   85.466702]  #4: ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_port_del_ctrls+0x2a/0x130 [nvmet]
[   85.468507] 
               stack backtrace:
[   85.470497] CPU: 1 PID: 1789 Comm: rm Tainted: G            E      6.3.0+ #5
[   85.471962] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[   85.473666] Call Trace:
[   85.474722]  <TASK>
[   85.475753]  dump_stack_lvl+0x57/0x90
[   85.476912]  __lock_acquire+0x12f5/0x5eb0
[   85.478095]  ? __pfx_lock_release+0x10/0x10
[   85.479333]  ? __pfx___lock_acquire+0x10/0x10
[   85.480549]  ? mark_held_locks+0x96/0xe0
[   85.481722]  lock_acquire+0x1a7/0x4e0
[   85.482855]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.484097]  ? __pfx_lock_acquire+0x10/0x10
[   85.485290]  ? lock_is_held_type+0xce/0x120
[   85.486421]  ? kasan_set_track+0x21/0x30
[   85.487507]  ? kasan_save_free_info+0x2a/0x50
[   85.488599]  ? __pfx___might_resched+0x10/0x10
[   85.489727]  ? nvmet_sq_destroy+0xbd/0x3f0 [nvmet]
[   85.490876]  ? nvmet_rdma_free_queue+0x30/0x390 [nvmet_rdma]
[   85.492123]  ? __nvmet_rdma_queue_disconnect+0x3cd/0x780 [nvmet_rdma]
[   85.493471]  __mutex_lock+0x185/0x18b0
[   85.494536]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.495680]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.496774]  ? __pfx_mark_lock+0x10/0x10
[   85.497794]  ? __pfx___mutex_lock+0x10/0x10
[   85.498892]  ? __pfx_timer_delete+0x10/0x10
[   85.499955]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.501058]  nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.502144]  ? __cancel_work+0x13d/0x220
[   85.503184]  ? lockdep_hardirqs_on+0x7d/0x100
[   85.504248]  ? __pfx_nvmet_ctrl_free+0x10/0x10 [nvmet]
[   85.505385]  ? __pfx___cancel_work+0x10/0x10
[   85.506418]  ? nvmet_sq_destroy+0xbd/0x3f0 [nvmet]
[   85.507510]  ? rcu_is_watching+0x11/0xb0
[   85.508499]  nvmet_sq_destroy+0x238/0x3f0 [nvmet]
[   85.509598]  nvmet_rdma_free_queue+0x30/0x390 [nvmet_rdma]
[   85.510753]  __nvmet_rdma_queue_disconnect+0x3cd/0x780 [nvmet_rdma]
[   85.512023]  nvmet_rdma_delete_ctrl+0x1a0/0x250 [nvmet_rdma]
[   85.513266]  nvmet_port_del_ctrls+0xe8/0x130 [nvmet]
[   85.514422]  nvmet_port_subsys_drop_link+0x1be/0x2e0 [nvmet]
[   85.515657]  ? __pfx_nvmet_port_subsys_drop_link+0x10/0x10 [nvmet]
[   85.516935]  configfs_unlink+0x382/0x7a0
[   85.518006]  vfs_unlink+0x29e/0x810
[   85.519019]  do_unlinkat+0x30c/0x590
[   85.520014]  ? __pfx_do_unlinkat+0x10/0x10
[   85.521072]  ? strncpy_from_user+0x66/0x2b0
[   85.522119]  ? getname_flags.part.0+0x8d/0x430
[   85.523184]  __x64_sys_unlinkat+0x92/0xe0
[   85.524200]  do_syscall_64+0x5c/0x90
[   85.525105]  ? asm_exc_page_fault+0x22/0x30
[   85.526099]  ? lockdep_hardirqs_on+0x7d/0x100
[   85.527084]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   85.528157] RIP: 0033:0x7ffb731f9adb
[   85.529079] Code: 73 01 c3 48 8b 0d 55 63 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 63 0d 00 f7 d8 64 89 01 48
[   85.531986] RSP: 002b:00007ffdf5cca688 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[   85.533397] RAX: ffffffffffffffda RBX: 0000561b08d0d810 RCX: 00007ffb731f9adb
[   85.534706] RDX: 0000000000000000 RSI: 0000561b08d0c5f0 RDI: 00000000ffffff9c
[   85.536022] RBP: 0000561b08d0c560 R08: 0000000000000003 R09: 0000000000000000
[   85.537384] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000000
[   85.538740] R13: 00007ffdf5cca7a0 R14: 0000000000000000 R15: 00007ffdf5cca7a0
[   85.540079]  </TASK>
[   94.320147] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[   94.326018] nvmet: ctrl 1 fatal error occurred!


[4]

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 3935165048e7..4589412c28c8 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -306,13 +306,15 @@ EXPORT_SYMBOL_GPL(nvmet_unregister_transport);
 
 void nvmet_port_del_ctrls(struct nvmet_port *port, struct nvmet_subsys *subsys)
 {
-	struct nvmet_ctrl *ctrl;
+	struct nvmet_ctrl *ctrl, *tmp;
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+	subsys->ctrls_owner = current;
+	list_for_each_entry_safe(ctrl, tmp, &subsys->ctrls, subsys_entry) {
 		if (ctrl->port == port)
 			ctrl->ops->delete_ctrl(ctrl);
 	}
+	subsys->ctrls_owner = NULL;
 	mutex_unlock(&subsys->lock);
 }
 
@@ -1476,11 +1478,18 @@ static void nvmet_ctrl_free(struct kref *ref)
 {
 	struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
 	struct nvmet_subsys *subsys = ctrl->subsys;
+	bool recursive_lock = false;
 
-	mutex_lock(&subsys->lock);
+	if (!mutex_trylock(&subsys->lock)) {
+		if (subsys->ctrls_owner == current)
+			recursive_lock = true;
+		else
+			mutex_lock(&subsys->lock);
+	}
 	nvmet_release_p2p_ns_map(ctrl);
 	list_del(&ctrl->subsys_entry);
-	mutex_unlock(&subsys->lock);
+	if (!recursive_lock)
+		mutex_unlock(&subsys->lock);
 
 	nvmet_stop_keep_alive_timer(ctrl);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 89bedfcd974c..03569c712ae1 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -246,6 +246,7 @@ struct nvmet_subsys {
 	u16			cntlid_max;
 
 	struct list_head	ctrls;
+	struct task_struct	*ctrls_owner;
 
 	struct list_head	hosts;
 	bool			allow_any_host;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-14 10:38 ` Shinichiro Kawasaki
@ 2023-05-15 17:29   ` Bart Van Assche
  2023-05-29 11:49     ` Pawel Baldysiak
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2023-05-15 17:29 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Hannes Reinecke

On 5/14/23 03:38, Shinichiro Kawasaki wrote:
> Bart, thanks for the v2 patch. I applied to kernel v6.3 and ran blktests nvme
> test group with siw transport. Unfortunately, I observed test process hang at
> nvme/003 [1]. The kernel reported lockdep WARN of id_priv->handler_mutex. The
> patch adds the "from_cm_handler" argument to __nvmet_rdma_queue_disconnect() to
> skip rdma_destroy_id() if the call is from CM handler. However,
> rdma_destroy_id() is still called in the call path below:
> 
> __nvmet_rdma_queue_disconnect()
>    nvmet_rdma_free_queue()
>      nvmet_rdma_destroy_queue_ib()
>        rdma_destroy_id()
> 
> So I think, the from_cm_handler should be added to nvmet_rdma_free_queue() and
> nvmet_rdma_destroy_queue_ib() also, so that from_cm_handler is propagated and
> rdma_destroy_id() is not called in nvmet_rdma_destroy_queue_ib(). I created a
> quick patch with this approach [2], and observed it avoids the process hang.
> 
> However, when I did further test using the v2 patch and the fix [1], I observed
> another process hang at nvme/009 [3]. This time, deadlock of subsys->lock
> happened in nvmet_ctrl_free(). I created a dirty patch to avoid it [4], but it
> caused lockdep WARN for possible deadlock between subsys->lock and
> id_priv->handler_mutex. I also observed hang at nvme/044 related
> nvmet_fatal_error_handler.

Hi Shinichiro,

We must be running different tests since the issues discovered by your 
tests did not show up in my tests. Anyway, I think this is a good time 
to ask for feedback from the NVMe maintainers. The only approach I see 
to suppress the lockdep complaints triggered by test nvme/003 about the 
NVMe RDMA transport driver is to something like patch [4]: track the 
owner of subsys->lock, find all code paths that could trigger recursive 
locking and make sure that recursive locking is not triggered. Does 
everyone agree with this approach?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-11 15:03 [PATCH] nvmet-rdma: Release connections synchronously Bart Van Assche
  2023-05-14 10:38 ` Shinichiro Kawasaki
@ 2023-05-17  7:42 ` Sagi Grimberg
  2023-05-18 20:32   ` Bart Van Assche
  1 sibling, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2023-05-17  7:42 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Hannes Reinecke, Shin'ichiro Kawasaki



On 5/11/23 18:03, Bart Van Assche wrote:
> Lockdep complains about flushing queue->release_work in the NVMe RDMA target
> driver because all instances of queue->release_work have the same lock class
> key. Assigning a different lock class key to each release_work instance is
> difficult because dynamically registered lock class keys must be unregistered
> and the last use of the work_struct lock class key happens after the work
> callback has returned. Hence rework the code for releasing connections as
> follows:
> * Add an argument to __nvmet_rdma_queue_disconnect() and also to
>    nvmet_rdma_queue_disconnect() that indicates whether or not these functions
>    are called from inside a CM handler callback.
> * Do not call rdma_destroy_id() if called from inside a CM handler callback.
> * Let these functions return a negative value if the caller should free the
>    CM ID.
> * Let the CM handler return a negative value if its caller should free the
>    CM ID.
> * Remove queue->release_work since releasing connections now happens
>    synchronously.
> 
> This patch suppresses the following lockdep complaint:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.3.0 #62 Not tainted
> ------------------------------------------------------
> kworker/1:3/455 is trying to acquire lock:
> ffff88813bfd9420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
> 
> but task is already holding lock:
> ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
>         process_one_work+0x7dc/0x1350
>         worker_thread+0xfc/0x1260
>         kthread+0x29e/0x340
>         ret_from_fork+0x2c/0x50
> 
> -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
>         __flush_workqueue+0x130/0x12f0
>         nvmet_rdma_cm_handler+0x961/0x39c0 [nvmet_rdma]
>         cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
>         cma_ib_req_handler+0x1096/0x4a50 [rdma_cm]
>         cm_process_work+0x46/0x3b0 [ib_cm]
>         cm_req_handler+0x20e2/0x61d0 [ib_cm]
>         cm_work_handler+0xc15/0x6ce0 [ib_cm]
>         process_one_work+0x843/0x1350
>         worker_thread+0xfc/0x1260
>         kthread+0x29e/0x340
>         ret_from_fork+0x2c/0x50
> 
> -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
>         __mutex_lock+0x186/0x18f0
>         cma_ib_req_handler+0xc3c/0x4a50 [rdma_cm]
>         cm_process_work+0x46/0x3b0 [ib_cm]
>         cm_req_handler+0x20e2/0x61d0 [ib_cm]
>         cm_work_handler+0xc15/0x6ce0 [ib_cm]
>         process_one_work+0x843/0x1350
>         worker_thread+0xfc/0x1260
>         kthread+0x29e/0x340
>         ret_from_fork+0x2c/0x50
> 
> -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
>         __lock_acquire+0x2fc0/0x60f0
>         lock_acquire+0x1a7/0x4e0
>         __mutex_lock+0x186/0x18f0
>         rdma_destroy_id+0x17/0x20 [rdma_cm]
>         nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
>         nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
>         process_one_work+0x843/0x1350
>         worker_thread+0xfc/0x1260
>         kthread+0x29e/0x340
>         ret_from_fork+0x2c/0x50
> 
> other info that might help us debug this:
> 
> Chain exists of:
>    &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work)
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                    CPU1
>         ----                    ----
>    lock((work_completion)(&queue->release_work));
>                                 lock((wq_completion)nvmet-wq);
>                                 lock((work_completion)(&queue->release_work));
>    lock(&id_priv->handler_mutex);
> 
>   *** DEADLOCK ***
> 
> 2 locks held by kworker/1:3/455:
>   #0: ffff888127b28538 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x766/0x1350
>   #1: ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350
> 
> stack backtrace:
> CPU: 1 PID: 455 Comm: kworker/1:3 Not tainted 6.3.0 #62
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
> Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma]
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x57/0x90
>   check_noncircular+0x27b/0x310
>   __lock_acquire+0x2fc0/0x60f0
>   lock_acquire+0x1a7/0x4e0
>   __mutex_lock+0x186/0x18f0
>   rdma_destroy_id+0x17/0x20 [rdma_cm]
>   nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
>   nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
>   process_one_work+0x843/0x1350
>   worker_thread+0xfc/0x1260
>   kthread+0x29e/0x340
>   ret_from_fork+0x2c/0x50
>   </TASK>
> 
> See also https://lore.kernel.org/all/rsmmxrchy6voi5qhl4irss5sprna3f5owkqtvybxglcv2pnylm@xmrnpfu3tfpe/
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   drivers/nvme/target/rdma.c | 70 ++++++++++++++++++++------------------
>   1 file changed, 37 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 4597bca43a6d..cfd9d4368248 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -101,7 +101,6 @@ struct nvmet_rdma_queue {
>   	spinlock_t		rsps_lock;
>   	struct nvmet_rdma_cmd	*cmds;
>   
> -	struct work_struct	release_work;
>   	struct list_head	rsp_wait_list;
>   	struct list_head	rsp_wr_wait_list;
>   	spinlock_t		rsp_wr_wait_lock;
> @@ -167,7 +166,8 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
>   static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc);
>   static void nvmet_rdma_write_data_done(struct ib_cq *cq, struct ib_wc *wc);
>   static void nvmet_rdma_qp_event(struct ib_event *event, void *priv);
> -static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue);
> +static int nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
> +				       bool from_cm_handler);
>   static void nvmet_rdma_free_rsp(struct nvmet_rdma_device *ndev,
>   				struct nvmet_rdma_rsp *r);
>   static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
> @@ -693,7 +693,7 @@ static void nvmet_rdma_error_comp(struct nvmet_rdma_queue *queue)
>   		 * of admin connect error, just disconnect and
>   		 * cleanup the queue
>   		 */
> -		nvmet_rdma_queue_disconnect(queue);
> +		nvmet_rdma_queue_disconnect(queue, false);
>   	}
>   }
>   
> @@ -1360,17 +1360,6 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>   	kfree(queue);
>   }
>   
> -static void nvmet_rdma_release_queue_work(struct work_struct *w)
> -{
> -	struct nvmet_rdma_queue *queue =
> -		container_of(w, struct nvmet_rdma_queue, release_work);
> -	struct nvmet_rdma_device *dev = queue->dev;
> -
> -	nvmet_rdma_free_queue(queue);
> -
> -	kref_put(&dev->ref, nvmet_rdma_free_dev);
> -}
> -
>   static int
>   nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
>   				struct nvmet_rdma_queue *queue)
> @@ -1441,11 +1430,6 @@ nvmet_rdma_alloc_queue(struct nvmet_rdma_device *ndev,
>   	if (ret)
>   		goto out_destroy_sq;
>   
> -	/*
> -	 * Schedules the actual release because calling rdma_destroy_id from
> -	 * inside a CM callback would trigger a deadlock. (great API design..)
> -	 */
> -	INIT_WORK(&queue->release_work, nvmet_rdma_release_queue_work);
>   	queue->dev = ndev;
>   	queue->cm_id = cm_id;
>   	queue->port = port->nport;
> @@ -1582,11 +1566,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>   		goto put_device;
>   	}
>   
> -	if (queue->host_qid == 0) {
> -		/* Let inflight controller teardown complete */
> -		flush_workqueue(nvmet_wq);
> -	}
> -
>   	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>   	if (ret) {
>   		/*

You could have simply removed this hunk alone to make lockdep quiet on
this, without the need to rework the async queue removal.

The flush here was added to prevent a reset/connect/disconnect storm
causing the target to run out of resources (which we have seen reports
about in the distant past). What prevents it now?

And you both reworked the teardown, and still removed the flush, I don't
get why both are needed.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-17  7:42 ` Sagi Grimberg
@ 2023-05-18 20:32   ` Bart Van Assche
  2023-05-22  6:41     ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2023-05-18 20:32 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Hannes Reinecke, Shin'ichiro Kawasaki

On 5/17/23 00:42, Sagi Grimberg wrote:
> On 5/11/23 18:03, Bart Van Assche wrote:
>> @@ -1582,11 +1566,6 @@ static int nvmet_rdma_queue_connect(struct 
>> rdma_cm_id *cm_id,
>>           goto put_device;
>>       }
>> -    if (queue->host_qid == 0) {
>> -        /* Let inflight controller teardown complete */
>> -        flush_workqueue(nvmet_wq);
>> -    }
>> -
>>       ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>       if (ret) {
>>           /*
> 
> You could have simply removed this hunk alone to make lockdep quiet on
> this, without the need to rework the async queue removal.
> 
> The flush here was added to prevent a reset/connect/disconnect storm
> causing the target to run out of resources (which we have seen reports
> about in the distant past). What prevents it now?
> 
> And you both reworked the teardown, and still removed the flush, I don't
> get why both are needed.

Hi Sagi,

My understanding is that the above flush_workqueue() call waits for 
prior release_work to complete. If the release_work instance is removed, 
I don't think that the above flush_workqueue() call is still necessary.

This patch prevents that the target runs out of resources by releasing 
connections synchronously instead of asynchronously.

Does the above answer your questions?

Bart.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-18 20:32   ` Bart Van Assche
@ 2023-05-22  6:41     ` Sagi Grimberg
  2023-05-22  7:07       ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2023-05-22  6:41 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Hannes Reinecke, Shin'ichiro Kawasaki


>>> @@ -1582,11 +1566,6 @@ static int nvmet_rdma_queue_connect(struct 
>>> rdma_cm_id *cm_id,
>>>           goto put_device;
>>>       }
>>> -    if (queue->host_qid == 0) {
>>> -        /* Let inflight controller teardown complete */
>>> -        flush_workqueue(nvmet_wq);
>>> -    }
>>> -
>>>       ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>       if (ret) {
>>>           /*
>>
>> You could have simply removed this hunk alone to make lockdep quiet on
>> this, without the need to rework the async queue removal.
>>
>> The flush here was added to prevent a reset/connect/disconnect storm
>> causing the target to run out of resources (which we have seen reports
>> about in the distant past). What prevents it now?
>>
>> And you both reworked the teardown, and still removed the flush, I don't
>> get why both are needed.
> 
> Hi Sagi,
> 
> My understanding is that the above flush_workqueue() call waits for 
> prior release_work to complete. If the release_work instance is removed, 
> I don't think that the above flush_workqueue() call is still necessary.

I'm wandering if making delete_ctrl synchronous may be a problem
in some cases, nvmet_fatal_error can be triggered from the
rdma completion path (rdma core workqueue context). Wondering if there
may be some inter-dependency there...

Also, nvmet-tcp has a similar lockdep complaint, where the teardown goes
via socket shutdown, which has to be async because we cannot release the
socket from the state_change callback.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-22  6:41     ` Sagi Grimberg
@ 2023-05-22  7:07       ` Sagi Grimberg
  0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2023-05-22  7:07 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Hannes Reinecke, Shin'ichiro Kawasaki


>>>> @@ -1582,11 +1566,6 @@ static int nvmet_rdma_queue_connect(struct 
>>>> rdma_cm_id *cm_id,
>>>>           goto put_device;
>>>>       }
>>>> -    if (queue->host_qid == 0) {
>>>> -        /* Let inflight controller teardown complete */
>>>> -        flush_workqueue(nvmet_wq);
>>>> -    }
>>>> -
>>>>       ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>>       if (ret) {
>>>>           /*
>>>
>>> You could have simply removed this hunk alone to make lockdep quiet on
>>> this, without the need to rework the async queue removal.
>>>
>>> The flush here was added to prevent a reset/connect/disconnect storm
>>> causing the target to run out of resources (which we have seen reports
>>> about in the distant past). What prevents it now?
>>>
>>> And you both reworked the teardown, and still removed the flush, I don't
>>> get why both are needed.
>>
>> Hi Sagi,
>>
>> My understanding is that the above flush_workqueue() call waits for 
>> prior release_work to complete. If the release_work instance is 
>> removed, I don't think that the above flush_workqueue() call is still 
>> necessary.
> 
> I'm wandering if making delete_ctrl synchronous may be a problem
> in some cases, nvmet_fatal_error can be triggered from the
> rdma completion path (rdma core workqueue context). Wondering if there
> may be some inter-dependency there...

For example, nvmet_rdma_comp_error -> nvmet_fatal_error -> delete_ctrl
-> __nvmet_rdma_queue_disconnect -> nvmet_rdma_free_queue ->
nvmet_rdma_destroy_queue_ib -> ib_drain_qp

ib_drain_qp cannot complete because the completion handler is
synchronously deleting the queue, hence cannot reap the drain
completion...


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] nvmet-rdma: Release connections synchronously
  2023-05-15 17:29   ` Bart Van Assche
@ 2023-05-29 11:49     ` Pawel Baldysiak
  0 siblings, 0 replies; 8+ messages in thread
From: Pawel Baldysiak @ 2023-05-29 11:49 UTC (permalink / raw)
  To: bvanassche; +Cc: hare, hch, kbusch, linux-nvme, sagi, shinichiro.kawasaki

We hit original issue with rdma_destroy_id() hang,
still looking for fix. Once proposed fixes are applied -
we still run into mentioned hang related to nvmet_fatal_error_handler.
We can help with testing.

Thanks
Pawel


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-29 11:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 15:03 [PATCH] nvmet-rdma: Release connections synchronously Bart Van Assche
2023-05-14 10:38 ` Shinichiro Kawasaki
2023-05-15 17:29   ` Bart Van Assche
2023-05-29 11:49     ` Pawel Baldysiak
2023-05-17  7:42 ` Sagi Grimberg
2023-05-18 20:32   ` Bart Van Assche
2023-05-22  6:41     ` Sagi Grimberg
2023-05-22  7:07       ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).