All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 4.15-rc 0/3] few fabrics fixes targeted to 4.15
@ 2017-12-21 10:07 Sagi Grimberg
  2017-12-21 10:07 ` [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 10:07 UTC (permalink / raw)


First two patches address possible hangs when deleting controllers
under heavy workqueue activities such as inflight ns scans, reconnects,
and resets. Lockdep complains about circular locking and rightfully so
as the lock dependency of in-workqueue flush hangs were observed.

The third fix is protecting against fabrics controller creation during
transport driver module unload.

Let's consider these for 4.15-rc

Roy Shterman (3):
  nvme-core: Don't set nvme_wq as MEM_RECLAIM
  nvme-core/loop/rdma: Host delete_work and reset_work on system
    workqueues
  nvme-fabrics: Protect against module unload during create_ctrl

 drivers/nvme/host/core.c    |  6 +++---
 drivers/nvme/host/fabrics.c | 16 ++++++++++++----
 drivers/nvme/host/fabrics.h |  2 ++
 drivers/nvme/host/fc.c      |  1 +
 drivers/nvme/host/rdma.c    |  3 ++-
 drivers/nvme/target/loop.c  |  3 ++-
 6 files changed, 22 insertions(+), 9 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 10:07 [PATCH 4.15-rc 0/3] few fabrics fixes targeted to 4.15 Sagi Grimberg
@ 2017-12-21 10:07 ` Sagi Grimberg
  2017-12-21 10:17   ` Christoph Hellwig
  2017-12-21 10:07 ` [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues Sagi Grimberg
  2017-12-21 10:07 ` [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl Sagi Grimberg
  2 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 10:07 UTC (permalink / raw)


From: Roy Shterman <roys@lightbitslabs.com>

nvme_wq is not a MEM_RECLAIM workqueue because it
can allocate memory in some of the works it is executing.

Signed-off-by: Roy Shterman <roys at lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c33f848ab49d..221c97d57562 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3490,7 +3490,7 @@ int __init nvme_core_init(void)
 	int result;
 
 	nvme_wq = alloc_workqueue("nvme-wq",
-			WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
+			WQ_UNBOUND | WQ_SYSFS, 0);
 	if (!nvme_wq)
 		return -ENOMEM;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues
  2017-12-21 10:07 [PATCH 4.15-rc 0/3] few fabrics fixes targeted to 4.15 Sagi Grimberg
  2017-12-21 10:07 ` [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM Sagi Grimberg
@ 2017-12-21 10:07 ` Sagi Grimberg
  2017-12-21 10:25   ` Christoph Hellwig
  2017-12-22 17:39   ` James Smart
  2017-12-21 10:07 ` [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl Sagi Grimberg
  2 siblings, 2 replies; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 10:07 UTC (permalink / raw)


From: Roy Shterman <roys@lightbitslabs.com>

We need to ensure that delete_work will be hosted on a different
workqueue than all the works we flush or cancel from it.
Otherwise we may hit a circular dependency warning [1].

Also, given that delete_work flushes reset_work, host reset_work
on system_wq and delete_work on system_long_wq. In addition,
fix the flushing in the individual drivers to flush system_long_wq
when draining queued deletes.

[1]:
[  178.491942] =============================================
[  178.492718] [ INFO: possible recursive locking detected ]
[  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
[  178.494382] ---------------------------------------------
[  178.495160] kworker/5:1/135 is trying to acquire lock:
[  178.495894]  (
[  178.496120] "nvme-wq"
[  178.496471] ){++++.+}
[  178.496599] , at:
[  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
[  178.497670]
               but task is already holding lock:
[  178.498499]  (
[  178.498724] "nvme-wq"
[  178.499074] ){++++.+}
[  178.499202] , at:
[  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.500343]
               other info that might help us debug this:
[  178.501269]  Possible unsafe locking scenario:

[  178.502113]        CPU0
[  178.502472]        ----
[  178.502829]   lock(
[  178.503115] "nvme-wq"
[  178.503467] );
[  178.503716]   lock(
[  178.504001] "nvme-wq"
[  178.504353] );
[  178.504601]
                *** DEADLOCK ***

[  178.505441]  May be due to missing lock nesting notation

[  178.506453] 2 locks held by kworker/5:1/135:
[  178.507068]  #0:
[  178.507330]  (
[  178.507598] "nvme-wq"
[  178.507726] ){++++.+}
[  178.508079] , at:
[  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.509004]  #1:
[  178.509265]  (
[  178.509532] (&ctrl->delete_work)
[  178.509795] ){+.+.+.}
[  178.510145] , at:
[  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.511070]
               stack backtrace:
:
[  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
[  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
[  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
[  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
[  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
[  178.518443] Call Trace:
[  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
[  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
[  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
[  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
[  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
[  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
[  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
[  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
[  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
[  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
[  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
[  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
[  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
[  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
[  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
[  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
[  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
[  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
[  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40

Signed-off-by: Roy Shterman <roys at lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/core.c   | 4 ++--
 drivers/nvme/host/rdma.c   | 2 +-
 drivers/nvme/target/loop.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 221c97d57562..12fe1704437b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -89,7 +89,7 @@ int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
 {
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
 		return -EBUSY;
-	if (!queue_work(nvme_wq, &ctrl->reset_work))
+	if (!queue_work(system_wq, &ctrl->reset_work))
 		return -EBUSY;
 	return 0;
 }
@@ -122,7 +122,7 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl)
 {
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
 		return -EBUSY;
-	if (!queue_work(nvme_wq, &ctrl->delete_work))
+	if (!queue_work(system_long_wq, &ctrl->delete_work))
 		return -EBUSY;
 	return 0;
 }
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0676a09c6eba..ff56d86467de 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2028,7 +2028,7 @@ static void nvme_rdma_remove_one(struct ib_device *ib_device, void *client_data)
 	}
 	mutex_unlock(&nvme_rdma_ctrl_mutex);
 
-	flush_workqueue(nvme_wq);
+	flush_workqueue(system_long_wq);
 }
 
 static struct ib_client nvme_rdma_ib_client = {
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 8652ed5c86e1..cab6ded4dca6 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -716,7 +716,7 @@ static void __exit nvme_loop_cleanup_module(void)
 		nvme_delete_ctrl(&ctrl->ctrl);
 	mutex_unlock(&nvme_loop_ctrl_mutex);
 
-	flush_workqueue(nvme_wq);
+	flush_workqueue(system_long_wq);
 }
 
 module_init(nvme_loop_init_module);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl
  2017-12-21 10:07 [PATCH 4.15-rc 0/3] few fabrics fixes targeted to 4.15 Sagi Grimberg
  2017-12-21 10:07 ` [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM Sagi Grimberg
  2017-12-21 10:07 ` [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues Sagi Grimberg
@ 2017-12-21 10:07 ` Sagi Grimberg
  2017-12-21 10:25   ` Christoph Hellwig
  2 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 10:07 UTC (permalink / raw)


From: Roy Shterman <roys@lightbitslabs.com>

nvme transport driver module unload may (and usually does) trigger iteration
over the active controllers and delete them all (sometimes under a mutex).
However, a controller can be created concurrently with module unload
which can lead to leakage of resources (most important char device node
leakage) in case the controller create occured after the unload delete and drain
sequence. To protect against this, we take a module reference to guarantee that
the nvme transport driver is not unloaded while creating a controller.

Signed-off-by: roy shterman <roys at lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/fabrics.c | 16 ++++++++++++----
 drivers/nvme/host/fabrics.h |  2 ++
 drivers/nvme/host/fc.c      |  1 +
 drivers/nvme/host/rdma.c    |  1 +
 drivers/nvme/target/loop.c  |  1 +
 5 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 76b4fe6816a0..11bf2d70570f 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -492,7 +492,7 @@ EXPORT_SYMBOL_GPL(nvmf_should_reconnect);
  */
 int nvmf_register_transport(struct nvmf_transport_ops *ops)
 {
-	if (!ops->create_ctrl)
+	if (!ops->create_ctrl || !ops->module)
 		return -EINVAL;
 
 	down_write(&nvmf_transports_rwsem);
@@ -868,24 +868,30 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
+	if (!try_module_get(ops->module)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	ret = nvmf_check_required_opts(opts, ops->required_opts);
 	if (ret)
-		goto out_unlock;
+		goto out_module_put;
 	ret = nvmf_check_allowed_opts(opts, NVMF_ALLOWED_OPTS |
 				ops->allowed_opts | ops->required_opts);
 	if (ret)
-		goto out_unlock;
+		goto out_module_put;
 
 	ctrl = ops->create_ctrl(dev, opts);
 	if (IS_ERR(ctrl)) {
 		ret = PTR_ERR(ctrl);
-		goto out_unlock;
+		goto out_module_put;
 	}
 
 	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
 			ctrl->subsys->subnqn);
+		module_put(ops->module);
 		up_read(&nvmf_transports_rwsem);
 		nvme_delete_ctrl_sync(ctrl);
 		return ERR_PTR(-EINVAL);
@@ -894,6 +900,8 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 	up_read(&nvmf_transports_rwsem);
 	return ctrl;
 
+out_module_put:
+	module_put(ops->module);
 out_unlock:
 	up_read(&nvmf_transports_rwsem);
 out_free_opts:
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 9ba614953607..25b19f722f5b 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -108,6 +108,7 @@ struct nvmf_ctrl_options {
  *			       fabric implementation of NVMe fabrics.
  * @entry:		Used by the fabrics library to add the new
  *			registration entry to its linked-list internal tree.
+ * @module:             Transport module reference
  * @name:		Name of the NVMe fabric driver implementation.
  * @required_opts:	sysfs command-line options that must be specified
  *			when adding a new NVMe controller.
@@ -126,6 +127,7 @@ struct nvmf_ctrl_options {
  */
 struct nvmf_transport_ops {
 	struct list_head	entry;
+	struct module		*module;
 	const char		*name;
 	int			required_opts;
 	int			allowed_opts;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index e2ade0237ffe..e68c6b84c177 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3380,6 +3380,7 @@ nvme_fc_create_ctrl(struct device *dev, struct nvmf_ctrl_options *opts)
 
 static struct nvmf_transport_ops nvme_fc_transport = {
 	.name		= "fc",
+	.module		= THIS_MODULE,
 	.required_opts	= NVMF_OPT_TRADDR | NVMF_OPT_HOST_TRADDR,
 	.allowed_opts	= NVMF_OPT_RECONNECT_DELAY | NVMF_OPT_CTRL_LOSS_TMO,
 	.create_ctrl	= nvme_fc_create_ctrl,
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ff56d86467de..596a3dde47fd 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2006,6 +2006,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 
 static struct nvmf_transport_ops nvme_rdma_transport = {
 	.name		= "rdma",
+	.module		= THIS_MODULE,
 	.required_opts	= NVMF_OPT_TRADDR,
 	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index cab6ded4dca6..098d5b978691 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -686,6 +686,7 @@ static struct nvmet_fabrics_ops nvme_loop_ops = {
 
 static struct nvmf_transport_ops nvme_loop_transport = {
 	.name		= "loop",
+	.module		= THIS_MODULE,
 	.create_ctrl	= nvme_loop_create_ctrl,
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 10:07 ` [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM Sagi Grimberg
@ 2017-12-21 10:17   ` Christoph Hellwig
  2017-12-21 10:41     ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-21 10:17 UTC (permalink / raw)


On Thu, Dec 21, 2017@12:07:50PM +0200, Sagi Grimberg wrote:
> From: Roy Shterman <roys at lightbitslabs.com>
> 
> nvme_wq is not a MEM_RECLAIM workqueue because it
> can allocate memory in some of the works it is executing.

But we need reset to work while in memory reclaim.

So instead we'll need to make sure whatever memory allocation required
(which ones, btw?) are marked GFP_NOIO.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues
  2017-12-21 10:07 ` [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues Sagi Grimberg
@ 2017-12-21 10:25   ` Christoph Hellwig
  2017-12-21 10:43     ` Sagi Grimberg
  2017-12-22 17:39   ` James Smart
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-21 10:25 UTC (permalink / raw)


On Thu, Dec 21, 2017@12:07:51PM +0200, Sagi Grimberg wrote:
> From: Roy Shterman <roys at lightbitslabs.com>
> 
> We need to ensure that delete_work will be hosted on a different
> workqueue than all the works we flush or cancel from it.
> Otherwise we may hit a circular dependency warning [1].
> 
> Also, given that delete_work flushes reset_work, host reset_work
> on system_wq and delete_work on system_long_wq. In addition,
> fix the flushing in the individual drivers to flush system_long_wq
> when draining queued deletes.

I vaguely remember pointing something like this out :)

I think we really need our own, separate WQ for this, e.g. a
nvme_reset_wq as reset progress under an swap / paging load is
essential and we need it isolated from the system.

Once we have that patch 1 is probably ok, but the patch adding
nvme_reset_wq should go before the current patch 1.  Also please
add good comments on the workqueue usage.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl
  2017-12-21 10:07 ` [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl Sagi Grimberg
@ 2017-12-21 10:25   ` Christoph Hellwig
  2017-12-21 18:31     ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-21 10:25 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 10:17   ` Christoph Hellwig
@ 2017-12-21 10:41     ` Sagi Grimberg
  2017-12-21 13:00       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 10:41 UTC (permalink / raw)



>> From: Roy Shterman <roys at lightbitslabs.com>
>>
>> nvme_wq is not a MEM_RECLAIM workqueue because it
>> can allocate memory in some of the works it is executing.
> 
> But we need reset to work while in memory reclaim.

AFAIK, WQ_MEM_RECLAIM means that this workqueue can be drained
for memory reclaim, which means that a workqueue that hosts works
that are allocating memory cannot be such a workqueue.

How does this patch make reset not work in memory reclaim? memory
reclaim will drain workqueues that *are* reclaimable workqueues.

> So instead we'll need to make sure whatever memory allocation required
> (which ones, btw?) are marked GFP_NOIO.

namespace scannig can allocate new namespaces.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues
  2017-12-21 10:25   ` Christoph Hellwig
@ 2017-12-21 10:43     ` Sagi Grimberg
  2017-12-21 13:00       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 10:43 UTC (permalink / raw)


>> From: Roy Shterman <roys at lightbitslabs.com>
>>
>> We need to ensure that delete_work will be hosted on a different
>> workqueue than all the works we flush or cancel from it.
>> Otherwise we may hit a circular dependency warning [1].
>>
>> Also, given that delete_work flushes reset_work, host reset_work
>> on system_wq and delete_work on system_long_wq. In addition,
>> fix the flushing in the individual drivers to flush system_long_wq
>> when draining queued deletes.
> 
> I vaguely remember pointing something like this out :)

You did :)

> I think we really need our own, separate WQ for this, e.g. a
> nvme_reset_wq as reset progress under an swap / paging load is
> essential and we need it isolated from the system.

I agree, this was the thought when introducing nvme_wq.

So you are fine with keeping controller delete in system_wq
(or system_long_wq)?

> Once we have that patch 1 is probably ok, but the patch adding
> nvme_reset_wq should go before the current patch 1.  Also please
> add good comments on the workqueue usage.

OK, I can do that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 10:41     ` Sagi Grimberg
@ 2017-12-21 13:00       ` Christoph Hellwig
  2017-12-21 13:17         ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-21 13:00 UTC (permalink / raw)


On Thu, Dec 21, 2017@12:41:30PM +0200, Sagi Grimberg wrote:
> AFAIK, WQ_MEM_RECLAIM means that this workqueue can be drained
> for memory reclaim, which means that a workqueue that hosts works
> that are allocating memory cannot be such a workqueue.

No.  WQ_MEM_RECLAIM means it has a dededicated rescuer execution
thread and is guaranteed to make forward progress even under grave
memory pressure.

> How does this patch make reset not work in memory reclaim? memory
> reclaim will drain workqueues that *are* reclaimable workqueues.

Without WQ_MEM_RECLAIM we might not be able to execute the reset
due to the overhead of starting a new helper thread to execute it.

> > So instead we'll need to make sure whatever memory allocation required
> > (which ones, btw?) are marked GFP_NOIO.
> 
> namespace scannig can allocate new namespaces.

Ok, makes sense.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues
  2017-12-21 10:43     ` Sagi Grimberg
@ 2017-12-21 13:00       ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-21 13:00 UTC (permalink / raw)


On Thu, Dec 21, 2017@12:43:53PM +0200, Sagi Grimberg wrote:
> > I think we really need our own, separate WQ for this, e.g. a
> > nvme_reset_wq as reset progress under an swap / paging load is
> > essential and we need it isolated from the system.
> 
> I agree, this was the thought when introducing nvme_wq.
> 
> So you are fine with keeping controller delete in system_wq
> (or system_long_wq)?

I'd keep delete on the reset workqueue.  If we can avoid depending
on too many system workqueues we're better off I think.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 13:00       ` Christoph Hellwig
@ 2017-12-21 13:17         ` Sagi Grimberg
  2017-12-21 13:54           ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 13:17 UTC (permalink / raw)



>> AFAIK, WQ_MEM_RECLAIM means that this workqueue can be drained
>> for memory reclaim, which means that a workqueue that hosts works
>> that are allocating memory cannot be such a workqueue.
> 
> No.  WQ_MEM_RECLAIM means it has a dededicated rescuer execution
> thread and is guaranteed to make forward progress even under grave
> memory pressure.
> 
>> How does this patch make reset not work in memory reclaim? memory
>> reclaim will drain workqueues that *are* reclaimable workqueues.
> 
> Without WQ_MEM_RECLAIM we might not be able to execute the reset
> due to the overhead of starting a new helper thread to execute it.

OK, thanks for the education :)

Note that the we need to make sure to not flush workqueue !MEM_RECLAIM
from a workqueue that is MEM_RECLAIM and vice-versa (if we do we will
can trigger deadlocks in severe memory pressure.

We cannot place the delete_work on the same workqueue as the reset_work
because we flush reset_work from nvme_delete_ctrl (this is what this
patch is trying to prevent).

We need a separate workqueue for it, or we can use system_long_wq
(however this would require making nvme_wq and nvme_reset_wq without
WQ_MEM_RECLAIM). Or, we introduce a new system_memreclaim_wq?

delete_ctrl does not need MEM_RECLAIM, but it flushes workqueues that
does need MEM_RECLAIM.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 13:17         ` Sagi Grimberg
@ 2017-12-21 13:54           ` Christoph Hellwig
  2017-12-21 14:17             ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-21 13:54 UTC (permalink / raw)


On Thu, Dec 21, 2017@03:17:10PM +0200, Sagi Grimberg wrote:
> Note that the we need to make sure to not flush workqueue !MEM_RECLAIM
> from a workqueue that is MEM_RECLAIM and vice-versa (if we do we will
> can trigger deadlocks in severe memory pressure.

Yes.

> We cannot place the delete_work on the same workqueue as the reset_work
> because we flush reset_work from nvme_delete_ctrl (this is what this
> patch is trying to prevent).

Ok..

Seems like we should instead have a single-thread MEM_RECLAIM
workqueue per nvme controller for reset and remove as that would
implicitly serialize remove and delete.

Alternatively we could use the reset_work for removal as well.
In fact it already has the removal and we'd just need to add
a goto for that case if we are in deleting state, e.g. something
like the patch below, just for rdma without the core and other
transport bits:

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 37af56596be6..ac09d5c4465f 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1753,6 +1753,9 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 	nvme_stop_ctrl(&ctrl->ctrl);
 	nvme_rdma_shutdown_ctrl(ctrl, false);
 
+	if (ctrl->state == NVME_CTRL_DELETING)
+		goto out_remove;
+
 	ret = nvme_rdma_configure_admin_queue(ctrl, false);
 	if (ret)
 		goto out_fail;
@@ -1760,7 +1763,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 	if (ctrl->ctrl.queue_count > 1) {
 		ret = nvme_rdma_configure_io_queues(ctrl, false);
 		if (ret)
-			goto out_fail;
+			goto out_remove;
 	}
 
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@@ -1774,7 +1777,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 
 	return;
 
-out_fail:
+out_remove:
 	dev_warn(ctrl->ctrl.device, "Removing after reset failure\n");
 	nvme_remove_namespaces(&ctrl->ctrl);
 	nvme_rdma_shutdown_ctrl(ctrl, true);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 13:54           ` Christoph Hellwig
@ 2017-12-21 14:17             ` Sagi Grimberg
  2017-12-29  9:35               ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 14:17 UTC (permalink / raw)



>> We cannot place the delete_work on the same workqueue as the reset_work
>> because we flush reset_work from nvme_delete_ctrl (this is what this
>> patch is trying to prevent).
> 
> Ok..
> 
> Seems like we should instead have a single-thread MEM_RECLAIM
> workqueue per nvme controller for reset and remove as that would
> implicitly serialize remove and delete.
> 
> Alternatively we could use the reset_work for removal as well.
> In fact it already has the removal and we'd just need to add
> a goto for that case if we are in deleting state, e.g. something
> like the patch below, just for rdma without the core and other
> transport bits:

The problem I have with these two suggestions is that in my mind,
delete_ctrl should never fail because there is inflight reset.

Also consider what should we do in module unload where we trigger
delete of all active controllers, we'd need some retry mechanism
to clean them up.

We have three work types:
1. async stuff like namespace scan, aen etc (which can allocate memory)
2. reset which flushes (1) and allocates memory on its own (nvme queues
etc...)
3. delete which flushes (2)

Maybe it won't be so bad to have three global workqueues to host these
works?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl
  2017-12-21 10:25   ` Christoph Hellwig
@ 2017-12-21 18:31     ` Sagi Grimberg
  0 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-21 18:31 UTC (permalink / raw)


> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch at lst.de>

Meh, its missing a ref put.. lets hold off on it for now.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues
  2017-12-21 10:07 ` [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues Sagi Grimberg
  2017-12-21 10:25   ` Christoph Hellwig
@ 2017-12-22 17:39   ` James Smart
  2017-12-24  8:55     ` Sagi Grimberg
  1 sibling, 1 reply; 19+ messages in thread
From: James Smart @ 2017-12-22 17:39 UTC (permalink / raw)


On 12/21/2017 2:07 AM, Sagi Grimberg wrote:
> From: Roy Shterman <roys at lightbitslabs.com>
>
> We need to ensure that delete_work will be hosted on a different
> workqueue than all the works we flush or cancel from it.
> Otherwise we may hit a circular dependency warning [1].
>
> Also, given that delete_work flushes reset_work, host reset_work
> on system_wq and delete_work on system_long_wq. In addition,
> fix the flushing in the individual drivers to flush system_long_wq
> when draining queued deletes.
>
>
Any particular reason you didn't do similar mods for the fc transport ?

-- james

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues
  2017-12-22 17:39   ` James Smart
@ 2017-12-24  8:55     ` Sagi Grimberg
  0 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-24  8:55 UTC (permalink / raw)



> Any particular reason you didn't do similar mods for the fc transport ?

AFAICT, fc never flushes the nvme_wq.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-21 14:17             ` Sagi Grimberg
@ 2017-12-29  9:35               ` Christoph Hellwig
  2017-12-31  9:51                 ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:35 UTC (permalink / raw)


On Thu, Dec 21, 2017@04:17:29PM +0200, Sagi Grimberg wrote:
> We have three work types:
> 1. async stuff like namespace scan, aen etc (which can allocate memory)
> 2. reset which flushes (1) and allocates memory on its own (nvme queues
> etc...)
> 3. delete which flushes (2)
>
> Maybe it won't be so bad to have three global workqueues to host these
> works?

Yes.  Or at least two given that we don't really a private workqueue
for 1 I think.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM
  2017-12-29  9:35               ` Christoph Hellwig
@ 2017-12-31  9:51                 ` Sagi Grimberg
  0 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2017-12-31  9:51 UTC (permalink / raw)



>> We have three work types:
>> 1. async stuff like namespace scan, aen etc (which can allocate memory)
>> 2. reset which flushes (1) and allocates memory on its own (nvme queues
>> etc...)
>> 3. delete which flushes (2)
>>
>> Maybe it won't be so bad to have three global workqueues to host these
>> works?
> 
> Yes.  Or at least two given that we don't really a private workqueue
> for 1 I think.

We still need it with MEM_RECLAIM because we flush it from the other
workqueues. So we don't need a private workqueue, but we need it with
WQ_MEM_RECLAIM. So either we keep it or introduce system_memreclaim_wq..

Thoughts?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-12-31  9:51 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-21 10:07 [PATCH 4.15-rc 0/3] few fabrics fixes targeted to 4.15 Sagi Grimberg
2017-12-21 10:07 ` [PATCH 4.15-rc 1/3] nvme-core: Don't set nvme_wq as MEM_RECLAIM Sagi Grimberg
2017-12-21 10:17   ` Christoph Hellwig
2017-12-21 10:41     ` Sagi Grimberg
2017-12-21 13:00       ` Christoph Hellwig
2017-12-21 13:17         ` Sagi Grimberg
2017-12-21 13:54           ` Christoph Hellwig
2017-12-21 14:17             ` Sagi Grimberg
2017-12-29  9:35               ` Christoph Hellwig
2017-12-31  9:51                 ` Sagi Grimberg
2017-12-21 10:07 ` [PATCH 4.15-rc 2/3] nvme-core/loop/rdma: Host delete_work and reset_work on system workqueues Sagi Grimberg
2017-12-21 10:25   ` Christoph Hellwig
2017-12-21 10:43     ` Sagi Grimberg
2017-12-21 13:00       ` Christoph Hellwig
2017-12-22 17:39   ` James Smart
2017-12-24  8:55     ` Sagi Grimberg
2017-12-21 10:07 ` [PATCH 4.15-rc 3/3] nvme-fabrics: Protect against module unload during create_ctrl Sagi Grimberg
2017-12-21 10:25   ` Christoph Hellwig
2017-12-21 18:31     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.