All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] nvme: fix several bugs in nvme-fabric
@ 2023-01-03 10:03 Taehee Yoo
  2023-01-03 10:03 ` [PATCH 1/4] nvme: fix delete uninitialized controller Taehee Yoo
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Taehee Yoo @ 2023-01-03 10:03 UTC (permalink / raw)
  To: linux-nvme, kbusch, axboe, hch, sagi, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp,
	pizhenwei, ap420073

There are several race conditions in the nvme-fabric scenario.
It fixes these bugs as well as a memory leak.

The first patch fixes kernel panic in delete controller logic because 
of dereference of an uninitialized controller.
Currently, controllers can be deleted with the sysfs.
It should be disallowed before controller initialization.
But it is allowed always.
So, the delete controller logic possibly accesses uninitialized 
controller resources, which results in kernel panic.

The second patch fixes kernel panic in reset controller logic because 
of dereference of an uninitialized controller.
This issue is very similar to the first issue.
The solution is very similar too.
It prevents resetting controllers before the initialization of the nvme 
controller.

The third patch fixes a race condition between nvmet_ns_disable() 
and nvmet_ns_enable()
nvmet_ns_enable() and nvmet_ns_disable() should not be worked concurrently.
But, it is possible.
So, hang occurs in the nvmet_ns_disable() due to a race condition.

The last patch fixes a memory leak when a tcp target is released.
When a host sends a reset command to the target, a target calls 
nvmet_tcp_free_cmd_data_in_buffers() to free resources in the CMD.
It internally possibly skips freeing resources due to some condition.
At this point, a memory leak would occur.

Taehee Yoo (4):
  nvme: fix delete uninitialized controller
  nvme: fix reset uninitialized controller
  nvmet: fix hang in nvmet_ns_disable()
  nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers()

 drivers/nvme/host/core.c       | 31 +++++++++++++++++++++----------
 drivers/nvme/target/configfs.c | 14 +++++++-------
 drivers/nvme/target/core.c     | 10 ++++++----
 drivers/nvme/target/nvmet.h    |  8 +++++++-
 drivers/nvme/target/tcp.c      |  9 +++------
 5 files changed, 44 insertions(+), 28 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] nvme: fix delete uninitialized controller
  2023-01-03 10:03 [PATCH 0/4] nvme: fix several bugs in nvme-fabric Taehee Yoo
@ 2023-01-03 10:03 ` Taehee Yoo
  2023-01-03 10:30   ` Sagi Grimberg
  2023-01-03 10:03 ` [PATCH 2/4] nvme: fix reset " Taehee Yoo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Taehee Yoo @ 2023-01-03 10:03 UTC (permalink / raw)
  To: linux-nvme, kbusch, axboe, hch, sagi, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp,
	pizhenwei, ap420073

nvme-fabric controllers can be deleted by
/sys/class/nvme/nvme<NS>/delete_controller
echo 1 > /sys/class/nvme/nvme<NS>/delete_controller
The above command will call nvme_delete_ctrl_sync().
This function internally tries to change ctrl->state to NVME_CTRL_DELETING.
NVME_CTRL_LIVE, NVME_CTRL_RESETTING, and NVME_CTRL_CONNECTING states can
be changed to NVME_CTRL_DELETING.
If the state is successfully changed, nvme_do_delete_ctrl() is called,
which is the actual delete logic of controller.

controller initialization logic changes ctrl->state.
NEW -> CONNECTING -> LIVE.
NVME_CTRL_CONNECTING state doesn't ensure that initialization is done.

So, delete logic can be called before the finish of controller
initialization.
So kernel panic would occur because nvme_do_delete_ctrl() dereferences
uninitialized values.

BUG: KASAN: null-ptr-deref in do_raw_spin_trylock+0x67/0x180
Read of size 4 at addr 00000000000000c0 by task bash/928

CPU: 7 PID: 928 Comm: bash Not tainted 6.1.0 #35
nvme nvme0: Connect command failed: host path error
Call Trace:
 <TASK>
 dump_stack_lvl+0x57/0x81
 ? do_raw_spin_trylock+0x67/0x180
 kasan_report+0xba/0x1f0
nvme nvme0: failed to connect queue: 0 ret=880
 ? do_raw_spin_trylock+0x67/0x180
 ? sysfs_file_ops+0x170/0x170
 kasan_check_range+0x14a/0x1a0
 do_raw_spin_trylock+0x67/0x180
 ? do_raw_spin_lock+0x270/0x270
 ? nvme_remove_namespaces+0x1bc/0x3d0
 _raw_spin_lock_irqsave+0x4b/0x90
 ? blk_mq_quiesce_queue+0x1b/0x160
 blk_mq_quiesce_queue+0x1b/0x160
 nvme_tcp_delete_ctrl+0x4b/0x70
 nvme_do_delete_ctrl+0x135/0x141
 nvme_sysfs_delete.cold+0x8/0xd
 kernfs_fop_write_iter+0x34b/0x520
 vfs_write+0x83a/0xd20
 ? kernel_write+0x630/0x630
 ? rcu_read_lock_sched_held+0x12/0x80
 ? lock_acquire+0x4f4/0x630
 ? __fget_light+0x51/0x230
 ksys_write+0xf9/0x1d0
 ? __ia32_sys_read+0xa0/0xa0
 ? syscall_enter_from_user_mode+0x1d/0x50
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc955d10104

Fixes: 1a353d85b02d ("nvme: add fabrics sysfs attributes")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d307ae4d8a57..cd4c80ca66d4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -243,7 +243,8 @@ static void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
 	 * since ->delete_ctrl can free the controller.
 	 */
 	nvme_get_ctrl(ctrl);
-	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
+	if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) &&
+	    nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
 		nvme_do_delete_ctrl(ctrl);
 	nvme_put_ctrl(ctrl);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] nvme: fix reset uninitialized controller
  2023-01-03 10:03 [PATCH 0/4] nvme: fix several bugs in nvme-fabric Taehee Yoo
  2023-01-03 10:03 ` [PATCH 1/4] nvme: fix delete uninitialized controller Taehee Yoo
@ 2023-01-03 10:03 ` Taehee Yoo
  2023-01-03 10:32   ` Sagi Grimberg
  2023-01-03 10:03 ` [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable() Taehee Yoo
  2023-01-03 10:03 ` [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers() Taehee Yoo
  3 siblings, 1 reply; 14+ messages in thread
From: Taehee Yoo @ 2023-01-03 10:03 UTC (permalink / raw)
  To: linux-nvme, kbusch, axboe, hch, sagi, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp,
	pizhenwei, ap420073

nvme-fabric controllers can be reset by
/sys/class/nvme/nvme#/reset_controller
echo 1 > /sys/class/nvme/nvme#/reset_controller
The above command will call nvme_sysfs_reset().

This function internally calls ctrl->reset_work synchronously or
asynchronously.
At this point, it doesn't sure if the controller will be reset after
initialization.

So kernel panic would occur because ctrl->reset_work dereferences
uninitialized values.

In order to avoid this, nvme_sysfs_reset checks
the NVME_CTRL_STARTED_ONCE flag. This flag indicates the controller is
initialized fully. So, reset logic can be executed safely.

WARNING: CPU: 1 PID: 462 at kernel/workqueue.c:3066 __flush_work+0x74f/0x960
Modules linked in: nvme_tcp xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype iptable_filter iptable_nat br_netfilter bridge stp llc crct10dif_pclmul crc32_generic crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 overlay openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib ib_uverbs ib_core xts cts ecb mlx5_core aesni_intel crypto_simd cryptd mlxfw ptp sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix
CPU: 1 PID: 462 Comm: kworker/u16:5 Not tainted 6.1.0+ #52 1d16bdc3867491ba5cf2147d49bd76d7eacb8fd9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Workqueue: nvme-reset-wq nvme_reset_ctrl_work [nvme_tcp]
RIP: 0010:__flush_work+0x74f/0x960
Code: c0 74 6c e8 53 97 17 00 48 c7 c6 c8 4a 1b 84 48 c7 c7 80 70 b9 8e 45 31 f6 e8 cd 53 0f 00 e9 5d fd ff ff 0f 0b e9 56 fd ff ff <0f> 0b 45 31 f6 e9 4c fd ff ff 4c 89 ef e8 4f 81 2a 02 e8 ea 75 16
RSP: 0018:ffff888116507a50 EFLAGS: 00010246
RAX: ffff88800646b490 RBX: 0000000000000011 RCX: 1ffffffff1e59f99
RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff88800646b490
RBP: ffff888116507be8 R08: 0000000000000000 R09: 0000000000000000
R10: ffffed1000c8d692 R11: 0000000000000001 R12: 1ffff11022ca0f50
R13: 1ffff11022ca0f80 R14: 0000000000000001 R15: ffff88800646b4a8
FS:  0000000000000000(0000) GS:ffff888117a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055953af11fd0 CR3: 0000000106f62001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? _raw_spin_unlock_irqrestore+0x59/0x70
 ? queue_delayed_work_on+0xa0/0xa0
 ? lock_release+0x631/0xe80
 ? __up_read+0x192/0x730
 ? up_write+0x520/0x520
 ? rcu_read_lock_sched_held+0x12/0x80
 ? lock_release+0x631/0xe80
 ? rcu_read_lock_sched_held+0x12/0x80
 ? try_to_grab_pending.part.0+0x23/0x540
 __cancel_work_timer+0x2cb/0x3f0
 ? cancel_delayed_work+0x10/0x10
 ? rcu_read_lock_sched_held+0x12/0x80
 ? lock_acquire+0x4f4/0x630
 ? lockdep_hardirqs_on_prepare+0x410/0x410
 ? lock_downgrade+0x700/0x700
 ? finish_task_switch.isra.0+0x23b/0x870
 ? trace_hardirqs_on+0x3c/0x190
 nvme_stop_ctrl+0x17/0x150
 nvme_reset_ctrl_work+0x19/0x120 [nvme_tcp aa1d0deebfd175637ed368a54a16dfbb09be290f]
[ ... ]

Fixes: f3ca80fc11c3 ("nvme: move chardev and sysfs interface to common code")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cd4c80ca66d4..418bd865c838 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -134,6 +134,14 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
 		queue_work(nvme_wq, &ctrl->scan_work);
 }
 
+void nvme_queue_scan_sync(struct nvme_ctrl *ctrl)
+{
+	if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset) {
+		queue_work(nvme_wq, &ctrl->scan_work);
+		flush_work(&ctrl->scan_work);
+	}
+}
+
 /*
  * Use this function to proceed with scheduling reset_work for a controller
  * that had previously been set to the resetting state. This is intended for
@@ -1150,10 +1158,8 @@ void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects,
 		dev_info(ctrl->device,
 "controller capabilities changed, reset may be required to take effect.\n");
 	}
-	if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC)) {
-		nvme_queue_scan(ctrl);
-		flush_work(&ctrl->scan_work);
-	}
+	if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC))
+		nvme_queue_scan_sync(ctrl);
 
 	switch (cmd->common.opcode) {
 	case nvme_admin_set_features:
@@ -3350,9 +3356,11 @@ static ssize_t nvme_sysfs_reset(struct device *dev,
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 	int ret;
 
-	ret = nvme_reset_ctrl_sync(ctrl);
-	if (ret < 0)
-		return ret;
+	if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags)) {
+		ret = nvme_reset_ctrl_sync(ctrl);
+		if (ret < 0)
+			return ret;
+	}
 	return count;
 }
 static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
@@ -4994,16 +5002,18 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 	 * that were missed. We identify persistent discovery controllers by
 	 * checking that they started once before, hence are reconnecting back.
 	 */
-	if (test_and_set_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) &&
+	if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) &&
 	    nvme_discovery_ctrl(ctrl))
 		nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
 
 	if (ctrl->queue_count > 1) {
-		nvme_queue_scan(ctrl);
+		nvme_queue_scan_sync(ctrl);
 		nvme_unquiesce_io_queues(ctrl);
 		nvme_mpath_update(ctrl);
 	}
 
+	set_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags);
+
 	nvme_change_uevent(ctrl, "NVME_EVENT=connected");
 }
 EXPORT_SYMBOL_GPL(nvme_start_ctrl);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable()
  2023-01-03 10:03 [PATCH 0/4] nvme: fix several bugs in nvme-fabric Taehee Yoo
  2023-01-03 10:03 ` [PATCH 1/4] nvme: fix delete uninitialized controller Taehee Yoo
  2023-01-03 10:03 ` [PATCH 2/4] nvme: fix reset " Taehee Yoo
@ 2023-01-03 10:03 ` Taehee Yoo
  2023-01-03 10:58   ` Sagi Grimberg
  2023-01-04  0:32   ` Chaitanya Kulkarni
  2023-01-03 10:03 ` [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers() Taehee Yoo
  3 siblings, 2 replies; 14+ messages in thread
From: Taehee Yoo @ 2023-01-03 10:03 UTC (permalink / raw)
  To: linux-nvme, kbusch, axboe, hch, sagi, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp,
	pizhenwei, ap420073

nvme target namespace is enabled or disabled by nvmet_ns_enable() or
nvmet_ns_disable().
The subsys->lock is used to disallow to use namespace data while
nvmet_ns_enable() or nvmet_ns_disable() are working.
The ns->enabled boolean variable prevents using namespace data in wrong
state such as uninitialized state.

nvmet_ns_disable() acquires ns->lock and set ns->enabled false.
Then, it releases ns->lock for a while to wait ns->disable_done completion.
At this point, nvmet_ns_enable() can be worked concurrently and it calls
percpu_ref_init().
So, ns->disable_done will never be completed.
Therefore hang would occur at this point.

   CPU0                                     CPU1
   nvmet_ns_disable();
   mutex_lock(&subsys->lock);               nvmet_ns_enable();
                                            mutex_lock(&subsys->lock);
   ns->enabled = false;
   mutex_unlock(&subsys->lock);
                                            percpu_ref_init();
   wait_for_completion(&ns->disable_done);  <-- infinite wait

   mutex_lock(&subsys->lock);
   mutex_unlock(&subsys->lock);

INFO: task bash:926 blocked for more than 30 seconds.
      Tainted: G        W          6.1.0+ #17
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash            state:D stack:27200 pid:926   ppid:911
flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0xafc/0x2930
 ? io_schedule_timeout+0x160/0x160
 ? _raw_spin_unlock_irq+0x24/0x50
 ? __wait_for_common+0x39b/0x5c0
 ? usleep_range_state+0x190/0x190
 schedule+0x130/0x230
 schedule_timeout+0x18a/0x240
 ? usleep_range_state+0x190/0x190
 ? rcu_read_lock_sched_held+0x12/0x80
 ? lock_downgrade+0x700/0x700
 ? do_raw_spin_trylock+0xb5/0x180
 ? lock_contended+0xdf0/0xdf0
 ? _raw_spin_unlock_irq+0x24/0x50
 ? trace_hardirqs_on+0x3c/0x190
 __wait_for_common+0x1ca/0x5c0
 ? usleep_range_state+0x190/0x190
 ? bit_wait_io+0xf0/0xf0
 ? _raw_spin_unlock_irqrestore+0x59/0x70
 nvmet_ns_disable+0x288/0x490
 ? nvmet_ns_enable+0x970/0x970
 ? lockdep_hardirqs_on_prepare+0x410/0x410
 ? rcu_read_lock_sched_held+0x12/0x80
 ? configfs_write_iter+0x1df/0x480
 ? nvmet_ns_revalidate_size_store+0x220/0x220
 nvmet_ns_enable_store+0x85/0xe0
[ ... ]

Fixes: a07b4970f464 ("nvmet: add a generic NVMe target")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 drivers/nvme/target/configfs.c | 14 +++++++-------
 drivers/nvme/target/core.c     | 10 ++++++----
 drivers/nvme/target/nvmet.h    |  8 +++++++-
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 907143870da5..d878c4231d65 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -348,7 +348,7 @@ static ssize_t nvmet_ns_device_path_store(struct config_item *item,
 
 	mutex_lock(&subsys->lock);
 	ret = -EBUSY;
-	if (ns->enabled)
+	if (ns->state != NVMET_NS_DISABLED)
 		goto out_unlock;
 
 	ret = -EINVAL;
@@ -390,7 +390,7 @@ static ssize_t nvmet_ns_p2pmem_store(struct config_item *item,
 	int error;
 
 	mutex_lock(&ns->subsys->lock);
-	if (ns->enabled) {
+	if (ns->state != NVMET_NS_DISABLED) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -427,7 +427,7 @@ static ssize_t nvmet_ns_device_uuid_store(struct config_item *item,
 	int ret = 0;
 
 	mutex_lock(&subsys->lock);
-	if (ns->enabled) {
+	if (ns->state != NVMET_NS_DISABLED) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -458,7 +458,7 @@ static ssize_t nvmet_ns_device_nguid_store(struct config_item *item,
 	int ret = 0;
 
 	mutex_lock(&subsys->lock);
-	if (ns->enabled) {
+	if (ns->state != NVMET_NS_DISABLED) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -523,7 +523,7 @@ CONFIGFS_ATTR(nvmet_ns_, ana_grpid);
 
 static ssize_t nvmet_ns_enable_show(struct config_item *item, char *page)
 {
-	return sprintf(page, "%d\n", to_nvmet_ns(item)->enabled);
+	return sprintf(page, "%d\n", !!to_nvmet_ns(item)->state);
 }
 
 static ssize_t nvmet_ns_enable_store(struct config_item *item,
@@ -561,7 +561,7 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
 		return -EINVAL;
 
 	mutex_lock(&ns->subsys->lock);
-	if (ns->enabled) {
+	if (ns->state != NVMET_NS_DISABLED) {
 		pr_err("disable ns before setting buffered_io value.\n");
 		mutex_unlock(&ns->subsys->lock);
 		return -EINVAL;
@@ -587,7 +587,7 @@ static ssize_t nvmet_ns_revalidate_size_store(struct config_item *item,
 		return -EINVAL;
 
 	mutex_lock(&ns->subsys->lock);
-	if (!ns->enabled) {
+	if (ns->state != NVMET_NS_ENABLED) {
 		pr_err("enable ns before revalidate.\n");
 		mutex_unlock(&ns->subsys->lock);
 		return -EINVAL;
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index f66ed13d7c11..58a91fb9c2f7 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -563,7 +563,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 		goto out_unlock;
 	}
 
-	if (ns->enabled)
+	if (ns->state != NVMET_NS_DISABLED)
 		goto out_unlock;
 
 	ret = -EMFILE;
@@ -598,7 +598,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	subsys->nr_namespaces++;
 
 	nvmet_ns_changed(subsys, ns->nsid);
-	ns->enabled = true;
+	ns->state = NVMET_NS_ENABLED;
 	ret = 0;
 out_unlock:
 	mutex_unlock(&subsys->lock);
@@ -621,10 +621,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	struct nvmet_ctrl *ctrl;
 
 	mutex_lock(&subsys->lock);
-	if (!ns->enabled)
+	if (ns->state != NVMET_NS_ENABLED)
 		goto out_unlock;
 
-	ns->enabled = false;
+	ns->state = NVMET_NS_DISABLING;
 	xa_erase(&ns->subsys->namespaces, ns->nsid);
 	if (ns->nsid == subsys->max_nsid)
 		subsys->max_nsid = nvmet_max_nsid(subsys);
@@ -652,6 +652,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	subsys->nr_namespaces--;
 	nvmet_ns_changed(subsys, ns->nsid);
 	nvmet_ns_dev_disable(ns);
+	ns->state = NVMET_NS_DISABLED;
 out_unlock:
 	mutex_unlock(&subsys->lock);
 }
@@ -689,6 +690,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
 	uuid_gen(&ns->uuid);
 	ns->buffered_io = false;
 	ns->csi = NVME_CSI_NVM;
+	ns->state = NVMET_NS_DISABLED;
 
 	return ns;
 }
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 89bedfcd974c..e609787577c6 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -56,6 +56,12 @@
 #define IPO_IATTR_CONNECT_SQE(x)	\
 	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
 
+enum nvmet_ns_state {
+	NVMET_NS_ENABLED,
+	NVMET_NS_DISABLING,
+	NVMET_NS_DISABLED
+};
+
 struct nvmet_ns {
 	struct percpu_ref	ref;
 	struct block_device	*bdev;
@@ -69,7 +75,7 @@ struct nvmet_ns {
 	u32			anagrpid;
 
 	bool			buffered_io;
-	bool			enabled;
+	enum nvmet_ns_state	state;
 	struct nvmet_subsys	*subsys;
 	const char		*device_path;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers()
  2023-01-03 10:03 [PATCH 0/4] nvme: fix several bugs in nvme-fabric Taehee Yoo
                   ` (2 preceding siblings ...)
  2023-01-03 10:03 ` [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable() Taehee Yoo
@ 2023-01-03 10:03 ` Taehee Yoo
  2023-01-03 10:54   ` Sagi Grimberg
  3 siblings, 1 reply; 14+ messages in thread
From: Taehee Yoo @ 2023-01-03 10:03 UTC (permalink / raw)
  To: linux-nvme, kbusch, axboe, hch, sagi, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp,
	pizhenwei, ap420073

While tcp socket is being released, nvmet_tcp_release_queue_work() is
called.
It calls nvmet_tcp_free_cmd_data_in_buffers() to free CMD resources.
But it may skip freeing resources due to unnecessary condition checks.
So, the memory leak will occur.

In order to fix this problem, it removes unnecessary condition checks
in nvmet_tcp_free_cmd_data_in_buffers().

This memory leak issue will occur in the target machine when a host
sends reset command to a target.

Reproducer:
   while :
   do
       echo 1 > /sys/class/nvme/nvme<NS>/reset_controller
   done

unreferenced object 0xffff88814a5c6da0 (size 32):
  comm "kworker/2:1H", pid 176, jiffies 4305953739 (age 72707.743s)
  hex dump (first 32 bytes):
    82 84 c8 04 00 ea ff ff 00 00 00 00 00 04 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffaa795cf7>] __kmalloc+0x47/0xc0
    [<ffffffffab030a72>] sgl_alloc_order+0x82/0x3a0
    [<ffffffffc086593c>] nvmet_tcp_map_data+0x1bc/0x570 [nvmet_tcp]
    [<ffffffffc086a7d4>] nvmet_tcp_try_recv_pdu+0x7f4/0x1e20 [nvmet_tcp]
    [<ffffffffc086c5a0>] nvmet_tcp_io_work+0x120/0x3272 [nvmet_tcp]
    [<ffffffffaa1b059d>] process_one_work+0x81d/0x1450
    [<ffffffffaa1b177c>] worker_thread+0x5ac/0xed0
    [<ffffffffaa1c8ccf>] kthread+0x29f/0x340
    [<ffffffffaa0034cf>] ret_from_fork+0x1f/0x30
unreferenced object 0xffff888153f3e1c0 (size 16):
  comm "kworker/2:1H", pid 176, jiffies 4305953739 (age 72707.743s)
  hex dump (first 16 bytes):
    80 84 c8 04 00 ea ff ff 00 04 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffaa795cf7>] __kmalloc+0x47/0xc0
    [<ffffffffc0865a80>] nvmet_tcp_map_data+0x300/0x570 [nvmet_tcp]
    [<ffffffffc086a7d4>] nvmet_tcp_try_recv_pdu+0x7f4/0x1e20 [nvmet_tcp]
    [<ffffffffc086c5a0>] nvmet_tcp_io_work+0x120/0x3272 [nvmet_tcp]
    [<ffffffffaa1b059d>] process_one_work+0x81d/0x1450
    [<ffffffffaa1b177c>] worker_thread+0x5ac/0xed0
    [<ffffffffaa1c8ccf>] kthread+0x29f/0x340
    [<ffffffffaa0034cf>] ret_from_fork+0x1f/0x30

Fixes: db94f240280c ("nvmet-tcp: fix NULL pointer dereference during release")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 drivers/nvme/target/tcp.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index cc05c094de22..dac08603fec9 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1427,13 +1427,10 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 	struct nvmet_tcp_cmd *cmd = queue->cmds;
 	int i;
 
-	for (i = 0; i < queue->nr_cmds; i++, cmd++) {
-		if (nvmet_tcp_need_data_in(cmd))
-			nvmet_tcp_free_cmd_buffers(cmd);
-	}
+	for (i = 0; i < queue->nr_cmds; i++, cmd++)
+		nvmet_tcp_free_cmd_buffers(cmd);
 
-	if (!queue->nr_cmds && nvmet_tcp_need_data_in(&queue->connect))
-		nvmet_tcp_free_cmd_buffers(&queue->connect);
+	nvmet_tcp_free_cmd_buffers(&queue->connect);
 }
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] nvme: fix delete uninitialized controller
  2023-01-03 10:03 ` [PATCH 1/4] nvme: fix delete uninitialized controller Taehee Yoo
@ 2023-01-03 10:30   ` Sagi Grimberg
  2023-01-04  0:24     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2023-01-03 10:30 UTC (permalink / raw)
  To: Taehee Yoo, linux-nvme, kbusch, axboe, hch, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp, pizhenwei

> nvme-fabric controllers can be deleted by
> /sys/class/nvme/nvme<NS>/delete_controller
> echo 1 > /sys/class/nvme/nvme<NS>/delete_controller
> The above command will call nvme_delete_ctrl_sync().
> This function internally tries to change ctrl->state to NVME_CTRL_DELETING.
> NVME_CTRL_LIVE, NVME_CTRL_RESETTING, and NVME_CTRL_CONNECTING states can
> be changed to NVME_CTRL_DELETING.
> If the state is successfully changed, nvme_do_delete_ctrl() is called,
> which is the actual delete logic of controller.
> 
> controller initialization logic changes ctrl->state.
> NEW -> CONNECTING -> LIVE.
> NVME_CTRL_CONNECTING state doesn't ensure that initialization is done.
> 
> So, delete logic can be called before the finish of controller
> initialization.
> So kernel panic would occur because nvme_do_delete_ctrl() dereferences
> uninitialized values.
> 
> BUG: KASAN: null-ptr-deref in do_raw_spin_trylock+0x67/0x180
> Read of size 4 at addr 00000000000000c0 by task bash/928
> 
> CPU: 7 PID: 928 Comm: bash Not tainted 6.1.0 #35
> nvme nvme0: Connect command failed: host path error
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x57/0x81
>   ? do_raw_spin_trylock+0x67/0x180
>   kasan_report+0xba/0x1f0
> nvme nvme0: failed to connect queue: 0 ret=880
>   ? do_raw_spin_trylock+0x67/0x180
>   ? sysfs_file_ops+0x170/0x170
>   kasan_check_range+0x14a/0x1a0
>   do_raw_spin_trylock+0x67/0x180
>   ? do_raw_spin_lock+0x270/0x270
>   ? nvme_remove_namespaces+0x1bc/0x3d0
>   _raw_spin_lock_irqsave+0x4b/0x90
>   ? blk_mq_quiesce_queue+0x1b/0x160
>   blk_mq_quiesce_queue+0x1b/0x160
>   nvme_tcp_delete_ctrl+0x4b/0x70
>   nvme_do_delete_ctrl+0x135/0x141
>   nvme_sysfs_delete.cold+0x8/0xd
>   kernfs_fop_write_iter+0x34b/0x520
>   vfs_write+0x83a/0xd20
>   ? kernel_write+0x630/0x630
>   ? rcu_read_lock_sched_held+0x12/0x80
>   ? lock_acquire+0x4f4/0x630
>   ? __fget_light+0x51/0x230
>   ksys_write+0xf9/0x1d0
>   ? __ia32_sys_read+0xa0/0xa0
>   ? syscall_enter_from_user_mode+0x1d/0x50
>   do_syscall_64+0x3b/0x90
>   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7fc955d10104
> 
> Fixes: 1a353d85b02d ("nvme: add fabrics sysfs attributes")
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>   drivers/nvme/host/core.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index d307ae4d8a57..cd4c80ca66d4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -243,7 +243,8 @@ static void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
>   	 * since ->delete_ctrl can free the controller.
>   	 */
>   	nvme_get_ctrl(ctrl);
> -	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
> +	if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) &&
> +	    nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
>   		nvme_do_delete_ctrl(ctrl);

So what is the outcome now? if the controller kept on dangling? what
triggers the controller deletion?

>   	nvme_put_ctrl(ctrl);
>   }

I don't think this is the correct approach.
the delete should fully fence the initialization and then delete
the controller.

In this case, the transport driver should not quiesce a non-existent
queue.

If further synchronization is needed, then it should be added so that
delete will fully fence the initialization.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] nvme: fix reset uninitialized controller
  2023-01-03 10:03 ` [PATCH 2/4] nvme: fix reset " Taehee Yoo
@ 2023-01-03 10:32   ` Sagi Grimberg
  0 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2023-01-03 10:32 UTC (permalink / raw)
  To: Taehee Yoo, linux-nvme, kbusch, axboe, hch, kch
  Cc: larrystevenwise, anthony.j.knapp, pizhenwei



On 1/3/23 12:03, Taehee Yoo wrote:
> nvme-fabric controllers can be reset by
> /sys/class/nvme/nvme#/reset_controller
> echo 1 > /sys/class/nvme/nvme#/reset_controller
> The above command will call nvme_sysfs_reset().
> 
> This function internally calls ctrl->reset_work synchronously or
> asynchronously.
> At this point, it doesn't sure if the controller will be reset after
> initialization.
> 
> So kernel panic would occur because ctrl->reset_work dereferences
> uninitialized values.

This is strange, the reset_work func is assigned earlier than sysfs...

> 
> In order to avoid this, nvme_sysfs_reset checks
> the NVME_CTRL_STARTED_ONCE flag. This flag indicates the controller is
> initialized fully. So, reset logic can be executed safely.

Same comment, the reset should fully fence the controller
initialization.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers()
  2023-01-03 10:03 ` [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers() Taehee Yoo
@ 2023-01-03 10:54   ` Sagi Grimberg
  2023-01-04  8:44     ` Taehee Yoo
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2023-01-03 10:54 UTC (permalink / raw)
  To: Taehee Yoo, linux-nvme, kbusch, axboe, hch, kch
  Cc: james.p.freyensee, ming.l, larrystevenwise, anthony.j.knapp, pizhenwei


> While tcp socket is being released, nvmet_tcp_release_queue_work() is
> called.
> It calls nvmet_tcp_free_cmd_data_in_buffers() to free CMD resources.
> But it may skip freeing resources due to unnecessary condition checks.
> So, the memory leak will occur.
> 
> In order to fix this problem, it removes unnecessary condition checks
> in nvmet_tcp_free_cmd_data_in_buffers().
> 
> This memory leak issue will occur in the target machine when a host
> sends reset command to a target.
> 
> Reproducer:
>     while :
>     do
>         echo 1 > /sys/class/nvme/nvme<NS>/reset_controller
>     done
> 
> unreferenced object 0xffff88814a5c6da0 (size 32):
>    comm "kworker/2:1H", pid 176, jiffies 4305953739 (age 72707.743s)
>    hex dump (first 32 bytes):
>      82 84 c8 04 00 ea ff ff 00 00 00 00 00 04 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace:
>      [<ffffffffaa795cf7>] __kmalloc+0x47/0xc0
>      [<ffffffffab030a72>] sgl_alloc_order+0x82/0x3a0
>      [<ffffffffc086593c>] nvmet_tcp_map_data+0x1bc/0x570 [nvmet_tcp]
>      [<ffffffffc086a7d4>] nvmet_tcp_try_recv_pdu+0x7f4/0x1e20 [nvmet_tcp]
>      [<ffffffffc086c5a0>] nvmet_tcp_io_work+0x120/0x3272 [nvmet_tcp]
>      [<ffffffffaa1b059d>] process_one_work+0x81d/0x1450
>      [<ffffffffaa1b177c>] worker_thread+0x5ac/0xed0
>      [<ffffffffaa1c8ccf>] kthread+0x29f/0x340
>      [<ffffffffaa0034cf>] ret_from_fork+0x1f/0x30
> unreferenced object 0xffff888153f3e1c0 (size 16):
>    comm "kworker/2:1H", pid 176, jiffies 4305953739 (age 72707.743s)
>    hex dump (first 16 bytes):
>      80 84 c8 04 00 ea ff ff 00 04 00 00 00 00 00 00  ................
>    backtrace:
>      [<ffffffffaa795cf7>] __kmalloc+0x47/0xc0
>      [<ffffffffc0865a80>] nvmet_tcp_map_data+0x300/0x570 [nvmet_tcp]
>      [<ffffffffc086a7d4>] nvmet_tcp_try_recv_pdu+0x7f4/0x1e20 [nvmet_tcp]
>      [<ffffffffc086c5a0>] nvmet_tcp_io_work+0x120/0x3272 [nvmet_tcp]
>      [<ffffffffaa1b059d>] process_one_work+0x81d/0x1450
>      [<ffffffffaa1b177c>] worker_thread+0x5ac/0xed0
>      [<ffffffffaa1c8ccf>] kthread+0x29f/0x340
>      [<ffffffffaa0034cf>] ret_from_fork+0x1f/0x30
> 
> Fixes: db94f240280c ("nvmet-tcp: fix NULL pointer dereference during release")
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>   drivers/nvme/target/tcp.c | 9 +++------
>   1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index cc05c094de22..dac08603fec9 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -1427,13 +1427,10 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
>   	struct nvmet_tcp_cmd *cmd = queue->cmds;
>   	int i;
>   
> -	for (i = 0; i < queue->nr_cmds; i++, cmd++) {
> -		if (nvmet_tcp_need_data_in(cmd))
> -			nvmet_tcp_free_cmd_buffers(cmd);
> -	}
> +	for (i = 0; i < queue->nr_cmds; i++, cmd++)
> +		nvmet_tcp_free_cmd_buffers(cmd);
>   
> -	if (!queue->nr_cmds && nvmet_tcp_need_data_in(&queue->connect))
> -		nvmet_tcp_free_cmd_buffers(&queue->connect);
> +	nvmet_tcp_free_cmd_buffers(&queue->connect);
>   }
>   
>   static void nvmet_tcp_release_queue_work(struct work_struct *w)

Would it be possible to understand what are the commands that are
leaking here? commands that do not wait for data from the host, should
have a normal path of release, if that is not happening, we may be in
risk for use-after-free or still have a leak somewhere.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable()
  2023-01-03 10:03 ` [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable() Taehee Yoo
@ 2023-01-03 10:58   ` Sagi Grimberg
  2023-01-04  0:32   ` Chaitanya Kulkarni
  1 sibling, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2023-01-03 10:58 UTC (permalink / raw)
  To: Taehee Yoo, linux-nvme, kbusch, axboe, hch, kch
  Cc: larrystevenwise, anthony.j.knapp, pizhenwei



On 1/3/23 12:03, Taehee Yoo wrote:
> nvme target namespace is enabled or disabled by nvmet_ns_enable() or
> nvmet_ns_disable().
> The subsys->lock is used to disallow to use namespace data while
> nvmet_ns_enable() or nvmet_ns_disable() are working.
> The ns->enabled boolean variable prevents using namespace data in wrong
> state such as uninitialized state.
> 
> nvmet_ns_disable() acquires ns->lock and set ns->enabled false.
> Then, it releases ns->lock for a while to wait ns->disable_done completion.
> At this point, nvmet_ns_enable() can be worked concurrently and it calls
> percpu_ref_init().
> So, ns->disable_done will never be completed.
> Therefore hang would occur at this point.
> 
>     CPU0                                     CPU1
>     nvmet_ns_disable();
>     mutex_lock(&subsys->lock);               nvmet_ns_enable();
>                                              mutex_lock(&subsys->lock);
>     ns->enabled = false;
>     mutex_unlock(&subsys->lock);
>                                              percpu_ref_init();
>     wait_for_completion(&ns->disable_done);  <-- infinite wait
> 
>     mutex_lock(&subsys->lock);
>     mutex_unlock(&subsys->lock);
> 
> INFO: task bash:926 blocked for more than 30 seconds.
>        Tainted: G        W          6.1.0+ #17
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:bash            state:D stack:27200 pid:926   ppid:911
> flags:0x00004000
> Call Trace:
>   <TASK>
>   __schedule+0xafc/0x2930
>   ? io_schedule_timeout+0x160/0x160
>   ? _raw_spin_unlock_irq+0x24/0x50
>   ? __wait_for_common+0x39b/0x5c0
>   ? usleep_range_state+0x190/0x190
>   schedule+0x130/0x230
>   schedule_timeout+0x18a/0x240
>   ? usleep_range_state+0x190/0x190
>   ? rcu_read_lock_sched_held+0x12/0x80
>   ? lock_downgrade+0x700/0x700
>   ? do_raw_spin_trylock+0xb5/0x180
>   ? lock_contended+0xdf0/0xdf0
>   ? _raw_spin_unlock_irq+0x24/0x50
>   ? trace_hardirqs_on+0x3c/0x190
>   __wait_for_common+0x1ca/0x5c0
>   ? usleep_range_state+0x190/0x190
>   ? bit_wait_io+0xf0/0xf0
>   ? _raw_spin_unlock_irqrestore+0x59/0x70
>   nvmet_ns_disable+0x288/0x490
>   ? nvmet_ns_enable+0x970/0x970
>   ? lockdep_hardirqs_on_prepare+0x410/0x410
>   ? rcu_read_lock_sched_held+0x12/0x80
>   ? configfs_write_iter+0x1df/0x480
>   ? nvmet_ns_revalidate_size_store+0x220/0x220
>   nvmet_ns_enable_store+0x85/0xe0
> [ ... ]
> 
> Fixes: a07b4970f464 ("nvmet: add a generic NVMe target")
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>   drivers/nvme/target/configfs.c | 14 +++++++-------
>   drivers/nvme/target/core.c     | 10 ++++++----
>   drivers/nvme/target/nvmet.h    |  8 +++++++-
>   3 files changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index 907143870da5..d878c4231d65 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -348,7 +348,7 @@ static ssize_t nvmet_ns_device_path_store(struct config_item *item,
>   
>   	mutex_lock(&subsys->lock);
>   	ret = -EBUSY;
> -	if (ns->enabled)
> +	if (ns->state != NVMET_NS_DISABLED)
>   		goto out_unlock;
>   
>   	ret = -EINVAL;
> @@ -390,7 +390,7 @@ static ssize_t nvmet_ns_p2pmem_store(struct config_item *item,
>   	int error;
>   
>   	mutex_lock(&ns->subsys->lock);
> -	if (ns->enabled) {
> +	if (ns->state != NVMET_NS_DISABLED) {
>   		ret = -EBUSY;
>   		goto out_unlock;
>   	}
> @@ -427,7 +427,7 @@ static ssize_t nvmet_ns_device_uuid_store(struct config_item *item,
>   	int ret = 0;
>   
>   	mutex_lock(&subsys->lock);
> -	if (ns->enabled) {
> +	if (ns->state != NVMET_NS_DISABLED) {
>   		ret = -EBUSY;
>   		goto out_unlock;
>   	}
> @@ -458,7 +458,7 @@ static ssize_t nvmet_ns_device_nguid_store(struct config_item *item,
>   	int ret = 0;
>   
>   	mutex_lock(&subsys->lock);
> -	if (ns->enabled) {
> +	if (ns->state != NVMET_NS_DISABLED) {
>   		ret = -EBUSY;
>   		goto out_unlock;
>   	}
> @@ -523,7 +523,7 @@ CONFIGFS_ATTR(nvmet_ns_, ana_grpid);
>   
>   static ssize_t nvmet_ns_enable_show(struct config_item *item, char *page)
>   {
> -	return sprintf(page, "%d\n", to_nvmet_ns(item)->enabled);
> +	return sprintf(page, "%d\n", !!to_nvmet_ns(item)->state);
>   }
>   
>   static ssize_t nvmet_ns_enable_store(struct config_item *item,
> @@ -561,7 +561,7 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
>   		return -EINVAL;
>   
>   	mutex_lock(&ns->subsys->lock);
> -	if (ns->enabled) {
> +	if (ns->state != NVMET_NS_DISABLED) {
>   		pr_err("disable ns before setting buffered_io value.\n");
>   		mutex_unlock(&ns->subsys->lock);
>   		return -EINVAL;
> @@ -587,7 +587,7 @@ static ssize_t nvmet_ns_revalidate_size_store(struct config_item *item,
>   		return -EINVAL;
>   
>   	mutex_lock(&ns->subsys->lock);
> -	if (!ns->enabled) {
> +	if (ns->state != NVMET_NS_ENABLED) {
>   		pr_err("enable ns before revalidate.\n");
>   		mutex_unlock(&ns->subsys->lock);
>   		return -EINVAL;
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index f66ed13d7c11..58a91fb9c2f7 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -563,7 +563,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>   		goto out_unlock;
>   	}
>   
> -	if (ns->enabled)
> +	if (ns->state != NVMET_NS_DISABLED)
>   		goto out_unlock;
>   
>   	ret = -EMFILE;
> @@ -598,7 +598,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>   	subsys->nr_namespaces++;
>   
>   	nvmet_ns_changed(subsys, ns->nsid);
> -	ns->enabled = true;
> +	ns->state = NVMET_NS_ENABLED;
>   	ret = 0;
>   out_unlock:
>   	mutex_unlock(&subsys->lock);
> @@ -621,10 +621,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
>   	struct nvmet_ctrl *ctrl;
>   
>   	mutex_lock(&subsys->lock);
> -	if (!ns->enabled)
> +	if (ns->state != NVMET_NS_ENABLED)
>   		goto out_unlock;
>   
> -	ns->enabled = false;
> +	ns->state = NVMET_NS_DISABLING;
>   	xa_erase(&ns->subsys->namespaces, ns->nsid);
>   	if (ns->nsid == subsys->max_nsid)
>   		subsys->max_nsid = nvmet_max_nsid(subsys);
> @@ -652,6 +652,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
>   	subsys->nr_namespaces--;
>   	nvmet_ns_changed(subsys, ns->nsid);
>   	nvmet_ns_dev_disable(ns);
> +	ns->state = NVMET_NS_DISABLED;
>   out_unlock:
>   	mutex_unlock(&subsys->lock);
>   }
> @@ -689,6 +690,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
>   	uuid_gen(&ns->uuid);
>   	ns->buffered_io = false;
>   	ns->csi = NVME_CSI_NVM;
> +	ns->state = NVMET_NS_DISABLED;
>   
>   	return ns;
>   }
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 89bedfcd974c..e609787577c6 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -56,6 +56,12 @@
>   #define IPO_IATTR_CONNECT_SQE(x)	\
>   	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>   
> +enum nvmet_ns_state {
> +	NVMET_NS_ENABLED,
> +	NVMET_NS_DISABLING,
> +	NVMET_NS_DISABLED
> +};
> +
>   struct nvmet_ns {
>   	struct percpu_ref	ref;
>   	struct block_device	*bdev;
> @@ -69,7 +75,7 @@ struct nvmet_ns {
>   	u32			anagrpid;
>   
>   	bool			buffered_io;
> -	bool			enabled;
> +	enum nvmet_ns_state	state;
>   	struct nvmet_subsys	*subsys;
>   	const char		*device_path;
>   

This looks reasonable to me...
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] nvme: fix delete uninitialized controller
  2023-01-03 10:30   ` Sagi Grimberg
@ 2023-01-04  0:24     ` Chaitanya Kulkarni
  2023-01-04  2:42       ` Taehee Yoo
  0 siblings, 1 reply; 14+ messages in thread
From: Chaitanya Kulkarni @ 2023-01-04  0:24 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: james.p.freyensee, hch, kbusch, linux-nvme, ming.l,
	larrystevenwise, anthony.j.knapp, pizhenwei, Sagi Grimberg,
	axboe, Chaitanya Kulkarni

On 1/3/23 02:30, Sagi Grimberg wrote:
>> nvme-fabric controllers can be deleted by
>> /sys/class/nvme/nvme<NS>/delete_controller
>> echo 1 > /sys/class/nvme/nvme<NS>/delete_controller
>> The above command will call nvme_delete_ctrl_sync().
>> This function internally tries to change ctrl->state to 
>> NVME_CTRL_DELETING.
>> NVME_CTRL_LIVE, NVME_CTRL_RESETTING, and NVME_CTRL_CONNECTING states can
>> be changed to NVME_CTRL_DELETING.
>> If the state is successfully changed, nvme_do_delete_ctrl() is called,
>> which is the actual delete logic of controller.
>>
>> controller initialization logic changes ctrl->state.
>> NEW -> CONNECTING -> LIVE.
>> NVME_CTRL_CONNECTING state doesn't ensure that initialization is done.
>>
>> So, delete logic can be called before the finish of controller
>> initialization.
>> So kernel panic would occur because nvme_do_delete_ctrl() dereferences
>> uninitialized values.

thanks for discovering this, do you perhaps have sequence of commands to
reproduce this ?

[...]

>> +++ b/drivers/nvme/host/core.c
>> @@ -243,7 +243,8 @@ static void nvme_delete_ctrl_sync(struct nvme_ctrl 
>> *ctrl)
>>        * since ->delete_ctrl can free the controller.
>>        */
>>       nvme_get_ctrl(ctrl);
>> -    if (nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
>> +    if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) &&
>> +        nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
>>           nvme_do_delete_ctrl(ctrl);
> 
> So what is the outcome now? if the controller kept on dangling? what
> triggers the controller deletion?
> 
>>       nvme_put_ctrl(ctrl);
>>   }
> 
> I don't think this is the correct approach.
> the delete should fully fence the initialization and then delete
> the controller.
> 
> In this case, the transport driver should not quiesce a non-existent
> queue.
> 
> If further synchronization is needed, then it should be added so that
> delete will fully fence the initialization.

as stated here I'd add complete fencing for the initialization and
delete transition ..

-ck


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable()
  2023-01-03 10:03 ` [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable() Taehee Yoo
  2023-01-03 10:58   ` Sagi Grimberg
@ 2023-01-04  0:32   ` Chaitanya Kulkarni
  2023-01-04  8:56     ` Taehee Yoo
  1 sibling, 1 reply; 14+ messages in thread
From: Chaitanya Kulkarni @ 2023-01-04  0:32 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: james.p.freyensee, axboe, Chaitanya Kulkarni, sagi, kbusch,
	ming.l, larrystevenwise, anthony.j.knapp, pizhenwei, linux-nvme,
	hch

On 1/3/23 02:03, Taehee Yoo wrote:
> nvme target namespace is enabled or disabled by nvmet_ns_enable() or
> nvmet_ns_disable().
> The subsys->lock is used to disallow to use namespace data while
> nvmet_ns_enable() or nvmet_ns_disable() are working.
> The ns->enabled boolean variable prevents using namespace data in wrong
> state such as uninitialized state.
> 
> nvmet_ns_disable() acquires ns->lock and set ns->enabled false.
> Then, it releases ns->lock for a while to wait ns->disable_done completion.
> At this point, nvmet_ns_enable() can be worked concurrently and it calls
> percpu_ref_init().
> So, ns->disable_done will never be completed.
> Therefore hang would occur at this point.
> 
>     CPU0                                     CPU1
>     nvmet_ns_disable();
>     mutex_lock(&subsys->lock);               nvmet_ns_enable();
>                                              mutex_lock(&subsys->lock);
>     ns->enabled = false;
>     mutex_unlock(&subsys->lock);
>                                              percpu_ref_init();
>     wait_for_completion(&ns->disable_done);  <-- infinite wait
> 
>     mutex_lock(&subsys->lock);
>     mutex_unlock(&subsys->lock);
> 
> INFO: task bash:926 blocked for more than 30 seconds.
>        Tainted: G        W          6.1.0+ #17
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:bash            state:D stack:27200 pid:926   ppid:911
> flags:0x00004000
> Call Trace:
>   <TASK>
>   __schedule+0xafc/0x2930
>   ? io_schedule_timeout+0x160/0x160
>   ? _raw_spin_unlock_irq+0x24/0x50
>   ? __wait_for_common+0x39b/0x5c0
>   ? usleep_range_state+0x190/0x190
>   schedule+0x130/0x230
>   schedule_timeout+0x18a/0x240
>   ? usleep_range_state+0x190/0x190
>   ? rcu_read_lock_sched_held+0x12/0x80
>   ? lock_downgrade+0x700/0x700
>   ? do_raw_spin_trylock+0xb5/0x180
>   ? lock_contended+0xdf0/0xdf0
>   ? _raw_spin_unlock_irq+0x24/0x50
>   ? trace_hardirqs_on+0x3c/0x190
>   __wait_for_common+0x1ca/0x5c0
>   ? usleep_range_state+0x190/0x190
>   ? bit_wait_io+0xf0/0xf0
>   ? _raw_spin_unlock_irqrestore+0x59/0x70
>   nvmet_ns_disable+0x288/0x490
>   ? nvmet_ns_enable+0x970/0x970
>   ? lockdep_hardirqs_on_prepare+0x410/0x410
>   ? rcu_read_lock_sched_held+0x12/0x80
>   ? configfs_write_iter+0x1df/0x480
>   ? nvmet_ns_revalidate_size_store+0x220/0x220
>   nvmet_ns_enable_store+0x85/0xe0
> [ ... ]
> 
> Fixes: a07b4970f464 ("nvmet: add a generic NVMe target")
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>

[...]

> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 89bedfcd974c..e609787577c6 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -56,6 +56,12 @@
>   #define IPO_IATTR_CONNECT_SQE(x)	\
>   	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>   
> +enum nvmet_ns_state {
> +	NVMET_NS_ENABLED,
> +	NVMET_NS_DISABLING,
> +	NVMET_NS_DISABLED
> +};
> +

The patch looks good to me, but I'm wondering if there is a way
we can do this without adding new enable disable states ?

-ck


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] nvme: fix delete uninitialized controller
  2023-01-04  0:24     ` Chaitanya Kulkarni
@ 2023-01-04  2:42       ` Taehee Yoo
  0 siblings, 0 replies; 14+ messages in thread
From: Taehee Yoo @ 2023-01-04  2:42 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: hch, kbusch, linux-nvme, larrystevenwise, anthony.j.knapp,
	pizhenwei, Sagi Grimberg, axboe

Hi Sagi and Chaitanya.
Thank you so much for the reviews!

On 1/4/23 09:24, Chaitanya Kulkarni wrote:
 > On 1/3/23 02:30, Sagi Grimberg wrote:
 >>> nvme-fabric controllers can be deleted by
 >>> /sys/class/nvme/nvme<NS>/delete_controller
 >>> echo 1 > /sys/class/nvme/nvme<NS>/delete_controller
 >>> The above command will call nvme_delete_ctrl_sync().
 >>> This function internally tries to change ctrl->state to
 >>> NVME_CTRL_DELETING.
 >>> NVME_CTRL_LIVE, NVME_CTRL_RESETTING, and NVME_CTRL_CONNECTING 
states can
 >>> be changed to NVME_CTRL_DELETING.
 >>> If the state is successfully changed, nvme_do_delete_ctrl() is called,
 >>> which is the actual delete logic of controller.
 >>>
 >>> controller initialization logic changes ctrl->state.
 >>> NEW -> CONNECTING -> LIVE.
 >>> NVME_CTRL_CONNECTING state doesn't ensure that initialization is done.
 >>>
 >>> So, delete logic can be called before the finish of controller
 >>> initialization.
 >>> So kernel panic would occur because nvme_do_delete_ctrl() dereferences
 >>> uninitialized values.
 >
 > thanks for discovering this, do you perhaps have sequence of commands to
 > reproduce this ?
 >

Yes, the below reproducer would be helpful.

#TARGET
sudo modprobe nvme_tcp
sudo modprobe nvmet
sudo modprobe nvmet-tcp
sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test-2
cd /sys/kernel/config/nvmet/subsystems/nvmet-test-2
echo 1 |sudo tee -a attr_allow_any_host > /dev/null
sudo mkdir namespaces/2
cd namespaces/2/
echo -n /dev/<NVME DEVICE> device_path
echo 1 > enable
sudo mkdir /sys/kernel/config/nvmet/ports/2
cd /sys/kernel/config/nvmet/ports/2
echo <TARGET IP> |sudo tee -a addr_traddr > /dev/null
echo tcp|sudo tee -a addr_trtype > /dev/null
echo 4002 |sudo tee -a addr_trsvcid > /dev/null
echo ipv4|sudo tee -a addr_adrfam > /dev/null
sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test-2/ 
/sys/kernel/config/nvmet/ports/2/subsystems/nvmet-t

#HOST SHELL1
while :
do
    nvme discover -t tcp -a <TARGET ADDRESS> -s 4002
    nvme connect -t tcp -n nvmet-test-2 -a <TARGET ADDRESS> -s 4002
done

#HOST SHELL2
while :
do
    echo 1 > /sys/class/nvme/nvme0/delete_controller
done

#HOST SHELL3
#This additional test script is to reproduce the second patch issue.
while :
do
    echo 1 > /sys/class/nvme/nvme0/reset_controller
done

 > [...]
 >
 >>> +++ b/drivers/nvme/host/core.c
 >>> @@ -243,7 +243,8 @@ static void nvme_delete_ctrl_sync(struct nvme_ctrl
 >>> *ctrl)
 >>>         * since ->delete_ctrl can free the controller.
 >>>         */
 >>>        nvme_get_ctrl(ctrl);
 >>> -    if (nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
 >>> +    if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) &&
 >>> +        nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
 >>>            nvme_do_delete_ctrl(ctrl);
 >>
 >> So what is the outcome now? if the controller kept on dangling? what
 >> triggers the controller deletion?
 >>
 >>>        nvme_put_ctrl(ctrl);
 >>>    }
 >>
 >> I don't think this is the correct approach.
 >> the delete should fully fence the initialization and then delete
 >> the controller.
 >>
 >> In this case, the transport driver should not quiesce a non-existent
 >> queue.
 >>
 >> If further synchronization is needed, then it should be added so that
 >> delete will fully fence the initialization.
 >
 > as stated here I'd add complete fencing for the initialization and
 > delete transition ..

I thought that delete/reset logic should not be allowed before the 
finish of initialization and I didn't consider that delete/reset logic 
fence initialization logic.
That was my approach, but after review, I think this approach is not fit 
for the nvme subsystem.
I think this case will be fixed by Chaitanya correctly.
I hope the above reproducer is helpful.

So, I will drop 1, and 2 patches in the v2 patchset.

 >
 > -ck
 >

Thanks a lot!
Taehee Yoo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers()
  2023-01-03 10:54   ` Sagi Grimberg
@ 2023-01-04  8:44     ` Taehee Yoo
  0 siblings, 0 replies; 14+ messages in thread
From: Taehee Yoo @ 2023-01-04  8:44 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, kbusch, axboe, hch, kch
  Cc: larrystevenwise, anthony.j.knapp, pizhenwei

Hi Sagi,
Thank you so much for your review!

On 1/3/23 19:54, Sagi Grimberg wrote:
 >
 >> While tcp socket is being released, nvmet_tcp_release_queue_work() is
 >> called.
 >> It calls nvmet_tcp_free_cmd_data_in_buffers() to free CMD resources.
 >> But it may skip freeing resources due to unnecessary condition checks.
 >> So, the memory leak will occur.
 >>
 >> In order to fix this problem, it removes unnecessary condition checks
 >> in nvmet_tcp_free_cmd_data_in_buffers().
 >>
 >> This memory leak issue will occur in the target machine when a host
 >> sends reset command to a target.
 >>
 >> Reproducer:
 >>     while :
 >>     do
 >>         echo 1 > /sys/class/nvme/nvme<NS>/reset_controller
 >>     done
 >>
 >> unreferenced object 0xffff88814a5c6da0 (size 32):
 >>    comm "kworker/2:1H", pid 176, jiffies 4305953739 (age 72707.743s)
 >>    hex dump (first 32 bytes):
 >>      82 84 c8 04 00 ea ff ff 00 00 00 00 00 04 00 00  ................
 >>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
 >>    backtrace:
 >>      [<ffffffffaa795cf7>] __kmalloc+0x47/0xc0
 >>      [<ffffffffab030a72>] sgl_alloc_order+0x82/0x3a0
 >>      [<ffffffffc086593c>] nvmet_tcp_map_data+0x1bc/0x570 [nvmet_tcp]
 >>      [<ffffffffc086a7d4>] nvmet_tcp_try_recv_pdu+0x7f4/0x1e20 
[nvmet_tcp]
 >>      [<ffffffffc086c5a0>] nvmet_tcp_io_work+0x120/0x3272 [nvmet_tcp]
 >>      [<ffffffffaa1b059d>] process_one_work+0x81d/0x1450
 >>      [<ffffffffaa1b177c>] worker_thread+0x5ac/0xed0
 >>      [<ffffffffaa1c8ccf>] kthread+0x29f/0x340
 >>      [<ffffffffaa0034cf>] ret_from_fork+0x1f/0x30
 >> unreferenced object 0xffff888153f3e1c0 (size 16):
 >>    comm "kworker/2:1H", pid 176, jiffies 4305953739 (age 72707.743s)
 >>    hex dump (first 16 bytes):
 >>      80 84 c8 04 00 ea ff ff 00 04 00 00 00 00 00 00  ................
 >>    backtrace:
 >>      [<ffffffffaa795cf7>] __kmalloc+0x47/0xc0
 >>      [<ffffffffc0865a80>] nvmet_tcp_map_data+0x300/0x570 [nvmet_tcp]
 >>      [<ffffffffc086a7d4>] nvmet_tcp_try_recv_pdu+0x7f4/0x1e20 
[nvmet_tcp]
 >>      [<ffffffffc086c5a0>] nvmet_tcp_io_work+0x120/0x3272 [nvmet_tcp]
 >>      [<ffffffffaa1b059d>] process_one_work+0x81d/0x1450
 >>      [<ffffffffaa1b177c>] worker_thread+0x5ac/0xed0
 >>      [<ffffffffaa1c8ccf>] kthread+0x29f/0x340
 >>      [<ffffffffaa0034cf>] ret_from_fork+0x1f/0x30
 >>
 >> Fixes: db94f240280c ("nvmet-tcp: fix NULL pointer dereference during
 >> release")
 >> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
 >> ---
 >>   drivers/nvme/target/tcp.c | 9 +++------
 >>   1 file changed, 3 insertions(+), 6 deletions(-)
 >>
 >> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
 >> index cc05c094de22..dac08603fec9 100644
 >> --- a/drivers/nvme/target/tcp.c
 >> +++ b/drivers/nvme/target/tcp.c
 >> @@ -1427,13 +1427,10 @@ static void
 >> nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 >>       struct nvmet_tcp_cmd *cmd = queue->cmds;
 >>       int i;
 >> -    for (i = 0; i < queue->nr_cmds; i++, cmd++) {
 >> -        if (nvmet_tcp_need_data_in(cmd))
 >> -            nvmet_tcp_free_cmd_buffers(cmd);
 >> -    }
 >> +    for (i = 0; i < queue->nr_cmds; i++, cmd++)
 >> +        nvmet_tcp_free_cmd_buffers(cmd);
 >> -    if (!queue->nr_cmds && nvmet_tcp_need_data_in(&queue->connect))
 >> -        nvmet_tcp_free_cmd_buffers(&queue->connect);
 >> +    nvmet_tcp_free_cmd_buffers(&queue->connect);
 >>   }
 >>   static void nvmet_tcp_release_queue_work(struct work_struct *w)
 >
 > Would it be possible to understand what are the commands that are
 > leaking here? commands that do not wait for data from the host, should
 > have a normal path of release, if that is not happening, we may be in
 > risk for use-after-free or still have a leak somewhere.

I tested with a reproducer and a debug patch like below.

#SHELL 1
while :
do
     echo 1 > /sys/class/nvme/nvme0/reset_controller
done

#SHELL 2
mount /dev/nvme0n1p66 ~/mnt
while :
do
    echo "aaaaaaaaaaaaa" > ~/mnt/test.txt
done

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index cc05c094de22..0cdf83b636a2 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1428,8 +1428,13 @@ static void 
nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
         int i;

         for (i = 0; i < queue->nr_cmds; i++, cmd++) {
-               if (nvmet_tcp_need_data_in(cmd))
+               if (nvmet_tcp_need_data_in(cmd)) {
                         nvmet_tcp_free_cmd_buffers(cmd);
+               } else {
+                       if (cmd->req.sg) {
+                               printk("[TEST]%s %u cmd = %x exec = 
%pS\n", __func__, __LINE__, cmd->cmd_pdu->cmd.common.opcode, 
cmd->req.execute);
+                       }
+               }
         }

Results:
[  393.244102] [TEST]nvmet_tcp_free_cmd_data_in_buffers 1435 cmd = 1 
exec = nvmet_bdev_execute_rw+0x0/0xc20
[  393.607385] [TEST]nvmet_tcp_free_cmd_data_in_buffers 1435 cmd = 1 
exec = nvmet_bdev_execute_rw+0x0/0xc20
[  393.717213] [TEST]nvmet_tcp_free_cmd_data_in_buffers 1435 cmd = 1 
exec = nvmet_bdev_execute_rw+0x0/0xc20
[  393.825434] [TEST]nvmet_tcp_free_cmd_data_in_buffers 1435 cmd = 1 
exec = nvmet_bdev_execute_rw+0x0/0xc20
[  393.952042] [TEST]nvmet_tcp_free_cmd_data_in_buffers 1435 cmd = 1 
exec = nvmet_bdev_execute_rw+0x0/0xc20

I think these are not only the leaking case.
Possibly there are more cases.
But I can't test all the cases.

I'm not sure that this info is what really you wanted.
If this is not satisfactory info, please let me know.

Thanks,
Taehee Yoo


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable()
  2023-01-04  0:32   ` Chaitanya Kulkarni
@ 2023-01-04  8:56     ` Taehee Yoo
  0 siblings, 0 replies; 14+ messages in thread
From: Taehee Yoo @ 2023-01-04  8:56 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: axboe, sagi, kbusch, larrystevenwise, anthony.j.knapp, pizhenwei,
	linux-nvme, hch

Hi Chaitanya,
Thank you so much for your review!

On 1/4/23 09:32, Chaitanya Kulkarni wrote:
 > On 1/3/23 02:03, Taehee Yoo wrote:
 >> nvme target namespace is enabled or disabled by nvmet_ns_enable() or
 >> nvmet_ns_disable().
 >> The subsys->lock is used to disallow to use namespace data while
 >> nvmet_ns_enable() or nvmet_ns_disable() are working.
 >> The ns->enabled boolean variable prevents using namespace data in wrong
 >> state such as uninitialized state.
 >>
 >> nvmet_ns_disable() acquires ns->lock and set ns->enabled false.
 >> Then, it releases ns->lock for a while to wait ns->disable_done 
completion.
 >> At this point, nvmet_ns_enable() can be worked concurrently and it calls
 >> percpu_ref_init().
 >> So, ns->disable_done will never be completed.
 >> Therefore hang would occur at this point.
 >>
 >>      CPU0                                     CPU1
 >>      nvmet_ns_disable();
 >>      mutex_lock(&subsys->lock);               nvmet_ns_enable();
 >>                                               mutex_lock(&subsys->lock);
 >>      ns->enabled = false;
 >>      mutex_unlock(&subsys->lock);
 >>                                               percpu_ref_init();
 >>      wait_for_completion(&ns->disable_done);  <-- infinite wait
 >>
 >>      mutex_lock(&subsys->lock);
 >>      mutex_unlock(&subsys->lock);
 >>
 >> INFO: task bash:926 blocked for more than 30 seconds.
 >>         Tainted: G        W          6.1.0+ #17
 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
 >> task:bash            state:D stack:27200 pid:926   ppid:911
 >> flags:0x00004000
 >> Call Trace:
 >>    <TASK>
 >>    __schedule+0xafc/0x2930
 >>    ? io_schedule_timeout+0x160/0x160
 >>    ? _raw_spin_unlock_irq+0x24/0x50
 >>    ? __wait_for_common+0x39b/0x5c0
 >>    ? usleep_range_state+0x190/0x190
 >>    schedule+0x130/0x230
 >>    schedule_timeout+0x18a/0x240
 >>    ? usleep_range_state+0x190/0x190
 >>    ? rcu_read_lock_sched_held+0x12/0x80
 >>    ? lock_downgrade+0x700/0x700
 >>    ? do_raw_spin_trylock+0xb5/0x180
 >>    ? lock_contended+0xdf0/0xdf0
 >>    ? _raw_spin_unlock_irq+0x24/0x50
 >>    ? trace_hardirqs_on+0x3c/0x190
 >>    __wait_for_common+0x1ca/0x5c0
 >>    ? usleep_range_state+0x190/0x190
 >>    ? bit_wait_io+0xf0/0xf0
 >>    ? _raw_spin_unlock_irqrestore+0x59/0x70
 >>    nvmet_ns_disable+0x288/0x490
 >>    ? nvmet_ns_enable+0x970/0x970
 >>    ? lockdep_hardirqs_on_prepare+0x410/0x410
 >>    ? rcu_read_lock_sched_held+0x12/0x80
 >>    ? configfs_write_iter+0x1df/0x480
 >>    ? nvmet_ns_revalidate_size_store+0x220/0x220
 >>    nvmet_ns_enable_store+0x85/0xe0
 >> [ ... ]
 >>
 >> Fixes: a07b4970f464 ("nvmet: add a generic NVMe target")
 >> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
 >
 > [...]
 >
 >> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
 >> index 89bedfcd974c..e609787577c6 100644
 >> --- a/drivers/nvme/target/nvmet.h
 >> +++ b/drivers/nvme/target/nvmet.h
 >> @@ -56,6 +56,12 @@
 >>    #define IPO_IATTR_CONNECT_SQE(x)	\
 >>    	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
 >>
 >> +enum nvmet_ns_state {
 >> +	NVMET_NS_ENABLED,
 >> +	NVMET_NS_DISABLING,
 >> +	NVMET_NS_DISABLED
 >> +};
 >> +
 >
 > The patch looks good to me, but I'm wondering if there is a way
 > we can do this without adding new enable disable states ?
 >

I'm not sure, but the workqueue would be possible.
The point is to make enable() and disable() are worked exclusively 
(serially).
workqueue will execute enable()/disable() functions serially.

Thanks a lot,
Taehee Yoo


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-01-04 10:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-03 10:03 [PATCH 0/4] nvme: fix several bugs in nvme-fabric Taehee Yoo
2023-01-03 10:03 ` [PATCH 1/4] nvme: fix delete uninitialized controller Taehee Yoo
2023-01-03 10:30   ` Sagi Grimberg
2023-01-04  0:24     ` Chaitanya Kulkarni
2023-01-04  2:42       ` Taehee Yoo
2023-01-03 10:03 ` [PATCH 2/4] nvme: fix reset " Taehee Yoo
2023-01-03 10:32   ` Sagi Grimberg
2023-01-03 10:03 ` [PATCH 3/4] nvmet: fix hang in nvmet_ns_disable() Taehee Yoo
2023-01-03 10:58   ` Sagi Grimberg
2023-01-04  0:32   ` Chaitanya Kulkarni
2023-01-04  8:56     ` Taehee Yoo
2023-01-03 10:03 ` [PATCH 4/4] nvmet-tcp: fix memory leak in nvmet_tcp_free_cmd_data_in_buffers() Taehee Yoo
2023-01-03 10:54   ` Sagi Grimberg
2023-01-04  8:44     ` Taehee Yoo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.