linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] nvme: avoid race in shutdown namespace removal
@ 2021-09-02  9:20 Daniel Wagner
  2021-09-06  8:01 ` Christoph Hellwig
  2021-09-09 14:09 ` Christoph Hellwig
  0 siblings, 2 replies; 5+ messages in thread
From: Daniel Wagner @ 2021-09-02  9:20 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-kernel, Daniel Wagner, Hannes Reinecke, Keith Busch

When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.

To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.

Fixes: 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down")
Cc: Hannes Reinecke <hare@suse.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
v2:
 - added nvme_find_ns_head fix as suggested by hch
v1:
 - https://lore.kernel.org/linux-nvme/20210830093618.97657-1-dwagner@suse.de/

 drivers/nvme/host/core.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4a3a33f5f11c..ac9a61d1d011 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3524,7 +3524,9 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_subsystem *subsys,
 	lockdep_assert_held(&subsys->lock);
 
 	list_for_each_entry(h, &subsys->nsheads, entry) {
-		if (h->ns_id == nsid && nvme_tryget_ns_head(h))
+		if (h->ns_id != nsid)
+			continue;
+		if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
 			return h;
 	}
 
@@ -3836,6 +3838,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	mutex_lock(&ns->ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
+	if (list_empty(&ns->head->list)) {
+		list_del_init(&ns->head->entry);
+		last_path = true;
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	/* guarantee not available in head->list */
@@ -3856,13 +3862,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	list_del_init(&ns->list);
 	up_write(&ns->ctrl->namespaces_rwsem);
 
-	/* Synchronize with nvme_init_ns_head() */
-	mutex_lock(&ns->head->subsys->lock);
-	if (list_empty(&ns->head->list)) {
-		list_del_init(&ns->head->entry);
-		last_path = true;
-	}
-	mutex_unlock(&ns->head->subsys->lock);
 	if (last_path)
 		nvme_mpath_shutdown_disk(ns->head);
 	nvme_put_ns(ns);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme: avoid race in shutdown namespace removal
  2021-09-02  9:20 [PATCH v2] nvme: avoid race in shutdown namespace removal Daniel Wagner
@ 2021-09-06  8:01 ` Christoph Hellwig
  2021-09-06  8:40   ` Hannes Reinecke
  2021-09-09 14:09 ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2021-09-06  8:01 UTC (permalink / raw)
  To: Daniel Wagner, Hannes Reinecke; +Cc: linux-nvme, linux-kernel, Keith Busch

On Thu, Sep 02, 2021 at 11:20:02AM +0200, Daniel Wagner wrote:
> When we remove the siblings entry, we update ns->head->list, hence we
> can't separate the removal and test for being empty. They have to be
> in the same critical section to avoid a race.
> 
> To avoid breaking the refcounting imbalance again, add a list empty
> check to nvme_find_ns_head.

Hannes, can you look over this and run your tests on it?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme: avoid race in shutdown namespace removal
  2021-09-06  8:01 ` Christoph Hellwig
@ 2021-09-06  8:40   ` Hannes Reinecke
  2021-09-09 10:06     ` Hannes Reinecke
  0 siblings, 1 reply; 5+ messages in thread
From: Hannes Reinecke @ 2021-09-06  8:40 UTC (permalink / raw)
  To: Christoph Hellwig, Daniel Wagner; +Cc: linux-nvme, linux-kernel, Keith Busch

On 9/6/21 10:01 AM, Christoph Hellwig wrote:
> On Thu, Sep 02, 2021 at 11:20:02AM +0200, Daniel Wagner wrote:
>> When we remove the siblings entry, we update ns->head->list, hence we
>> can't separate the removal and test for being empty. They have to be
>> in the same critical section to avoid a race.
>>
>> To avoid breaking the refcounting imbalance again, add a list empty
>> check to nvme_find_ns_head.
> 
> Hannes, can you look over this and run your tests on it?
> 
I'm at it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme: avoid race in shutdown namespace removal
  2021-09-06  8:40   ` Hannes Reinecke
@ 2021-09-09 10:06     ` Hannes Reinecke
  0 siblings, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2021-09-09 10:06 UTC (permalink / raw)
  To: Christoph Hellwig, Daniel Wagner; +Cc: linux-nvme, linux-kernel, Keith Busch

On 9/6/21 10:40 AM, Hannes Reinecke wrote:
> On 9/6/21 10:01 AM, Christoph Hellwig wrote:
>> On Thu, Sep 02, 2021 at 11:20:02AM +0200, Daniel Wagner wrote:
>>> When we remove the siblings entry, we update ns->head->list, hence we
>>> can't separate the removal and test for being empty. They have to be
>>> in the same critical section to avoid a race.
>>>
>>> To avoid breaking the refcounting imbalance again, add a list empty
>>> check to nvme_find_ns_head.
>>
>> Hannes, can you look over this and run your tests on it?
>>
> I'm at it.
> 
Finally. qemu being it's usual bitchy self.

But managed to test the patch, and all looks good.

For reference, the testcase is:

- Create qemu instance with two NVMe namespaces:
  -device pcie-root-port,id=rp80,bus=pcie.0,chassis=2,addr=8.0,\
   multifunction=on,pref64-reserve=32M \
  -device pcie-root-port,id=rp90,bus=pcie.0,chassis=3,addr=9.0,\
   multifunction=on,pref64-reserve=32M \
  -device nvme-subsys,id=nvme-subsys1,nqn=slesnvmesubsys-1 \
  -device nvme-subsys,id=nvme-subsys2,nqn=slesnvmesubsys-2 \
  -device nvme,bus=rp80,id=nvme-rp80,serial=SLESNVME2,\
   subsys=nvme-subsys1 \
  -device nvme-ns,id=nvme-ns-2,bus=nvme-rp80,drive=nvme-2 \
  -device nvme,bus=rp90,id=nvme-rp90,serial=SLESNVME3,\
   subsys=nvme-subsys2 \
  -device nvme-ns,id=nvme-ns-3,bus=nvme-rp90,drive=nvme-3

- Install the system, and create an MD RAID 1 on those namespaces.
- Enter qemu monitor, and detach one controller:
  device_del nvme-rp90
- Check in the OS that the device has been removed, and MD has
  registered the device failure:

# dmesg
[ 1801.076236] pcieport 0000:00:09.0: pciehp: Slot(0-2): Attention
button pressed
[ 1801.076251] pcieport 0000:00:09.0: pciehp: Slot(0-2): Powering off
due to button press
[ 1806.250017] block nvme2n1: no available path - failing I/O
[ 1806.250030] md: super_written gets error=-5
[ 1806.250036] md/raid1:md1: Disk failure on nvme2n1, disabling device.
               md/raid1:md1: Operation continuing on 1 devices.

- Enter qemu monitor, and re-attach the controller:
  device_add bus=rp90,id=nvme-rp90,serial=SLESNVME3,subsys=nvme-subsys2

- Check in the OS that the device has been reattached:
# dmesg
[ 1845.634613] pcieport 0000:00:09.0: pciehp: Slot(0-2): Attention
button pressed
[ 1845.634626] pcieport 0000:00:09.0: pciehp: Slot(0-2) Powering on due
to button press
[ 1845.634821] pcieport 0000:00:09.0: pciehp: Slot(0-2): Card present
[ 1845.634826] pcieport 0000:00:09.0: pciehp: Slot(0-2): Link Up
[ 1845.770969] pci 0000:03:00.0: [1b36:0010] type 00 class 0x010802
[ 1845.771307] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[ 1845.773503] pci 0000:03:00.0: BAR 0: assigned [mem
0xc1200000-0xc1203fff 64bit]
[ 1845.773646] pcieport 0000:00:09.0: PCI bridge to [bus 03]
[ 1845.773671] pcieport 0000:00:09.0:   bridge window [io  0x7000-0x7fff]
[ 1845.776926] pcieport 0000:00:09.0:   bridge window [mem
0xc1200000-0xc13fffff]
[ 1845.778816] pcieport 0000:00:09.0:   bridge window [mem
0x804000000-0x805ffffff 64bit pref]
[ 1845.783970] nvme nvme2: pci function 0000:03:00.0
[ 1845.784227] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[ 1845.798918] nvme nvme2: 1/0/0 default/read/poll queues

- Reattach the namespace to the MD RAID:
# mdadm --manage /dev/md1 --re-add /dev/nvme2n1

- Check that everything worked:
# # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

So you can add:

Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme: avoid race in shutdown namespace removal
  2021-09-02  9:20 [PATCH v2] nvme: avoid race in shutdown namespace removal Daniel Wagner
  2021-09-06  8:01 ` Christoph Hellwig
@ 2021-09-09 14:09 ` Christoph Hellwig
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2021-09-09 14:09 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-nvme, linux-kernel, Hannes Reinecke, Keith Busch

Thanks,

applied to nvme-5.15.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-09 14:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02  9:20 [PATCH v2] nvme: avoid race in shutdown namespace removal Daniel Wagner
2021-09-06  8:01 ` Christoph Hellwig
2021-09-06  8:40   ` Hannes Reinecke
2021-09-09 10:06     ` Hannes Reinecke
2021-09-09 14:09 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).