All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme-multipath: round-robin infinite looping
@ 2022-03-21 22:43 Chris Leech
  2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi

I've been looking at a lockup reported by a partner doing nvme-tcp
testing, and I believe there's an issue between nvme_ns_remove and
nvme_round_robin_path that can result in infinite looping.

It seems like the same concern that was raised by Chao Leng in this
thread: https://lore.kernel.org/all/bd37abd5-759d-efe2-fdcd-8b004a41c75a@huawei.com/

The ordering of nvme_ns_remove is constructed to prevent races that
would re-assign a namespace being removed to the current_path cache.
That leaves a period where a namespace in current_path is not in the
path sibling list.  But nvme_round_robin_path makes the assumption that
the "old" ns taken from current_path is always on the list, and the odd
list traversal with nvme_next_ns isn't safe with an RCU list that can
change while it's being read.

I'm not convinced that there is a way to meet all of these assumptions
only looking at the list and current_path. I think it can be done if the
NVME_NS_READY flag is taken into account, but possibly needing an
additional synchronization point.

I'm following this email with details from a kdump analysis that shows
this happening, with a current_path entry partially removed from the
list (pointing into the list, but not on it, as list_del_rcu does) and a
CPU stuck in the inner loop of nvme_round_robin_path.

And then a couple of suggestions, for trying to fix this in
nvme_ns_remove as well as an easy backstop for nvme_round_robin_path
that would prevent endless looping without fixing the race.  

Just looking for discussion on these right now, we're working on getting
them tested.

- Chris



^ permalink raw reply	[flat|nested] 16+ messages in thread

* kdump details (Re: nvme-multipath: round-robin infinite looping)
  2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
@ 2022-03-21 22:43 ` Chris Leech
  2022-03-22 11:16   ` Christoph Hellwig
  2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
  2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
  2 siblings, 1 reply; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi

Backtrace of a soft-lockup while executing in nvme_ns_head_make_request

crash> bt
PID: 6      TASK: ffff90550742c000  CPU: 0   COMMAND: "kworker/0:0H"
 #0 [ffffb40a40003d48] machine_kexec at ffffffff93864e9e
 #1 [ffffb40a40003da0] __crash_kexec at ffffffff939a4c4d
 #2 [ffffb40a40003e68] panic at ffffffff938ed587
 #3 [ffffb40a40003ee8] watchdog_timer_fn.cold.10 at ffffffff939db197
 #4 [ffffb40a40003f18] __hrtimer_run_queues at ffffffff93983c20
 #5 [ffffb40a40003f78] hrtimer_interrupt at ffffffff93984480
 #6 [ffffb40a40003fd8] smp_apic_timer_interrupt at ffffffff942026ba
 #7 [ffffb40a40003ff0] apic_timer_interrupt at ffffffff94201c4f
--- <IRQ stack> ---
 #8 [ffffb40a403d3d18] apic_timer_interrupt at ffffffff94201c4f
    [exception RIP: nvme_ns_head_make_request+592]
    RIP: ffffffffc0280810  RSP: ffffb40a403d3dc0  RFLAGS: 00000206
    RAX: 0000000000000003  RBX: ffff9059517dd800  RCX: 0000000000000001
    RDX: 000043822f41b300  RSI: 0000000000000000  RDI: ffff9058da8d0010
    RBP: ffff9058da8d0010   R8: 0000000000000001   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000600  R12: 0000000000000001
    R13: ffff90724096e000  R14: ffff9058da8dca50  R15: ffff9058da8d0000
    ORIG_RAX: ffffffffffffff13  CS: 0010  SS: 0018
 #9 [ffffb40a403d3e18] generic_make_request at ffffffff93c6e8bb
#10 [ffffb40a403d3e80] nvme_requeue_work at ffffffffc02805aa [nvme_core]
#11 [ffffb40a403d3e98] process_one_work at ffffffff9390b237
#12 [ffffb40a403d3ed8] worker_thread at ffffffff9390b8f0
#13 [ffffb40a403d3f10] kthread at ffffffff9391269a
#14 [ffffb40a403d3f50] ret_from_fork at ffffffff94200255

RIP here is in the nvme_round_robin_path loop.  I've seen a report
without softlockup_panic enabled where multiple CPUs locked with a
variety of addresses reported all within the loop.

Disassembling and looking for the loop test expression, we can find the
instructions for ns && ns != old

236             for (ns = nvme_next_ns(head, old);
237                  ns && ns != old;
   0x000000000000982f <+575>:   test   %rbx,%rbx
   0x0000000000009832 <+578>:   je     0x98a8 <nvme_ns_head_make_request+696>
   0x0000000000009834 <+580>:   cmp    %rbx,%r13
   0x0000000000009837 <+583>:   je     0x98a8 <nvme_ns_head_make_request+696>

At this point ns is in rbx, and old is in r13, that doesn’t appear to
change while in this loop.

Similarly head is in r15, being used here in list_first_or_null_rcu

./include/linux/compiler.h:
276             __READ_ONCE_SIZE;
   0x0000000000009961 <+881>:   mov    (%r15),%rax

drivers/nvme/host/multipath.c:
222             return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
   0x0000000000009964 <+884>:   mov    %rax,0x18(%rsp)
   0x0000000000009969 <+889>:   cmp    %rax,%r15
   0x000000000000996c <+892>:   je     0x99d4 <nvme_ns_head_make_request+996>

So, from the saved register values in the backtrace:
old = ffff90724096e000
head = ffff9058da8d0000

Checking that old.head == head

crash> struct nvme_ns.head ffff90724096e000
  head = 0xffff9058da8d0000

Dumping the nvme_ns structs on head.list

crash> list nvme_ns.siblings -H 0xffff9058da8d0000
ffff907084504000
ffff9059517dd800

Only 2, and neither of them match “old”

What do the list pointers in old look like?

crash> struct nvme_ns.siblings ffff90724096e000
  siblings = {
    next = 0xffff9059517dd830, 
    prev = 0xdead000000000200
  }

old.siblings->next points into a valid nvme_ns on the list, but prev has
been poisoned.

This loop can still exit if any ns on the list is not disabled and an
ANA optimized path.

crash> list nvme_ns.siblings -H 0xffff9058da8d0000 -s nvme_ns.ctrl
ffff907084504000
  ctrl = 0xffff90555c13c338
ffff9059517dd800
  ctrl = 0xffff90555c138338
crash> struct nvme_ctrl.state 0xffff90555c13c338
  state = NVME_CTRL_CONNECTING
crash> struct nvme_ctrl.state 0xffff90555c138338
  state = NVME_CTRL_CONNECTING

No good, both are disabled while connecting.

Dump head.current_path[], and there’s “old”

crash> struct nvme_ns_head ffff9058da8d0000 -o
struct nvme_ns_head {
  [ffff9058da8d0000] struct list_head list;
  [ffff9058da8d0010] struct srcu_struct srcu;
  [ffff9058da8dc510] struct nvme_subsystem *subsys;
  [ffff9058da8dc518] unsigned int ns_id;
  [ffff9058da8dc51c] struct nvme_ns_ids ids;
  [ffff9058da8dc548] struct list_head entry;
  [ffff9058da8dc558] struct kref ref;
  [ffff9058da8dc55c] bool shared;
  [ffff9058da8dc560] int instance;
  [ffff9058da8dc568] struct nvme_effects_log *effects;
  [ffff9058da8dc570] struct cdev cdev;
  [ffff9058da8dc5f8] struct device cdev_device;
  [ffff9058da8dc9c8] struct gendisk *disk;
  [ffff9058da8dc9d0] struct bio_list requeue_list;
  [ffff9058da8dc9e0] spinlock_t requeue_lock;
  [ffff9058da8dc9e8] struct work_struct requeue_work;
  [ffff9058da8dca28] struct mutex lock;
  [ffff9058da8dca48] unsigned long flags;
  [ffff9058da8dca50] struct nvme_ns *current_path[];
}
SIZE: 51792

crash> p *(struct nvme_ns **) 0xffff9058da8dca50 @ 4
$18 = 
 {0xffff90724096e000, 0x0, 0x0, 0x0}

So, we end up in an infinite loop because:

The nvme_ns_head list contains 2 nvme_ns.  Both of them have ctrl->state
as NVME_CTRL_CONNECTING, so they are considered as disabled and we will
execute the continue statement in the loop. “old” has been removed from
the siblings list, so it will never trigger the ns != old condition



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
  2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
  2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
@ 2022-03-21 22:43 ` Chris Leech
  2022-03-22 11:17   ` Christoph Hellwig
  2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
  2 siblings, 1 reply; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi

This is a backstop for the odd loop construct in nvme_round_robin_path.
It counts how many times the head pointer has been passed, as that's the
only thing guarenteed to stay on the list.  Once is needed to start from
a different place and check the entire list, twice is excessive looping.
---
 drivers/nvme/host/multipath.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ff775235534c..c2039746ca15 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -261,12 +261,13 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 }
 
 static struct nvme_ns *nvme_next_ns(struct nvme_ns_head *head,
-		struct nvme_ns *ns)
+		struct nvme_ns *ns, int *head_pass_count)
 {
 	ns = list_next_or_null_rcu(&head->list, &ns->siblings, struct nvme_ns,
 			siblings);
 	if (ns)
 		return ns;
+	*head_pass_count++;
 	return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
 }
 
@@ -274,6 +275,7 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 		int node, struct nvme_ns *old)
 {
 	struct nvme_ns *ns, *found = NULL;
+	int head_pass_count = 0;
 
 	if (list_is_singular(&head->list)) {
 		if (nvme_path_is_disabled(old))
@@ -281,9 +283,12 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 		return old;
 	}
 
-	for (ns = nvme_next_ns(head, old);
+	for (ns = nvme_next_ns(head, old, &head_pass_count);
 	     ns && ns != old;
-	     ns = nvme_next_ns(head, ns)) {
+	     ns = nvme_next_ns(head, ns, &head_pass_count)) {
+		if (head_pass_count > 1)
+			break;
+
 		if (nvme_path_is_disabled(ns))
 			continue;
 
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
  2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
  2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
@ 2022-03-21 22:43 ` Chris Leech
  2022-03-23 14:54   ` Sagi Grimberg
                     ` (2 more replies)
  2 siblings, 3 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi

Make nvme_ns_remove match the assumptions elsewhere.

1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
   running in __nvme_find_path or nvme_round_robin_path that will
   re-assign this ns to current_path.

2) Any matching current_path entries need to be cleared before removing
   from the siblings list, to prevent calling nvme_round_robin_path with
   an "old" ns that's off list.

3) Finally the list_del_rcu can happen, and then synchronize again
   before releasing any reference counts.
---
 drivers/nvme/host/core.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fd4720d37cc0..20778dc9224c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	set_capacity(ns->disk, 0);
 	nvme_fault_inject_fini(&ns->fault_inject);
 
+	/* ensure that !NVME_NS_READY is seen
+	 * to prevent this ns going back in current_path
+	 */
+	synchronize_srcu(&ns->head->srcu);
+
+	/* wait for concurrent submissions */
+	if (nvme_mpath_clear_current_path(ns))
+		synchronize_srcu(&ns->head->srcu);
+
 	mutex_lock(&ns->ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
 	if (list_empty(&ns->head->list)) {
@@ -3928,10 +3937,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	/* guarantee not available in head->list */
 	synchronize_rcu();
 
-	/* wait for concurrent submissions */
-	if (nvme_mpath_clear_current_path(ns))
-		synchronize_srcu(&ns->head->srcu);
-
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
 	del_gendisk(ns->disk);
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: kdump details (Re: nvme-multipath: round-robin infinite looping)
  2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
@ 2022-03-22 11:16   ` Christoph Hellwig
  2022-03-22 15:36     ` Chris Leech
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-22 11:16 UTC (permalink / raw)
  To: Chris Leech
  Cc: linux-nvme, sagi, hch, lengchao, dwagner, hare, mlombard, jmeneghi

What kernel version is this?  generic_make_request is long gone, so this
looks pretty old.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
  2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
@ 2022-03-22 11:17   ` Christoph Hellwig
  2022-03-22 12:07     ` Daniel Wagner
  2022-03-22 15:42     ` Chris Leech
  0 siblings, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-22 11:17 UTC (permalink / raw)
  To: Chris Leech
  Cc: linux-nvme, sagi, hch, lengchao, dwagner, hare, mlombard, jmeneghi

On Mon, Mar 21, 2022 at 03:43:03PM -0700, Chris Leech wrote:
> This is a backstop for the odd loop construct in nvme_round_robin_path.
> It counts how many times the head pointer has been passed, as that's the
> only thing guarenteed to stay on the list.  Once is needed to start from
> a different place and check the entire list, twice is excessive looping.

Shouldn't

1bcf006a9d3d63c1bcb65a993cb13756954cd9c
Author: Daniel Wagner <dwagner@suse.de>
Date:   Wed Jan 27 11:30:33 2021 +0100

    nvme-multipath: Early exit if no path is available

take care of this?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
  2022-03-22 11:17   ` Christoph Hellwig
@ 2022-03-22 12:07     ` Daniel Wagner
  2022-03-22 15:42     ` Chris Leech
  1 sibling, 0 replies; 16+ messages in thread
From: Daniel Wagner @ 2022-03-22 12:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Leech, linux-nvme, sagi, lengchao, hare, mlombard, jmeneghi

On Tue, Mar 22, 2022 at 12:17:43PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 21, 2022 at 03:43:03PM -0700, Chris Leech wrote:
> > This is a backstop for the odd loop construct in nvme_round_robin_path.
> > It counts how many times the head pointer has been passed, as that's the
> > only thing guarenteed to stay on the list.  Once is needed to start from
> > a different place and check the entire list, twice is excessive looping.
> 
> Shouldn't

d1bcf006a9d3 ("nvme-multipath: Early exit if no path is available")

> take care of this?

IIRC, this addressed the issue that we started to dereference a NULL
pointer if there was no path left.

Chris' problem is that the iterator always returns the same
pointer. Personally, I prefer the solution to introduce some sort of
sentinel over the one with the locking.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kdump details (Re: nvme-multipath: round-robin infinite looping)
  2022-03-22 11:16   ` Christoph Hellwig
@ 2022-03-22 15:36     ` Chris Leech
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-22 15:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Sagi Grimberg, lengchao, dwagner, hare,
	Maurizio Lombardi, John Meneghini

On Tue, Mar 22, 2022 at 4:17 AM Christoph Hellwig <hch@lst.de> wrote:
>
> What kernel version is this?  generic_make_request is long gone, so this
> looks pretty old.

Yes, this was found on a RHEL 8 backport, we're working with the
reporter to move the test setup to something newer.  So far I haven't
been able to create my own reproducer for this.  I went ahead and
brought this to the list because I don't see any reason why upstream
would be different in this case.

- Chris



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
  2022-03-22 11:17   ` Christoph Hellwig
  2022-03-22 12:07     ` Daniel Wagner
@ 2022-03-22 15:42     ` Chris Leech
  1 sibling, 0 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-22 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Sagi Grimberg, lengchao, dwagner, hare,
	Maurizio Lombardi, John Meneghini

On Tue, Mar 22, 2022 at 4:17 AM Christoph Hellwig <hch@lst.de> wrote:
> Shouldn't
>
> 1bcf006a9d3d63c1bcb65a993cb13756954cd9c
> Author: Daniel Wagner <dwagner@suse.de>
> Date:   Wed Jan 27 11:30:33 2021 +0100
>
>     nvme-multipath: Early exit if no path is available
>
> take care of this?

No, in this case nvme_next_ns never returns NULL, but it also never
returns old.  In the kdump I have, there are two NVME_CTRL_CONNECTING
paths that are returned over-and-over.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
@ 2022-03-23 14:54   ` Sagi Grimberg
  2022-03-23 15:34     ` Christoph Hellwig
  2022-03-25  6:36   ` Christoph Hellwig
  2022-03-25 12:42   ` Sagi Grimberg
  2 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2022-03-23 14:54 UTC (permalink / raw)
  To: Chris Leech, linux-nvme, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi



On 3/22/22 00:43, Chris Leech wrote:
> Make nvme_ns_remove match the assumptions elsewhere.
> 
> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>     running in __nvme_find_path or nvme_round_robin_path that will
>     re-assign this ns to current_path.
> 
> 2) Any matching current_path entries need to be cleared before removing
>     from the siblings list, to prevent calling nvme_round_robin_path with
>     an "old" ns that's off list.
> 
> 3) Finally the list_del_rcu can happen, and then synchronize again
>     before releasing any reference counts.
> ---
>   drivers/nvme/host/core.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fd4720d37cc0..20778dc9224c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   	set_capacity(ns->disk, 0);
>   	nvme_fault_inject_fini(&ns->fault_inject);
>   
> +	/* ensure that !NVME_NS_READY is seen
> +	 * to prevent this ns going back in current_path
> +	 */
> +	synchronize_srcu(&ns->head->srcu);
> +
> +	/* wait for concurrent submissions */
> +	if (nvme_mpath_clear_current_path(ns))
> +		synchronize_srcu(&ns->head->srcu);

Nothing prevents it from being reselected again.
This is what drove the placement of this after the
ns is removed from the head->list. But that was before
the selection looked into NVME_NS_READY flag...

This looks legit to me...


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-23 14:54   ` Sagi Grimberg
@ 2022-03-23 15:34     ` Christoph Hellwig
  2022-03-23 19:07       ` John Meneghini
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-23 15:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chris Leech, linux-nvme, hch, lengchao, dwagner, hare, mlombard,
	jmeneghi

On Wed, Mar 23, 2022 at 04:54:26PM +0200, Sagi Grimberg wrote:
>
>
> On 3/22/22 00:43, Chris Leech wrote:
>> Make nvme_ns_remove match the assumptions elsewhere.
>>
>> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>>     running in __nvme_find_path or nvme_round_robin_path that will
>>     re-assign this ns to current_path.
>>
>> 2) Any matching current_path entries need to be cleared before removing
>>     from the siblings list, to prevent calling nvme_round_robin_path with
>>     an "old" ns that's off list.
>>
>> 3) Finally the list_del_rcu can happen, and then synchronize again
>>     before releasing any reference counts.
>> ---
>>   drivers/nvme/host/core.c | 13 +++++++++----
>>   1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fd4720d37cc0..20778dc9224c 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>   	set_capacity(ns->disk, 0);
>>   	nvme_fault_inject_fini(&ns->fault_inject);
>>   +	/* ensure that !NVME_NS_READY is seen
>> +	 * to prevent this ns going back in current_path
>> +	 */
>> +	synchronize_srcu(&ns->head->srcu);
>> +
>> +	/* wait for concurrent submissions */
>> +	if (nvme_mpath_clear_current_path(ns))
>> +		synchronize_srcu(&ns->head->srcu);
>
> Nothing prevents it from being reselected again.
> This is what drove the placement of this after the
> ns is removed from the head->list. But that was before
> the selection looked into NVME_NS_READY flag...
>
> This looks legit to me...

Yes, this looks pretty sensible.  I'm tempted to just queue it up
for 5.18.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-23 15:34     ` Christoph Hellwig
@ 2022-03-23 19:07       ` John Meneghini
  2022-04-05 13:14         ` John Meneghini
  0 siblings, 1 reply; 16+ messages in thread
From: John Meneghini @ 2022-03-23 19:07 UTC (permalink / raw)
  To: Chris Leech
  Cc: linux-nvme, lengchao, dwagner, hare, mlombard, Christoph Hellwig,
	Sagi Grimberg

This is good. I'm glad everyone agrees.

Chris, please ask our partner to test this patch in their testbed and verify that
this fixes the soft-lockup problem they are seeing. I believe they have the ability
to reproduce this problem on their 8.6 testbed.

If that works then please re-post a patch so people can consider this for v5.18.

/John


On 3/23/22 11:34, Christoph Hellwig wrote:
> On Wed, Mar 23, 2022 at 04:54:26PM +0200, Sagi Grimberg wrote:
>>
>>
>> On 3/22/22 00:43, Chris Leech wrote:
>>> Make nvme_ns_remove match the assumptions elsewhere.
>>>
>>> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>>>      running in __nvme_find_path or nvme_round_robin_path that will
>>>      re-assign this ns to current_path.
>>>
>>> 2) Any matching current_path entries need to be cleared before removing
>>>      from the siblings list, to prevent calling nvme_round_robin_path with
>>>      an "old" ns that's off list.
>>>
>>> 3) Finally the list_del_rcu can happen, and then synchronize again
>>>      before releasing any reference counts.
>>> ---
>>>    drivers/nvme/host/core.c | 13 +++++++++----
>>>    1 file changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fd4720d37cc0..20778dc9224c 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>    	set_capacity(ns->disk, 0);
>>>    	nvme_fault_inject_fini(&ns->fault_inject);
>>>    +	/* ensure that !NVME_NS_READY is seen
>>> +	 * to prevent this ns going back in current_path
>>> +	 */
>>> +	synchronize_srcu(&ns->head->srcu);
>>> +
>>> +	/* wait for concurrent submissions */
>>> +	if (nvme_mpath_clear_current_path(ns))
>>> +		synchronize_srcu(&ns->head->srcu);
>>
>> Nothing prevents it from being reselected again.
>> This is what drove the placement of this after the
>> ns is removed from the head->list. But that was before
>> the selection looked into NVME_NS_READY flag...
>>
>> This looks legit to me...
> 
> Yes, this looks pretty sensible.  I'm tempted to just queue it up
> for 5.18.
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
  2022-03-23 14:54   ` Sagi Grimberg
@ 2022-03-25  6:36   ` Christoph Hellwig
  2022-03-25 12:42   ` Sagi Grimberg
  2 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-25  6:36 UTC (permalink / raw)
  To: Chris Leech
  Cc: linux-nvme, sagi, hch, lengchao, dwagner, hare, mlombard, jmeneghi

Tentantively applied to nvme-5.18.  Sagi, can you give this a formal
reviewed-by?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
  2022-03-23 14:54   ` Sagi Grimberg
  2022-03-25  6:36   ` Christoph Hellwig
@ 2022-03-25 12:42   ` Sagi Grimberg
  2022-04-05 17:25     ` Chris Leech
  2 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2022-03-25 12:42 UTC (permalink / raw)
  To: Chris Leech, linux-nvme, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi



On 3/22/22 00:43, Chris Leech wrote:
> Make nvme_ns_remove match the assumptions elsewhere.
> 
> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>     running in __nvme_find_path or nvme_round_robin_path that will
>     re-assign this ns to current_path.
> 
> 2) Any matching current_path entries need to be cleared before removing
>     from the siblings list, to prevent calling nvme_round_robin_path with
>     an "old" ns that's off list.
> 
> 3) Finally the list_del_rcu can happen, and then synchronize again
>     before releasing any reference counts.
> ---
>   drivers/nvme/host/core.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fd4720d37cc0..20778dc9224c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   	set_capacity(ns->disk, 0);
>   	nvme_fault_inject_fini(&ns->fault_inject);
>   
> +	/* ensure that !NVME_NS_READY is seen
> +	 * to prevent this ns going back in current_path
> +	 */
> +	synchronize_srcu(&ns->head->srcu);
> +
> +	/* wait for concurrent submissions */
> +	if (nvme_mpath_clear_current_path(ns))
> +		synchronize_srcu(&ns->head->srcu);

Can you describe how the unconditional srcu is protecting
something that is not protected with the one under clearing
the current_path?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-23 19:07       ` John Meneghini
@ 2022-04-05 13:14         ` John Meneghini
  0 siblings, 0 replies; 16+ messages in thread
From: John Meneghini @ 2022-04-05 13:14 UTC (permalink / raw)
  To: Chris Leech
  Cc: linux-nvme, lengchao, dwagner, hare, mlombard, Christoph Hellwig,
	Sagi Grimberg

I just want to report back that our partner has tested this patch and verified that it fixes the soft-lockup problems.

They are unable to reproduce the problem when running the same tests on the same testbed when using this patch.

/John

On 3/23/22 15:07, John Meneghini wrote:
> This is good. I'm glad everyone agrees.
> 
> Chris, please ask our partner to test this patch in their testbed and verify that
> this fixes the soft-lockup problem they are seeing. I believe they have the ability
> to reproduce this problem on their 8.6 testbed.
> 
> If that works then please re-post a patch so people can consider this for v5.18.
> 
> /John
> 
> 
> On 3/23/22 11:34, Christoph Hellwig wrote:
>> On Wed, Mar 23, 2022 at 04:54:26PM +0200, Sagi Grimberg wrote:
>>>
>>>
>>> On 3/22/22 00:43, Chris Leech wrote:
>>>> Make nvme_ns_remove match the assumptions elsewhere.
>>>>
>>>> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>>>>      running in __nvme_find_path or nvme_round_robin_path that will
>>>>      re-assign this ns to current_path.
>>>>
>>>> 2) Any matching current_path entries need to be cleared before removing
>>>>      from the siblings list, to prevent calling nvme_round_robin_path with
>>>>      an "old" ns that's off list.
>>>>
>>>> 3) Finally the list_del_rcu can happen, and then synchronize again
>>>>      before releasing any reference counts.
>>>> ---
>>>>    drivers/nvme/host/core.c | 13 +++++++++----
>>>>    1 file changed, 9 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index fd4720d37cc0..20778dc9224c 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>>        set_capacity(ns->disk, 0);
>>>>        nvme_fault_inject_fini(&ns->fault_inject);
>>>>    +    /* ensure that !NVME_NS_READY is seen
>>>> +     * to prevent this ns going back in current_path
>>>> +     */
>>>> +    synchronize_srcu(&ns->head->srcu);
>>>> +
>>>> +    /* wait for concurrent submissions */
>>>> +    if (nvme_mpath_clear_current_path(ns))
>>>> +        synchronize_srcu(&ns->head->srcu);
>>>
>>> Nothing prevents it from being reselected again.
>>> This is what drove the placement of this after the
>>> ns is removed from the head->list. But that was before
>>> the selection looked into NVME_NS_READY flag...
>>>
>>> This looks legit to me...
>>
>> Yes, this looks pretty sensible.  I'm tempted to just queue it up
>> for 5.18.
>>
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
  2022-03-25 12:42   ` Sagi Grimberg
@ 2022-04-05 17:25     ` Chris Leech
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Leech @ 2022-04-05 17:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, hch, lengchao, dwagner, hare, mlombard, jmeneghi

On Fri, Mar 25, 2022 at 5:42 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> On 3/22/22 00:43, Chris Leech wrote:
> > Make nvme_ns_remove match the assumptions elsewhere.
> >
> > 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
> >     running in __nvme_find_path or nvme_round_robin_path that will
> >     re-assign this ns to current_path.
> >
> > 2) Any matching current_path entries need to be cleared before removing
> >     from the siblings list, to prevent calling nvme_round_robin_path with
> >     an "old" ns that's off list.
> >
> > 3) Finally the list_del_rcu can happen, and then synchronize again
> >     before releasing any reference counts.
> > ---
> >   drivers/nvme/host/core.c | 13 +++++++++----
> >   1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index fd4720d37cc0..20778dc9224c 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> >       set_capacity(ns->disk, 0);
> >       nvme_fault_inject_fini(&ns->fault_inject);
> >
> > +     /* ensure that !NVME_NS_READY is seen
> > +      * to prevent this ns going back in current_path
> > +      */
> > +     synchronize_srcu(&ns->head->srcu);
> > +
> > +     /* wait for concurrent submissions */
> > +     if (nvme_mpath_clear_current_path(ns))
> > +             synchronize_srcu(&ns->head->srcu);
>
> Can you describe how the unconditional srcu is protecting
> something that is not protected with the one under clearing
> the current_path?

My concern there was if nvme_mpath_clear_current_path was run while a
different CPU was in __nvme_find_path or nvme_round_robin_path, beyond
an nvme_path_is_disabled check but before rcu_assign_pointer, then the
ns being removed could be re-assigned to current_path.

Apologies for not noticing this question before now.

- Chris



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-04-05 17:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
2022-03-22 11:16   ` Christoph Hellwig
2022-03-22 15:36     ` Chris Leech
2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
2022-03-22 11:17   ` Christoph Hellwig
2022-03-22 12:07     ` Daniel Wagner
2022-03-22 15:42     ` Chris Leech
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
2022-03-23 14:54   ` Sagi Grimberg
2022-03-23 15:34     ` Christoph Hellwig
2022-03-23 19:07       ` John Meneghini
2022-04-05 13:14         ` John Meneghini
2022-03-25  6:36   ` Christoph Hellwig
2022-03-25 12:42   ` Sagi Grimberg
2022-04-05 17:25     ` Chris Leech

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.