* nvme-multipath: round-robin infinite looping
@ 2022-03-21 22:43 Chris Leech
2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi
I've been looking at a lockup reported by a partner doing nvme-tcp
testing, and I believe there's an issue between nvme_ns_remove and
nvme_round_robin_path that can result in infinite looping.
It seems like the same concern that was raised by Chao Leng in this
thread: https://lore.kernel.org/all/bd37abd5-759d-efe2-fdcd-8b004a41c75a@huawei.com/
The ordering of nvme_ns_remove is constructed to prevent races that
would re-assign a namespace being removed to the current_path cache.
That leaves a period where a namespace in current_path is not in the
path sibling list. But nvme_round_robin_path makes the assumption that
the "old" ns taken from current_path is always on the list, and the odd
list traversal with nvme_next_ns isn't safe with an RCU list that can
change while it's being read.
I'm not convinced that there is a way to meet all of these assumptions
only looking at the list and current_path. I think it can be done if the
NVME_NS_READY flag is taken into account, but possibly needing an
additional synchronization point.
I'm following this email with details from a kdump analysis that shows
this happening, with a current_path entry partially removed from the
list (pointing into the list, but not on it, as list_del_rcu does) and a
CPU stuck in the inner loop of nvme_round_robin_path.
And then a couple of suggestions, for trying to fix this in
nvme_ns_remove as well as an easy backstop for nvme_round_robin_path
that would prevent endless looping without fixing the race.
Just looking for discussion on these right now, we're working on getting
them tested.
- Chris
^ permalink raw reply [flat|nested] 16+ messages in thread
* kdump details (Re: nvme-multipath: round-robin infinite looping)
2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
@ 2022-03-21 22:43 ` Chris Leech
2022-03-22 11:16 ` Christoph Hellwig
2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
2 siblings, 1 reply; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi
Backtrace of a soft-lockup while executing in nvme_ns_head_make_request
crash> bt
PID: 6 TASK: ffff90550742c000 CPU: 0 COMMAND: "kworker/0:0H"
#0 [ffffb40a40003d48] machine_kexec at ffffffff93864e9e
#1 [ffffb40a40003da0] __crash_kexec at ffffffff939a4c4d
#2 [ffffb40a40003e68] panic at ffffffff938ed587
#3 [ffffb40a40003ee8] watchdog_timer_fn.cold.10 at ffffffff939db197
#4 [ffffb40a40003f18] __hrtimer_run_queues at ffffffff93983c20
#5 [ffffb40a40003f78] hrtimer_interrupt at ffffffff93984480
#6 [ffffb40a40003fd8] smp_apic_timer_interrupt at ffffffff942026ba
#7 [ffffb40a40003ff0] apic_timer_interrupt at ffffffff94201c4f
--- <IRQ stack> ---
#8 [ffffb40a403d3d18] apic_timer_interrupt at ffffffff94201c4f
[exception RIP: nvme_ns_head_make_request+592]
RIP: ffffffffc0280810 RSP: ffffb40a403d3dc0 RFLAGS: 00000206
RAX: 0000000000000003 RBX: ffff9059517dd800 RCX: 0000000000000001
RDX: 000043822f41b300 RSI: 0000000000000000 RDI: ffff9058da8d0010
RBP: ffff9058da8d0010 R8: 0000000000000001 R9: 0000000000000000
R10: 0000000000000003 R11: 0000000000000600 R12: 0000000000000001
R13: ffff90724096e000 R14: ffff9058da8dca50 R15: ffff9058da8d0000
ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
#9 [ffffb40a403d3e18] generic_make_request at ffffffff93c6e8bb
#10 [ffffb40a403d3e80] nvme_requeue_work at ffffffffc02805aa [nvme_core]
#11 [ffffb40a403d3e98] process_one_work at ffffffff9390b237
#12 [ffffb40a403d3ed8] worker_thread at ffffffff9390b8f0
#13 [ffffb40a403d3f10] kthread at ffffffff9391269a
#14 [ffffb40a403d3f50] ret_from_fork at ffffffff94200255
RIP here is in the nvme_round_robin_path loop. I've seen a report
without softlockup_panic enabled where multiple CPUs locked with a
variety of addresses reported all within the loop.
Disassembling and looking for the loop test expression, we can find the
instructions for ns && ns != old
236 for (ns = nvme_next_ns(head, old);
237 ns && ns != old;
0x000000000000982f <+575>: test %rbx,%rbx
0x0000000000009832 <+578>: je 0x98a8 <nvme_ns_head_make_request+696>
0x0000000000009834 <+580>: cmp %rbx,%r13
0x0000000000009837 <+583>: je 0x98a8 <nvme_ns_head_make_request+696>
At this point ns is in rbx, and old is in r13, that doesn’t appear to
change while in this loop.
Similarly head is in r15, being used here in list_first_or_null_rcu
./include/linux/compiler.h:
276 __READ_ONCE_SIZE;
0x0000000000009961 <+881>: mov (%r15),%rax
drivers/nvme/host/multipath.c:
222 return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
0x0000000000009964 <+884>: mov %rax,0x18(%rsp)
0x0000000000009969 <+889>: cmp %rax,%r15
0x000000000000996c <+892>: je 0x99d4 <nvme_ns_head_make_request+996>
So, from the saved register values in the backtrace:
old = ffff90724096e000
head = ffff9058da8d0000
Checking that old.head == head
crash> struct nvme_ns.head ffff90724096e000
head = 0xffff9058da8d0000
Dumping the nvme_ns structs on head.list
crash> list nvme_ns.siblings -H 0xffff9058da8d0000
ffff907084504000
ffff9059517dd800
Only 2, and neither of them match “old”
What do the list pointers in old look like?
crash> struct nvme_ns.siblings ffff90724096e000
siblings = {
next = 0xffff9059517dd830,
prev = 0xdead000000000200
}
old.siblings->next points into a valid nvme_ns on the list, but prev has
been poisoned.
This loop can still exit if any ns on the list is not disabled and an
ANA optimized path.
crash> list nvme_ns.siblings -H 0xffff9058da8d0000 -s nvme_ns.ctrl
ffff907084504000
ctrl = 0xffff90555c13c338
ffff9059517dd800
ctrl = 0xffff90555c138338
crash> struct nvme_ctrl.state 0xffff90555c13c338
state = NVME_CTRL_CONNECTING
crash> struct nvme_ctrl.state 0xffff90555c138338
state = NVME_CTRL_CONNECTING
No good, both are disabled while connecting.
Dump head.current_path[], and there’s “old”
crash> struct nvme_ns_head ffff9058da8d0000 -o
struct nvme_ns_head {
[ffff9058da8d0000] struct list_head list;
[ffff9058da8d0010] struct srcu_struct srcu;
[ffff9058da8dc510] struct nvme_subsystem *subsys;
[ffff9058da8dc518] unsigned int ns_id;
[ffff9058da8dc51c] struct nvme_ns_ids ids;
[ffff9058da8dc548] struct list_head entry;
[ffff9058da8dc558] struct kref ref;
[ffff9058da8dc55c] bool shared;
[ffff9058da8dc560] int instance;
[ffff9058da8dc568] struct nvme_effects_log *effects;
[ffff9058da8dc570] struct cdev cdev;
[ffff9058da8dc5f8] struct device cdev_device;
[ffff9058da8dc9c8] struct gendisk *disk;
[ffff9058da8dc9d0] struct bio_list requeue_list;
[ffff9058da8dc9e0] spinlock_t requeue_lock;
[ffff9058da8dc9e8] struct work_struct requeue_work;
[ffff9058da8dca28] struct mutex lock;
[ffff9058da8dca48] unsigned long flags;
[ffff9058da8dca50] struct nvme_ns *current_path[];
}
SIZE: 51792
crash> p *(struct nvme_ns **) 0xffff9058da8dca50 @ 4
$18 =
{0xffff90724096e000, 0x0, 0x0, 0x0}
So, we end up in an infinite loop because:
The nvme_ns_head list contains 2 nvme_ns. Both of them have ctrl->state
as NVME_CTRL_CONNECTING, so they are considered as disabled and we will
execute the continue statement in the loop. “old” has been removed from
the siblings list, so it will never trigger the ns != old condition
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
@ 2022-03-21 22:43 ` Chris Leech
2022-03-22 11:17 ` Christoph Hellwig
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
2 siblings, 1 reply; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi
This is a backstop for the odd loop construct in nvme_round_robin_path.
It counts how many times the head pointer has been passed, as that's the
only thing guarenteed to stay on the list. Once is needed to start from
a different place and check the entire list, twice is excessive looping.
---
drivers/nvme/host/multipath.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ff775235534c..c2039746ca15 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -261,12 +261,13 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
}
static struct nvme_ns *nvme_next_ns(struct nvme_ns_head *head,
- struct nvme_ns *ns)
+ struct nvme_ns *ns, int *head_pass_count)
{
ns = list_next_or_null_rcu(&head->list, &ns->siblings, struct nvme_ns,
siblings);
if (ns)
return ns;
+ *head_pass_count++;
return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
}
@@ -274,6 +275,7 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
int node, struct nvme_ns *old)
{
struct nvme_ns *ns, *found = NULL;
+ int head_pass_count = 0;
if (list_is_singular(&head->list)) {
if (nvme_path_is_disabled(old))
@@ -281,9 +283,12 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
return old;
}
- for (ns = nvme_next_ns(head, old);
+ for (ns = nvme_next_ns(head, old, &head_pass_count);
ns && ns != old;
- ns = nvme_next_ns(head, ns)) {
+ ns = nvme_next_ns(head, ns, &head_pass_count)) {
+ if (head_pass_count > 1)
+ break;
+
if (nvme_path_is_disabled(ns))
continue;
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
@ 2022-03-21 22:43 ` Chris Leech
2022-03-23 14:54 ` Sagi Grimberg
` (2 more replies)
2 siblings, 3 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-21 22:43 UTC (permalink / raw)
To: linux-nvme, sagi, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi
Make nvme_ns_remove match the assumptions elsewhere.
1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
running in __nvme_find_path or nvme_round_robin_path that will
re-assign this ns to current_path.
2) Any matching current_path entries need to be cleared before removing
from the siblings list, to prevent calling nvme_round_robin_path with
an "old" ns that's off list.
3) Finally the list_del_rcu can happen, and then synchronize again
before releasing any reference counts.
---
drivers/nvme/host/core.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fd4720d37cc0..20778dc9224c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
set_capacity(ns->disk, 0);
nvme_fault_inject_fini(&ns->fault_inject);
+ /* ensure that !NVME_NS_READY is seen
+ * to prevent this ns going back in current_path
+ */
+ synchronize_srcu(&ns->head->srcu);
+
+ /* wait for concurrent submissions */
+ if (nvme_mpath_clear_current_path(ns))
+ synchronize_srcu(&ns->head->srcu);
+
mutex_lock(&ns->ctrl->subsys->lock);
list_del_rcu(&ns->siblings);
if (list_empty(&ns->head->list)) {
@@ -3928,10 +3937,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
/* guarantee not available in head->list */
synchronize_rcu();
- /* wait for concurrent submissions */
- if (nvme_mpath_clear_current_path(ns))
- synchronize_srcu(&ns->head->srcu);
-
if (!nvme_ns_head_multipath(ns->head))
nvme_cdev_del(&ns->cdev, &ns->cdev_device);
del_gendisk(ns->disk);
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: kdump details (Re: nvme-multipath: round-robin infinite looping)
2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
@ 2022-03-22 11:16 ` Christoph Hellwig
2022-03-22 15:36 ` Chris Leech
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-22 11:16 UTC (permalink / raw)
To: Chris Leech
Cc: linux-nvme, sagi, hch, lengchao, dwagner, hare, mlombard, jmeneghi
What kernel version is this? generic_make_request is long gone, so this
looks pretty old.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
@ 2022-03-22 11:17 ` Christoph Hellwig
2022-03-22 12:07 ` Daniel Wagner
2022-03-22 15:42 ` Chris Leech
0 siblings, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-22 11:17 UTC (permalink / raw)
To: Chris Leech
Cc: linux-nvme, sagi, hch, lengchao, dwagner, hare, mlombard, jmeneghi
On Mon, Mar 21, 2022 at 03:43:03PM -0700, Chris Leech wrote:
> This is a backstop for the odd loop construct in nvme_round_robin_path.
> It counts how many times the head pointer has been passed, as that's the
> only thing guarenteed to stay on the list. Once is needed to start from
> a different place and check the entire list, twice is excessive looping.
Shouldn't
1bcf006a9d3d63c1bcb65a993cb13756954cd9c
Author: Daniel Wagner <dwagner@suse.de>
Date: Wed Jan 27 11:30:33 2021 +0100
nvme-multipath: Early exit if no path is available
take care of this?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
2022-03-22 11:17 ` Christoph Hellwig
@ 2022-03-22 12:07 ` Daniel Wagner
2022-03-22 15:42 ` Chris Leech
1 sibling, 0 replies; 16+ messages in thread
From: Daniel Wagner @ 2022-03-22 12:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chris Leech, linux-nvme, sagi, lengchao, hare, mlombard, jmeneghi
On Tue, Mar 22, 2022 at 12:17:43PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 21, 2022 at 03:43:03PM -0700, Chris Leech wrote:
> > This is a backstop for the odd loop construct in nvme_round_robin_path.
> > It counts how many times the head pointer has been passed, as that's the
> > only thing guarenteed to stay on the list. Once is needed to start from
> > a different place and check the entire list, twice is excessive looping.
>
> Shouldn't
d1bcf006a9d3 ("nvme-multipath: Early exit if no path is available")
> take care of this?
IIRC, this addressed the issue that we started to dereference a NULL
pointer if there was no path left.
Chris' problem is that the iterator always returns the same
pointer. Personally, I prefer the solution to introduce some sort of
sentinel over the one with the locking.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kdump details (Re: nvme-multipath: round-robin infinite looping)
2022-03-22 11:16 ` Christoph Hellwig
@ 2022-03-22 15:36 ` Chris Leech
0 siblings, 0 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-22 15:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, Sagi Grimberg, lengchao, dwagner, hare,
Maurizio Lombardi, John Meneghini
On Tue, Mar 22, 2022 at 4:17 AM Christoph Hellwig <hch@lst.de> wrote:
>
> What kernel version is this? generic_make_request is long gone, so this
> looks pretty old.
Yes, this was found on a RHEL 8 backport, we're working with the
reporter to move the test setup to something newer. So far I haven't
been able to create my own reproducer for this. I went ahead and
brought this to the list because I don't see any reason why upstream
would be different in this case.
- Chris
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path
2022-03-22 11:17 ` Christoph Hellwig
2022-03-22 12:07 ` Daniel Wagner
@ 2022-03-22 15:42 ` Chris Leech
1 sibling, 0 replies; 16+ messages in thread
From: Chris Leech @ 2022-03-22 15:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, Sagi Grimberg, lengchao, dwagner, hare,
Maurizio Lombardi, John Meneghini
On Tue, Mar 22, 2022 at 4:17 AM Christoph Hellwig <hch@lst.de> wrote:
> Shouldn't
>
> 1bcf006a9d3d63c1bcb65a993cb13756954cd9c
> Author: Daniel Wagner <dwagner@suse.de>
> Date: Wed Jan 27 11:30:33 2021 +0100
>
> nvme-multipath: Early exit if no path is available
>
> take care of this?
No, in this case nvme_next_ns never returns NULL, but it also never
returns old. In the kdump I have, there are two NVME_CTRL_CONNECTING
paths that are returned over-and-over.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
@ 2022-03-23 14:54 ` Sagi Grimberg
2022-03-23 15:34 ` Christoph Hellwig
2022-03-25 6:36 ` Christoph Hellwig
2022-03-25 12:42 ` Sagi Grimberg
2 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2022-03-23 14:54 UTC (permalink / raw)
To: Chris Leech, linux-nvme, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi
On 3/22/22 00:43, Chris Leech wrote:
> Make nvme_ns_remove match the assumptions elsewhere.
>
> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
> running in __nvme_find_path or nvme_round_robin_path that will
> re-assign this ns to current_path.
>
> 2) Any matching current_path entries need to be cleared before removing
> from the siblings list, to prevent calling nvme_round_robin_path with
> an "old" ns that's off list.
>
> 3) Finally the list_del_rcu can happen, and then synchronize again
> before releasing any reference counts.
> ---
> drivers/nvme/host/core.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fd4720d37cc0..20778dc9224c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> set_capacity(ns->disk, 0);
> nvme_fault_inject_fini(&ns->fault_inject);
>
> + /* ensure that !NVME_NS_READY is seen
> + * to prevent this ns going back in current_path
> + */
> + synchronize_srcu(&ns->head->srcu);
> +
> + /* wait for concurrent submissions */
> + if (nvme_mpath_clear_current_path(ns))
> + synchronize_srcu(&ns->head->srcu);
Nothing prevents it from being reselected again.
This is what drove the placement of this after the
ns is removed from the head->list. But that was before
the selection looked into NVME_NS_READY flag...
This looks legit to me...
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-23 14:54 ` Sagi Grimberg
@ 2022-03-23 15:34 ` Christoph Hellwig
2022-03-23 19:07 ` John Meneghini
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-23 15:34 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chris Leech, linux-nvme, hch, lengchao, dwagner, hare, mlombard,
jmeneghi
On Wed, Mar 23, 2022 at 04:54:26PM +0200, Sagi Grimberg wrote:
>
>
> On 3/22/22 00:43, Chris Leech wrote:
>> Make nvme_ns_remove match the assumptions elsewhere.
>>
>> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>> running in __nvme_find_path or nvme_round_robin_path that will
>> re-assign this ns to current_path.
>>
>> 2) Any matching current_path entries need to be cleared before removing
>> from the siblings list, to prevent calling nvme_round_robin_path with
>> an "old" ns that's off list.
>>
>> 3) Finally the list_del_rcu can happen, and then synchronize again
>> before releasing any reference counts.
>> ---
>> drivers/nvme/host/core.c | 13 +++++++++----
>> 1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fd4720d37cc0..20778dc9224c 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>> set_capacity(ns->disk, 0);
>> nvme_fault_inject_fini(&ns->fault_inject);
>> + /* ensure that !NVME_NS_READY is seen
>> + * to prevent this ns going back in current_path
>> + */
>> + synchronize_srcu(&ns->head->srcu);
>> +
>> + /* wait for concurrent submissions */
>> + if (nvme_mpath_clear_current_path(ns))
>> + synchronize_srcu(&ns->head->srcu);
>
> Nothing prevents it from being reselected again.
> This is what drove the placement of this after the
> ns is removed from the head->list. But that was before
> the selection looked into NVME_NS_READY flag...
>
> This looks legit to me...
Yes, this looks pretty sensible. I'm tempted to just queue it up
for 5.18.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-23 15:34 ` Christoph Hellwig
@ 2022-03-23 19:07 ` John Meneghini
2022-04-05 13:14 ` John Meneghini
0 siblings, 1 reply; 16+ messages in thread
From: John Meneghini @ 2022-03-23 19:07 UTC (permalink / raw)
To: Chris Leech
Cc: linux-nvme, lengchao, dwagner, hare, mlombard, Christoph Hellwig,
Sagi Grimberg
This is good. I'm glad everyone agrees.
Chris, please ask our partner to test this patch in their testbed and verify that
this fixes the soft-lockup problem they are seeing. I believe they have the ability
to reproduce this problem on their 8.6 testbed.
If that works then please re-post a patch so people can consider this for v5.18.
/John
On 3/23/22 11:34, Christoph Hellwig wrote:
> On Wed, Mar 23, 2022 at 04:54:26PM +0200, Sagi Grimberg wrote:
>>
>>
>> On 3/22/22 00:43, Chris Leech wrote:
>>> Make nvme_ns_remove match the assumptions elsewhere.
>>>
>>> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>>> running in __nvme_find_path or nvme_round_robin_path that will
>>> re-assign this ns to current_path.
>>>
>>> 2) Any matching current_path entries need to be cleared before removing
>>> from the siblings list, to prevent calling nvme_round_robin_path with
>>> an "old" ns that's off list.
>>>
>>> 3) Finally the list_del_rcu can happen, and then synchronize again
>>> before releasing any reference counts.
>>> ---
>>> drivers/nvme/host/core.c | 13 +++++++++----
>>> 1 file changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fd4720d37cc0..20778dc9224c 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>> set_capacity(ns->disk, 0);
>>> nvme_fault_inject_fini(&ns->fault_inject);
>>> + /* ensure that !NVME_NS_READY is seen
>>> + * to prevent this ns going back in current_path
>>> + */
>>> + synchronize_srcu(&ns->head->srcu);
>>> +
>>> + /* wait for concurrent submissions */
>>> + if (nvme_mpath_clear_current_path(ns))
>>> + synchronize_srcu(&ns->head->srcu);
>>
>> Nothing prevents it from being reselected again.
>> This is what drove the placement of this after the
>> ns is removed from the head->list. But that was before
>> the selection looked into NVME_NS_READY flag...
>>
>> This looks legit to me...
>
> Yes, this looks pretty sensible. I'm tempted to just queue it up
> for 5.18.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
2022-03-23 14:54 ` Sagi Grimberg
@ 2022-03-25 6:36 ` Christoph Hellwig
2022-03-25 12:42 ` Sagi Grimberg
2 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-03-25 6:36 UTC (permalink / raw)
To: Chris Leech
Cc: linux-nvme, sagi, hch, lengchao, dwagner, hare, mlombard, jmeneghi
Tentantively applied to nvme-5.18. Sagi, can you give this a formal
reviewed-by?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
2022-03-23 14:54 ` Sagi Grimberg
2022-03-25 6:36 ` Christoph Hellwig
@ 2022-03-25 12:42 ` Sagi Grimberg
2022-04-05 17:25 ` Chris Leech
2 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2022-03-25 12:42 UTC (permalink / raw)
To: Chris Leech, linux-nvme, hch; +Cc: lengchao, dwagner, hare, mlombard, jmeneghi
On 3/22/22 00:43, Chris Leech wrote:
> Make nvme_ns_remove match the assumptions elsewhere.
>
> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
> running in __nvme_find_path or nvme_round_robin_path that will
> re-assign this ns to current_path.
>
> 2) Any matching current_path entries need to be cleared before removing
> from the siblings list, to prevent calling nvme_round_robin_path with
> an "old" ns that's off list.
>
> 3) Finally the list_del_rcu can happen, and then synchronize again
> before releasing any reference counts.
> ---
> drivers/nvme/host/core.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fd4720d37cc0..20778dc9224c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> set_capacity(ns->disk, 0);
> nvme_fault_inject_fini(&ns->fault_inject);
>
> + /* ensure that !NVME_NS_READY is seen
> + * to prevent this ns going back in current_path
> + */
> + synchronize_srcu(&ns->head->srcu);
> +
> + /* wait for concurrent submissions */
> + if (nvme_mpath_clear_current_path(ns))
> + synchronize_srcu(&ns->head->srcu);
Can you describe how the unconditional srcu is protecting
something that is not protected with the one under clearing
the current_path?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-23 19:07 ` John Meneghini
@ 2022-04-05 13:14 ` John Meneghini
0 siblings, 0 replies; 16+ messages in thread
From: John Meneghini @ 2022-04-05 13:14 UTC (permalink / raw)
To: Chris Leech
Cc: linux-nvme, lengchao, dwagner, hare, mlombard, Christoph Hellwig,
Sagi Grimberg
I just want to report back that our partner has tested this patch and verified that it fixes the soft-lockup problems.
They are unable to reproduce the problem when running the same tests on the same testbed when using this patch.
/John
On 3/23/22 15:07, John Meneghini wrote:
> This is good. I'm glad everyone agrees.
>
> Chris, please ask our partner to test this patch in their testbed and verify that
> this fixes the soft-lockup problem they are seeing. I believe they have the ability
> to reproduce this problem on their 8.6 testbed.
>
> If that works then please re-post a patch so people can consider this for v5.18.
>
> /John
>
>
> On 3/23/22 11:34, Christoph Hellwig wrote:
>> On Wed, Mar 23, 2022 at 04:54:26PM +0200, Sagi Grimberg wrote:
>>>
>>>
>>> On 3/22/22 00:43, Chris Leech wrote:
>>>> Make nvme_ns_remove match the assumptions elsewhere.
>>>>
>>>> 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
>>>> running in __nvme_find_path or nvme_round_robin_path that will
>>>> re-assign this ns to current_path.
>>>>
>>>> 2) Any matching current_path entries need to be cleared before removing
>>>> from the siblings list, to prevent calling nvme_round_robin_path with
>>>> an "old" ns that's off list.
>>>>
>>>> 3) Finally the list_del_rcu can happen, and then synchronize again
>>>> before releasing any reference counts.
>>>> ---
>>>> drivers/nvme/host/core.c | 13 +++++++++----
>>>> 1 file changed, 9 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index fd4720d37cc0..20778dc9224c 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>> set_capacity(ns->disk, 0);
>>>> nvme_fault_inject_fini(&ns->fault_inject);
>>>> + /* ensure that !NVME_NS_READY is seen
>>>> + * to prevent this ns going back in current_path
>>>> + */
>>>> + synchronize_srcu(&ns->head->srcu);
>>>> +
>>>> + /* wait for concurrent submissions */
>>>> + if (nvme_mpath_clear_current_path(ns))
>>>> + synchronize_srcu(&ns->head->srcu);
>>>
>>> Nothing prevents it from being reselected again.
>>> This is what drove the placement of this after the
>>> ns is removed from the head->list. But that was before
>>> the selection looked into NVME_NS_READY flag...
>>>
>>> This looks legit to me...
>>
>> Yes, this looks pretty sensible. I'm tempted to just queue it up
>> for 5.18.
>>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin
2022-03-25 12:42 ` Sagi Grimberg
@ 2022-04-05 17:25 ` Chris Leech
0 siblings, 0 replies; 16+ messages in thread
From: Chris Leech @ 2022-04-05 17:25 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme, hch, lengchao, dwagner, hare, mlombard, jmeneghi
On Fri, Mar 25, 2022 at 5:42 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> On 3/22/22 00:43, Chris Leech wrote:
> > Make nvme_ns_remove match the assumptions elsewhere.
> >
> > 1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
> > running in __nvme_find_path or nvme_round_robin_path that will
> > re-assign this ns to current_path.
> >
> > 2) Any matching current_path entries need to be cleared before removing
> > from the siblings list, to prevent calling nvme_round_robin_path with
> > an "old" ns that's off list.
> >
> > 3) Finally the list_del_rcu can happen, and then synchronize again
> > before releasing any reference counts.
> > ---
> > drivers/nvme/host/core.c | 13 +++++++++----
> > 1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index fd4720d37cc0..20778dc9224c 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3917,6 +3917,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> > set_capacity(ns->disk, 0);
> > nvme_fault_inject_fini(&ns->fault_inject);
> >
> > + /* ensure that !NVME_NS_READY is seen
> > + * to prevent this ns going back in current_path
> > + */
> > + synchronize_srcu(&ns->head->srcu);
> > +
> > + /* wait for concurrent submissions */
> > + if (nvme_mpath_clear_current_path(ns))
> > + synchronize_srcu(&ns->head->srcu);
>
> Can you describe how the unconditional srcu is protecting
> something that is not protected with the one under clearing
> the current_path?
My concern there was if nvme_mpath_clear_current_path was run while a
different CPU was in __nvme_find_path or nvme_round_robin_path, beyond
an nvme_path_is_disabled check but before rcu_assign_pointer, then the
ns being removed could be re-assigned to current_path.
Apologies for not noticing this question before now.
- Chris
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-04-05 17:26 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-21 22:43 nvme-multipath: round-robin infinite looping Chris Leech
2022-03-21 22:43 ` kdump details (Re: nvme-multipath: round-robin infinite looping) Chris Leech
2022-03-22 11:16 ` Christoph Hellwig
2022-03-22 15:36 ` Chris Leech
2022-03-21 22:43 ` [RFC PATCH] nvme-multipath: break endless loop in nvme_round_robin_path Chris Leech
2022-03-22 11:17 ` Christoph Hellwig
2022-03-22 12:07 ` Daniel Wagner
2022-03-22 15:42 ` Chris Leech
2022-03-21 22:43 ` [RFC PATCH] nvme: fix RCU hole that allowed for endless looping in multipath round robin Chris Leech
2022-03-23 14:54 ` Sagi Grimberg
2022-03-23 15:34 ` Christoph Hellwig
2022-03-23 19:07 ` John Meneghini
2022-04-05 13:14 ` John Meneghini
2022-03-25 6:36 ` Christoph Hellwig
2022-03-25 12:42 ` Sagi Grimberg
2022-04-05 17:25 ` Chris Leech
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.