All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: fix SRCU protection of nvme_ns_head list
@ 2022-11-18 23:27 Caleb Sander
  2022-11-20 11:24 ` Sagi Grimberg
  2022-11-24  0:24 ` [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate) Caleb Sander
  0 siblings, 2 replies; 25+ messages in thread
From: Caleb Sander @ 2022-11-18 23:27 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme
  Cc: Uday Shankar, Caleb Sander

Walking the nvme_ns_head siblings list is protected by the head's srcu
in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths().
Removing namespaces from the list also fails to synchronize the srcu.
Concurrent scan work can therefore cause use-after-frees.

Hold the head's srcu lock in nvme_mpath_revalidate_paths() and
synchronize with the srcu, not the global RCU, in nvme_ns_remove().

Observed the following panic when making NVMe/RDMA connections
with native multipath on the Rocky Linux 8.6 kernel
(it seems the upstream kernel has the same race condition).
Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx;
computing capacity != get_capacity(ns->disk).
Address 0x50 is dereferenced because ns->disk is NULL.
The NULL disk appears to be the result of concurrent scan work
freeing the namespace (note the log line in the middle of the panic).

[37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
[37314.299753] PGD 0 P4D 0
[37314.299756] Oops: 0000 [#1] SMP PTI
[37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G        W      X --------- -  - 4.18.0-372.32.1.el8test86.x86_64 #1
[37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018
[37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core]
[37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3
[37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202
[37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000
[37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800
[37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff
[37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000
[37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000
[37315.548286] FS:  0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000
[37315.645111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0
[37315.799267] Call Trace:
[37315.828515]  nvme_update_ns_info+0x1ac/0x250 [nvme_core]
[37315.892075]  nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core]
[37315.961871]  ? __blk_mq_free_request+0x6b/0x90
[37316.015021]  nvme_scan_work+0x151/0x240 [nvme_core]
[37316.073371]  process_one_work+0x1a7/0x360
[37316.121318]  ? create_worker+0x1a0/0x1a0
[37316.168227]  worker_thread+0x30/0x390
[37316.212024]  ? create_worker+0x1a0/0x1a0
[37316.258939]  kthread+0x10a/0x120
[37316.297557]  ? set_kthread_struct+0x50/0x50
[37316.347590]  ret_from_fork+0x35/0x40
[37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea
[37316.390419]  sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core]
[37317.645908] CR2: 0000000000000050

Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: Caleb Sander <csander@purestorage.com>
---
 drivers/nvme/host/core.c      | 2 +-
 drivers/nvme/host/multipath.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index da55ce45ac70..69e333922bea 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4304,7 +4304,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	/* guarantee not available in head->list */
-	synchronize_rcu();
+	synchronize_srcu(&ns->head->srcu);
 
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 93e2138a8b42..7e025b8948cb 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -174,11 +174,14 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
 	struct nvme_ns_head *head = ns->head;
 	sector_t capacity = get_capacity(head->disk);
 	int node;
+	int srcu_idx;
 
+	srcu_idx = srcu_read_lock(&head->srcu);
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		if (capacity != get_capacity(ns->disk))
 			clear_bit(NVME_NS_READY, &ns->flags);
 	}
+	srcu_read_unlock(&head->srcu, srcu_idx);
 
 	for_each_node(node)
 		rcu_assign_pointer(head->current_path[node], NULL);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-18 23:27 [PATCH] nvme: fix SRCU protection of nvme_ns_head list Caleb Sander
@ 2022-11-20 11:24 ` Sagi Grimberg
  2022-11-21  7:40   ` Christoph Hellwig
  2022-11-24  0:24 ` [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate) Caleb Sander
  1 sibling, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-20 11:24 UTC (permalink / raw)
  To: Caleb Sander, Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme
  Cc: Uday Shankar



On 11/19/22 01:27, Caleb Sander wrote:
> Walking the nvme_ns_head siblings list is protected by the head's srcu
> in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths().
> Removing namespaces from the list also fails to synchronize the srcu.
> Concurrent scan work can therefore cause use-after-frees.
> 
> Hold the head's srcu lock in nvme_mpath_revalidate_paths() and
> synchronize with the srcu, not the global RCU, in nvme_ns_remove().
> 
> Observed the following panic when making NVMe/RDMA connections
> with native multipath on the Rocky Linux 8.6 kernel
> (it seems the upstream kernel has the same race condition).
> Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx;
> computing capacity != get_capacity(ns->disk).
> Address 0x50 is dereferenced because ns->disk is NULL.
> The NULL disk appears to be the result of concurrent scan work
> freeing the namespace (note the log line in the middle of the panic).
> 
> [37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.299753] PGD 0 P4D 0
> [37314.299756] Oops: 0000 [#1] SMP PTI
> [37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G        W      X --------- -  - 4.18.0-372.32.1.el8test86.x86_64 #1
> [37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018
> [37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core]
> [37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3
> [37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202
> [37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000
> [37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800
> [37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff
> [37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000
> [37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000
> [37315.548286] FS:  0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000
> [37315.645111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0
> [37315.799267] Call Trace:
> [37315.828515]  nvme_update_ns_info+0x1ac/0x250 [nvme_core]
> [37315.892075]  nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core]
> [37315.961871]  ? __blk_mq_free_request+0x6b/0x90
> [37316.015021]  nvme_scan_work+0x151/0x240 [nvme_core]
> [37316.073371]  process_one_work+0x1a7/0x360
> [37316.121318]  ? create_worker+0x1a0/0x1a0
> [37316.168227]  worker_thread+0x30/0x390
> [37316.212024]  ? create_worker+0x1a0/0x1a0
> [37316.258939]  kthread+0x10a/0x120
> [37316.297557]  ? set_kthread_struct+0x50/0x50
> [37316.347590]  ret_from_fork+0x35/0x40
> [37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea
> [37316.390419]  sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core]
> [37317.645908] CR2: 0000000000000050
> 
> Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan")
> Signed-off-by: Caleb Sander <csander@purestorage.com>
> ---
>   drivers/nvme/host/core.c      | 2 +-
>   drivers/nvme/host/multipath.c | 3 +++
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index da55ce45ac70..69e333922bea 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4304,7 +4304,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   	mutex_unlock(&ns->ctrl->subsys->lock);
>   
>   	/* guarantee not available in head->list */
> -	synchronize_rcu();
> +	synchronize_srcu(&ns->head->srcu);
>   
>   	if (!nvme_ns_head_multipath(ns->head))
>   		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 93e2138a8b42..7e025b8948cb 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -174,11 +174,14 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>   	struct nvme_ns_head *head = ns->head;
>   	sector_t capacity = get_capacity(head->disk);
>   	int node;
> +	int srcu_idx;
>   
> +	srcu_idx = srcu_read_lock(&head->srcu);
>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>   		if (capacity != get_capacity(ns->disk))
>   			clear_bit(NVME_NS_READY, &ns->flags);
>   	}
> +	srcu_read_unlock(&head->srcu, srcu_idx);

I don't think you need srcu here, rcu_read_lock/unlock is sufficient.

>   
>   	for_each_node(node)
>   		rcu_assign_pointer(head->current_path[node], NULL);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-20 11:24 ` Sagi Grimberg
@ 2022-11-21  7:40   ` Christoph Hellwig
  2022-11-21  9:43     ` Sagi Grimberg
  2022-11-21 17:48     ` Caleb Sander
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-21  7:40 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Caleb Sander, Keith Busch, Jens Axboe, Christoph Hellwig,
	linux-nvme, Uday Shankar, Paul E. McKenney

On Sun, Nov 20, 2022 at 01:24:51PM +0200, Sagi Grimberg wrote:
>>   	sector_t capacity = get_capacity(head->disk);
>>   	int node;
>> +	int srcu_idx;
>>   +	srcu_idx = srcu_read_lock(&head->srcu);
>>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>>   		if (capacity != get_capacity(ns->disk))
>>   			clear_bit(NVME_NS_READY, &ns->flags);
>>   	}
>> +	srcu_read_unlock(&head->srcu, srcu_idx);
>
> I don't think you need srcu here, rcu_read_lock/unlock is sufficient.

So the code obviously does not sleep.  But I wonder if anything speaks
against mixing SRCU and RCU protection for the same list?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21  7:40   ` Christoph Hellwig
@ 2022-11-21  9:43     ` Sagi Grimberg
  2022-11-21 14:57       ` Paul E. McKenney
  2022-11-21 17:48     ` Caleb Sander
  1 sibling, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-21  9:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Caleb Sander, Keith Busch, Jens Axboe, linux-nvme, Uday Shankar,
	Paul E. McKenney


>>>    	sector_t capacity = get_capacity(head->disk);
>>>    	int node;
>>> +	int srcu_idx;
>>>    +	srcu_idx = srcu_read_lock(&head->srcu);
>>>    	list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>    		if (capacity != get_capacity(ns->disk))
>>>    			clear_bit(NVME_NS_READY, &ns->flags);
>>>    	}
>>> +	srcu_read_unlock(&head->srcu, srcu_idx);
>>
>> I don't think you need srcu here, rcu_read_lock/unlock is sufficient.
> 
> So the code obviously does not sleep.  But I wonder if anything speaks
> against mixing SRCU and RCU protection for the same list?

I am not an expert, but the point of (s)rcu_synchronize to fence the
reader critical section isn't it? so if the reader doesn't sleep, the
existing rcu_synchronize should be sufficient.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21  9:43     ` Sagi Grimberg
@ 2022-11-21 14:57       ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2022-11-21 14:57 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Caleb Sander, Keith Busch, Jens Axboe,
	linux-nvme, Uday Shankar

On Mon, Nov 21, 2022 at 11:43:33AM +0200, Sagi Grimberg wrote:
> 
> > > >    	sector_t capacity = get_capacity(head->disk);
> > > >    	int node;
> > > > +	int srcu_idx;
> > > >    +	srcu_idx = srcu_read_lock(&head->srcu);
> > > >    	list_for_each_entry_rcu(ns, &head->list, siblings) {
> > > >    		if (capacity != get_capacity(ns->disk))
> > > >    			clear_bit(NVME_NS_READY, &ns->flags);
> > > >    	}
> > > > +	srcu_read_unlock(&head->srcu, srcu_idx);
> > > 
> > > I don't think you need srcu here, rcu_read_lock/unlock is sufficient.
> > 
> > So the code obviously does not sleep.  But I wonder if anything speaks
> > against mixing SRCU and RCU protection for the same list?
> 
> I am not an expert, but the point of (s)rcu_synchronize to fence the
> reader critical section isn't it? so if the reader doesn't sleep, the
> existing rcu_synchronize should be sufficient.

Just in case you need to mix RCU and SRCU readers, so that you need
to wait for both types of readers, but need to avoid the latency of
back-to-back waits, you can overlap the waits like this:

	void call_my_srrcu(struct rcu_head *head, rcu_callback_t func)
	{
		call_srcu(&my_srcu, head, func);
	}

	...

	synchronize_rcu_mult(call_rcu, call_my_srrcu);


							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21  7:40   ` Christoph Hellwig
  2022-11-21  9:43     ` Sagi Grimberg
@ 2022-11-21 17:48     ` Caleb Sander
  2022-11-21 17:59       ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Caleb Sander @ 2022-11-21 17:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, Jens Axboe, linux-nvme, Uday Shankar,
	Paul E. McKenney

On Sun, Nov 20, 2022 at 11:40 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sun, Nov 20, 2022 at 01:24:51PM +0200, Sagi Grimberg wrote:
> >>      sector_t capacity = get_capacity(head->disk);
> >>      int node;
> >> +    int srcu_idx;
> >>   +  srcu_idx = srcu_read_lock(&head->srcu);
> >>      list_for_each_entry_rcu(ns, &head->list, siblings) {
> >>              if (capacity != get_capacity(ns->disk))
> >>                      clear_bit(NVME_NS_READY, &ns->flags);
> >>      }
> >> +    srcu_read_unlock(&head->srcu, srcu_idx);
> >
> > I don't think you need srcu here, rcu_read_lock/unlock is sufficient.
>
> So the code obviously does not sleep.  But I wonder if anything speaks
> against mixing SRCU and RCU protection for the same list?

As I understand, synchronize_rcu() only synchronizes with RCU critical sections,
and likewise synchronize_srcu() only synchronizes with SRCU critical sections.
Currently nvme_ns_head_submit_bio() is using srcu_read_lock()
but nvme_ns_remove() is using synchronize_rcu(),
so there is no synchronization at all.
Since nvme_ns_head_submit_bio() sleeps during the critical section,
we definitely need to keep using SRCU there,
and therefore nvme_ns_remove() needs to use synchronize_srcu().
I agree RCU would work in nvme_mpath_revalidate_paths(), but as Paul points out,
nvme_ns_remove() would then need to synchronize with both SRCU and RCU.
This seems like it would complicate nvme_ns_remove()
and require waiting on readers of unrelated RCU-protected structures.
Is there some cost to using SRCU over RCU that I am missing?

Thanks,
Caleb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21 17:48     ` Caleb Sander
@ 2022-11-21 17:59       ` Paul E. McKenney
  2022-11-21 19:58         ` Caleb Sander
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2022-11-21 17:59 UTC (permalink / raw)
  To: Caleb Sander
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe,
	linux-nvme, Uday Shankar

On Mon, Nov 21, 2022 at 09:48:31AM -0800, Caleb Sander wrote:
> On Sun, Nov 20, 2022 at 11:40 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Sun, Nov 20, 2022 at 01:24:51PM +0200, Sagi Grimberg wrote:
> > >>      sector_t capacity = get_capacity(head->disk);
> > >>      int node;
> > >> +    int srcu_idx;
> > >>   +  srcu_idx = srcu_read_lock(&head->srcu);
> > >>      list_for_each_entry_rcu(ns, &head->list, siblings) {
> > >>              if (capacity != get_capacity(ns->disk))
> > >>                      clear_bit(NVME_NS_READY, &ns->flags);
> > >>      }
> > >> +    srcu_read_unlock(&head->srcu, srcu_idx);
> > >
> > > I don't think you need srcu here, rcu_read_lock/unlock is sufficient.
> >
> > So the code obviously does not sleep.  But I wonder if anything speaks
> > against mixing SRCU and RCU protection for the same list?
> 
> As I understand, synchronize_rcu() only synchronizes with RCU critical sections,
> and likewise synchronize_srcu() only synchronizes with SRCU critical sections.
> Currently nvme_ns_head_submit_bio() is using srcu_read_lock()
> but nvme_ns_remove() is using synchronize_rcu(),
> so there is no synchronization at all.
> Since nvme_ns_head_submit_bio() sleeps during the critical section,
> we definitely need to keep using SRCU there,
> and therefore nvme_ns_remove() needs to use synchronize_srcu().
> I agree RCU would work in nvme_mpath_revalidate_paths(), but as Paul points out,
> nvme_ns_remove() would then need to synchronize with both SRCU and RCU.
> This seems like it would complicate nvme_ns_remove()
> and require waiting on readers of unrelated RCU-protected structures.
> Is there some cost to using SRCU over RCU that I am missing?

The cost is that srcu_read_lock() and srcu_read_unlock() both invoke
smp_mb().  This might or might not be a problem for you, depending on
how fast your fastpaths need to be.  Another cost is the need to pass
the return value of srcu_read_lock() to srcu_read_unlock(), which is
usually more of a software-engineering cost than it is a performance cost.

Of course, the easiest thing is to try it both ways and see if you can
measure the difference.  If you are unsure whether or not it is safe
to replace rcu_read_lock() with srcu_read_lock(), you can just add
smp_mb() after each rcu_read_lock() and before each rcu_read_unlock().
That won't be exactly, but it will get you quite close to the cost of
srcu_read_lock() and srcu_read_unlock().

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21 17:59       ` Paul E. McKenney
@ 2022-11-21 19:58         ` Caleb Sander
  2022-11-22  0:25           ` Paul E. McKenney
  2022-11-22 10:06           ` Sagi Grimberg
  0 siblings, 2 replies; 25+ messages in thread
From: Caleb Sander @ 2022-11-21 19:58 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe,
	linux-nvme, Uday Shankar

On Mon, Nov 21, 2022 at 9:59 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Nov 21, 2022 at 09:48:31AM -0800, Caleb Sander wrote:
> > On Sun, Nov 20, 2022 at 11:40 PM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Sun, Nov 20, 2022 at 01:24:51PM +0200, Sagi Grimberg wrote:
> > > >>      sector_t capacity = get_capacity(head->disk);
> > > >>      int node;
> > > >> +    int srcu_idx;
> > > >>   +  srcu_idx = srcu_read_lock(&head->srcu);
> > > >>      list_for_each_entry_rcu(ns, &head->list, siblings) {
> > > >>              if (capacity != get_capacity(ns->disk))
> > > >>                      clear_bit(NVME_NS_READY, &ns->flags);
> > > >>      }
> > > >> +    srcu_read_unlock(&head->srcu, srcu_idx);
> > > >
> > > > I don't think you need srcu here, rcu_read_lock/unlock is sufficient.
> > >
> > > So the code obviously does not sleep.  But I wonder if anything speaks
> > > against mixing SRCU and RCU protection for the same list?
> >
> > As I understand, synchronize_rcu() only synchronizes with RCU critical sections,
> > and likewise synchronize_srcu() only synchronizes with SRCU critical sections.
> > Currently nvme_ns_head_submit_bio() is using srcu_read_lock()
> > but nvme_ns_remove() is using synchronize_rcu(),
> > so there is no synchronization at all.
> > Since nvme_ns_head_submit_bio() sleeps during the critical section,
> > we definitely need to keep using SRCU there,
> > and therefore nvme_ns_remove() needs to use synchronize_srcu().
> > I agree RCU would work in nvme_mpath_revalidate_paths(), but as Paul points out,
> > nvme_ns_remove() would then need to synchronize with both SRCU and RCU.
> > This seems like it would complicate nvme_ns_remove()
> > and require waiting on readers of unrelated RCU-protected structures.
> > Is there some cost to using SRCU over RCU that I am missing?
>
> The cost is that srcu_read_lock() and srcu_read_unlock() both invoke
> smp_mb().  This might or might not be a problem for you, depending on
> how fast your fastpaths need to be.  Another cost is the need to pass
> the return value of srcu_read_lock() to srcu_read_unlock(), which is
> usually more of a software-engineering cost than it is a performance cost.

Thanks, that's great to know! I think this is probably not performance-critical,
since it's called only for changes in the connected NVMe namespaces.

> Of course, the easiest thing is to try it both ways and see if you can
> measure the difference.  If you are unsure whether or not it is safe
> to replace rcu_read_lock() with srcu_read_lock(), you can just add
> smp_mb() after each rcu_read_lock() and before each rcu_read_unlock().
> That won't be exactly, but it will get you quite close to the cost of
> srcu_read_lock() and srcu_read_unlock().
>
>                                                         Thanx, Paul

Regarding your suggestion to use synchronize_rcu_mult() to wait concurrently,
I'm not sure it's possible to write an equivalent to call_my_srrcu() here,
since the srcu_struct is per-nvme_ns_head rather than global.
I think we would probably need to synchronize with RCU and SRCU sequentially.

Thanks,
Caleb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21 19:58         ` Caleb Sander
@ 2022-11-22  0:25           ` Paul E. McKenney
  2022-11-22 10:06           ` Sagi Grimberg
  1 sibling, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2022-11-22  0:25 UTC (permalink / raw)
  To: Caleb Sander
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe,
	linux-nvme, Uday Shankar

On Mon, Nov 21, 2022 at 11:58:53AM -0800, Caleb Sander wrote:
> On Mon, Nov 21, 2022 at 9:59 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Nov 21, 2022 at 09:48:31AM -0800, Caleb Sander wrote:
> > > On Sun, Nov 20, 2022 at 11:40 PM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > On Sun, Nov 20, 2022 at 01:24:51PM +0200, Sagi Grimberg wrote:
> > > > >>      sector_t capacity = get_capacity(head->disk);
> > > > >>      int node;
> > > > >> +    int srcu_idx;
> > > > >>   +  srcu_idx = srcu_read_lock(&head->srcu);
> > > > >>      list_for_each_entry_rcu(ns, &head->list, siblings) {
> > > > >>              if (capacity != get_capacity(ns->disk))
> > > > >>                      clear_bit(NVME_NS_READY, &ns->flags);
> > > > >>      }
> > > > >> +    srcu_read_unlock(&head->srcu, srcu_idx);
> > > > >
> > > > > I don't think you need srcu here, rcu_read_lock/unlock is sufficient.
> > > >
> > > > So the code obviously does not sleep.  But I wonder if anything speaks
> > > > against mixing SRCU and RCU protection for the same list?
> > >
> > > As I understand, synchronize_rcu() only synchronizes with RCU critical sections,
> > > and likewise synchronize_srcu() only synchronizes with SRCU critical sections.
> > > Currently nvme_ns_head_submit_bio() is using srcu_read_lock()
> > > but nvme_ns_remove() is using synchronize_rcu(),
> > > so there is no synchronization at all.
> > > Since nvme_ns_head_submit_bio() sleeps during the critical section,
> > > we definitely need to keep using SRCU there,
> > > and therefore nvme_ns_remove() needs to use synchronize_srcu().
> > > I agree RCU would work in nvme_mpath_revalidate_paths(), but as Paul points out,
> > > nvme_ns_remove() would then need to synchronize with both SRCU and RCU.
> > > This seems like it would complicate nvme_ns_remove()
> > > and require waiting on readers of unrelated RCU-protected structures.
> > > Is there some cost to using SRCU over RCU that I am missing?
> >
> > The cost is that srcu_read_lock() and srcu_read_unlock() both invoke
> > smp_mb().  This might or might not be a problem for you, depending on
> > how fast your fastpaths need to be.  Another cost is the need to pass
> > the return value of srcu_read_lock() to srcu_read_unlock(), which is
> > usually more of a software-engineering cost than it is a performance cost.
> 
> Thanks, that's great to know! I think this is probably not performance-critical,
> since it's called only for changes in the connected NVMe namespaces.
> 
> > Of course, the easiest thing is to try it both ways and see if you can
> > measure the difference.  If you are unsure whether or not it is safe
> > to replace rcu_read_lock() with srcu_read_lock(), you can just add
> > smp_mb() after each rcu_read_lock() and before each rcu_read_unlock().
> > That won't be exactly, but it will get you quite close to the cost of
> > srcu_read_lock() and srcu_read_unlock().
> 
> Regarding your suggestion to use synchronize_rcu_mult() to wait concurrently,
> I'm not sure it's possible to write an equivalent to call_my_srrcu() here,
> since the srcu_struct is per-nvme_ns_head rather than global.
> I think we would probably need to synchronize with RCU and SRCU sequentially.

That might be a reason to use a global srcu_struct.  Though there might
be good reasons not to, for all I know.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-21 19:58         ` Caleb Sander
  2022-11-22  0:25           ` Paul E. McKenney
@ 2022-11-22 10:06           ` Sagi Grimberg
  2022-11-22 12:14             ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-22 10:06 UTC (permalink / raw)
  To: Caleb Sander, paulmck
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, linux-nvme, Uday Shankar


>> Of course, the easiest thing is to try it both ways and see if you can
>> measure the difference.  If you are unsure whether or not it is safe
>> to replace rcu_read_lock() with srcu_read_lock(), you can just add
>> smp_mb() after each rcu_read_lock() and before each rcu_read_unlock().
>> That won't be exactly, but it will get you quite close to the cost of
>> srcu_read_lock() and srcu_read_unlock().
>>
>>                                                          Thanx, Paul
> 
> Regarding your suggestion to use synchronize_rcu_mult() to wait concurrently,
> I'm not sure it's possible to write an equivalent to call_my_srrcu() here,
> since the srcu_struct is per-nvme_ns_head rather than global.
> I think we would probably need to synchronize with RCU and SRCU sequentially.

I don't think rcu_mult is necessary.

Lets look at nvme_ns_remove.

1. clears NVME_NS_READY + synchronize_srcu()
    -> guarantees that nshead->current_path is not re-selecting this ns

2. nvme_mpath_clear_current_path + synchronize_srcu()
    -> if this ns happens to be the current_path -> clear it and fence
       concurrent bio submissions

3. removes ns from head sibling list + synchronize rcu
    -> this should fence non-sleeping traversals (like revalidate_paths)

Maybe it is OK to have it also srcu locked and just accept that
nshead sibling list is srcu protected. In that case, your patch
needs to extend the srcu also the clearing of current_head pointer.

But looking again at your bug report, you mention that there are
concurrent scans, one removing the ns and another accessing it.
That cannot happen due to the scan_lock held around this section afaict.

I guess it can be that in general ns removal can compete with a scan
if due to some controller behavior that failed an identify command
transiently in a prior scan, and a subsequent scan finds it? worth
pinning down exactly what happened in the race you got because maybe we
have a different issue that may manifest in other issues.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-22 10:06           ` Sagi Grimberg
@ 2022-11-22 12:14             ` Christoph Hellwig
  2022-11-22 15:08               ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-22 12:14 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Caleb Sander, paulmck, Christoph Hellwig, Keith Busch,
	Jens Axboe, linux-nvme, Uday Shankar

On Tue, Nov 22, 2022 at 12:06:57PM +0200, Sagi Grimberg wrote:
> I don't think rcu_mult is necessary.

It would also be really cumbersome to use.

> Lets look at nvme_ns_remove.
>
> 1. clears NVME_NS_READY + synchronize_srcu()
>    -> guarantees that nshead->current_path is not re-selecting this ns

Yes.

> 2. nvme_mpath_clear_current_path + synchronize_srcu()
>    -> if this ns happens to be the current_path -> clear it and fence
>       concurrent bio submissions

Yes.

> 3. removes ns from head sibling list + synchronize rcu
>    -> this should fence non-sleeping traversals (like revalidate_paths)

Well, non-sleeping would only matter if those non-sleeping travesals
are under rcu_read_lock(), but they are not.  They are either part of
a longer srcu critical section because other code can sleep, or in
case of revalidate_paths unprotected at all (which this patch fixes).

> Maybe it is OK to have it also srcu locked and just accept that
> nshead sibling list is srcu protected. In that case, your patch
> needs to extend the srcu also the clearing of current_head pointer.

I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
It never dereferences the current_path, it just checks is for pointer
equality and if they match clears it to NULL.  (I wonder if it should
use cmpxchg though).

> But looking again at your bug report, you mention that there are
> concurrent scans, one removing the ns and another accessing it.
> That cannot happen due to the scan_lock held around this section afaict.
>
> I guess it can be that in general ns removal can compete with a scan
> if due to some controller behavior that failed an identify command
> transiently in a prior scan, and a subsequent scan finds it? worth
> pinning down exactly what happened in the race you got because maybe we
> have a different issue that may manifest in other issues.

So scanning itself should be single threaded as it only happens from
the workqueue.  But nvme_ns_remove can be called from
nvme_remove_namespaces in in 6.1 and earlier from the passthrough
handler.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-22 12:14             ` Christoph Hellwig
@ 2022-11-22 15:08               ` Sagi Grimberg
  2022-11-24  0:12                 ` Caleb Sander
  2022-12-01 21:27                 ` Caleb Sander
  0 siblings, 2 replies; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-22 15:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Caleb Sander, paulmck, Keith Busch, Jens Axboe, linux-nvme, Uday Shankar


>> 3. removes ns from head sibling list + synchronize rcu
>>     -> this should fence non-sleeping traversals (like revalidate_paths)
> 
> Well, non-sleeping would only matter if those non-sleeping travesals
> are under rcu_read_lock(), but they are not.  They are either part of
> a longer srcu critical section because other code can sleep, or in
> case of revalidate_paths unprotected at all (which this patch fixes).

The original patch comment was that rcu_read_lock/unlock would be
sufficient and we don't need to touch nvme_ns_remove()

> 
>> Maybe it is OK to have it also srcu locked and just accept that
>> nshead sibling list is srcu protected. In that case, your patch
>> needs to extend the srcu also the clearing of current_head pointer.
> 
> I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
> It never dereferences the current_path, it just checks is for pointer
> equality and if they match clears it to NULL.  (I wonder if it should
> use cmpxchg though).

Agree. it can stay out. because at this point it does not compete with
concurrent submissions due to prior synchronizations. The list traversal
needs to be under rcu lock.


> 
>> But looking again at your bug report, you mention that there are
>> concurrent scans, one removing the ns and another accessing it.
>> That cannot happen due to the scan_lock held around this section afaict.
>>
>> I guess it can be that in general ns removal can compete with a scan
>> if due to some controller behavior that failed an identify command
>> transiently in a prior scan, and a subsequent scan finds it? worth
>> pinning down exactly what happened in the race you got because maybe we
>> have a different issue that may manifest in other issues.
> 
> So scanning itself should be single threaded as it only happens from
> the workqueue.  But nvme_ns_remove can be called from
> nvme_remove_namespaces in in 6.1 and earlier from the passthrough
> handler.

The original patch report did not include any sequence that removes all
namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
6.1... Hence I think that we need to understand how a namespace removal
happened at the same time that the namespace is being scanned. Maybe
something else is broken.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-22 15:08               ` Sagi Grimberg
@ 2022-11-24  0:12                 ` Caleb Sander
  2022-11-24  3:08                   ` Chao Leng
  2022-11-24 14:17                   ` Sagi Grimberg
  2022-12-01 21:27                 ` Caleb Sander
  1 sibling, 2 replies; 25+ messages in thread
From: Caleb Sander @ 2022-11-24  0:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, paulmck, Keith Busch, Jens Axboe, linux-nvme,
	Uday Shankar

On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> 3. removes ns from head sibling list + synchronize rcu
> >>     -> this should fence non-sleeping traversals (like revalidate_paths)
> >
> > Well, non-sleeping would only matter if those non-sleeping travesals
> > are under rcu_read_lock(), but they are not.  They are either part of
> > a longer srcu critical section because other code can sleep, or in
> > case of revalidate_paths unprotected at all (which this patch fixes).
>
> The original patch comment was that rcu_read_lock/unlock would be
> sufficient and we don't need to touch nvme_ns_remove()
>
> >
> >> Maybe it is OK to have it also srcu locked and just accept that
> >> nshead sibling list is srcu protected. In that case, your patch
> >> needs to extend the srcu also the clearing of current_head pointer.
> >
> > I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
> > It never dereferences the current_path, it just checks is for pointer
> > equality and if they match clears it to NULL.  (I wonder if it should
> > use cmpxchg though).
>
> Agree. it can stay out. because at this point it does not compete with
> concurrent submissions due to prior synchronizations. The list traversal
> needs to be under rcu lock.
>
>
> >
> >> But looking again at your bug report, you mention that there are
> >> concurrent scans, one removing the ns and another accessing it.
> >> That cannot happen due to the scan_lock held around this section afaict.
> >>
> >> I guess it can be that in general ns removal can compete with a scan
> >> if due to some controller behavior that failed an identify command
> >> transiently in a prior scan, and a subsequent scan finds it? worth
> >> pinning down exactly what happened in the race you got because maybe we
> >> have a different issue that may manifest in other issues.
> >
> > So scanning itself should be single threaded as it only happens from
> > the workqueue.  But nvme_ns_remove can be called from
> > nvme_remove_namespaces in in 6.1 and earlier from the passthrough
> > handler.
>
> The original patch report did not include any sequence that removes all
> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
> 6.1... Hence I think that we need to understand how a namespace removal
> happened at the same time that the namespace is being scanned. Maybe
> something else is broken.

We can reliably cause the panic by sending a "NS Changed" AEN
from multiple controllers at the same time, resulting in multiple scan works
running concurrently for the same namespace heads.
In our test, we have 4 controllers for the same subsystem, with 9 namespaces
that are added one at a time, resulting in many AENs for each controller.
We can see from the dmesg logs that the controllers' scan works overlap:
[37311.530367] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
[37311.530398] nvme nvme0: creating 32 I/O queues.
[37311.819883] nvme nvme0: mapped 32/0/0 default/read/poll queues.
[37311.828129] nvme nvme0: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.1.110:4420
[37311.924908] nvme nvme1: queue_size 128 > ctrl sqsize 32, clamping down
[37311.924935] nvme nvme1: creating 32 I/O queues.
[37312.298561] nvme nvme1: mapped 32/0/0 default/read/poll queues.
[37312.306296] nvme nvme1: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.3.110:4420
[37312.400143] nvme nvme2: queue_size 128 > ctrl sqsize 32, clamping down
[37312.400180] nvme nvme2: creating 32 I/O queues.
[37312.671861] nvme nvme2: mapped 32/0/0 default/read/poll queues.
[37312.678318] nvme nvme2: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.2.110:4420
[37312.760833] nvme nvme3: queue_size 128 > ctrl sqsize 32, clamping down
[37312.760860] nvme nvme3: creating 32 I/O queues.
[37313.123490] nvme nvme3: mapped 32/0/0 default/read/poll queues.
[37313.130407] nvme nvme3: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.4.110:4420
[37313.654120] nvme nvme3: rescanning namespaces.
[37313.654152] nvme nvme2: rescanning namespaces.
[37313.654867] nvme0n1: detected capacity change from 0 to 11811160064
[37313.654876] nvme0n1: detected capacity change from 0 to 11811160064
[37313.655573] nvme nvme0: rescanning namespaces.
[37313.655694] nvme nvme1: rescanning namespaces.
[37313.656405] nvme0n1: detected capacity change from 0 to 11811160064
[37313.656445] nvme0n1: detected capacity change from 0 to 11811160064
[37313.897745] nvme nvme3: rescanning namespaces.
[37313.897748] nvme nvme2: rescanning namespaces.
[37313.898614] nvme0n2: detected capacity change from 0 to 11811160064
[37313.907348] nvme nvme0: rescanning namespaces.
[37313.907409] nvme nvme1: rescanning namespaces.
[37314.191586] nvme nvme2: rescanning namespaces.
[37314.191589] nvme nvme3: rescanning namespaces.
[37314.193241] nvme nvme0: rescanning namespaces.
[37314.193303] nvme nvme1: rescanning namespaces.
[37314.205965] nvme0n3: detected capacity change from 0 to 11811160064
[37314.206026] nvme0n3: detected capacity change from 0 to 11811160064
[37314.206036] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000050
[37314.206036] nvme0n3: detected capacity change from 0 to 11811160064

I don't see any mechanism to prevent scan work from running concurrently.
scan_lock is per-controller, but there are 4 controllers in our test.
I'm not very familiar with work queues in Linux, but it looks like they can run
multiple pieces of work concurrently.
The nvme-wq appears not CPU bound and has a high max concurrency:
$ cat /sys/bus/workqueue/devices/nvme-wq/per_cpu
0
$ cat /sys/bus/workqueue/devices/nvme-wq/max_active
256

At the end of the scan, nvme_remove_invalid_namespaces() is called,
which I think explains how we end up with namespace removals during the scans.

Thanks,
Caleb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-18 23:27 [PATCH] nvme: fix SRCU protection of nvme_ns_head list Caleb Sander
  2022-11-20 11:24 ` Sagi Grimberg
@ 2022-11-24  0:24 ` Caleb Sander
  2022-11-24 14:19   ` Sagi Grimberg
  2022-11-29  8:39   ` Christoph Hellwig
  1 sibling, 2 replies; 25+ messages in thread
From: Caleb Sander @ 2022-11-24  0:24 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme
  Cc: Uday Shankar, Caleb Sander

This is an alternate fix, protecting the nvme_ns_head siblings list
with both RCU and SRCU.
Use RCU in nvme_mpath_revalidate_paths(), which doesn't sleep.
And add synchronize_srcu() in nvme_ns_remove(),
since nvme_ns_head_submit_bio() protects the list with SRCU.

Signed-off-by: Caleb Sander <csander@purestorage.com>
---
 drivers/nvme/host/core.c      | 1 +
 drivers/nvme/host/multipath.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index da55ce45ac70..9b26d4781b32 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4305,6 +4305,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	/* guarantee not available in head->list */
 	synchronize_rcu();
+	synchronize_srcu(&ns->head->srcu);
 
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 93e2138a8b42..6d0aea477253 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -175,10 +175,12 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
 	sector_t capacity = get_capacity(head->disk);
 	int node;
 
+	rcu_read_lock();
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		if (capacity != get_capacity(ns->disk))
 			clear_bit(NVME_NS_READY, &ns->flags);
 	}
+	rcu_read_unlock();
 
 	for_each_node(node)
 		rcu_assign_pointer(head->current_path[node], NULL);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-24  0:12                 ` Caleb Sander
@ 2022-11-24  3:08                   ` Chao Leng
  2022-11-24 14:17                   ` Sagi Grimberg
  1 sibling, 0 replies; 25+ messages in thread
From: Chao Leng @ 2022-11-24  3:08 UTC (permalink / raw)
  To: Caleb Sander, Sagi Grimberg
  Cc: Christoph Hellwig, paulmck, Keith Busch, Jens Axboe, linux-nvme,
	Uday Shankar



On 2022/11/24 8:12, Caleb Sander wrote:
> On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>>> 3. removes ns from head sibling list + synchronize rcu
>>>>      -> this should fence non-sleeping traversals (like revalidate_paths)
>>>
>>> Well, non-sleeping would only matter if those non-sleeping travesals
>>> are under rcu_read_lock(), but they are not.  They are either part of
>>> a longer srcu critical section because other code can sleep, or in
>>> case of revalidate_paths unprotected at all (which this patch fixes).
>>
>> The original patch comment was that rcu_read_lock/unlock would be
>> sufficient and we don't need to touch nvme_ns_remove()
>>
>>>
>>>> Maybe it is OK to have it also srcu locked and just accept that
>>>> nshead sibling list is srcu protected. In that case, your patch
>>>> needs to extend the srcu also the clearing of current_head pointer.
>>>
>>> I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
>>> It never dereferences the current_path, it just checks is for pointer
>>> equality and if they match clears it to NULL.  (I wonder if it should
>>> use cmpxchg though).
>>
>> Agree. it can stay out. because at this point it does not compete with
>> concurrent submissions due to prior synchronizations. The list traversal
>> needs to be under rcu lock.
>>
>>
>>>
>>>> But looking again at your bug report, you mention that there are
>>>> concurrent scans, one removing the ns and another accessing it.
>>>> That cannot happen due to the scan_lock held around this section afaict.
>>>>
>>>> I guess it can be that in general ns removal can compete with a scan
>>>> if due to some controller behavior that failed an identify command
>>>> transiently in a prior scan, and a subsequent scan finds it? worth
>>>> pinning down exactly what happened in the race you got because maybe we
>>>> have a different issue that may manifest in other issues.
>>>
>>> So scanning itself should be single threaded as it only happens from
>>> the workqueue.  But nvme_ns_remove can be called from
>>> nvme_remove_namespaces in in 6.1 and earlier from the passthrough
>>> handler.
>>
>> The original patch report did not include any sequence that removes all
>> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
>> 6.1... Hence I think that we need to understand how a namespace removal
>> happened at the same time that the namespace is being scanned. Maybe
>> something else is broken.
> 
> We can reliably cause the panic by sending a "NS Changed" AEN
> from multiple controllers at the same time, resulting in multiple scan works
> running concurrently for the same namespace heads.
> In our test, we have 4 controllers for the same subsystem, with 9 namespaces
> that are added one at a time, resulting in many AENs for each controller.
> We can see from the dmesg logs that the controllers' scan works overlap:
> [37311.530367] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.530398] nvme nvme0: creating 32 I/O queues.
> [37311.819883] nvme nvme0: mapped 32/0/0 default/read/poll queues.
> [37311.828129] nvme nvme0: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.1.110:4420
> [37311.924908] nvme nvme1: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.924935] nvme nvme1: creating 32 I/O queues.
> [37312.298561] nvme nvme1: mapped 32/0/0 default/read/poll queues.
> [37312.306296] nvme nvme1: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.3.110:4420
> [37312.400143] nvme nvme2: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.400180] nvme nvme2: creating 32 I/O queues.
> [37312.671861] nvme nvme2: mapped 32/0/0 default/read/poll queues.
> [37312.678318] nvme nvme2: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.2.110:4420
> [37312.760833] nvme nvme3: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.760860] nvme nvme3: creating 32 I/O queues.
> [37313.123490] nvme nvme3: mapped 32/0/0 default/read/poll queues.
> [37313.130407] nvme nvme3: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.4.110:4420
> [37313.654120] nvme nvme3: rescanning namespaces.
> [37313.654152] nvme nvme2: rescanning namespaces.
> [37313.654867] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.654876] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.655573] nvme nvme0: rescanning namespaces.
> [37313.655694] nvme nvme1: rescanning namespaces.
> [37313.656405] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.656445] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.897745] nvme nvme3: rescanning namespaces.
> [37313.897748] nvme nvme2: rescanning namespaces.
> [37313.898614] nvme0n2: detected capacity change from 0 to 11811160064
> [37313.907348] nvme nvme0: rescanning namespaces.
> [37313.907409] nvme nvme1: rescanning namespaces.
> [37314.191586] nvme nvme2: rescanning namespaces.
> [37314.191589] nvme nvme3: rescanning namespaces.
> [37314.193241] nvme nvme0: rescanning namespaces.
> [37314.193303] nvme nvme1: rescanning namespaces.
> [37314.205965] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206026] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206036] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000050
> [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
> 
> I don't see any mechanism to prevent scan work from running concurrently.
> scan_lock is per-controller, but there are 4 controllers in our test.
> I'm not very familiar with work queues in Linux, but it looks like they can run
> multiple pieces of work concurrently.
> The nvme-wq appears not CPU bound and has a high max concurrency:
> $ cat /sys/bus/workqueue/devices/nvme-wq/per_cpu
> 0
> $ cat /sys/bus/workqueue/devices/nvme-wq/max_active
> 256
> 
> At the end of the scan, nvme_remove_invalid_namespaces() is called,
> which I think explains how we end up with namespace removals during the scans.
work should not concurrent.
ctrl->scan_lock also should not ineffective.
I guess the reason is that the nsid of ns_list reported by the target is not in ascending order.
Of course, there is a problem with the robustness of the code on the host side.
You can try this patch:
  drivers/nvme/host/core.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c78a5825a057..ebf751cc3855 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4467,6 +4467,8 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)

                         if (!nsid)      /* end of the list? */
                                 goto out;
+                       if (nsid <= prev)
+                               continue;
                         nvme_scan_ns(ctrl, nsid);
                         while (++prev < nsid)
                                 nvme_ns_remove_by_nsid(ctrl, prev);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-24  0:12                 ` Caleb Sander
  2022-11-24  3:08                   ` Chao Leng
@ 2022-11-24 14:17                   ` Sagi Grimberg
  1 sibling, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-24 14:17 UTC (permalink / raw)
  To: Caleb Sander
  Cc: Christoph Hellwig, paulmck, Keith Busch, Jens Axboe, linux-nvme,
	Uday Shankar



On 11/24/22 02:12, Caleb Sander wrote:
> On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>>> 3. removes ns from head sibling list + synchronize rcu
>>>>      -> this should fence non-sleeping traversals (like revalidate_paths)
>>>
>>> Well, non-sleeping would only matter if those non-sleeping travesals
>>> are under rcu_read_lock(), but they are not.  They are either part of
>>> a longer srcu critical section because other code can sleep, or in
>>> case of revalidate_paths unprotected at all (which this patch fixes).
>>
>> The original patch comment was that rcu_read_lock/unlock would be
>> sufficient and we don't need to touch nvme_ns_remove()
>>
>>>
>>>> Maybe it is OK to have it also srcu locked and just accept that
>>>> nshead sibling list is srcu protected. In that case, your patch
>>>> needs to extend the srcu also the clearing of current_head pointer.
>>>
>>> I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
>>> It never dereferences the current_path, it just checks is for pointer
>>> equality and if they match clears it to NULL.  (I wonder if it should
>>> use cmpxchg though).
>>
>> Agree. it can stay out. because at this point it does not compete with
>> concurrent submissions due to prior synchronizations. The list traversal
>> needs to be under rcu lock.
>>
>>
>>>
>>>> But looking again at your bug report, you mention that there are
>>>> concurrent scans, one removing the ns and another accessing it.
>>>> That cannot happen due to the scan_lock held around this section afaict.
>>>>
>>>> I guess it can be that in general ns removal can compete with a scan
>>>> if due to some controller behavior that failed an identify command
>>>> transiently in a prior scan, and a subsequent scan finds it? worth
>>>> pinning down exactly what happened in the race you got because maybe we
>>>> have a different issue that may manifest in other issues.
>>>
>>> So scanning itself should be single threaded as it only happens from
>>> the workqueue.  But nvme_ns_remove can be called from
>>> nvme_remove_namespaces in in 6.1 and earlier from the passthrough
>>> handler.
>>
>> The original patch report did not include any sequence that removes all
>> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
>> 6.1... Hence I think that we need to understand how a namespace removal
>> happened at the same time that the namespace is being scanned. Maybe
>> something else is broken.
> 
> We can reliably cause the panic by sending a "NS Changed" AEN
> from multiple controllers at the same time, resulting in multiple scan works
> running concurrently for the same namespace heads.

That is fine.

> In our test, we have 4 controllers for the same subsystem, with 9 namespaces
> that are added one at a time, resulting in many AENs for each controller.
> We can see from the dmesg logs that the controllers' scan works overlap:
> [37311.530367] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.530398] nvme nvme0: creating 32 I/O queues.
> [37311.819883] nvme nvme0: mapped 32/0/0 default/read/poll queues.
> [37311.828129] nvme nvme0: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.1.110:4420
> [37311.924908] nvme nvme1: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.924935] nvme nvme1: creating 32 I/O queues.
> [37312.298561] nvme nvme1: mapped 32/0/0 default/read/poll queues.
> [37312.306296] nvme nvme1: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.3.110:4420
> [37312.400143] nvme nvme2: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.400180] nvme nvme2: creating 32 I/O queues.
> [37312.671861] nvme nvme2: mapped 32/0/0 default/read/poll queues.
> [37312.678318] nvme nvme2: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.2.110:4420
> [37312.760833] nvme nvme3: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.760860] nvme nvme3: creating 32 I/O queues.
> [37313.123490] nvme nvme3: mapped 32/0/0 default/read/poll queues.
> [37313.130407] nvme nvme3: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.4.110:4420
> [37313.654120] nvme nvme3: rescanning namespaces.
> [37313.654152] nvme nvme2: rescanning namespaces.
> [37313.654867] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.654876] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.655573] nvme nvme0: rescanning namespaces.
> [37313.655694] nvme nvme1: rescanning namespaces.
> [37313.656405] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.656445] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.897745] nvme nvme3: rescanning namespaces.
> [37313.897748] nvme nvme2: rescanning namespaces.
> [37313.898614] nvme0n2: detected capacity change from 0 to 11811160064
> [37313.907348] nvme nvme0: rescanning namespaces.
> [37313.907409] nvme nvme1: rescanning namespaces.
> [37314.191586] nvme nvme2: rescanning namespaces.
> [37314.191589] nvme nvme3: rescanning namespaces.
> [37314.193241] nvme nvme0: rescanning namespaces.
> [37314.193303] nvme nvme1: rescanning namespaces.
> [37314.205965] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206026] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206036] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000050
> [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
> 
> I don't see any mechanism to prevent scan work from running concurrently.
> scan_lock is per-controller, but there are 4 controllers in our test.
> I'm not very familiar with work queues in Linux, but it looks like they can run
> multiple pieces of work concurrently.

The scan is running sequentially per-controller, nothing prevents
different controllers to scan at the same time.

> The nvme-wq appears not CPU bound and has a high max concurrency:
> $ cat /sys/bus/workqueue/devices/nvme-wq/per_cpu
> 0
> $ cat /sys/bus/workqueue/devices/nvme-wq/max_active
> 256
> 
> At the end of the scan, nvme_remove_invalid_namespaces() is called,
> which I think explains how we end up with namespace removals during the scans.

I see. Makes sense that at least revalidate_paths is called in parallel
with ns remove which you indicated is being called.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-24  0:24 ` [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate) Caleb Sander
@ 2022-11-24 14:19   ` Sagi Grimberg
  2022-11-29  8:39   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-24 14:19 UTC (permalink / raw)
  To: Caleb Sander, Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme
  Cc: Uday Shankar


> This is an alternate fix, protecting the nvme_ns_head siblings list
> with both RCU and SRCU.

There is no need for another srcu protection.

> Use RCU in nvme_mpath_revalidate_paths(), which doesn't sleep.
> And add synchronize_srcu() in nvme_ns_remove(),
> since nvme_ns_head_submit_bio() protects the list with SRCU.

submit_bio is already fenced in steps before so I don't see
why there is a need for srcu protection here.

> Signed-off-by: Caleb Sander <csander@purestorage.com>
> ---
>   drivers/nvme/host/core.c      | 1 +
>   drivers/nvme/host/multipath.c | 2 ++
>   2 files changed, 3 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index da55ce45ac70..9b26d4781b32 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4305,6 +4305,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   
>   	/* guarantee not available in head->list */
>   	synchronize_rcu();
> +	synchronize_srcu(&ns->head->srcu);

This can be removed.

>   
>   	if (!nvme_ns_head_multipath(ns->head))
>   		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 93e2138a8b42..6d0aea477253 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -175,10 +175,12 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>   	sector_t capacity = get_capacity(head->disk);
>   	int node;
>   
> +	rcu_read_lock();
>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>   		if (capacity != get_capacity(ns->disk))
>   			clear_bit(NVME_NS_READY, &ns->flags);
>   	}
> +	rcu_read_unlock();
>   

This looks fine.

>   	for_each_node(node)
>   		rcu_assign_pointer(head->current_path[node], NULL);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-24  0:24 ` [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate) Caleb Sander
  2022-11-24 14:19   ` Sagi Grimberg
@ 2022-11-29  8:39   ` Christoph Hellwig
  2022-11-30  8:25     ` Sagi Grimberg
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-29  8:39 UTC (permalink / raw)
  To: Caleb Sander
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, Uday Shankar

On Wed, Nov 23, 2022 at 05:24:32PM -0700, Caleb Sander wrote:
> +++ b/drivers/nvme/host/multipath.c
> @@ -175,10 +175,12 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>  	sector_t capacity = get_capacity(head->disk);
>  	int node;
>  
> +	rcu_read_lock();
>  	list_for_each_entry_rcu(ns, &head->list, siblings) {
>  		if (capacity != get_capacity(ns->disk))
>  			clear_bit(NVME_NS_READY, &ns->flags);
>  	}
> +	rcu_read_unlock();

I can't see how protection iteraction of the same list once with
SRCU and once with RCU can work in practice.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-29  8:39   ` Christoph Hellwig
@ 2022-11-30  8:25     ` Sagi Grimberg
  2022-11-30  8:35       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-30  8:25 UTC (permalink / raw)
  To: Christoph Hellwig, Caleb Sander
  Cc: Keith Busch, Jens Axboe, linux-nvme, Uday Shankar


>> +++ b/drivers/nvme/host/multipath.c
>> @@ -175,10 +175,12 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>   	sector_t capacity = get_capacity(head->disk);
>>   	int node;
>>   
>> +	rcu_read_lock();
>>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>>   		if (capacity != get_capacity(ns->disk))
>>   			clear_bit(NVME_NS_READY, &ns->flags);
>>   	}
>> +	rcu_read_unlock();
> 
> I can't see how protection iteraction of the same list once with
> SRCU and once with RCU can work in practice.

I understand what you mean in general, but in this particular case
I don't understand what is not working.

But, if this is too confusing we can make everything srcu protected
and be done with it.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-30  8:25     ` Sagi Grimberg
@ 2022-11-30  8:35       ` Christoph Hellwig
  2022-11-30  8:40         ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-30  8:35 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Caleb Sander, Keith Busch, Jens Axboe,
	linux-nvme, Uday Shankar

On Wed, Nov 30, 2022 at 10:25:19AM +0200, Sagi Grimberg wrote:
> I understand what you mean in general, but in this particular case
> I don't understand what is not working.

How does this work?

> But, if this is too confusing we can make everything srcu protected
> and be done with it.

To me the original patch looks like correct and reasonable, so I'd
prefer to go with that.  We can then look into optimizing the
SRCU usage later if needed.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-30  8:35       ` Christoph Hellwig
@ 2022-11-30  8:40         ` Sagi Grimberg
  2022-12-01 21:17           ` Caleb Sander
  0 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2022-11-30  8:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Caleb Sander, Keith Busch, Jens Axboe, linux-nvme, Uday Shankar


>> I understand what you mean in general, but in this particular case
>> I don't understand what is not working.
> 
> How does this work?

Particularly, the sleeping contexts are guaranteed not to dereference
this NS after the two previous srcu synchronization steps so at this
point, the only protection of nvme_ns_remove is against this
non-sleepable traversal, which should be enough to protect with rcu.

>> But, if this is too confusing we can make everything srcu protected
>> and be done with it.
> 
> To me the original patch looks like correct and reasonable, so I'd
> prefer to go with that.  We can then look into optimizing the
> SRCU usage later if needed.

That is fine with me.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-11-30  8:40         ` Sagi Grimberg
@ 2022-12-01 21:17           ` Caleb Sander
  2022-12-02  1:21             ` Chao Leng
  0 siblings, 1 reply; 25+ messages in thread
From: Caleb Sander @ 2022-12-01 21:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, linux-nvme, Uday Shankar

On Wed, Nov 30, 2022 at 12:40 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> I understand what you mean in general, but in this particular case
> >> I don't understand what is not working.
> >
> > How does this work?
>
> Particularly, the sleeping contexts are guaranteed not to dereference
> this NS after the two previous srcu synchronization steps so at this
> point, the only protection of nvme_ns_remove is against this
> non-sleepable traversal, which should be enough to protect with rcu.

Can you help me understand the safety here?
The namespace will be dereferenced when traversing the siblings list,
which is protected by SRCU.
But nvme_ns_remove() only synchronizes with RCU between removing the namespace
from the siblings list and freeing the namespace.
So it seems like there's a race here:
Thread A:                               Thread B:
nvme_ns_remove() executes
past the last synchronize_srcu()
                                        nvme_ns_head_submit_bio()
                                        calls srcu_read_lock(),
                                        starts traversing siblings list,
                                        holds pointer to ns
Removes ns from siblings list
Calls synchronize_rcu()
(does not block for the SRCU reader)
nvme_put_ns() reaches 0 references,
frees ns
                                        Dereferences ns to continue traversal
                                        => USE AFTER FREE


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-11-22 15:08               ` Sagi Grimberg
  2022-11-24  0:12                 ` Caleb Sander
@ 2022-12-01 21:27                 ` Caleb Sander
  2022-12-01 23:18                   ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Caleb Sander @ 2022-12-01 21:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, paulmck, Keith Busch, Jens Axboe, linux-nvme,
	Uday Shankar

On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> ...
> The original patch report did not include any sequence that removes all
> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
> 6.1... Hence I think that we need to understand how a namespace removal
> happened at the same time that the namespace is being scanned. Maybe
> something else is broken.

After some more debugging, I found the main cause:
ns->disk isn't set until after ns is linked into the ns_head list.
The ns is linked in nvme_init_ns_head(),
which nvme_alloc_ns() calls before creating the disk:
static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
struct nvme_ns_ids *ids)
{
// ...
if (nvme_init_ns_head(ns, nsid, ids, id->nmic & NVME_NS_NMIC_SHARED))
goto out_free_queue;

disk = alloc_disk_node(0, node);
if (!disk)
goto out_unlink_ns;
// ...
ns->disk = disk;
// ...
}

It looks like this has never been a problem upstream:
5f432cceb3e9 ("nvme: use blk_mq_alloc_disk") fixed the order before
e7d65803e2bb ("nvme-multipath: revalidate paths during rescan") was committed.
Strange that the Red Hat 8.6 kernel has pulled in only the later commit.

The list traversal here still needs to be protected by (S)RCU,
so I would still push for this patch.
Sagi had suggested using RCU here instead, but discussing with Chris,
it sounds like we reached a consensus to consistently protect this list
with SRCU. We can evaluate switching to RCU later if needed,
but since this is not in the I/O path, I don't see a big performance concern.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
  2022-12-01 21:27                 ` Caleb Sander
@ 2022-12-01 23:18                   ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2022-12-01 23:18 UTC (permalink / raw)
  To: Caleb Sander
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe,
	linux-nvme, Uday Shankar

On Thu, Dec 01, 2022 at 01:27:16PM -0800, Caleb Sander wrote:
> On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> > ...
> > The original patch report did not include any sequence that removes all
> > namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
> > 6.1... Hence I think that we need to understand how a namespace removal
> > happened at the same time that the namespace is being scanned. Maybe
> > something else is broken.
> 
> After some more debugging, I found the main cause:
> ns->disk isn't set until after ns is linked into the ns_head list.
> The ns is linked in nvme_init_ns_head(),
> which nvme_alloc_ns() calls before creating the disk:
> static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
> struct nvme_ns_ids *ids)
> {
> // ...
> if (nvme_init_ns_head(ns, nsid, ids, id->nmic & NVME_NS_NMIC_SHARED))
> goto out_free_queue;
> 
> disk = alloc_disk_node(0, node);
> if (!disk)
> goto out_unlink_ns;
> // ...
> ns->disk = disk;
> // ...
> }
> 
> It looks like this has never been a problem upstream:
> 5f432cceb3e9 ("nvme: use blk_mq_alloc_disk") fixed the order before
> e7d65803e2bb ("nvme-multipath: revalidate paths during rescan") was committed.
> Strange that the Red Hat 8.6 kernel has pulled in only the later commit.

Sadly, not strange at all.  Keeping upstream sane is hard enough, and
also keeping N -stable releases sane is not any easier.  ;-)

> The list traversal here still needs to be protected by (S)RCU,
> so I would still push for this patch.
> Sagi had suggested using RCU here instead, but discussing with Chris,
> it sounds like we reached a consensus to consistently protect this list
> with SRCU. We can evaluate switching to RCU later if needed,
> but since this is not in the I/O path, I don't see a big performance concern.

For whatever it is worth, makes sense to me!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate)
  2022-12-01 21:17           ` Caleb Sander
@ 2022-12-02  1:21             ` Chao Leng
  0 siblings, 0 replies; 25+ messages in thread
From: Chao Leng @ 2022-12-02  1:21 UTC (permalink / raw)
  To: linux-nvme



On 2022/12/2 5:17, Caleb Sander wrote:
> On Wed, Nov 30, 2022 at 12:40 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>>> I understand what you mean in general, but in this particular case
>>>> I don't understand what is not working.
>>>
>>> How does this work?
>>
>> Particularly, the sleeping contexts are guaranteed not to dereference
>> this NS after the two previous srcu synchronization steps so at this
>> point, the only protection of nvme_ns_remove is against this
>> non-sleepable traversal, which should be enough to protect with rcu.
> 
> Can you help me understand the safety here?
> The namespace will be dereferenced when traversing the siblings list,
> which is protected by SRCU.
> But nvme_ns_remove() only synchronizes with RCU between removing the namespace
> from the siblings list and freeing the namespace.
> So it seems like there's a race here:
> Thread A:                               Thread B:
> nvme_ns_remove() executes
> past the last synchronize_srcu()
>                                          nvme_ns_head_submit_bio()
>                                          calls srcu_read_lock(),
>                                          starts traversing siblings list,
>                                          holds pointer to ns
> Removes ns from siblings list
> Calls synchronize_rcu()
> (does not block for the SRCU reader)
del_gendisk will wait all requests to be completed, "use after free" will do not happen.
> nvme_put_ns() reaches 0 references,
> frees ns
>                                          Dereferences ns to continue traversal
>                                          => USE AFTER FREE
> 
> .
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-12-02  1:21 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-18 23:27 [PATCH] nvme: fix SRCU protection of nvme_ns_head list Caleb Sander
2022-11-20 11:24 ` Sagi Grimberg
2022-11-21  7:40   ` Christoph Hellwig
2022-11-21  9:43     ` Sagi Grimberg
2022-11-21 14:57       ` Paul E. McKenney
2022-11-21 17:48     ` Caleb Sander
2022-11-21 17:59       ` Paul E. McKenney
2022-11-21 19:58         ` Caleb Sander
2022-11-22  0:25           ` Paul E. McKenney
2022-11-22 10:06           ` Sagi Grimberg
2022-11-22 12:14             ` Christoph Hellwig
2022-11-22 15:08               ` Sagi Grimberg
2022-11-24  0:12                 ` Caleb Sander
2022-11-24  3:08                   ` Chao Leng
2022-11-24 14:17                   ` Sagi Grimberg
2022-12-01 21:27                 ` Caleb Sander
2022-12-01 23:18                   ` Paul E. McKenney
2022-11-24  0:24 ` [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate) Caleb Sander
2022-11-24 14:19   ` Sagi Grimberg
2022-11-29  8:39   ` Christoph Hellwig
2022-11-30  8:25     ` Sagi Grimberg
2022-11-30  8:35       ` Christoph Hellwig
2022-11-30  8:40         ` Sagi Grimberg
2022-12-01 21:17           ` Caleb Sander
2022-12-02  1:21             ` Chao Leng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.