linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
       [not found] <20240505124910.1877325-1-dan.aloni@vastdata.com>
@ 2024-05-05 14:35 ` Chuck Lever
  2024-05-05 18:36   ` Dan Aloni
  0 siblings, 1 reply; 4+ messages in thread
From: Chuck Lever @ 2024-05-05 14:35 UTC (permalink / raw)
  To: Dan Aloni; +Cc: linux-nfs, linux-rdma

On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> underflow.
> 
> The specific scenario that caused this to happen is IB device bonding,
> when bringing down one of the ports, or all ports. The situation is not
> just a print - it also causes a non-recoverable state it is not even
> possible to complete the disconnect and shut it down the mount,
> requiring a reboot, suggesting that tear-down is also incomplete in this
> state.
> 
> The trivial fix seems to work as such - if we did not receive a
> `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> `rpcrdma_xprt_disconnect`.
> 
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Hi Dan, thanks for the bug report!

In the future, note that the Fixes: tag needs to go down by the
Signed-off-by: tag. But I wonder if 2acc5 introduced this issue --
ep reference counting seems to have been added before this commit.

Second, I wonder if, when bonding interfaces, there's an opportunity
for the verbs consumer to take an additional transport reference.
Cc'ing linux-rdma for input on that issue. Can you show a brief
diagram of when the ep reference count changes when bonding?

Also, I note that the WARNING below happened on a RHEL9 kernel:

   5.14.0-284.11.1.el9_2.x86_64

Can you confirm that this issue reproduces on v6.8 and newer ?


> Example crash:
> 
> rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
> ------------[ cut here ]------------
> refcount_t: underflow; use-after-free.
> WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
> isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
> CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G        W  OE    --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
> Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
> Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> RIP: 0010:refcount_warn_saturate+0xba/0x110
> Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
> RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
> RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
> RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
> R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
> R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
> FS:  0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> rpcrdma_ep_put+0x42/0x50 [rpcrdma]
> rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
> xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
> process_one_work+0x1e5/0x3c0
> ? rescuer_thread+0x3a0/0x3a0
> worker_thread+0x50/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> kthread+0xd6/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
> </TASK>
> 
> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
> ---
>  net/sunrpc/xprtrdma/verbs.c     | 5 ++++-
>  net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4f8d7efa469f..19996515da94 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>  		goto disconnected;
>  	case RDMA_CM_EVENT_ESTABLISHED:
>  		rpcrdma_ep_get(ep);
> +               ep->re_connect_ref = true;
>  		ep->re_connect_status = 1;
>  		rpcrdma_update_cm_private(ep, &event->param.conn);
>  		trace_xprtrdma_inline_thresh(ep);
> @@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>  		ep->re_connect_status = -ECONNABORTED;
>  disconnected:
>  		rpcrdma_force_disconnect(ep);
> -		return rpcrdma_ep_put(ep);
> +		if (ep->re_connect_ref)
> +			return rpcrdma_ep_put(ep);
> +		return 0;
>  	default:
>  		break;
>  	}
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index da409450dfc0..1553ef69a844 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -84,6 +84,7 @@ struct rpcrdma_ep {
>  	unsigned int		re_max_inline_recv;
>  	int			re_async_rc;
>  	int			re_connect_status;
> +	bool			re_connect_ref;
>  	atomic_t		re_receiving;
>  	atomic_t		re_force_disconnect;
>  	struct ib_qp_init_attr	re_attr;
> -- 
> 2.39.3
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
  2024-05-05 14:35 ` [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen Chuck Lever
@ 2024-05-05 18:36   ` Dan Aloni
  2024-05-05 19:00     ` Chuck Lever
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Aloni @ 2024-05-05 18:36 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, linux-rdma

On 2024-05-05 10:35:46, Chuck Lever wrote:
> On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > underflow.
> > 
> > The specific scenario that caused this to happen is IB device bonding,
> > when bringing down one of the ports, or all ports. The situation is not
> > just a print - it also causes a non-recoverable state it is not even
> > possible to complete the disconnect and shut it down the mount,
> > requiring a reboot, suggesting that tear-down is also incomplete in this
> > state.
> > 
> > The trivial fix seems to work as such - if we did not receive a
> > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > `rpcrdma_xprt_disconnect`.
> > 
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> 
> Hi Dan, thanks for the bug report!
> 
[..]
> Second, I wonder if, when bonding interfaces, there's an opportunity
> for the verbs consumer to take an additional transport reference.
> Cc'ing linux-rdma for input on that issue. Can you show a brief
> diagram of when the ep reference count changes when bonding?

Not sure we need an additional reference here. I understand regarding rpcrdma
managing its internal EP refcount, that it is having three in total: one for
init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
and I found some more about it, please read on.

> Also, I note that the WARNING below happened on a RHEL9 kernel:
> 
>    5.14.0-284.11.1.el9_2.x86_64
> 
> Can you confirm that this issue reproduces on v6.8 and newer ?

Some context: I originally tested with a version that also has implemented a
timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
'culprit' in the negative decref, but there is the other issue and second order
effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
the unkillable transport, so we are not reaching teardown at all. With both of
these changes applied both problems are gone.

As for upstream version testing, with the elrepo build of 6.8.9 which matches
vanilla, I don't see the negative kref but I do see the other effect of
`XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
which does not release even after ports are back up online.

Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
on bonding, I'll post in a reply.

```
INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
      Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:6  state:D stack:0     pid:1688  tgid:1688  ppid:2      flags:0x00004000
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Call Trace:
 <TASK>
 __schedule+0x21c/0x680
 schedule+0x31/0xd0
 schedule_timeout+0x148/0x160
 ? mutex_lock+0xe/0x30
 ? cma_process_remove+0x218/0x260 [rdma_cm]
 ? preempt_count_add+0x70/0xa0
 __wait_for_common+0x90/0x1e0
 ? __pfx_schedule_timeout+0x10/0x10
 cma_remove_one+0x5c/0xd0 [rdma_cm]
 remove_client_context+0x88/0xd0 [ib_core]
 disable_device+0x8a/0x160 [ib_core]
 ? _raw_spin_unlock+0x15/0x30
 __ib_unregister_device+0x42/0xb0 [ib_core]
 ib_unregister_device+0x22/0x30 [ib_core]
 mlx5r_remove+0x36/0x60 [mlx5_ib]
 auxiliary_bus_remove+0x18/0x30
 device_release_driver_internal+0x193/0x200
 bus_remove_device+0xbf/0x130
 device_del+0x157/0x3e0
 ? devl_param_driverinit_value_get+0x29/0xa0
 mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
 mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
 mlx5_do_bond+0x2cb/0x390 [mlx5_core]
 mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
 process_one_work+0x174/0x340
 worker_thread+0x27e/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xee/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
INFO: task fio:10584 blocked for more than 122 seconds.
      Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
```

-- 
Dan Aloni

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
  2024-05-05 18:36   ` Dan Aloni
@ 2024-05-05 19:00     ` Chuck Lever
  2024-05-05 20:10       ` Dan Aloni
  0 siblings, 1 reply; 4+ messages in thread
From: Chuck Lever @ 2024-05-05 19:00 UTC (permalink / raw)
  To: Dan Aloni; +Cc: linux-nfs, linux-rdma

On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> On 2024-05-05 10:35:46, Chuck Lever wrote:
> > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > > underflow.
> > > 
> > > The specific scenario that caused this to happen is IB device bonding,
> > > when bringing down one of the ports, or all ports. The situation is not
> > > just a print - it also causes a non-recoverable state it is not even
> > > possible to complete the disconnect and shut it down the mount,
> > > requiring a reboot, suggesting that tear-down is also incomplete in this
> > > state.
> > > 
> > > The trivial fix seems to work as such - if we did not receive a
> > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > > `rpcrdma_xprt_disconnect`.
> > > 
> > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> > 
> > Hi Dan, thanks for the bug report!
> > 
> [..]
> > Second, I wonder if, when bonding interfaces, there's an opportunity
> > for the verbs consumer to take an additional transport reference.
> > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > diagram of when the ep reference count changes when bonding?
> 
> Not sure we need an additional reference here.

We already have two mechanisms in play:

- the ep reference count
- the re_connect_status value

I would prefer not to add a boolean here. Seems like
re_connect_status could do that job if we need it to.

I seem to recall that when a device is removed, the verbs consumer
doesn't get a DISCONNECT first. Or does it? Having a sequence of
CM events that you see in the bonded case would help a lot.


> I understand regarding rpcrdma
> managing its internal EP refcount, that it is having three in total: one for
> init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
> so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
> and I found some more about it, please read on.
> 
> > Also, I note that the WARNING below happened on a RHEL9 kernel:
> > 
> >    5.14.0-284.11.1.el9_2.x86_64
> > 
> > Can you confirm that this issue reproduces on v6.8 and newer ?
> 
> Some context: I originally tested with a version that also has implemented a
> timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
> 'culprit' in the negative decref, but there is the other issue and second order
> effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
> the unkillable transport, so we are not reaching teardown at all. With both of
> these changes applied both problems are gone.
> 
> As for upstream version testing, with the elrepo build of 6.8.9 which matches
> vanilla, I don't see the negative kref but I do see the other effect of
> `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
> which does not release even after ports are back up online.
> 
> Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> on bonding, I'll post in a reply.

It looks like you are trying to fix several issues in one patch. So
I need you to separate these issues and the fixes, and let's focus
on the upstream kernel (v6.9-rc6) because there's nothing I can do
about the RHEL9 kernel, which is clearly a different source base
than the one I work on.

If we need a "wake_up_all(&ep->re_connect_wait);" during
DEVICE_REMOVAL, that should be a separate patch. And you need to
figure out if ADDR_CHANGE needs the same treatment: the v2 you just
sent changes the behavior of ADDR_CHANGE too but does not mention
whether it intended that change.

Without a reproducer or a detailed explanation of how the ep
reference count changes in step with CM events, I can't properly
assess your proposed fix.


> ```
> INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
>       Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u128:6  state:D stack:0     pid:1688  tgid:1688  ppid:2      flags:0x00004000
> Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
> Call Trace:
>  <TASK>
>  __schedule+0x21c/0x680
>  schedule+0x31/0xd0
>  schedule_timeout+0x148/0x160
>  ? mutex_lock+0xe/0x30
>  ? cma_process_remove+0x218/0x260 [rdma_cm]
>  ? preempt_count_add+0x70/0xa0
>  __wait_for_common+0x90/0x1e0
>  ? __pfx_schedule_timeout+0x10/0x10
>  cma_remove_one+0x5c/0xd0 [rdma_cm]
>  remove_client_context+0x88/0xd0 [ib_core]
>  disable_device+0x8a/0x160 [ib_core]
>  ? _raw_spin_unlock+0x15/0x30
>  __ib_unregister_device+0x42/0xb0 [ib_core]
>  ib_unregister_device+0x22/0x30 [ib_core]
>  mlx5r_remove+0x36/0x60 [mlx5_ib]
>  auxiliary_bus_remove+0x18/0x30
>  device_release_driver_internal+0x193/0x200
>  bus_remove_device+0xbf/0x130
>  device_del+0x157/0x3e0
>  ? devl_param_driverinit_value_get+0x29/0xa0
>  mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
>  mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
>  mlx5_do_bond+0x2cb/0x390 [mlx5_core]
>  mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
>  process_one_work+0x174/0x340
>  worker_thread+0x27e/0x390
>  ? __pfx_worker_thread+0x10/0x10
>  kthread+0xee/0x120
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2d/0x50
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork_asm+0x1b/0x30
>  </TASK>
> INFO: task fio:10584 blocked for more than 122 seconds.
>       Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
> ```
> 
> -- 
> Dan Aloni

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
  2024-05-05 19:00     ` Chuck Lever
@ 2024-05-05 20:10       ` Dan Aloni
  0 siblings, 0 replies; 4+ messages in thread
From: Dan Aloni @ 2024-05-05 20:10 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, linux-rdma

On 2024-05-05 15:00:10, Chuck Lever wrote:
> On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> > On 2024-05-05 10:35:46, Chuck Lever wrote:
> > [..]
> > > Second, I wonder if, when bonding interfaces, there's an opportunity
> > > for the verbs consumer to take an additional transport reference.
> > > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > > diagram of when the ep reference count changes when bonding?
> > 
> > Not sure we need an additional reference here.
> 
> We already have two mechanisms in play:
> 
> - the ep reference count
> - the re_connect_status value
> 
> I would prefer not to add a boolean here. Seems like
> re_connect_status could do that job if we need it to.

That's possible. So before overrides e.g. `ep->re_connect_status =
-ENODEV` we would need to remember whether it was `1` and across the
call to `rpcrdma_force_disconnect()`, to do the putref() after it.
Anyway, the patch will be a big larger I guess.

>[..]
> > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> > on bonding, I'll post in a reply.
> 
> It looks like you are trying to fix several issues in one patch. So
> I need you to separate these issues and the fixes, and let's focus
> on the upstream kernel (v6.9-rc6) because there's nothing I can do
> about the RHEL9 kernel, which is clearly a different source base
> than the one I work on.
> 
> If we need a "wake_up_all(&ep->re_connect_wait);" during
> DEVICE_REMOVAL, that should be a separate patch. And you need to
> figure out if ADDR_CHANGE needs the same treatment: the v2 you just
> sent changes the behavior of ADDR_CHANGE too but does not mention
> whether it intended that change.

Was it just the `rpcrdma_ep_put` change in this case? Will feel less
comfortable changing it without a repro of the `ADDR_CHANGE` case. Going
to isolate it.

> Without a reproducer or a detailed explanation of how the ep
> reference count changes in step with CM events, I can't properly
> assess your proposed fix.

With 6.8 I only see a single `RDMA_CM_EVENT_DEVICE_REMOVAL` event
arriving, driving a single `rpcrdma_ep_put` call to put the reference
back at 1, and it waits forever. Getting late here, I'll check on 6.9-rc
tomorrow.

Maybe for a repro, you can generate a single interface bond? I haven't
tried, but possibly it would occur on a single one too if the underlying
port is down.

-- 
Dan Aloni

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-05 20:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240505124910.1877325-1-dan.aloni@vastdata.com>
2024-05-05 14:35 ` [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen Chuck Lever
2024-05-05 18:36   ` Dan Aloni
2024-05-05 19:00     ` Chuck Lever
2024-05-05 20:10       ` Dan Aloni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).