All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
@ 2022-09-06 15:55 Chuck Lever
  2022-09-15 15:56 ` Olga Kornievskaia
  0 siblings, 1 reply; 5+ messages in thread
From: Chuck Lever @ 2022-09-06 15:55 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-nfs, linux-rdma

While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:

kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  __flush_work.isra.0+0xaf/0x188
kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
kernel:  ? lock_timer_base+0x38/0x5f
kernel:  __cancel_work_timer+0xea/0x13d
kernel:  ? preempt_latency_start+0x2b/0x46
kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
kernel:  ? _raw_spin_unlock+0x14/0x29
kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
kernel:  ? finish_task_switch.isra.0+0x171/0x249
kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
kernel:  process_one_work+0x1d8/0x2d4
kernel:  worker_thread+0x18b/0x24f
kernel:  ? rescuer_thread+0x280/0x280
kernel:  kthread+0xf4/0xfc
kernel:  ? kthread_complete_and_exit+0x1b/0x1b
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>

SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion. The internal workqueues in the
RDMA/core are currently non-MEM_RECLAIM.

Jason Gunthorpe says this about the current state of RDMA/core:
> If you attempt to do a reconnection/etc from within a RECLAIM
> context it will deadlock on one of the many allocations that are
> made to support opening the connection.
>
> The general idea of reclaim is that the entire task context
> working under the reclaim is marked with an override of the gfp
> flags to make all allocations under that call chain reclaim safe.
>
> But rdmacm does allocations outside this, eg in the WQs processing
> the CM packets. So this doesn't work and we will deadlock.
>
> Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> here and there.

So we will change the ULP in this case to avoid the use of
WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
are not fixed, but at least we no longer have a false sense of
confidence that the stack won't allocate memory during memory
reclaim.

While we're adjusting these queue_* call sites, ensure the work
requests always run on the local CPU so the worker allocates RDMA
resources that are local to the CPU that queued the work request.

Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/transport.c |    4 ++--
 net/sunrpc/xprtrdma/verbs.c     |   11 ++++-------
 2 files changed, 6 insertions(+), 9 deletions(-)

Hi Anna-

I've had this applied to my test client for a while. I think it's
ready to apply.


diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index bcb37b51adf6..9581641bb8cb 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -494,8 +494,8 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
 		xprt_reconnect_backoff(xprt, RPCRDMA_INIT_REEST_TO);
 	}
 	trace_xprtrdma_op_connect(r_xprt, delay);
-	queue_delayed_work(xprtiod_workqueue, &r_xprt->rx_connect_worker,
-			   delay);
+	queue_delayed_work_on(smp_processor_id(), system_long_wq,
+			      &r_xprt->rx_connect_worker, delay);
 }
 
 /**
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2fbe9aaeec34..691afc96bcbc 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -791,13 +791,10 @@ void rpcrdma_mrs_refresh(struct rpcrdma_xprt *r_xprt)
 	/* If there is no underlying connection, it's no use
 	 * to wake the refresh worker.
 	 */
-	if (ep->re_connect_status == 1) {
-		/* The work is scheduled on a WQ_MEM_RECLAIM
-		 * workqueue in order to prevent MR allocation
-		 * from recursing into NFS during direct reclaim.
-		 */
-		queue_work(xprtiod_workqueue, &buf->rb_refresh_worker);
-	}
+	if (ep->re_connect_status != 1)
+		return;
+	queue_work_on(smp_processor_id(), system_highpri_wq,
+		      &buf->rb_refresh_worker);
 }
 
 /**



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
  2022-09-06 15:55 [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma Chuck Lever
@ 2022-09-15 15:56 ` Olga Kornievskaia
  2022-09-16 18:28   ` Olga Kornievskaia
  0 siblings, 1 reply; 5+ messages in thread
From: Olga Kornievskaia @ 2022-09-15 15:56 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, linux-nfs, linux-rdma

On Tue, Sep 6, 2022 at 12:25 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> While setting up a new lab, I accidentally misconfigured the
> Ethernet port for a system that tried an NFS mount using RoCE.
> This made the NFS server unreachable. The following WARNING
> popped on the NFS client while waiting for the mount attempt to
> time out:

I also hit this today (on the 5.18 kernel) while running xfstest
generic/460 using soft iWarp. In my case the port was properly
configured. The test was going. I'm not sure exactly what happened. I
know I also crashed the server that I was running against. But the
point I would like to make is that this condition is possible to get
to on a properly configured system.

> kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
> kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
> kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
> kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
> kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
> kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
> kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
> kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
> kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
> kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
> kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
> kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
> kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
> kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> kernel: Call Trace:
> kernel:  <TASK>
> kernel:  __flush_work.isra.0+0xaf/0x188
> kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
> kernel:  ? lock_timer_base+0x38/0x5f
> kernel:  __cancel_work_timer+0xea/0x13d
> kernel:  ? preempt_latency_start+0x2b/0x46
> kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
> kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
> kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
> kernel:  ? _raw_spin_unlock+0x14/0x29
> kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
> kernel:  ? finish_task_switch.isra.0+0x171/0x249
> kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
> kernel:  process_one_work+0x1d8/0x2d4
> kernel:  worker_thread+0x18b/0x24f
> kernel:  ? rescuer_thread+0x280/0x280
> kernel:  kthread+0xf4/0xfc
> kernel:  ? kthread_complete_and_exit+0x1b/0x1b
> kernel:  ret_from_fork+0x22/0x30
> kernel:  </TASK>
>
> SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
> prevent a priority inversion. The internal workqueues in the
> RDMA/core are currently non-MEM_RECLAIM.
>
> Jason Gunthorpe says this about the current state of RDMA/core:
> > If you attempt to do a reconnection/etc from within a RECLAIM
> > context it will deadlock on one of the many allocations that are
> > made to support opening the connection.
> >
> > The general idea of reclaim is that the entire task context
> > working under the reclaim is marked with an override of the gfp
> > flags to make all allocations under that call chain reclaim safe.
> >
> > But rdmacm does allocations outside this, eg in the WQs processing
> > the CM packets. So this doesn't work and we will deadlock.
> >
> > Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> > here and there.
>
> So we will change the ULP in this case to avoid the use of
> WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
> are not fixed, but at least we no longer have a false sense of
> confidence that the stack won't allocate memory during memory
> reclaim.
>
> While we're adjusting these queue_* call sites, ensure the work
> requests always run on the local CPU so the worker allocates RDMA
> resources that are local to the CPU that queued the work request.
>
> Suggested-by: Leon Romanovsky <leon@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/transport.c |    4 ++--
>  net/sunrpc/xprtrdma/verbs.c     |   11 ++++-------
>  2 files changed, 6 insertions(+), 9 deletions(-)
>
> Hi Anna-
>
> I've had this applied to my test client for a while. I think it's
> ready to apply.
>
>
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index bcb37b51adf6..9581641bb8cb 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -494,8 +494,8 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
>                 xprt_reconnect_backoff(xprt, RPCRDMA_INIT_REEST_TO);
>         }
>         trace_xprtrdma_op_connect(r_xprt, delay);
> -       queue_delayed_work(xprtiod_workqueue, &r_xprt->rx_connect_worker,
> -                          delay);
> +       queue_delayed_work_on(smp_processor_id(), system_long_wq,
> +                             &r_xprt->rx_connect_worker, delay);
>  }
>
>  /**
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 2fbe9aaeec34..691afc96bcbc 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -791,13 +791,10 @@ void rpcrdma_mrs_refresh(struct rpcrdma_xprt *r_xprt)
>         /* If there is no underlying connection, it's no use
>          * to wake the refresh worker.
>          */
> -       if (ep->re_connect_status == 1) {
> -               /* The work is scheduled on a WQ_MEM_RECLAIM
> -                * workqueue in order to prevent MR allocation
> -                * from recursing into NFS during direct reclaim.
> -                */
> -               queue_work(xprtiod_workqueue, &buf->rb_refresh_worker);
> -       }
> +       if (ep->re_connect_status != 1)
> +               return;
> +       queue_work_on(smp_processor_id(), system_highpri_wq,
> +                     &buf->rb_refresh_worker);
>  }
>
>  /**
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
  2022-09-15 15:56 ` Olga Kornievskaia
@ 2022-09-16 18:28   ` Olga Kornievskaia
  2022-09-16 22:40     ` Trond Myklebust
  0 siblings, 1 reply; 5+ messages in thread
From: Olga Kornievskaia @ 2022-09-16 18:28 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, linux-nfs, linux-rdma

On Thu, Sep 15, 2022 at 11:56 AM Olga Kornievskaia <aglo@umich.edu> wrote:
>
> On Tue, Sep 6, 2022 at 12:25 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >
> > While setting up a new lab, I accidentally misconfigured the
> > Ethernet port for a system that tried an NFS mount using RoCE.
> > This made the NFS server unreachable. The following WARNING
> > popped on the NFS client while waiting for the mount attempt to
> > time out:
>
> I also hit this today (on the 5.18 kernel) while running xfstest
> generic/460 using soft iWarp. In my case the port was properly
> configured. The test was going. I'm not sure exactly what happened. I
> know I also crashed the server that I was running against. But the
> point I would like to make is that this condition is possible to get
> to on a properly configured system.

But I think with this patch. I'm hitting this instead (of course could
be something else):

[ 3222.712335] BUG: using smp_processor_id() in preemptible [00000000]
code: 192.168.1.124-m/3814
[ 3222.714428] caller is xprt_rdma_connect+0x6a/0x120 [rpcrdma]
[ 3222.716047] CPU: 0 PID: 3814 Comm: 192.168.1.124-m Not tainted
6.0.0-rc5+ #123
[ 3222.717706] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 3222.720310] Call Trace:
[ 3222.721032]  <TASK>
[ 3222.721587]  dump_stack_lvl+0x33/0x46
[ 3222.722501]  check_preemption_disabled+0xc3/0xf0
[ 3222.723754]  xprt_rdma_connect+0x6a/0x120 [rpcrdma]
[ 3222.725594]  xprt_connect+0x300/0x370 [sunrpc]
[ 3222.727369]  ? call_reserveresult+0xa0/0xa0 [sunrpc]
[ 3222.729272]  __rpc_execute+0x162/0x870 [sunrpc]
[ 3222.731101]  ? rpc_exit+0x40/0x40 [sunrpc]
[ 3222.732841]  ? __wake_up+0x10/0x10
[ 3222.733657]  rpc_execute+0x148/0x1b0 [sunrpc]
[ 3222.735326]  rpc_run_task+0x270/0x2d0 [sunrpc]
[ 3222.737182]  nfs4_proc_bind_one_conn_to_session+0x1cc/0x3a0 [nfsv4]
[ 3222.740472]  ? _nfs4_do_set_security_label+0x2d0/0x2d0 [nfsv4]
[ 3222.745034]  ? xprt_get+0xa0/0x120 [sunrpc]
[ 3222.747150]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0 [nfsv4]
[ 3222.749299]  ? __rcu_read_unlock+0x4e/0x250
[ 3222.750429]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0 [nfsv4]
[ 3222.752586]  rpc_clnt_iterate_for_each_xprt+0xc6/0x140 [sunrpc]
[ 3222.754900]  ? rpc_clnt_xprt_switch_add_xprt+0xa0/0xa0 [sunrpc]
[ 3222.757041]  ? preempt_count_sub+0x14/0xc0
[ 3222.758097]  nfs4_proc_bind_conn_to_session+0x87/0xb0 [nfsv4]
[ 3222.760341]  ? nfs4_proc_secinfo+0x250/0x250 [nfsv4]
[ 3222.762257]  nfs4_state_manager+0x34e/0xf60 [nfsv4]
[ 3222.764095]  nfs4_run_state_manager+0x1a6/0x2e0 [nfsv4]
[ 3222.766778]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
[ 3222.768811]  ? _raw_spin_lock_irqsave+0x8d/0xf0
[ 3222.770029]  ? _raw_spin_unlock_irqrestore+0x40/0x40
[ 3222.771740]  ? __list_del_entry_valid+0x77/0xa0
[ 3222.773400]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
[ 3222.775653]  kthread+0x160/0x190
[ 3222.776729]  ? kthread_complete_and_exit+0x20/0x20
[ 3222.777865]  ret_from_fork+0x1f/0x30
[ 3222.778972]  </TASK>


>
> > kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
> > kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
> > kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
> > kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
> > kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
> > kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> > kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
> > kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
> > kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
> > kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
> > kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
> > kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
> > kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
> > kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
> > kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
> > kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
> > kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > kernel: Call Trace:
> > kernel:  <TASK>
> > kernel:  __flush_work.isra.0+0xaf/0x188
> > kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
> > kernel:  ? lock_timer_base+0x38/0x5f
> > kernel:  __cancel_work_timer+0xea/0x13d
> > kernel:  ? preempt_latency_start+0x2b/0x46
> > kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
> > kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
> > kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
> > kernel:  ? _raw_spin_unlock+0x14/0x29
> > kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
> > kernel:  ? finish_task_switch.isra.0+0x171/0x249
> > kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
> > kernel:  process_one_work+0x1d8/0x2d4
> > kernel:  worker_thread+0x18b/0x24f
> > kernel:  ? rescuer_thread+0x280/0x280
> > kernel:  kthread+0xf4/0xfc
> > kernel:  ? kthread_complete_and_exit+0x1b/0x1b
> > kernel:  ret_from_fork+0x22/0x30
> > kernel:  </TASK>
> >
> > SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
> > one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
> > prevent a priority inversion. The internal workqueues in the
> > RDMA/core are currently non-MEM_RECLAIM.
> >
> > Jason Gunthorpe says this about the current state of RDMA/core:
> > > If you attempt to do a reconnection/etc from within a RECLAIM
> > > context it will deadlock on one of the many allocations that are
> > > made to support opening the connection.
> > >
> > > The general idea of reclaim is that the entire task context
> > > working under the reclaim is marked with an override of the gfp
> > > flags to make all allocations under that call chain reclaim safe.
> > >
> > > But rdmacm does allocations outside this, eg in the WQs processing
> > > the CM packets. So this doesn't work and we will deadlock.
> > >
> > > Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> > > here and there.
> >
> > So we will change the ULP in this case to avoid the use of
> > WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
> > are not fixed, but at least we no longer have a false sense of
> > confidence that the stack won't allocate memory during memory
> > reclaim.
> >
> > While we're adjusting these queue_* call sites, ensure the work
> > requests always run on the local CPU so the worker allocates RDMA
> > resources that are local to the CPU that queued the work request.
> >
> > Suggested-by: Leon Romanovsky <leon@kernel.org>
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  net/sunrpc/xprtrdma/transport.c |    4 ++--
> >  net/sunrpc/xprtrdma/verbs.c     |   11 ++++-------
> >  2 files changed, 6 insertions(+), 9 deletions(-)
> >
> > Hi Anna-
> >
> > I've had this applied to my test client for a while. I think it's
> > ready to apply.
> >
> >
> > diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> > index bcb37b51adf6..9581641bb8cb 100644
> > --- a/net/sunrpc/xprtrdma/transport.c
> > +++ b/net/sunrpc/xprtrdma/transport.c
> > @@ -494,8 +494,8 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
> >                 xprt_reconnect_backoff(xprt, RPCRDMA_INIT_REEST_TO);
> >         }
> >         trace_xprtrdma_op_connect(r_xprt, delay);
> > -       queue_delayed_work(xprtiod_workqueue, &r_xprt->rx_connect_worker,
> > -                          delay);
> > +       queue_delayed_work_on(smp_processor_id(), system_long_wq,
> > +                             &r_xprt->rx_connect_worker, delay);
> >  }
> >
> >  /**
> > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> > index 2fbe9aaeec34..691afc96bcbc 100644
> > --- a/net/sunrpc/xprtrdma/verbs.c
> > +++ b/net/sunrpc/xprtrdma/verbs.c
> > @@ -791,13 +791,10 @@ void rpcrdma_mrs_refresh(struct rpcrdma_xprt *r_xprt)
> >         /* If there is no underlying connection, it's no use
> >          * to wake the refresh worker.
> >          */
> > -       if (ep->re_connect_status == 1) {
> > -               /* The work is scheduled on a WQ_MEM_RECLAIM
> > -                * workqueue in order to prevent MR allocation
> > -                * from recursing into NFS during direct reclaim.
> > -                */
> > -               queue_work(xprtiod_workqueue, &buf->rb_refresh_worker);
> > -       }
> > +       if (ep->re_connect_status != 1)
> > +               return;
> > +       queue_work_on(smp_processor_id(), system_highpri_wq,
> > +                     &buf->rb_refresh_worker);
> >  }
> >
> >  /**
> >
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
  2022-09-16 18:28   ` Olga Kornievskaia
@ 2022-09-16 22:40     ` Trond Myklebust
  2022-09-17  2:14       ` Chuck Lever III
  0 siblings, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2022-09-16 22:40 UTC (permalink / raw)
  To: aglo, chuck.lever; +Cc: linux-nfs, anna.schumaker, linux-rdma

On Fri, 2022-09-16 at 14:28 -0400, Olga Kornievskaia wrote:
> On Thu, Sep 15, 2022 at 11:56 AM Olga Kornievskaia <aglo@umich.edu>
> wrote:
> > 
> > On Tue, Sep 6, 2022 at 12:25 PM Chuck Lever
> > <chuck.lever@oracle.com> wrote:
> > > 
> > > While setting up a new lab, I accidentally misconfigured the
> > > Ethernet port for a system that tried an NFS mount using RoCE.
> > > This made the NFS server unreachable. The following WARNING
> > > popped on the NFS client while waiting for the mount attempt to
> > > time out:
> > 
> > I also hit this today (on the 5.18 kernel) while running xfstest
> > generic/460 using soft iWarp. In my case the port was properly
> > configured. The test was going. I'm not sure exactly what happened.
> > I
> > know I also crashed the server that I was running against. But the
> > point I would like to make is that this condition is possible to
> > get
> > to on a properly configured system.
> 
> But I think with this patch. I'm hitting this instead (of course
> could
> be something else):
> 
> [ 3222.712335] BUG: using smp_processor_id() in preemptible
> [00000000]
> code: 192.168.1.124-m/3814
> [ 3222.714428] caller is xprt_rdma_connect+0x6a/0x120 [rpcrdma]
> [ 3222.716047] CPU: 0 PID: 3814 Comm: 192.168.1.124-m Not tainted
> 6.0.0-rc5+ #123
> [ 3222.717706] Hardware name: VMware, Inc. VMware Virtual
> Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
> [ 3222.720310] Call Trace:
> [ 3222.721032]  <TASK>
> [ 3222.721587]  dump_stack_lvl+0x33/0x46
> [ 3222.722501]  check_preemption_disabled+0xc3/0xf0
> [ 3222.723754]  xprt_rdma_connect+0x6a/0x120 [rpcrdma]
> [ 3222.725594]  xprt_connect+0x300/0x370 [sunrpc]
> [ 3222.727369]  ? call_reserveresult+0xa0/0xa0 [sunrpc]
> [ 3222.729272]  __rpc_execute+0x162/0x870 [sunrpc]
> [ 3222.731101]  ? rpc_exit+0x40/0x40 [sunrpc]
> [ 3222.732841]  ? __wake_up+0x10/0x10
> [ 3222.733657]  rpc_execute+0x148/0x1b0 [sunrpc]
> [ 3222.735326]  rpc_run_task+0x270/0x2d0 [sunrpc]
> [ 3222.737182]  nfs4_proc_bind_one_conn_to_session+0x1cc/0x3a0
> [nfsv4]
> [ 3222.740472]  ? _nfs4_do_set_security_label+0x2d0/0x2d0 [nfsv4]
> [ 3222.745034]  ? xprt_get+0xa0/0x120 [sunrpc]
> [ 3222.747150]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0
> [nfsv4]
> [ 3222.749299]  ? __rcu_read_unlock+0x4e/0x250
> [ 3222.750429]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0
> [nfsv4]
> [ 3222.752586]  rpc_clnt_iterate_for_each_xprt+0xc6/0x140 [sunrpc]
> [ 3222.754900]  ? rpc_clnt_xprt_switch_add_xprt+0xa0/0xa0 [sunrpc]
> [ 3222.757041]  ? preempt_count_sub+0x14/0xc0
> [ 3222.758097]  nfs4_proc_bind_conn_to_session+0x87/0xb0 [nfsv4]
> [ 3222.760341]  ? nfs4_proc_secinfo+0x250/0x250 [nfsv4]
> [ 3222.762257]  nfs4_state_manager+0x34e/0xf60 [nfsv4]
> [ 3222.764095]  nfs4_run_state_manager+0x1a6/0x2e0 [nfsv4]
> [ 3222.766778]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
> [ 3222.768811]  ? _raw_spin_lock_irqsave+0x8d/0xf0
> [ 3222.770029]  ? _raw_spin_unlock_irqrestore+0x40/0x40
> [ 3222.771740]  ? __list_del_entry_valid+0x77/0xa0
> [ 3222.773400]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
> [ 3222.775653]  kthread+0x160/0x190
> [ 3222.776729]  ? kthread_complete_and_exit+0x20/0x20
> [ 3222.777865]  ret_from_fork+0x1f/0x30
> [ 3222.778972]  </TASK>
> 
> 
> > 
> > > kernel: workqueue: WQ_MEM_RECLAIM
> > > xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing
> > > !WQ_MEM_RECLAI>
> > > kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628
> > > check_flush_dependency+0xbf/0xca
> > > kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs
> > > 8021q garp stp mrp llc rfkill rpcrdma>
> > > kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-
> > > 00002-g6229f8c054e5 #13
> > > kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b
> > > 06/12/2017
> > > kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> > > kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
> > > kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48
> > > 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
> > > kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
> > > kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX:
> > > 0000000000000027
> > > kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI:
> > > 00000000ffffffff
> > > kernel: RBP: ffff978941315840 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > kernel: R10: 00000000000008b0 R11: 0000000000000001 R12:
> > > ffffffffc0ce3731
> > > kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15:
> > > ffff978951112eb0
> > > kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000)
> > > knlGS:0000000000000000
> > > kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4:
> > > 00000000003706f0
> > > kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > kernel: Call Trace:
> > > kernel:  <TASK>
> > > kernel:  __flush_work.isra.0+0xaf/0x188
> > > kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
> > > kernel:  ? lock_timer_base+0x38/0x5f
> > > kernel:  __cancel_work_timer+0xea/0x13d
> > > kernel:  ? preempt_latency_start+0x2b/0x46
> > > kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
> > > kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
> > > kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
> > > kernel:  ? _raw_spin_unlock+0x14/0x29
> > > kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
> > > kernel:  ? finish_task_switch.isra.0+0x171/0x249
> > > kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
> > > kernel:  process_one_work+0x1d8/0x2d4
> > > kernel:  worker_thread+0x18b/0x24f
> > > kernel:  ? rescuer_thread+0x280/0x280
> > > kernel:  kthread+0xf4/0xfc
> > > kernel:  ? kthread_complete_and_exit+0x1b/0x1b
> > > kernel:  ret_from_fork+0x22/0x30
> > > kernel:  </TASK>
> > > 
> > > SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue
> > > that
> > > one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
> > > prevent a priority inversion. The internal workqueues in the
> > > RDMA/core are currently non-MEM_RECLAIM.
> > > 
> > > Jason Gunthorpe says this about the current state of RDMA/core:
> > > > If you attempt to do a reconnection/etc from within a RECLAIM
> > > > context it will deadlock on one of the many allocations that
> > > > are
> > > > made to support opening the connection.
> > > > 
> > > > The general idea of reclaim is that the entire task context
> > > > working under the reclaim is marked with an override of the gfp
> > > > flags to make all allocations under that call chain reclaim
> > > > safe.
> > > > 
> > > > But rdmacm does allocations outside this, eg in the WQs
> > > > processing
> > > > the CM packets. So this doesn't work and we will deadlock.
> > > > 
> > > > Fixing it is a big deal and needs more than poking
> > > > WQ_MEM_RECLAIM
> > > > here and there.
> > > 
> > > So we will change the ULP in this case to avoid the use of
> > > WQ_MEM_RECLAIM where possible. Deadlocks that were possible
> > > before
> > > are not fixed, but at least we no longer have a false sense of
> > > confidence that the stack won't allocate memory during memory
> > > reclaim.
> > > 
> > > While we're adjusting these queue_* call sites, ensure the work
> > > requests always run on the local CPU so the worker allocates RDMA
> > > resources that are local to the CPU that queued the work request.
> > > 
> > > Suggested-by: Leon Romanovsky <leon@kernel.org>
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  net/sunrpc/xprtrdma/transport.c |    4 ++--
> > >  net/sunrpc/xprtrdma/verbs.c     |   11 ++++-------
> > >  2 files changed, 6 insertions(+), 9 deletions(-)
> > > 
> > > Hi Anna-
> > > 
> > > I've had this applied to my test client for a while. I think it's
> > > ready to apply.
> > > 
> > > 
> > > diff --git a/net/sunrpc/xprtrdma/transport.c
> > > b/net/sunrpc/xprtrdma/transport.c
> > > index bcb37b51adf6..9581641bb8cb 100644
> > > --- a/net/sunrpc/xprtrdma/transport.c
> > > +++ b/net/sunrpc/xprtrdma/transport.c
> > > @@ -494,8 +494,8 @@ xprt_rdma_connect(struct rpc_xprt *xprt,
> > > struct rpc_task *task)
> > >                 xprt_reconnect_backoff(xprt,
> > > RPCRDMA_INIT_REEST_TO);
> > >         }
> > >         trace_xprtrdma_op_connect(r_xprt, delay);
> > > -       queue_delayed_work(xprtiod_workqueue, &r_xprt-
> > > >rx_connect_worker,
> > > -                          delay);
> > > +       queue_delayed_work_on(smp_processor_id(), system_long_wq,
> > > +                             &r_xprt->rx_connect_worker, delay);
> > >  }
> > > 
> > >  /**
> > > diff --git a/net/sunrpc/xprtrdma/verbs.c
> > > b/net/sunrpc/xprtrdma/verbs.c
> > > index 2fbe9aaeec34..691afc96bcbc 100644
> > > --- a/net/sunrpc/xprtrdma/verbs.c
> > > +++ b/net/sunrpc/xprtrdma/verbs.c
> > > @@ -791,13 +791,10 @@ void rpcrdma_mrs_refresh(struct
> > > rpcrdma_xprt *r_xprt)
> > >         /* If there is no underlying connection, it's no use
> > >          * to wake the refresh worker.
> > >          */
> > > -       if (ep->re_connect_status == 1) {
> > > -               /* The work is scheduled on a WQ_MEM_RECLAIM
> > > -                * workqueue in order to prevent MR allocation
> > > -                * from recursing into NFS during direct reclaim.
> > > -                */
> > > -               queue_work(xprtiod_workqueue, &buf-
> > > >rb_refresh_worker);
> > > -       }
> > > +       if (ep->re_connect_status != 1)
> > > +               return;
> > > +       queue_work_on(smp_processor_id(), system_highpri_wq,
> > > +                     &buf->rb_refresh_worker);
> > >  }
> > > 
> > >  /**
> > > 
> > > 

Right. smp_processor_id() is only allowed to be called when preemption
has been disabled. See Documentation/kernel-hacking/hacking.rst and
Documentation/locking/preempt-locking.rst.

Why not just use queue_work(), Chuck? That achieves the exact same
thing without requiring any extra locking.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
  2022-09-16 22:40     ` Trond Myklebust
@ 2022-09-17  2:14       ` Chuck Lever III
  0 siblings, 0 replies; 5+ messages in thread
From: Chuck Lever III @ 2022-09-17  2:14 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Olga Kornievskaia, Linux NFS Mailing List, Anna Schumaker, linux-rdma



> On Sep 16, 2022, at 6:40 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Fri, 2022-09-16 at 14:28 -0400, Olga Kornievskaia wrote:
>> On Thu, Sep 15, 2022 at 11:56 AM Olga Kornievskaia <aglo@umich.edu>
>> wrote:
>>> 
>>> On Tue, Sep 6, 2022 at 12:25 PM Chuck Lever
>>> <chuck.lever@oracle.com> wrote:
>>>> 
>>>> While setting up a new lab, I accidentally misconfigured the
>>>> Ethernet port for a system that tried an NFS mount using RoCE.
>>>> This made the NFS server unreachable. The following WARNING
>>>> popped on the NFS client while waiting for the mount attempt to
>>>> time out:
>>> 
>>> I also hit this today (on the 5.18 kernel) while running xfstest
>>> generic/460 using soft iWarp. In my case the port was properly
>>> configured. The test was going. I'm not sure exactly what happened.
>>> I
>>> know I also crashed the server that I was running against. But the
>>> point I would like to make is that this condition is possible to
>>> get
>>> to on a properly configured system.
>> 
>> But I think with this patch. I'm hitting this instead (of course
>> could
>> be something else):
>> 
>> [ 3222.712335] BUG: using smp_processor_id() in preemptible
>> [00000000]
>> code: 192.168.1.124-m/3814
>> [ 3222.714428] caller is xprt_rdma_connect+0x6a/0x120 [rpcrdma]
>> [ 3222.716047] CPU: 0 PID: 3814 Comm: 192.168.1.124-m Not tainted
>> 6.0.0-rc5+ #123
>> [ 3222.717706] Hardware name: VMware, Inc. VMware Virtual
>> Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
>> [ 3222.720310] Call Trace:
>> [ 3222.721032]  <TASK>
>> [ 3222.721587]  dump_stack_lvl+0x33/0x46
>> [ 3222.722501]  check_preemption_disabled+0xc3/0xf0
>> [ 3222.723754]  xprt_rdma_connect+0x6a/0x120 [rpcrdma]
>> [ 3222.725594]  xprt_connect+0x300/0x370 [sunrpc]
>> [ 3222.727369]  ? call_reserveresult+0xa0/0xa0 [sunrpc]
>> [ 3222.729272]  __rpc_execute+0x162/0x870 [sunrpc]
>> [ 3222.731101]  ? rpc_exit+0x40/0x40 [sunrpc]
>> [ 3222.732841]  ? __wake_up+0x10/0x10
>> [ 3222.733657]  rpc_execute+0x148/0x1b0 [sunrpc]
>> [ 3222.735326]  rpc_run_task+0x270/0x2d0 [sunrpc]
>> [ 3222.737182]  nfs4_proc_bind_one_conn_to_session+0x1cc/0x3a0
>> [nfsv4]
>> [ 3222.740472]  ? _nfs4_do_set_security_label+0x2d0/0x2d0 [nfsv4]
>> [ 3222.745034]  ? xprt_get+0xa0/0x120 [sunrpc]
>> [ 3222.747150]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0
>> [nfsv4]
>> [ 3222.749299]  ? __rcu_read_unlock+0x4e/0x250
>> [ 3222.750429]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0
>> [nfsv4]
>> [ 3222.752586]  rpc_clnt_iterate_for_each_xprt+0xc6/0x140 [sunrpc]
>> [ 3222.754900]  ? rpc_clnt_xprt_switch_add_xprt+0xa0/0xa0 [sunrpc]
>> [ 3222.757041]  ? preempt_count_sub+0x14/0xc0
>> [ 3222.758097]  nfs4_proc_bind_conn_to_session+0x87/0xb0 [nfsv4]
>> [ 3222.760341]  ? nfs4_proc_secinfo+0x250/0x250 [nfsv4]
>> [ 3222.762257]  nfs4_state_manager+0x34e/0xf60 [nfsv4]
>> [ 3222.764095]  nfs4_run_state_manager+0x1a6/0x2e0 [nfsv4]
>> [ 3222.766778]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
>> [ 3222.768811]  ? _raw_spin_lock_irqsave+0x8d/0xf0
>> [ 3222.770029]  ? _raw_spin_unlock_irqrestore+0x40/0x40
>> [ 3222.771740]  ? __list_del_entry_valid+0x77/0xa0
>> [ 3222.773400]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
>> [ 3222.775653]  kthread+0x160/0x190
>> [ 3222.776729]  ? kthread_complete_and_exit+0x20/0x20
>> [ 3222.777865]  ret_from_fork+0x1f/0x30
>> [ 3222.778972]  </TASK>
>> 
>> 
>>> 
>>>> kernel: workqueue: WQ_MEM_RECLAIM
>>>> xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing
>>>> !WQ_MEM_RECLAI>
>>>> kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628
>>>> check_flush_dependency+0xbf/0xca
>>>> kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs
>>>> 8021q garp stp mrp llc rfkill rpcrdma>
>>>> kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-
>>>> 00002-g6229f8c054e5 #13
>>>> kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b
>>>> 06/12/2017
>>>> kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
>>>> kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
>>>> kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48
>>>> 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
>>>> kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
>>>> kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX:
>>>> 0000000000000027
>>>> kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI:
>>>> 00000000ffffffff
>>>> kernel: RBP: ffff978941315840 R08: 0000000000000000 R09:
>>>> 0000000000000000
>>>> kernel: R10: 00000000000008b0 R11: 0000000000000001 R12:
>>>> ffffffffc0ce3731
>>>> kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15:
>>>> ffff978951112eb0
>>>> kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000)
>>>> knlGS:0000000000000000
>>>> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4:
>>>> 00000000003706f0
>>>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> kernel: Call Trace:
>>>> kernel:  <TASK>
>>>> kernel:  __flush_work.isra.0+0xaf/0x188
>>>> kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
>>>> kernel:  ? lock_timer_base+0x38/0x5f
>>>> kernel:  __cancel_work_timer+0xea/0x13d
>>>> kernel:  ? preempt_latency_start+0x2b/0x46
>>>> kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
>>>> kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
>>>> kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
>>>> kernel:  ? _raw_spin_unlock+0x14/0x29
>>>> kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
>>>> kernel:  ? finish_task_switch.isra.0+0x171/0x249
>>>> kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
>>>> kernel:  process_one_work+0x1d8/0x2d4
>>>> kernel:  worker_thread+0x18b/0x24f
>>>> kernel:  ? rescuer_thread+0x280/0x280
>>>> kernel:  kthread+0xf4/0xfc
>>>> kernel:  ? kthread_complete_and_exit+0x1b/0x1b
>>>> kernel:  ret_from_fork+0x22/0x30
>>>> kernel:  </TASK>
>>>> 
>>>> SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue
>>>> that
>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
>>>> prevent a priority inversion. The internal workqueues in the
>>>> RDMA/core are currently non-MEM_RECLAIM.
>>>> 
>>>> Jason Gunthorpe says this about the current state of RDMA/core:
>>>>> If you attempt to do a reconnection/etc from within a RECLAIM
>>>>> context it will deadlock on one of the many allocations that
>>>>> are
>>>>> made to support opening the connection.
>>>>> 
>>>>> The general idea of reclaim is that the entire task context
>>>>> working under the reclaim is marked with an override of the gfp
>>>>> flags to make all allocations under that call chain reclaim
>>>>> safe.
>>>>> 
>>>>> But rdmacm does allocations outside this, eg in the WQs
>>>>> processing
>>>>> the CM packets. So this doesn't work and we will deadlock.
>>>>> 
>>>>> Fixing it is a big deal and needs more than poking
>>>>> WQ_MEM_RECLAIM
>>>>> here and there.
>>>> 
>>>> So we will change the ULP in this case to avoid the use of
>>>> WQ_MEM_RECLAIM where possible. Deadlocks that were possible
>>>> before
>>>> are not fixed, but at least we no longer have a false sense of
>>>> confidence that the stack won't allocate memory during memory
>>>> reclaim.
>>>> 
>>>> While we're adjusting these queue_* call sites, ensure the work
>>>> requests always run on the local CPU so the worker allocates RDMA
>>>> resources that are local to the CPU that queued the work request.
>>>> 
>>>> Suggested-by: Leon Romanovsky <leon@kernel.org>
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>>  net/sunrpc/xprtrdma/transport.c |    4 ++--
>>>>  net/sunrpc/xprtrdma/verbs.c     |   11 ++++-------
>>>>  2 files changed, 6 insertions(+), 9 deletions(-)
>>>> 
>>>> Hi Anna-
>>>> 
>>>> I've had this applied to my test client for a while. I think it's
>>>> ready to apply.
>>>> 
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>> b/net/sunrpc/xprtrdma/transport.c
>>>> index bcb37b51adf6..9581641bb8cb 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -494,8 +494,8 @@ xprt_rdma_connect(struct rpc_xprt *xprt,
>>>> struct rpc_task *task)
>>>>                 xprt_reconnect_backoff(xprt,
>>>> RPCRDMA_INIT_REEST_TO);
>>>>         }
>>>>         trace_xprtrdma_op_connect(r_xprt, delay);
>>>> -       queue_delayed_work(xprtiod_workqueue, &r_xprt-
>>>>> rx_connect_worker,
>>>> -                          delay);
>>>> +       queue_delayed_work_on(smp_processor_id(), system_long_wq,
>>>> +                             &r_xprt->rx_connect_worker, delay);
>>>>  }
>>>> 
>>>>  /**
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>> index 2fbe9aaeec34..691afc96bcbc 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -791,13 +791,10 @@ void rpcrdma_mrs_refresh(struct
>>>> rpcrdma_xprt *r_xprt)
>>>>         /* If there is no underlying connection, it's no use
>>>>          * to wake the refresh worker.
>>>>          */
>>>> -       if (ep->re_connect_status == 1) {
>>>> -               /* The work is scheduled on a WQ_MEM_RECLAIM
>>>> -                * workqueue in order to prevent MR allocation
>>>> -                * from recursing into NFS during direct reclaim.
>>>> -                */
>>>> -               queue_work(xprtiod_workqueue, &buf-
>>>>> rb_refresh_worker);
>>>> -       }
>>>> +       if (ep->re_connect_status != 1)
>>>> +               return;
>>>> +       queue_work_on(smp_processor_id(), system_highpri_wq,
>>>> +                     &buf->rb_refresh_worker);
>>>>  }
>>>> 
>>>>  /**
>>>> 
>>>> 
> 
> Right. smp_processor_id() is only allowed to be called when preemption
> has been disabled. See Documentation/kernel-hacking/hacking.rst and
> Documentation/locking/preempt-locking.rst.
> 
> Why not just use queue_work(), Chuck? That achieves the exact same
> thing without requiring any extra locking.

The intent was to allocate resources on the NUMA node that owns
the device. The code doesn't do that, I can see now.

I will send you a v2 that does the obvious thing.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-17  2:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-06 15:55 [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma Chuck Lever
2022-09-15 15:56 ` Olga Kornievskaia
2022-09-16 18:28   ` Olga Kornievskaia
2022-09-16 22:40     ` Trond Myklebust
2022-09-17  2:14       ` Chuck Lever III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.