* Re: Task hung while using RTRS with rxe and using "ip" utility to down the interface
2021-11-14 6:12 ` Leon Romanovsky
@ 2021-11-17 11:21 ` Haris Iqbal
2021-11-23 9:06 ` Leon Romanovsky
0 siblings, 1 reply; 4+ messages in thread
From: Haris Iqbal @ 2021-11-17 11:21 UTC (permalink / raw)
To: Leon Romanovsky
Cc: RDMA mailing list, Jason Gunthorpe, Jinpu Wang, Danil Kipnis,
Aleksei Marov
On Sun, Nov 14, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Nov 11, 2021 at 05:25:11PM +0100, Haris Iqbal wrote:
> > Hi,
> >
> > We are experiencing a hang while using RTRS with softROCE and
> > transitioning the network interface down using ifdown command.
> >
> > Steps.
> > 1) Map an RNBD/RTRS device through softROCE port.
> > 2) Once mapped, transition the eth interface to down (on which the
> > softROCE interface was created) using command "ifconfig <ethx> down",
> > or "ip link set <ethx> down".
> > 3) The device errors out, and one can see RTRS connection errors in
> > dmesg. So far so good.
> > 4) After a while, we see task hung traces in dmesg.
> >
> > [ 550.866462] INFO: task kworker/1:2:170 blocked for more than 184 seconds.
> > [ 550.868820] Tainted: G O 5.10.42-pserver+ #84
> > [ 550.869337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 550.869963] task:kworker/1:2 state:D stack: 0 pid: 170
> > ppid: 2 flags:0x00004000
> > [ 550.870619] Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server]
> > [ 550.871134] Call Trace:
> > [ 550.871375] __schedule+0x421/0x810
> > [ 550.871683] schedule+0x46/0xb0
> > [ 550.871964] schedule_timeout+0x20e/0x2a0
> > [ 550.872300] ? internal_add_timer+0x44/0x70
> > [ 550.872650] wait_for_completion+0x86/0xe0
> > [ 550.872994] cm_destroy_id+0x18c/0x5a0 [ib_cm]
> > [ 550.873357] ? _cond_resched+0x15/0x30
> > [ 550.873680] ? wait_for_completion+0x33/0xe0
> > [ 550.874036] _destroy_id+0x57/0x210 [rdma_cm]
> > [ 550.874395] rtrs_srv_close_work+0xcc/0x250 [rtrs_server]
> > [ 550.874819] process_one_work+0x1d4/0x370
> > [ 550.875156] worker_thread+0x4a/0x3b0
> > [ 550.875471] ? process_one_work+0x370/0x370
> > [ 550.875817] kthread+0xfe/0x140
> > [ 550.876098] ? kthread_park+0x90/0x90
> > [ 550.876453] ret_from_fork+0x1f/0x30
> >
> >
> > Our observations till now.
> >
> > 1) Does not occur if we use "ifdown <ethx>" instead. There is a
> > difference between the commands, but we are not sure if the above one
> > should lead to a task hang.
> > https://access.redhat.com/solutions/27166
> > 2) We have verified v5.10 and v.15.1 kernels, and both have this issue.
> > 3) We tried the same test with NvmeOf target and host over softROCE.
> > We get the same task hang after doing "ifconfig .. down".
> >
> > [Tue Nov 9 14:28:51 2021] INFO: task kworker/1:1:34 blocked for more
> > than 184 seconds.
> > [Tue Nov 9 14:28:51 2021] Tainted: G O
> > 5.10.42-pserver+ #84
> > [Tue Nov 9 14:28:51 2021] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [Tue Nov 9 14:28:51 2021] task:kworker/1:1 state:D stack: 0
> > pid: 34 ppid: 2 flags:0x00004000
> > [Tue Nov 9 14:28:51 2021] Workqueue: events
> > nvmet_rdma_release_queue_work [nvmet_rdma]
> > [Tue Nov 9 14:28:51 2021] Call Trace:
> > [Tue Nov 9 14:28:51 2021] __schedule+0x421/0x810
> > [Tue Nov 9 14:28:51 2021] schedule+0x46/0xb0
> > [Tue Nov 9 14:28:51 2021] schedule_timeout+0x20e/0x2a0
> > [Tue Nov 9 14:28:51 2021] ? internal_add_timer+0x44/0x70
> > [Tue Nov 9 14:28:51 2021] wait_for_completion+0x86/0xe0
> > [Tue Nov 9 14:28:51 2021] cm_destroy_id+0x18c/0x5a0 [ib_cm]
> > [Tue Nov 9 14:28:51 2021] ? _cond_resched+0x15/0x30
> > [Tue Nov 9 14:28:51 2021] ? wait_for_completion+0x33/0xe0
> > [Tue Nov 9 14:28:51 2021] _destroy_id+0x57/0x210 [rdma_cm]
> > [Tue Nov 9 14:28:51 2021] nvmet_rdma_free_queue+0x2e/0xc0 [nvmet_rdma]
> > [Tue Nov 9 14:28:51 2021] nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma]
> > [Tue Nov 9 14:28:51 2021] process_one_work+0x1d4/0x370
> > [Tue Nov 9 14:28:51 2021] worker_thread+0x4a/0x3b0
> > [Tue Nov 9 14:28:51 2021] ? process_one_work+0x370/0x370
> > [Tue Nov 9 14:28:51 2021] kthread+0xfe/0x140
> > [Tue Nov 9 14:28:51 2021] ? kthread_park+0x90/0x90
> > [Tue Nov 9 14:28:51 2021] ret_from_fork+0x1f/0x30
> >
> > Is this an known issue with ifconfig or the rxe driver? Thoughts?
>
> It doesn't look related to RXE and if to judge by the call trace the
> issue is one of two: missing call to cm_deref_id() or extra call
> to refcount_inc(&cm_id_priv->refcount). Both can be related to the
> callers of rdma-cm interface. For example by not releasing cm_id
> properly.
Hi,
@Aleksei did some more investigation and found out the cause of the hang.
1. enabled traces for rdma_cma and ib_cma events.
2. When we call ifdown (the one that does not hung) the following
trace can be observed
kworker/1:2-100 [001] .... 98.191532: cm_disconnect: cm.id=2
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100 [001] dN.. 98.208935: icm_send_drep_err:
local_id=3791663578 remote_id=2340456478 state=TIMEWAIT
lap_state=LAP_UNINIT
kworker/1:2-100 [001] .N.. 98.213924: cm_disconnect: cm.id=3
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100 [001] dN.. 98.218595: icm_send_drep_err:
local_id=3808440794 remote_id=2323679262 state=TIMEWAIT
lap_state=LAP_UNINIT
kworker/1:2-100 [001] .N.. 98.219583: cm_disconnect: cm.id=4
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100 [001] dN.. 98.228200: icm_send_drep_err:
local_id=3758109146 remote_id=2306902046 state=TIMEWAIT
lap_state=LAP_UNINIT
kworker/1:2-100 [001] .... 98.322883: cm_qp_destroy: cm.id=2
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=17
kworker/1:2-100 [001] .... 98.322913: cm_id_destroy: cm.id=2
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100 [001] .... 98.332065: cm_qp_destroy: cm.id=3
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=18
kworker/1:2-100 [001] .... 98.332115: cm_id_destroy: cm.id=3
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
kworker/1:2-100 [001] .... 98.340362: cm_qp_destroy: cm.id=4
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=19
kworker/1:2-100 [001] .... 98.340400: cm_id_destroy: cm.id=4
src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0
Explanation:
In this example we have 3 cm.id (2,3,4) for each of them we call
rdma_disconnect() that calls ib_send_cm_dreq which calls
cm_send_dreq_locked
but there we call
ret = cm_alloc_msg(cm_id_priv, &msg);
if (ret) {
cm_enter_timewait(cm_id_priv);
return ret;
}
which returns with the error before doing
refcount_inc(&cm_id_priv->refcount); inside cm_alloc_msg. Since the
error returns from rdma_create_ah.
so the refcount for each cm.id is 1 after calling rdma_disconnect.
Hence in cm_destroy_id we call once more cm_deref_id and do NOT hung
on
wait_for_completion(&cm_id_priv->comp);
because refcount is 0.
3. In case of ifconfig down (the one that hung) the trace is following
kworker/1:0-19 [001] .... 289.211230: cm_disconnect: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19 [001] d... 289.222918: icm_send_dreq:
local_id=3636223036 remote_id=2994969923 state=ESTABLISHED
lap_state=LAP_UNINIT
kworker/1:0-19 [001] .N.. 289.224789: cm_sent_dreq: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/u5:0-63 [001] d... 289.224820: icm_mad_send_err:
state=DREQ_SENT completion status=LOC_PROT_ERR
kworker/u5:0-63 [001] .... 289.224821: cm_event_handler:
cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED
(10/-110)
kworker/u5:0-63 [001] .... 289.224822: cm_event_done: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED consumer
returns 0
kworker/1:0-19 [001] .N.. 289.226797: cm_disconnect: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19 [001] dN.. 289.238556: icm_send_dreq:
local_id=3619445820 remote_id=3045301571 state=ESTABLISHED
lap_state=LAP_UNINIT
kworker/1:0-19 [001] .N.. 289.240126: cm_sent_dreq: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19 [001] .N.. 289.241162: cm_disconnect: cm.id=7
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19 [001] dN.. 289.251148: icm_send_dreq:
local_id=3602668604 remote_id=3028524355 state=ESTABLISHED
lap_state=LAP_UNINIT
kworker/1:0-19 [001] .N.. 289.252624: cm_sent_dreq: cm.id=7
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19 [001] .... 289.340055: cm_qp_destroy: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=19
kworker/1:0-19 [001] .... 289.340159: cm_id_destroy: cm.id=5
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
kworker/1:0-19 [001] .... 289.367393: cm_qp_destroy: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=20
kworker/1:0-19 [001] .... 289.367445: cm_id_destroy: cm.id=6
src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0
Explanation:
In this example we have 3 cm.id (5,6,7). for each of them we call
rdma_disconnect() that calls ib_send_cm_dreq which calls
cm_send_dreq_locked which calls cm_alloc_msg but in this case
cm_alloc_msg does not return the error and increase the refcount for
each cm.id to 2. And we do send dreq.
But later only for cm.id=5 the send_handler is called with LOC_PROT_ERR
icm_mad_send_err: state=DREQ_SENT completion status=LOC_PROT_ERR
which triggers cm_process_send_error -> cm_free_msg -> cm_deref_id
which decrease the refcount for cm.id=5 to 1.
For other cm.id (6,7) the refcounts are 2 since the dreq were sent but
nor completed nor timed out. Nothing happens with those dreq sent to
cm.id=6,7
So later, when cm_id_destroy is called for cm.id=5 (refcount=1) we
call once more cm_deref_id and do not hung on
wait_for_completion(&cm_id_priv->comp);
because refcount is 0.
but for cm.i=6,7 we do cm_deref_id but refcount goes from 2 to 1 and
we wait_for_completion forever, because those sent dreqs are still
somewhere.
4. The reason why cm_alloc_msg in one case (ifdown "interface")
returns with an error without refcount increase and in other case
(ifconfig down "interface") increase refcount and return 0 is what
happens in cm_alloc_msg. ( gid entry state in ib_gid_table. )
Explanation:
The cm_alloc_msg calls the following functions:
cm_alloc_msg -> rdma_create_ah -> rdma_lag_get_ah_roce_slave ->
rdma_read_gid_attr_ndev_rcu -> is_gid_entry_valid
which
return entry && entry->state == GID_TABLE_ENTRY_VALID;
In case of ifdown entry is either NULL or in the state
GID_TABLE_ENTRY_PENDING_DEL so the function returns false. and
cm_alloc_msg fails without refcount increase.
In case of ifconfig down entry is in the state GID_TABLE_ENTRY_VALID
so the function returns true. and cm_alloc_msg increase the refcount
for cm.id.
We think one of the below questions is the problem,
1. why cm_process_send_error was triggered only for 1 of 3 sent dreq?
Why the cm_send_handler is not called for others?
2. When RTRS receives an error and starts closing the network, during
which rdma_disconnect is called, shouldnt cm_alloc_msg fail in such a
case?
the same behavior is observed for RTRS and for NVMEof targets.
>
> Thanks
>
> >
> > Regards
> > -Haris
^ permalink raw reply [flat|nested] 4+ messages in thread