All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Haris Iqbal <haris.iqbal@ionos.com>
Cc: RDMA mailing list <linux-rdma@vger.kernel.org>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jinpu Wang <jinpu.wang@ionos.com>,
	Danil Kipnis <danil.kipnis@ionos.com>,
	Aleksei Marov <aleksei.marov@ionos.com>
Subject: Re: Task hung while using RTRS with rxe and using "ip" utility to down the interface
Date: Sun, 14 Nov 2021 08:12:52 +0200	[thread overview]
Message-ID: <YZCo5BwbTgKtbImi@unreal> (raw)
In-Reply-To: <CAJpMwyjTggq-q8YQd7iPyp_TA29z5mWcEFAKe_Zg0=Z3a843qA@mail.gmail.com>

On Thu, Nov 11, 2021 at 05:25:11PM +0100, Haris Iqbal wrote:
> Hi,
> 
> We are experiencing a hang while using RTRS with softROCE and
> transitioning the network interface down using ifdown command.
> 
> Steps.
> 1) Map an RNBD/RTRS device through softROCE port.
> 2) Once mapped, transition the eth interface to down (on which the
> softROCE interface was created) using command "ifconfig <ethx> down",
> or "ip link set <ethx> down".
> 3) The device errors out, and one can see RTRS connection errors in
> dmesg. So far so good.
> 4) After a while, we see task hung traces in dmesg.
> 
> [  550.866462] INFO: task kworker/1:2:170 blocked for more than 184 seconds.
> [  550.868820]       Tainted: G           O      5.10.42-pserver+ #84
> [  550.869337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  550.869963] task:kworker/1:2     state:D stack:    0 pid:  170
> ppid:     2 flags:0x00004000
> [  550.870619] Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server]
> [  550.871134] Call Trace:
> [  550.871375]  __schedule+0x421/0x810
> [  550.871683]  schedule+0x46/0xb0
> [  550.871964]  schedule_timeout+0x20e/0x2a0
> [  550.872300]  ? internal_add_timer+0x44/0x70
> [  550.872650]  wait_for_completion+0x86/0xe0
> [  550.872994]  cm_destroy_id+0x18c/0x5a0 [ib_cm]
> [  550.873357]  ? _cond_resched+0x15/0x30
> [  550.873680]  ? wait_for_completion+0x33/0xe0
> [  550.874036]  _destroy_id+0x57/0x210 [rdma_cm]
> [  550.874395]  rtrs_srv_close_work+0xcc/0x250 [rtrs_server]
> [  550.874819]  process_one_work+0x1d4/0x370
> [  550.875156]  worker_thread+0x4a/0x3b0
> [  550.875471]  ? process_one_work+0x370/0x370
> [  550.875817]  kthread+0xfe/0x140
> [  550.876098]  ? kthread_park+0x90/0x90
> [  550.876453]  ret_from_fork+0x1f/0x30
> 
> 
> Our observations till now.
> 
> 1) Does not occur if we use "ifdown <ethx>" instead. There is a
> difference between the commands, but we are not sure if the above one
> should lead to a task hang.
> https://access.redhat.com/solutions/27166
> 2) We have verified v5.10 and v.15.1 kernels, and both have this issue.
> 3) We tried the same test with NvmeOf target and host over softROCE.
> We get the same task hang after doing "ifconfig .. down".
> 
> [Tue Nov  9 14:28:51 2021] INFO: task kworker/1:1:34 blocked for more
> than 184 seconds.
> [Tue Nov  9 14:28:51 2021]       Tainted: G           O
> 5.10.42-pserver+ #84
> [Tue Nov  9 14:28:51 2021] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Nov  9 14:28:51 2021] task:kworker/1:1     state:D stack:    0
> pid:   34 ppid:     2 flags:0x00004000
> [Tue Nov  9 14:28:51 2021] Workqueue: events
> nvmet_rdma_release_queue_work [nvmet_rdma]
> [Tue Nov  9 14:28:51 2021] Call Trace:
> [Tue Nov  9 14:28:51 2021]  __schedule+0x421/0x810
> [Tue Nov  9 14:28:51 2021]  schedule+0x46/0xb0
> [Tue Nov  9 14:28:51 2021]  schedule_timeout+0x20e/0x2a0
> [Tue Nov  9 14:28:51 2021]  ? internal_add_timer+0x44/0x70
> [Tue Nov  9 14:28:51 2021]  wait_for_completion+0x86/0xe0
> [Tue Nov  9 14:28:51 2021]  cm_destroy_id+0x18c/0x5a0 [ib_cm]
> [Tue Nov  9 14:28:51 2021]  ? _cond_resched+0x15/0x30
> [Tue Nov  9 14:28:51 2021]  ? wait_for_completion+0x33/0xe0
> [Tue Nov  9 14:28:51 2021]  _destroy_id+0x57/0x210 [rdma_cm]
> [Tue Nov  9 14:28:51 2021]  nvmet_rdma_free_queue+0x2e/0xc0 [nvmet_rdma]
> [Tue Nov  9 14:28:51 2021]  nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma]
> [Tue Nov  9 14:28:51 2021]  process_one_work+0x1d4/0x370
> [Tue Nov  9 14:28:51 2021]  worker_thread+0x4a/0x3b0
> [Tue Nov  9 14:28:51 2021]  ? process_one_work+0x370/0x370
> [Tue Nov  9 14:28:51 2021]  kthread+0xfe/0x140
> [Tue Nov  9 14:28:51 2021]  ? kthread_park+0x90/0x90
> [Tue Nov  9 14:28:51 2021]  ret_from_fork+0x1f/0x30
> 
> Is this an known issue with ifconfig or the rxe driver? Thoughts?

It doesn't look related to RXE and if to judge by the call trace the
issue is one of two: missing call to cm_deref_id() or extra call
to refcount_inc(&cm_id_priv->refcount). Both can be related to the
callers of rdma-cm interface. For example by not releasing cm_id
properly.

Thanks

> 
> Regards
> -Haris

  reply	other threads:[~2021-11-14  6:12 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-11 16:25 Task hung while using RTRS with rxe and using "ip" utility to down the interface Haris Iqbal
2021-11-14  6:12 ` Leon Romanovsky [this message]
2021-11-17 11:21   ` Haris Iqbal
2021-11-23  9:06     ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YZCo5BwbTgKtbImi@unreal \
    --to=leon@kernel.org \
    --cc=aleksei.marov@ionos.com \
    --cc=danil.kipnis@ionos.com \
    --cc=haris.iqbal@ionos.com \
    --cc=jgg@ziepe.ca \
    --cc=jinpu.wang@ionos.com \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.