All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jinpu Wang <jinpu.wang@cloud.ionos.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Jack Wang <xjtuwjp@gmail.com>,
	Bart Van Assche <bvanassche@acm.org>,
	Danil Kipnis <danil.kipnis@cloud.ionos.com>,
	Doug Ledford <dledford@redhat.com>,
	Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH for-next 02/18] RMDA/rtrs-srv: Occasionally flush ongoing session closing
Date: Mon, 4 Jan 2021 09:06:13 +0100	[thread overview]
Message-ID: <CAMGffEm8R8uU=-qKN62gKiUJ1s+obhR6rA+_AWjSP=N55_STTw@mail.gmail.com> (raw)
In-Reply-To: <20201227090118.GG4457@unreal>

On Sun, Dec 27, 2020 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Dec 16, 2020 at 05:42:17PM +0100, Jinpu Wang wrote:
> > On Fri, Dec 11, 2020 at 5:29 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 05:17:36PM +0100, Jack Wang wrote:
> > > >    En, the lockdep was complaining about the new conn_id, I will
> > > >    post the full log if needed next week.  let’s skip this patch for
> > > >    now, will double check!
> > >
> > > That is even more worrysome as the new conn_id already has a different
> > > lock class.
> > >
> > > Jason
> > This is the dmesg of the LOCKDEP warning, it's on kernel 5.4.77, but
> > the latest 5.10 behaves the same.
> >
> > [  500.071552] ======================================================
> > [  500.071648] WARNING: possible circular locking dependency detected
> > [  500.071869] 5.4.77-storage+ #35 Tainted: G           O
> > [  500.071959] ------------------------------------------------------
> > [  500.072054] kworker/1:1/28 is trying to acquire lock:
> > [  500.072200] ffff99653a624390 (&id_priv->handler_mutex){+.+.}, at:
> > rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.072837]
> >                but task is already holding lock:
> > [  500.072938] ffff9d18800f7e80
> > ((work_completion)(&sess->close_work)){+.+.}, at:
> > process_one_work+0x223/0x600
> > [  500.075642]
> >                which lock already depends on the new lock.
> >
> > [  500.075759]
> >                the existing dependency chain (in reverse order) is:
> > [  500.075880]
> >                -> #3 ((work_completion)(&sess->close_work)){+.+.}:
> > [  500.076062]        process_one_work+0x278/0x600
> > [  500.076154]        worker_thread+0x2d/0x3d0
> > [  500.076225]        kthread+0x111/0x130
> > [  500.076290]        ret_from_fork+0x24/0x30
> > [  500.076370]
> >                -> #2 ((wq_completion)rtrs_server_wq){+.+.}:
> > [  500.076482]        flush_workqueue+0xab/0x4b0
> > [  500.076565]        rtrs_srv_rdma_cm_handler+0x71d/0x1500 [rtrs_server]
> > [  500.076674]        cma_ib_req_handler+0x8c4/0x14f0 [rdma_cm]
> > [  500.076770]        cm_process_work+0x22/0x140 [ib_cm]
> > [  500.076857]        cm_req_handler+0x900/0xde0 [ib_cm]
> > [  500.076944]        cm_work_handler+0x136/0x1af2 [ib_cm]
> > [  500.077025]        process_one_work+0x29f/0x600
> > [  500.077097]        worker_thread+0x2d/0x3d0
> > [  500.077164]        kthread+0x111/0x130
> > [  500.077224]        ret_from_fork+0x24/0x30
> > [  500.077294]
> >                -> #1 (&id_priv->handler_mutex/1){+.+.}:
> > [  500.077409]        __mutex_lock+0x7e/0x950
> > [  500.077488]        cma_ib_req_handler+0x787/0x14f0 [rdma_cm]
> > [  500.077582]        cm_process_work+0x22/0x140 [ib_cm]
> > [  500.077669]        cm_req_handler+0x900/0xde0 [ib_cm]
> > [  500.077755]        cm_work_handler+0x136/0x1af2 [ib_cm]
> > [  500.077835]        process_one_work+0x29f/0x600
> > [  500.077907]        worker_thread+0x2d/0x3d0
> > [  500.077973]        kthread+0x111/0x130
> > [  500.078034]        ret_from_fork+0x24/0x30
> > [  500.078095]
> >                -> #0 (&id_priv->handler_mutex){+.+.}:
> > [  500.078196]        __lock_acquire+0x1166/0x1440
> > [  500.078267]        lock_acquire+0x90/0x170
> > [  500.078335]        __mutex_lock+0x7e/0x950
> > [  500.078410]        rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.078498]        rtrs_srv_close_work+0xf2/0x2d0 [rtrs_server]
> > [  500.078586]        process_one_work+0x29f/0x600
> > [  500.078662]        worker_thread+0x2d/0x3d0
> > [  500.078732]        kthread+0x111/0x130
> > [  500.078793]        ret_from_fork+0x24/0x30
> > [  500.078859]
> >                other info that might help us debug this:
> >
> > [  500.078984] Chain exists of:
> >                  &id_priv->handler_mutex -->
> > (wq_completion)rtrs_server_wq --> (work_completion)(&sess->close_work)
> >
> > [  500.079207]  Possible unsafe locking scenario:
> >
> > [  500.079293]        CPU0                    CPU1
> > [  500.079358]        ----                    ----
> > [  500.079358]   lock((work_completion)(&sess->close_work));
> > [  500.079358]
> > lock((wq_completion)rtrs_server_wq);
> > [  500.079358]
> > lock((work_completion)(&sess->close_work));
> > [  500.079358]   lock(&id_priv->handler_mutex);
> > [  500.079358]
> >                 *** DEADLOCK ***
> >
> > [  500.079358] 2 locks held by kworker/1:1/28:
> > [  500.079358]  #0: ffff99652d281f28
> > ((wq_completion)rtrs_server_wq){+.+.}, at:
> > process_one_work+0x223/0x600
> > [  500.079358]  #1: ffff9d18800f7e80
> > ((work_completion)(&sess->close_work)){+.+.}, at:
> > process_one_work+0x223/0x600
> > [  500.079358]
> >                stack backtrace:
> > [  500.079358] CPU: 1 PID: 28 Comm: kworker/1:1 Tainted: G           O
> >      5.4.77-storage+ #35
> > [  500.079358] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  500.079358] Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server]
> > [  500.079358] Call Trace:
> > [  500.079358]  dump_stack+0x71/0x9b
> > [  500.079358]  check_noncircular+0x17d/0x1a0
> > [  500.079358]  ? __lock_acquire+0x1166/0x1440
> > [  500.079358]  __lock_acquire+0x1166/0x1440
> > [  500.079358]  lock_acquire+0x90/0x170
> > [  500.079358]  ? rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.079358]  ? rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.079358]  __mutex_lock+0x7e/0x950
> > [  500.079358]  ? rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.079358]  ? find_held_lock+0x2d/0x90
> > [  500.079358]  ? mark_held_locks+0x49/0x70
> > [  500.079358]  ? rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.079358]  rdma_destroy_id+0x55/0x230 [rdma_cm]
> > [  500.079358]  rtrs_srv_close_work+0xf2/0x2d0 [rtrs_server]
> > [  500.079358]  process_one_work+0x29f/0x600
> > [  500.079358]  worker_thread+0x2d/0x3d0
> > [  500.079358]  ? process_one_work+0x600/0x600
> > [  500.079358]  kthread+0x111/0x130
> > [  500.079358]  ? kthread_park+0x90/0x90
> > [  500.079358]  ret_from_fork+0x24/0x30
> >
> > According to my understanding
> > in cma_ib_req_handler, the conn_id is newly created in
> > https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/cma.c#L2222.
> > And the rdma_cm_id associated with conn_id is passed to
> > rtrs_srv_rdma_cm_handler->rtrs_rdma_connect.
> >
> > In rtrs_rdma_connect, we do flush_workqueue will only flush close_work
> > for any other cm_id, but
> > not the newly created one conn_id, it has not associated with anything yet.
>
> How did you come to this conclusion that rtrs handler was called before
> cma_cm_event_handler()? I'm not so sure about that and it will explain
> the lockdep.
>
> Thanks
Hi Leon,
I never said that, the call chain here is:
cma_ib_req_handler->cma_cm_event_handler->rtrs_srv_rdma_cm_handler->rtrs_rdma_connect.
Repeat myself in last email:
in cma_ib_req_handler, the conn_id is newly created in
 https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/cma.c#L2222.
And the rdma_cm_id associated with conn_id is passed to
rtrs_rdma_connect.

In rtrs_rdma_connect, we do flush_workqueue will only flush close_work
for any other cm_id, but
not the newly created one conn_id, the rdma_cm_id passed in
rtrs_rdma_connect has not associated with anything yet.

Hope this is now clear.

Happy New Year!

  reply	other threads:[~2021-01-04  8:07 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-09 16:45 [PATCH for-next 00/18] Misc update for rtrs Jack Wang
2020-12-09 16:45 ` [PATCH for-next 01/18] RDMA/rtrs: Extend ibtrs_cq_qp_create Jack Wang
2020-12-09 16:45 ` [PATCH for-next 02/18] RMDA/rtrs-srv: Occasionally flush ongoing session closing Jack Wang
2020-12-10 14:56   ` Jinpu Wang
2020-12-11  2:33     ` Guoqing Jiang
2020-12-11  6:50       ` Jinpu Wang
2020-12-11  7:26         ` Leon Romanovsky
2020-12-11  7:53           ` Jinpu Wang
2020-12-11  7:58             ` Jinpu Wang
2020-12-11 13:45               ` Jason Gunthorpe
     [not found]                 ` <CAD+HZHXso=S5c=MqgrmDMZpWi10FbPTinWPfLMTkMCCiosihCQ@mail.gmail.com>
2020-12-11 16:29                   ` Jason Gunthorpe
2020-12-16 16:42                     ` Jinpu Wang
2020-12-27  9:01                       ` Leon Romanovsky
2021-01-04  8:06                         ` Jinpu Wang [this message]
2021-01-04  8:25                           ` Leon Romanovsky
2021-01-04 11:04                             ` Jinpu Wang
2020-12-11 20:49               ` Leon Romanovsky
2020-12-09 16:45 ` [PATCH for-next 03/18] RDMA/rtrs-srv: Release lock before call into close_sess Jack Wang
2020-12-09 16:45 ` [PATCH for-next 04/18] RDMA/rtrs-srv: Use sysfs_remove_file_self for disconnect Jack Wang
2020-12-09 16:45 ` [PATCH for-next 05/18] RDMA/rtrs-clt: Set mininum limit when create QP Jack Wang
2020-12-09 16:45 ` [PATCH for-next 06/18] RDMA/rtrs-srv: Jump to dereg_mr label if allocate iu fails Jack Wang
2020-12-09 16:45 ` [PATCH for-next 07/18] RDMA/rtrs: Call kobject_put in the failure path Jack Wang
2020-12-09 16:45 ` [PATCH for-next 08/18] RDMA/rtrs-clt: Consolidate rtrs_clt_destroy_sysfs_root_{folder,files} Jack Wang
2020-12-09 16:45 ` [PATCH for-next 09/18] RDMA/rtrs-clt: Kill wait_for_inflight_permits Jack Wang
2020-12-09 16:45 ` [PATCH for-next 10/18] RDMA/rtrs-clt: Remove unnecessary 'goto out' Jack Wang
2020-12-09 16:45 ` [PATCH for-next 11/18] RDMA/rtrs-clt: Kill rtrs_clt_change_state Jack Wang
2020-12-09 16:45 ` [PATCH for-next 12/18] RDMA/rtrs-clt: Rename __rtrs_clt_change_state to rtrs_clt_change_state Jack Wang
2020-12-09 16:45 ` [PATCH for-next 13/18] RDMA/rtrs-srv: Fix missing wr_cqe Jack Wang
2020-12-09 16:45 ` [PATCH for-next 14/18] RDMA/rtrs-clt: Refactor the failure cases in alloc_clt Jack Wang
2020-12-09 16:45 ` [PATCH for-next 15/18] RDMA/rtrs: Do not signal for heatbeat Jack Wang
2020-12-09 16:45 ` [PATCH for-next 16/18] RDMA/rtrs-clt: Use bitmask to check sess->flags Jack Wang
2020-12-09 16:45 ` [PATCH for-next 17/18] RDMA/rtrs-srv: Do not signal REG_MR Jack Wang
2020-12-09 16:45 ` [PATCH for-next 18/18] RDMA/rtrs-srv: Init wr_cnt as 1 Jack Wang
2020-12-11 19:48 ` [PATCH for-next 00/18] Misc update for rtrs Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMGffEm8R8uU=-qKN62gKiUJ1s+obhR6rA+_AWjSP=N55_STTw@mail.gmail.com' \
    --to=jinpu.wang@cloud.ionos.com \
    --cc=bvanassche@acm.org \
    --cc=danil.kipnis@cloud.ionos.com \
    --cc=dledford@redhat.com \
    --cc=guoqing.jiang@cloud.ionos.com \
    --cc=jgg@ziepe.ca \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=xjtuwjp@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.