linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table
@ 2020-07-12 10:26 Leon Romanovsky
  2020-07-12 14:33 ` Matthew Wilcox
  2020-07-16 12:50 ` Jason Gunthorpe
  0 siblings, 2 replies; 4+ messages in thread
From: Leon Romanovsky @ 2020-07-12 10:26 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe; +Cc: Maor Gottlieb, linux-rdma, Matthew Wilcox

From: Maor Gottlieb <maorg@mellanox.com>

SRQ table is accessed both from interrupt and process context,
therefore we must use xa_lock_irq.

 [ 9878.321379] --------------------------------
 [ 9878.322349] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
 [ 9878.323667] kworker/u17:9/8573 [HC0[0]:SC0[0]:HE1:SE1] takes:
 [ 9878.324894] ffff8883e3503d30 (&xa->xa_lock#13){?...}-{2:2}, at:
mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
 [ 9878.326816] {IN-HARDIRQ-W} state was registered at:
 [ 9878.327905]   lock_acquire+0xb9/0x3a0
 [ 9878.328720]   _raw_spin_lock+0x25/0x30
 [ 9878.329475]   srq_event_notifier+0x2b/0xc0 [mlx5_ib]
 [ 9878.330433]   notifier_call_chain+0x45/0x70
 [ 9878.331393]   __atomic_notifier_call_chain+0x69/0x100
 [ 9878.332530]   forward_event+0x36/0xc0 [mlx5_core]
 [ 9878.333558]   notifier_call_chain+0x45/0x70
 [ 9878.334418]   __atomic_notifier_call_chain+0x69/0x100
 [ 9878.335498]   mlx5_eq_async_int+0xc5/0x160 [mlx5_core]
 [ 9878.336543]   notifier_call_chain+0x45/0x70
 [ 9878.337354]   __atomic_notifier_call_chain+0x69/0x100
 [ 9878.338337]   mlx5_irq_int_handler+0x19/0x30 [mlx5_core]
 [ 9878.339369]   __handle_irq_event_percpu+0x43/0x2a0
 [ 9878.340382]   handle_irq_event_percpu+0x30/0x70
 [ 9878.341252]   handle_irq_event+0x34/0x60
 [ 9878.342020]   handle_edge_irq+0x7c/0x1b0
 [ 9878.342788]   do_IRQ+0x60/0x110
 [ 9878.343482]   ret_from_intr+0x0/0x2a
 [ 9878.344251]   default_idle+0x34/0x160
 [ 9878.344996]   do_idle+0x1ec/0x220
 [ 9878.345682]   cpu_startup_entry+0x19/0x20
 [ 9878.346511]   start_secondary+0x153/0x1a0
 [ 9878.347328]   secondary_startup_64+0xa4/0xb0
 [ 9878.348226] irq event stamp: 20907
 [ 9878.348953] hardirqs last  enabled at (20907): [<ffffffff819f0eb4>]
_raw_spin_unlock_irq+0x24/0x30
 [ 9878.350599] hardirqs last disabled at (20906): [<ffffffff819f0cbf>]
_raw_spin_lock_irq+0xf/0x40
 [ 9878.352300] softirqs last  enabled at (20746): [<ffffffff81c002c9>]
__do_softirq+0x2c9/0x436
 [ 9878.353859] softirqs last disabled at (20681): [<ffffffff81139543>]
irq_exit+0xb3/0xc0
 [ 9878.355365]
 [ 9878.355365] other info that might help us debug this:
 [ 9878.356703]  Possible unsafe locking scenario:
 [ 9878.356703]
 [ 9878.357941]        CPU0
 [ 9878.358522]        ----
 [ 9878.359109]   lock(&xa->xa_lock#13);
 [ 9878.359875]   <Interrupt>
 [ 9878.360504]     lock(&xa->xa_lock#13);
 [ 9878.361315]
 [ 9878.361315]  *** DEADLOCK ***
 [ 9878.361315]
 [ 9878.362632] 2 locks held by kworker/u17:9/8573:
 [ 9878.374883]  #0: ffff888295218d38
((wq_completion)mlx5_ib_page_fault){+.+.}-{0:0}, at:
process_one_work+0x1f1/0x5f0
 [ 9878.376728]  #1: ffff888401647e78
((work_completion)(&pfault->work)){+.+.}-{0:0}, at:
process_one_work+0x1f1/0x5f0
 [ 9878.378550]
 [ 9878.378550] stack backtrace:
 [ 9878.379489] CPU: 0 PID: 8573 Comm: kworker/u17:9 Tainted: G
O      5.7.0_for_upstream_min_debug_2020_06_14_11_31_46_41 #1
 [ 9878.381730] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
 [ 9878.383940] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action
[mlx5_ib]
 [ 9878.385239] Call Trace:
 [ 9878.385822]  dump_stack+0x71/0x9b
 [ 9878.386519]  mark_lock+0x4f2/0x590
 [ 9878.387263]  ? print_shortest_lock_dependencies+0x200/0x200
 [ 9878.388362]  __lock_acquire+0xa00/0x1eb0
 [ 9878.389133]  lock_acquire+0xb9/0x3a0
 [ 9878.389854]  ? mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
 [ 9878.390796]  _raw_spin_lock+0x25/0x30
 [ 9878.391533]  ? mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
 [ 9878.392455]  mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
 [ 9878.393351]  mlx5_ib_eqe_pf_action+0x257/0xa30 [mlx5_ib]
 [ 9878.394337]  ? process_one_work+0x209/0x5f0
 [ 9878.395150]  process_one_work+0x27b/0x5f0
 [ 9878.395939]  ? __schedule+0x280/0x7e0
 [ 9878.396683]  worker_thread+0x2d/0x3c0
 [ 9878.397424]  ? process_one_work+0x5f0/0x5f0
 [ 9878.398249]  kthread+0x111/0x130
 [ 9878.398926]  ? kthread_park+0x90/0x90
 [ 9878.399709]  ret_from_fork+0x24/0x30

Fixes: b02a29eb8841 ("mlx5: Convert mlx5_srq_table to XArray")
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
v2:
 * Dropped wrong hunk
v1:
https://lore.kernel.org/linux-rdma/20200707131551.1153207-1-leon@kernel.org
 * I left Fixes as before to make sure that it is taken properly to stable@.
 * xa_lock_irqsave -> xa_lock_irq
v0:
https://lore.kernel.org/linux-rdma/20200707110612.882962-2-leon@kernel.org
---
 drivers/infiniband/hw/mlx5/srq_cmd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c
index 6f5eadc4d183..37aaacebd3f2 100644
--- a/drivers/infiniband/hw/mlx5/srq_cmd.c
+++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
@@ -83,11 +83,11 @@ struct mlx5_core_srq *mlx5_cmd_get_srq(struct mlx5_ib_dev *dev, u32 srqn)
 	struct mlx5_srq_table *table = &dev->srq_table;
 	struct mlx5_core_srq *srq;

-	xa_lock(&table->array);
+	xa_lock_irq(&table->array);
 	srq = xa_load(&table->array, srqn);
 	if (srq)
 		refcount_inc(&srq->common.refcount);
-	xa_unlock(&table->array);
+	xa_unlock_irq(&table->array);

 	return srq;
 }
--
2.26.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table
  2020-07-12 10:26 [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table Leon Romanovsky
@ 2020-07-12 14:33 ` Matthew Wilcox
  2020-07-12 17:41   ` Leon Romanovsky
  2020-07-16 12:50 ` Jason Gunthorpe
  1 sibling, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2020-07-12 14:33 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, Jason Gunthorpe, Maor Gottlieb, linux-rdma

On Sun, Jul 12, 2020 at 01:26:41PM +0300, Leon Romanovsky wrote:
> Fixes: b02a29eb8841 ("mlx5: Convert mlx5_srq_table to XArray")
> Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
> v2:
>  * Dropped wrong hunk
> v1:
> https://lore.kernel.org/linux-rdma/20200707131551.1153207-1-leon@kernel.org
>  * I left Fixes as before to make sure that it is taken properly to stable@.
>  * xa_lock_irqsave -> xa_lock_irq

But it won't be taken properly to stable.  It's true that that's the
earliest version that this applies to, but as far as I can tell the
underlying bug is there a lot earlier.  It doesn't appear to be there
in e126ba97dba9edeb6fafa3665b5f8497fc9cdf8c as the cq->lock is taken
irqsafe in mlx5_ib_poll_cq() before calling mlx5_poll_one() which calls
handle_responder() which calls mlx5_core_get_srq(), but it was there
before b02a29eb8841.  I think it's up to you to figure out which version
this bug was introduced in, because stable still cares about trees before
5.2 (which is when b02a29eb8841 was introduced).  You'll also have to
produce a patch for those earlier kernels.

> v0:
> https://lore.kernel.org/linux-rdma/20200707110612.882962-2-leon@kernel.org
> ---
>  drivers/infiniband/hw/mlx5/srq_cmd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c
> index 6f5eadc4d183..37aaacebd3f2 100644
> --- a/drivers/infiniband/hw/mlx5/srq_cmd.c
> +++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
> @@ -83,11 +83,11 @@ struct mlx5_core_srq *mlx5_cmd_get_srq(struct mlx5_ib_dev *dev, u32 srqn)
>  	struct mlx5_srq_table *table = &dev->srq_table;
>  	struct mlx5_core_srq *srq;
> 
> -	xa_lock(&table->array);
> +	xa_lock_irq(&table->array);
>  	srq = xa_load(&table->array, srqn);
>  	if (srq)
>  		refcount_inc(&srq->common.refcount);
> -	xa_unlock(&table->array);
> +	xa_unlock_irq(&table->array);
> 
>  	return srq;
>  }
> --
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table
  2020-07-12 14:33 ` Matthew Wilcox
@ 2020-07-12 17:41   ` Leon Romanovsky
  0 siblings, 0 replies; 4+ messages in thread
From: Leon Romanovsky @ 2020-07-12 17:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Doug Ledford, Jason Gunthorpe, Maor Gottlieb, linux-rdma

On Sun, Jul 12, 2020 at 03:33:06PM +0100, Matthew Wilcox wrote:
> On Sun, Jul 12, 2020 at 01:26:41PM +0300, Leon Romanovsky wrote:
> > Fixes: b02a29eb8841 ("mlx5: Convert mlx5_srq_table to XArray")
> > Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > ---
> > v2:
> >  * Dropped wrong hunk
> > v1:
> > https://lore.kernel.org/linux-rdma/20200707131551.1153207-1-leon@kernel.org
> >  * I left Fixes as before to make sure that it is taken properly to stable@.
> >  * xa_lock_irqsave -> xa_lock_irq
>
> But it won't be taken properly to stable.  It's true that that's the
> earliest version that this applies to, but as far as I can tell the
> underlying bug is there a lot earlier.  It doesn't appear to be there
> in e126ba97dba9edeb6fafa3665b5f8497fc9cdf8c as the cq->lock is taken
> irqsafe in mlx5_ib_poll_cq() before calling mlx5_poll_one() which calls
> handle_responder() which calls mlx5_core_get_srq(), but it was there
> before b02a29eb8841.  I think it's up to you to figure out which version
> this bug was introduced in, because stable still cares about trees before
> 5.2 (which is when b02a29eb8841 was introduced).  You'll also have to
> produce a patch for those earlier kernels.

Of course, but first I would like to be sure that Sasha's autosel will
work out-of-box.

Thanks

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table
  2020-07-12 10:26 [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table Leon Romanovsky
  2020-07-12 14:33 ` Matthew Wilcox
@ 2020-07-16 12:50 ` Jason Gunthorpe
  1 sibling, 0 replies; 4+ messages in thread
From: Jason Gunthorpe @ 2020-07-16 12:50 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, Maor Gottlieb, linux-rdma, Matthew Wilcox

On Sun, Jul 12, 2020 at 01:26:41PM +0300, Leon Romanovsky wrote:
> From: Maor Gottlieb <maorg@mellanox.com>
> 
> SRQ table is accessed both from interrupt and process context,
> therefore we must use xa_lock_irq.
> 
>  [ 9878.321379] --------------------------------
>  [ 9878.322349] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
>  [ 9878.323667] kworker/u17:9/8573 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  [ 9878.324894] ffff8883e3503d30 (&xa->xa_lock#13){?...}-{2:2}, at:
> mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
>  [ 9878.326816] {IN-HARDIRQ-W} state was registered at:
>  [ 9878.327905]   lock_acquire+0xb9/0x3a0
>  [ 9878.328720]   _raw_spin_lock+0x25/0x30
>  [ 9878.329475]   srq_event_notifier+0x2b/0xc0 [mlx5_ib]
>  [ 9878.330433]   notifier_call_chain+0x45/0x70
>  [ 9878.331393]   __atomic_notifier_call_chain+0x69/0x100
>  [ 9878.332530]   forward_event+0x36/0xc0 [mlx5_core]
>  [ 9878.333558]   notifier_call_chain+0x45/0x70
>  [ 9878.334418]   __atomic_notifier_call_chain+0x69/0x100
>  [ 9878.335498]   mlx5_eq_async_int+0xc5/0x160 [mlx5_core]
>  [ 9878.336543]   notifier_call_chain+0x45/0x70
>  [ 9878.337354]   __atomic_notifier_call_chain+0x69/0x100
>  [ 9878.338337]   mlx5_irq_int_handler+0x19/0x30 [mlx5_core]
>  [ 9878.339369]   __handle_irq_event_percpu+0x43/0x2a0
>  [ 9878.340382]   handle_irq_event_percpu+0x30/0x70
>  [ 9878.341252]   handle_irq_event+0x34/0x60
>  [ 9878.342020]   handle_edge_irq+0x7c/0x1b0
>  [ 9878.342788]   do_IRQ+0x60/0x110
>  [ 9878.343482]   ret_from_intr+0x0/0x2a
>  [ 9878.344251]   default_idle+0x34/0x160
>  [ 9878.344996]   do_idle+0x1ec/0x220
>  [ 9878.345682]   cpu_startup_entry+0x19/0x20
>  [ 9878.346511]   start_secondary+0x153/0x1a0
>  [ 9878.347328]   secondary_startup_64+0xa4/0xb0
>  [ 9878.348226] irq event stamp: 20907
>  [ 9878.348953] hardirqs last  enabled at (20907): [<ffffffff819f0eb4>]
> _raw_spin_unlock_irq+0x24/0x30
>  [ 9878.350599] hardirqs last disabled at (20906): [<ffffffff819f0cbf>]
> _raw_spin_lock_irq+0xf/0x40
>  [ 9878.352300] softirqs last  enabled at (20746): [<ffffffff81c002c9>]
> __do_softirq+0x2c9/0x436
>  [ 9878.353859] softirqs last disabled at (20681): [<ffffffff81139543>]
> irq_exit+0xb3/0xc0
>  [ 9878.355365]
>  [ 9878.355365] other info that might help us debug this:
>  [ 9878.356703]  Possible unsafe locking scenario:
>  [ 9878.356703]
>  [ 9878.357941]        CPU0
>  [ 9878.358522]        ----
>  [ 9878.359109]   lock(&xa->xa_lock#13);
>  [ 9878.359875]   <Interrupt>
>  [ 9878.360504]     lock(&xa->xa_lock#13);
>  [ 9878.361315]
>  [ 9878.361315]  *** DEADLOCK ***
>  [ 9878.361315]
>  [ 9878.362632] 2 locks held by kworker/u17:9/8573:
>  [ 9878.374883]  #0: ffff888295218d38
> ((wq_completion)mlx5_ib_page_fault){+.+.}-{0:0}, at:
> process_one_work+0x1f1/0x5f0
>  [ 9878.376728]  #1: ffff888401647e78
> ((work_completion)(&pfault->work)){+.+.}-{0:0}, at:
> process_one_work+0x1f1/0x5f0
>  [ 9878.378550]
>  [ 9878.378550] stack backtrace:
>  [ 9878.379489] CPU: 0 PID: 8573 Comm: kworker/u17:9 Tainted: G
> O      5.7.0_for_upstream_min_debug_2020_06_14_11_31_46_41 #1
>  [ 9878.381730] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
>  [ 9878.383940] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action
> [mlx5_ib]
>  [ 9878.385239] Call Trace:
>  [ 9878.385822]  dump_stack+0x71/0x9b
>  [ 9878.386519]  mark_lock+0x4f2/0x590
>  [ 9878.387263]  ? print_shortest_lock_dependencies+0x200/0x200
>  [ 9878.388362]  __lock_acquire+0xa00/0x1eb0
>  [ 9878.389133]  lock_acquire+0xb9/0x3a0
>  [ 9878.389854]  ? mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
>  [ 9878.390796]  _raw_spin_lock+0x25/0x30
>  [ 9878.391533]  ? mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
>  [ 9878.392455]  mlx5_cmd_get_srq+0x18/0x70 [mlx5_ib]
>  [ 9878.393351]  mlx5_ib_eqe_pf_action+0x257/0xa30 [mlx5_ib]
>  [ 9878.394337]  ? process_one_work+0x209/0x5f0
>  [ 9878.395150]  process_one_work+0x27b/0x5f0
>  [ 9878.395939]  ? __schedule+0x280/0x7e0
>  [ 9878.396683]  worker_thread+0x2d/0x3c0
>  [ 9878.397424]  ? process_one_work+0x5f0/0x5f0
>  [ 9878.398249]  kthread+0x111/0x130
>  [ 9878.398926]  ? kthread_park+0x90/0x90
>  [ 9878.399709]  ret_from_fork+0x24/0x30
> 
> Fixes: b02a29eb8841 ("mlx5: Convert mlx5_srq_table to XArray")
> Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

Applied to for-rc, I updated the fixes line, this bug has been present
since day 1 apparently.

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-07-16 12:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-12 10:26 [PATCH rdma-rc v2] RDMA/mlx5: Use xa_lock_irq when access to SRQ table Leon Romanovsky
2020-07-12 14:33 ` Matthew Wilcox
2020-07-12 17:41   ` Leon Romanovsky
2020-07-16 12:50 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).