[PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects
@ 2023-03-16 13:39 Leon Romanovsky
  2023-03-16 13:39 ` [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction Leon Romanovsky
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-16 13:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Eric Dumazet, Jakub Kicinski, linux-kernel,
	linux-rdma, netdev, Paolo Abeni, Patrisious Haddad,
	Saeed Mahameed

From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v1: 
 * Dropped EQ changes
v0: https://lore.kernel.org/all/cover.1649139915.git.leonro@nvidia.com
-----------------------------------------------------------------------

Hi,

This series from Patrisious extends mlx5 driver to convey FW failures
back to the upper layers and allow retry to delete these hardware
resources.

Thanks

Patrisious Haddad (3):
  net/mlx5: Nullify qp->dbg pointer post destruction
  RDMA/mlx5: Handling dct common resource destruction upon firmware
    failure
  RDMA/mlx5: Return the firmware result upon destroying QP/RQ

 drivers/infiniband/hw/mlx5/qpc.c                  | 13 +++++++------
 drivers/net/ethernet/mellanox/mlx5/core/debugfs.c |  6 +++---
 2 files changed, 10 insertions(+), 9 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction
  2023-03-16 13:39 [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects Leon Romanovsky
@ 2023-03-16 13:39 ` Leon Romanovsky
  2023-03-16 13:39 ` [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure Leon Romanovsky
  2023-03-16 13:39 ` [PATCH rdma-next v1 3/3] RDMA/mlx5: Return the firmware result upon destroying QP/RQ Leon Romanovsky
  2 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-16 13:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Patrisious Haddad, Eric Dumazet, Jakub Kicinski, linux-rdma,
	netdev, Paolo Abeni, Saeed Mahameed

From: Patrisious Haddad <phaddad@nvidia.com>

Nullifying qp->dbg is a preparation for the next patches
from the series in which mlx5_core_destroy_qp() could actually fail,
and then it can be called again which causes a kernel crash, since
qp->dbg was not nullified in previous call.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/debugfs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
index bb95b40d25eb..b08b5695ee45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
@@ -513,11 +513,11 @@ EXPORT_SYMBOL(mlx5_debug_qp_add);
 
 void mlx5_debug_qp_remove(struct mlx5_core_dev *dev, struct mlx5_core_qp *qp)
 {
-	if (!mlx5_debugfs_root)
+	if (!mlx5_debugfs_root || !qp->dbg)
 		return;
 
-	if (qp->dbg)
-		rem_res_tree(qp->dbg);
+	rem_res_tree(qp->dbg);
+	qp->dbg = NULL;
 }
 EXPORT_SYMBOL(mlx5_debug_qp_remove);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-16 13:39 [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects Leon Romanovsky
  2023-03-16 13:39 ` [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction Leon Romanovsky
@ 2023-03-16 13:39 ` Leon Romanovsky
  2023-03-20 19:18   ` Jason Gunthorpe
  2023-03-16 13:39 ` [PATCH rdma-next v1 3/3] RDMA/mlx5: Return the firmware result upon destroying QP/RQ Leon Romanovsky
  2 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-16 13:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

From: Patrisious Haddad <phaddad@nvidia.com>

Previously when destroying a DCT, if the firmware function for the
destruction failed, the common resource would have been destroyed
either way, since it was destroyed before the firmware object.
Which leads to kernel warning "refcount_t: underflow" which indicates
possible use-after-free.
Which is triggered when we try to destroy the common resource for the
second time and execute refcount_dec_and_test(&common->refcount).

So, currently before destroying the common resource we check its
refcount and continue with the destruction only if it isn't zero.

refcount_t: underflow; use-after-free.
WARNING: CPU: 8 PID: 1002 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0
Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse
CPU: 8 PID: 1002 Comm: python3 Not tainted 5.16.0-rc5+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0xd8/0xe0
Code: ff 48 c7 c7 18 f5 23 82 c6 05 60 70 ff 00 01 e8 d0 0a 45 00 0f 0b c3 48 c7 c7 c0 f4 23 82 c6 05 4c 70 ff 00 01 e8 ba 0a 45 00 <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13
RSP: 0018:ffff8881221d3aa8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8881313e8d40 RCX: ffff88852cc1b5c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff88852cc1b5c0
RBP: ffff888100f70000 R08: ffff88853ffd1ba8 R09: 0000000000000003
R10: 00000000fffff000 R11: 3fffffffffffffff R12: 0000000000000246
R13: ffff888100f71fa0 R14: ffff8881221d3c68 R15: 0000000000000020
FS:  00007efebbb13740(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005611aac29f80 CR3: 00000001313de004 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 destroy_resource_common+0x6e/0x95 [mlx5_ib]
 mlx5_core_destroy_rq_tracked+0x38/0xbe [mlx5_ib]
 mlx5_ib_destroy_wq+0x22/0x80 [mlx5_ib]
 ib_destroy_wq_user+0x1f/0x40 [ib_core]
 uverbs_free_wq+0x19/0x40 [ib_uverbs]
 destroy_hw_idr_uobject+0x18/0x50 [ib_uverbs]
 uverbs_destroy_uobject+0x2f/0x190 [ib_uverbs]
 uobj_destroy+0x3c/0x80 [ib_uverbs]
 ib_uverbs_cmd_verbs+0x3e4/0xb80 [ib_uverbs]
 ? uverbs_free_wq+0x40/0x40 [ib_uverbs]
 ? ip_list_rcv+0xf7/0x120
 ? netif_receive_skb_list_internal+0x1b6/0x2d0
 ? task_tick_fair+0xbf/0x450
 ? __handle_mm_fault+0x11fc/0x1450
 ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs]
 __x64_sys_ioctl+0x3e4/0x8e0
 ? handle_mm_fault+0xb9/0x210
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7efebc0be17b
Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe71813e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe71813fb8 RCX: 00007efebc0be17b
RDX: 00007ffe71813fa0 RSI: 00000000c0181b01 RDI: 0000000000000005
RBP: 00007ffe71813f80 R08: 00005611aae96020 R09: 000000000000004f
R10: 00007efebbf9ffa0 R11: 0000000000000246 R12: 00007ffe71813f80
R13: 00007ffe71813f4c R14: 00005611aae2eca0 R15: 00007efeae6c89d0
 </TASK>

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/qpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/qpc.c b/drivers/infiniband/hw/mlx5/qpc.c
index bae0334d6e7f..43d87bdcaf9c 100644
--- a/drivers/infiniband/hw/mlx5/qpc.c
+++ b/drivers/infiniband/hw/mlx5/qpc.c
@@ -179,6 +179,9 @@ static void destroy_resource_common(struct mlx5_ib_dev *dev,
 	struct mlx5_qp_table *table = &dev->qp_table;
 	unsigned long flags;
 
+	if (refcount_read(&qp->common.refcount) == 0)
+		return;
+
 	spin_lock_irqsave(&table->lock, flags);
 	radix_tree_delete(&table->tree,
 			  qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next v1 3/3] RDMA/mlx5: Return the firmware result upon destroying QP/RQ
  2023-03-16 13:39 [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects Leon Romanovsky
  2023-03-16 13:39 ` [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction Leon Romanovsky
  2023-03-16 13:39 ` [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure Leon Romanovsky
@ 2023-03-16 13:39 ` Leon Romanovsky
  2 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-16 13:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

From: Patrisious Haddad <phaddad@nvidia.com>

Previously when destroying a QP/RQ, the result of the firmware
destruction function was ignored and upper layers weren't informed
about the failure.
Which in turn could lead to various problems since when upper layer
isn't aware of the failure it continues its operation thinking that the
related QP/RQ was successfully destroyed while it actually wasn't,
which could lead to the below kernel WARN.

Currently, we return the correct firmware destruction status to upper
layers which in case of the RQ would be mlx5_ib_destroy_wq() which
was already capable of handling RQ destruction failure or in case of
a QP to destroy_qp_common(), which now would actually warn upon qp
destruction failure.

WARNING: CPU: 3 PID: 995 at drivers/infiniband/core/rdma_core.c:940 uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs]
Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse
CPU: 3 PID: 995 Comm: python3 Not tainted 5.16.0-rc5+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs]
Code: 41 5c 41 5d 41 5e e9 44 34 f0 e0 48 89 df e8 4c 77 ff ff 49 8b 86 10 01 00 00 48 85 c0 74 a1 4c 89 e7 ff d0 eb 9a 0f 0b eb c1 <0f> 0b be 04 00 00 00 48 89 df e8 b6 f6 ff ff e9 75 ff ff ff 90 0f
RSP: 0018:ffff8881533e3e78 EFLAGS: 00010287
RAX: ffff88811b2cf3e0 RBX: ffff888106209700 RCX: 0000000000000000
RDX: ffff888106209780 RSI: ffff8881533e3d30 RDI: ffff888109b101a0
RBP: 0000000000000001 R08: ffff888127cb381c R09: 0de9890000000009
R10: ffff888127cb3800 R11: 0000000000000000 R12: ffff888106209780
R13: ffff888106209750 R14: ffff888100f20660 R15: 0000000000000000
FS:  00007f8be353b740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8bd5b117c0 CR3: 000000012cd8a004 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ib_uverbs_close+0x1a/0x90 [ib_uverbs]
 __fput+0x82/0x230
 task_work_run+0x59/0x90
 exit_to_user_mode_prepare+0x138/0x140
 syscall_exit_to_user_mode+0x1d/0x50
 ? __x64_sys_close+0xe/0x40
 do_syscall_64+0x4a/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f8be3ae0abb
Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 83 43 f9 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 c1 43 f9 ff 8b 44
RSP: 002b:00007ffdb51909c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000557bb7f7c020 RCX: 00007f8be3ae0abb
RDX: 0000557bb7c74010 RSI: 0000557bb7f14ca0 RDI: 0000000000000005
RBP: 0000557bb7fbd598 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000557bb7fbd5b8
R13: 0000557bb7fbd5a8 R14: 0000000000001000 R15: 0000557bb7f7c020
 </TASK>

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/qpc.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qpc.c b/drivers/infiniband/hw/mlx5/qpc.c
index 43d87bdcaf9c..eeee18af36ed 100644
--- a/drivers/infiniband/hw/mlx5/qpc.c
+++ b/drivers/infiniband/hw/mlx5/qpc.c
@@ -301,8 +301,7 @@ int mlx5_core_destroy_qp(struct mlx5_ib_dev *dev, struct mlx5_core_qp *qp)
 	MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP);
 	MLX5_SET(destroy_qp_in, in, qpn, qp->qpn);
 	MLX5_SET(destroy_qp_in, in, uid, qp->uid);
-	mlx5_cmd_exec_in(dev->mdev, destroy_qp, in);
-	return 0;
+	return mlx5_cmd_exec_in(dev->mdev, destroy_qp, in);
 }
 
 int mlx5_core_set_delay_drop(struct mlx5_ib_dev *dev,
@@ -554,14 +553,14 @@ int mlx5_core_xrcd_dealloc(struct mlx5_ib_dev *dev, u32 xrcdn)
 	return mlx5_cmd_exec_in(dev->mdev, dealloc_xrcd, in);
 }
 
-static void destroy_rq_tracked(struct mlx5_ib_dev *dev, u32 rqn, u16 uid)
+static int destroy_rq_tracked(struct mlx5_ib_dev *dev, u32 rqn, u16 uid)
 {
 	u32 in[MLX5_ST_SZ_DW(destroy_rq_in)] = {};
 
 	MLX5_SET(destroy_rq_in, in, opcode, MLX5_CMD_OP_DESTROY_RQ);
 	MLX5_SET(destroy_rq_in, in, rqn, rqn);
 	MLX5_SET(destroy_rq_in, in, uid, uid);
-	mlx5_cmd_exec_in(dev->mdev, destroy_rq, in);
+	return mlx5_cmd_exec_in(dev->mdev, destroy_rq, in);
 }
 
 int mlx5_core_create_rq_tracked(struct mlx5_ib_dev *dev, u32 *in, int inlen,
@@ -592,8 +591,7 @@ int mlx5_core_destroy_rq_tracked(struct mlx5_ib_dev *dev,
 				 struct mlx5_core_qp *rq)
 {
 	destroy_resource_common(dev, rq);
-	destroy_rq_tracked(dev, rq->qpn, rq->uid);
-	return 0;
+	return destroy_rq_tracked(dev, rq->qpn, rq->uid);
 }
 
 static void destroy_sq_tracked(struct mlx5_ib_dev *dev, u32 sqn, u16 uid)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-16 13:39 ` [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure Leon Romanovsky
@ 2023-03-20 19:18   ` Jason Gunthorpe
  2023-03-21  7:54     ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2023-03-20 19:18 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> From: Patrisious Haddad <phaddad@nvidia.com>
> 
> Previously when destroying a DCT, if the firmware function for the
> destruction failed, the common resource would have been destroyed
> either way, since it was destroyed before the firmware object.
> Which leads to kernel warning "refcount_t: underflow" which indicates
> possible use-after-free.
> Which is triggered when we try to destroy the common resource for the
> second time and execute refcount_dec_and_test(&common->refcount).
> 
> So, currently before destroying the common resource we check its
> refcount and continue with the destruction only if it isn't zero.

This seems super sketchy

If the destruction fails why not set the refcount back to 1?

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-20 19:18   ` Jason Gunthorpe
@ 2023-03-21  7:54     ` Leon Romanovsky
  2023-03-21 11:53       ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-21  7:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > From: Patrisious Haddad <phaddad@nvidia.com>
> > 
> > Previously when destroying a DCT, if the firmware function for the
> > destruction failed, the common resource would have been destroyed
> > either way, since it was destroyed before the firmware object.
> > Which leads to kernel warning "refcount_t: underflow" which indicates
> > possible use-after-free.
> > Which is triggered when we try to destroy the common resource for the
> > second time and execute refcount_dec_and_test(&common->refcount).
> > 
> > So, currently before destroying the common resource we check its
> > refcount and continue with the destruction only if it isn't zero.
> 
> This seems super sketchy
> 
> If the destruction fails why not set the refcount back to 1?

Because destruction will fail in destroy_rq_tracked() which is after
destroy_resource_common().

In first destruction attempt, we delete qp from radix tree and wait for all
reference to drop. In order do not undo all this logic (setting 1 alone is
not enough), it is much safer simply skip destroy_resource_common() in reentry
case.

Failure to delete means that something external to kernel holds reference to that
QP, but it is safe to delete from kernel as nothing in kernel can use it after call
to destroy_resource_common().

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-21  7:54     ` Leon Romanovsky
@ 2023-03-21 11:53       ` Jason Gunthorpe
  2023-03-21 12:02         ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 11:53 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > From: Patrisious Haddad <phaddad@nvidia.com>
> > > 
> > > Previously when destroying a DCT, if the firmware function for the
> > > destruction failed, the common resource would have been destroyed
> > > either way, since it was destroyed before the firmware object.
> > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > possible use-after-free.
> > > Which is triggered when we try to destroy the common resource for the
> > > second time and execute refcount_dec_and_test(&common->refcount).
> > > 
> > > So, currently before destroying the common resource we check its
> > > refcount and continue with the destruction only if it isn't zero.
> > 
> > This seems super sketchy
> > 
> > If the destruction fails why not set the refcount back to 1?
> 
> Because destruction will fail in destroy_rq_tracked() which is after
> destroy_resource_common().
> 
> In first destruction attempt, we delete qp from radix tree and wait for all
> reference to drop. In order do not undo all this logic (setting 1 alone is
> not enough), it is much safer simply skip destroy_resource_common() in reentry
> case.

This is the bug I pointed a long time ago, it is ordered wrong to
remove restrack before destruction is assured

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-21 11:53       ` Jason Gunthorpe
@ 2023-03-21 12:02         ` Leon Romanovsky
  2023-03-21 12:37           ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-21 12:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

On Tue, Mar 21, 2023 at 08:53:35AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> > On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > > From: Patrisious Haddad <phaddad@nvidia.com>
> > > > 
> > > > Previously when destroying a DCT, if the firmware function for the
> > > > destruction failed, the common resource would have been destroyed
> > > > either way, since it was destroyed before the firmware object.
> > > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > > possible use-after-free.
> > > > Which is triggered when we try to destroy the common resource for the
> > > > second time and execute refcount_dec_and_test(&common->refcount).
> > > > 
> > > > So, currently before destroying the common resource we check its
> > > > refcount and continue with the destruction only if it isn't zero.
> > > 
> > > This seems super sketchy
> > > 
> > > If the destruction fails why not set the refcount back to 1?
> > 
> > Because destruction will fail in destroy_rq_tracked() which is after
> > destroy_resource_common().
> > 
> > In first destruction attempt, we delete qp from radix tree and wait for all
> > reference to drop. In order do not undo all this logic (setting 1 alone is
> > not enough), it is much safer simply skip destroy_resource_common() in reentry
> > case.
> 
> This is the bug I pointed a long time ago, it is ordered wrong to
> remove restrack before destruction is assured

It is not restrack, but internal to mlx5_core structure.

  176 static void destroy_resource_common(struct mlx5_ib_dev *dev,
  177                                     struct mlx5_core_qp *qp)
  178 {
  179         struct mlx5_qp_table *table = &dev->qp_table;
  180         unsigned long flags;
  181

....

  185         spin_lock_irqsave(&table->lock, flags);
  186         radix_tree_delete(&table->tree,
  187                           qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
  188         spin_unlock_irqrestore(&table->lock, flags);
  189         mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
  190         wait_for_completion(&qp->common.free);
  191 }


> 
> Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-21 12:02         ` Leon Romanovsky
@ 2023-03-21 12:37           ` Jason Gunthorpe
  2023-03-21 12:43             ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 12:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

On Tue, Mar 21, 2023 at 02:02:59PM +0200, Leon Romanovsky wrote:
> On Tue, Mar 21, 2023 at 08:53:35AM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> > > On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > > > From: Patrisious Haddad <phaddad@nvidia.com>
> > > > > 
> > > > > Previously when destroying a DCT, if the firmware function for the
> > > > > destruction failed, the common resource would have been destroyed
> > > > > either way, since it was destroyed before the firmware object.
> > > > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > > > possible use-after-free.
> > > > > Which is triggered when we try to destroy the common resource for the
> > > > > second time and execute refcount_dec_and_test(&common->refcount).
> > > > > 
> > > > > So, currently before destroying the common resource we check its
> > > > > refcount and continue with the destruction only if it isn't zero.
> > > > 
> > > > This seems super sketchy
> > > > 
> > > > If the destruction fails why not set the refcount back to 1?
> > > 
> > > Because destruction will fail in destroy_rq_tracked() which is after
> > > destroy_resource_common().
> > > 
> > > In first destruction attempt, we delete qp from radix tree and wait for all
> > > reference to drop. In order do not undo all this logic (setting 1 alone is
> > > not enough), it is much safer simply skip destroy_resource_common() in reentry
> > > case.
> > 
> > This is the bug I pointed a long time ago, it is ordered wrong to
> > remove restrack before destruction is assured
> 
> It is not restrack, but internal to mlx5_core structure.
> 
>   176 static void destroy_resource_common(struct mlx5_ib_dev *dev,
>   177                                     struct mlx5_core_qp *qp)
>   178 {
>   179         struct mlx5_qp_table *table = &dev->qp_table;
>   180         unsigned long flags;
>   181
> 
> ....
> 
>   185         spin_lock_irqsave(&table->lock, flags);
>   186         radix_tree_delete(&table->tree,
>   187                           qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
>   188         spin_unlock_irqrestore(&table->lock, flags);
>   189         mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
>   190         wait_for_completion(&qp->common.free);
>   191 }

Same basic issue.

"RSC"'s refcount stuff is really only for ODP to use, and the silly
pseudo locking should really just be rwsem not a refcount.

Get DCT out of that particular mess and the scheme is quite simple and
doesn't nee hacky stuff.

Please make a patch to remove radix tree from this code too...

diff --git a/drivers/infiniband/hw/mlx5/qpc.c b/drivers/infiniband/hw/mlx5/qpc.c
index bae0334d6e7f18..68009bff4bd544 100644
--- a/drivers/infiniband/hw/mlx5/qpc.c
+++ b/drivers/infiniband/hw/mlx5/qpc.c
@@ -88,23 +88,34 @@ static bool is_event_type_allowed(int rsc_type, int event_type)
 	}
 }
 
+static int dct_event_notifier(struct mlx5_ib_dev *dev, struct mlx5_eqe *eqe)
+{
+	struct mlx5_core_dct *dct;
+	u32 qpn;
+
+	qpn = be32_to_cpu(eqe->data.dct.dctn) & 0xffffff;
+	xa_lock(&dev->qp_table.dct_xa);
+	dct = xa_load(&dev->qp_table.dct_xa, qpn);
+	if (dct)
+		complete(&dct->drained);
+	xa_unlock(&dev->qp_table.dct_xa);
+	return NOTIFY_OK;
+}
+
 static int rsc_event_notifier(struct notifier_block *nb,
 			      unsigned long type, void *data)
 {
+	struct mlx5_ib_dev *dev =
+		container_of(nb, struct mlx5_ib_dev, qp_table.nb);
 	struct mlx5_core_rsc_common *common;
-	struct mlx5_qp_table *table;
-	struct mlx5_core_dct *dct;
+	struct mlx5_eqe *eqe = data;
 	u8 event_type = (u8)type;
 	struct mlx5_core_qp *qp;
-	struct mlx5_eqe *eqe;
 	u32 rsn;
 
 	switch (event_type) {
 	case MLX5_EVENT_TYPE_DCT_DRAINED:
-		eqe = data;
-		rsn = be32_to_cpu(eqe->data.dct.dctn) & 0xffffff;
-		rsn |= (MLX5_RES_DCT << MLX5_USER_INDEX_LEN);
-		break;
+		return dct_event_notifier(dev, eqe);
 	case MLX5_EVENT_TYPE_PATH_MIG:
 	case MLX5_EVENT_TYPE_COMM_EST:
 	case MLX5_EVENT_TYPE_SQ_DRAINED:
@@ -113,7 +124,6 @@ static int rsc_event_notifier(struct notifier_block *nb,
 	case MLX5_EVENT_TYPE_PATH_MIG_FAILED:
 	case MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
 	case MLX5_EVENT_TYPE_WQ_ACCESS_ERROR:
-		eqe = data;
 		rsn = be32_to_cpu(eqe->data.qp_srq.qp_srq_n) & 0xffffff;
 		rsn |= (eqe->data.qp_srq.type << MLX5_USER_INDEX_LEN);
 		break;
@@ -121,8 +131,7 @@ static int rsc_event_notifier(struct notifier_block *nb,
 		return NOTIFY_DONE;
 	}
 
-	table = container_of(nb, struct mlx5_qp_table, nb);
-	common = mlx5_get_rsc(table, rsn);
+	common = mlx5_get_rsc(&dev->qp_table, rsn);
 	if (!common)
 		return NOTIFY_OK;
 
@@ -137,11 +146,6 @@ static int rsc_event_notifier(struct notifier_block *nb,
 		qp->event(qp, event_type);
 		/* Need to put resource in event handler */
 		return NOTIFY_OK;
-	case MLX5_RES_DCT:
-		dct = (struct mlx5_core_dct *)common;
-		if (event_type == MLX5_EVENT_TYPE_DCT_DRAINED)
-			complete(&dct->drained);
-		break;
 	default:
 		break;
 	}
@@ -188,7 +192,7 @@ static void destroy_resource_common(struct mlx5_ib_dev *dev,
 }
 
 static int _mlx5_core_destroy_dct(struct mlx5_ib_dev *dev,
-				  struct mlx5_core_dct *dct, bool need_cleanup)
+				  struct mlx5_core_dct *dct)
 {
 	u32 in[MLX5_ST_SZ_DW(destroy_dct_in)] = {};
 	struct mlx5_core_qp *qp = &dct->mqp;
@@ -203,13 +207,14 @@ static int _mlx5_core_destroy_dct(struct mlx5_ib_dev *dev,
 	}
 	wait_for_completion(&dct->drained);
 destroy:
-	if (need_cleanup)
-		destroy_resource_common(dev, &dct->mqp);
 	MLX5_SET(destroy_dct_in, in, opcode, MLX5_CMD_OP_DESTROY_DCT);
 	MLX5_SET(destroy_dct_in, in, dctn, qp->qpn);
 	MLX5_SET(destroy_dct_in, in, uid, qp->uid);
 	err = mlx5_cmd_exec_in(dev->mdev, destroy_dct, in);
-	return err;
+	if (err)
+		return err;
+	xa_cmpxchg(&dev->qp_table.dct_xa, dct->mqp.qpn, dct, NULL, GFP_KERNEL);
+	return 0;
 }
 
 int mlx5_core_create_dct(struct mlx5_ib_dev *dev, struct mlx5_core_dct *dct,
@@ -227,13 +232,13 @@ int mlx5_core_create_dct(struct mlx5_ib_dev *dev, struct mlx5_core_dct *dct,
 
 	qp->qpn = MLX5_GET(create_dct_out, out, dctn);
 	qp->uid = MLX5_GET(create_dct_in, in, uid);
-	err = create_resource_common(dev, qp, MLX5_RES_DCT);
+	err = xa_err(xa_store(&dev->qp_table.dct_xa, qp->qpn, dct, GFP_KERNEL));
 	if (err)
 		goto err_cmd;
 
 	return 0;
 err_cmd:
-	_mlx5_core_destroy_dct(dev, dct, false);
+	_mlx5_core_destroy_dct(dev, dct);
 	return err;
 }
 
@@ -284,7 +289,7 @@ static int mlx5_core_drain_dct(struct mlx5_ib_dev *dev,
 int mlx5_core_destroy_dct(struct mlx5_ib_dev *dev,
 			  struct mlx5_core_dct *dct)
 {
-	return _mlx5_core_destroy_dct(dev, dct, true);
+	return _mlx5_core_destroy_dct(dev, dct);
 }
 
 int mlx5_core_destroy_qp(struct mlx5_ib_dev *dev, struct mlx5_core_qp *qp)
@@ -488,6 +493,7 @@ int mlx5_init_qp_table(struct mlx5_ib_dev *dev)
 
 	spin_lock_init(&table->lock);
 	INIT_RADIX_TREE(&table->tree, GFP_ATOMIC);
+	xa_init(&table->dct_xa);
 	mlx5_qp_debugfs_init(dev->mdev);
 
 	table->nb.notifier_call = rsc_event_notifier;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index f33389b42209e4..87e19e6d07a94a 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -381,7 +381,6 @@ enum mlx5_res_type {
 	MLX5_RES_SRQ	= 3,
 	MLX5_RES_XSRQ	= 4,
 	MLX5_RES_XRQ	= 5,
-	MLX5_RES_DCT	= MLX5_EVENT_QUEUE_TYPE_DCT,
 };
 
 struct mlx5_core_rsc_common {
@@ -443,6 +442,7 @@ struct mlx5_core_health {
 
 struct mlx5_qp_table {
 	struct notifier_block   nb;
+	struct xarray dct_xa;
 
 	/* protect radix tree
 	 */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
  2023-03-21 12:37           ` Jason Gunthorpe
@ 2023-03-21 12:43             ` Leon Romanovsky
  0 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Patrisious Haddad, David S. Miller, Eric Dumazet, Jakub Kicinski,
	linux-rdma, netdev, Paolo Abeni, Saeed Mahameed

On Tue, Mar 21, 2023 at 09:37:24AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 02:02:59PM +0200, Leon Romanovsky wrote:
> > On Tue, Mar 21, 2023 at 08:53:35AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> > > > On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > > > > From: Patrisious Haddad <phaddad@nvidia.com>
> > > > > > 
> > > > > > Previously when destroying a DCT, if the firmware function for the
> > > > > > destruction failed, the common resource would have been destroyed
> > > > > > either way, since it was destroyed before the firmware object.
> > > > > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > > > > possible use-after-free.
> > > > > > Which is triggered when we try to destroy the common resource for the
> > > > > > second time and execute refcount_dec_and_test(&common->refcount).
> > > > > > 
> > > > > > So, currently before destroying the common resource we check its
> > > > > > refcount and continue with the destruction only if it isn't zero.
> > > > > 
> > > > > This seems super sketchy
> > > > > 
> > > > > If the destruction fails why not set the refcount back to 1?
> > > > 
> > > > Because destruction will fail in destroy_rq_tracked() which is after
> > > > destroy_resource_common().
> > > > 
> > > > In first destruction attempt, we delete qp from radix tree and wait for all
> > > > reference to drop. In order do not undo all this logic (setting 1 alone is
> > > > not enough), it is much safer simply skip destroy_resource_common() in reentry
> > > > case.
> > > 
> > > This is the bug I pointed a long time ago, it is ordered wrong to
> > > remove restrack before destruction is assured
> > 
> > It is not restrack, but internal to mlx5_core structure.
> > 
> >   176 static void destroy_resource_common(struct mlx5_ib_dev *dev,
> >   177                                     struct mlx5_core_qp *qp)
> >   178 {
> >   179         struct mlx5_qp_table *table = &dev->qp_table;
> >   180         unsigned long flags;
> >   181
> > 
> > ....
> > 
> >   185         spin_lock_irqsave(&table->lock, flags);
> >   186         radix_tree_delete(&table->tree,
> >   187                           qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
> >   188         spin_unlock_irqrestore(&table->lock, flags);
> >   189         mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
> >   190         wait_for_completion(&qp->common.free);
> >   191 }
> 
> Same basic issue.
> 
> "RSC"'s refcount stuff is really only for ODP to use, and the silly
> pseudo locking should really just be rwsem not a refcount.
> 
> Get DCT out of that particular mess and the scheme is quite simple and
> doesn't nee hacky stuff.
> 
> Please make a patch to remove radix tree from this code too...

ok, I'll take a look.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-03-21 12:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16 13:39 [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects Leon Romanovsky
2023-03-16 13:39 ` [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction Leon Romanovsky
2023-03-16 13:39 ` [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure Leon Romanovsky
2023-03-20 19:18   ` Jason Gunthorpe
2023-03-21  7:54     ` Leon Romanovsky
2023-03-21 11:53       ` Jason Gunthorpe
2023-03-21 12:02         ` Leon Romanovsky
2023-03-21 12:37           ` Jason Gunthorpe
2023-03-21 12:43             ` Leon Romanovsky
2023-03-16 13:39 ` [PATCH rdma-next v1 3/3] RDMA/mlx5: Return the firmware result upon destroying QP/RQ Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.