All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c
@ 2022-02-23 23:07 Bob Pearson
  2022-02-23 23:07 ` [PATCH for-next v13 1/6] RDMA/rxe: Warn if mcast memory is not freed Bob Pearson
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

This patch series completes the separation of the mc_grp and mc_elem
object pools from rxe_pools.c and moving of their code to rxe_mcast.c.
This makes sense because these two pools are different from the other
pools as the only ones that do not share objects with rdma-core and
that use key's instead of indices to enable looking up objects. This
change will enable a significant simplification of the normal object
pools.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v13
  Addressed issues raised by Jason Gunthorpe
  Replaced rxe_mcast_cleanup() by a warning.
  Dropped _rcu from __rxe_init/cleanup_mca_rcu
  Restructured rxe_detach_mcg to simplify
v12
  Addressed issues raised by Jason Gunthorpe.
  Returned a warning if rdma-core fails to cleanup mcast memory.
  Split the 'cleanup patch' into more managable patches.
  Fixed an error in the rcu patch by moving the kfree and
  rxe_drop_ref(qp) to a call_rcu() routine.
v11
  Restructured the patch series to simplify it and make each patch
  simpler. Addressed some additional issues raised by Jason Gunthorpe
  in the review of that last version.
v10
  Corrected issues reported by Jason Gunthorpe
  Isolated patches 01-17 separate from the remaining patches.
  They will be submitted later
v9
  Corrected issues reported by Jason Gunthorpe,
  Converted locking in rxe_mcast.c and rxe_pool.c to use RCU
  Split up the patches into smaller changes
v8
  Fixed an additional race in 3/8 which was not handled correctly.
v7
  Corrected issues reported by Jason Gunthorpe
Link: https://lore.kernel.org/linux-rdma/20211207190947.GH6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207191857.GI6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207192824.GJ6385@nvidia.com/
v6
  Fixed a kzalloc flags bug.
  Fixed comment bug reported by 'Kernel Test Robot'.
  Changed type of rxe_pool.c in __rxe_fini().
v5
  Removed patches already accepted into for-next and addressed comments
  from Jason Gunthorpe.
v4
  Restructured patch series to change to xarray earlier which
  greatly simplified the changes.
  Rebased to current for-next
v3
  Changed rxe_alloc to use GFP_KERNEL
  Addressed other comments by Jason Gunthorp
  Merged the previous 06/10 and 07/10 patches into one since they overlapped
  Added some minor cleanups as 10/10
v2
  Rebased to current for-next.
  Added 4 additional patches

Bob Pearson (6):
  RDMA/rxe: Warn if mcast memory is not freed
  RDMA/rxe: Collect mca init code in a subroutine
  RDMA/rxe: Collect cleanup mca code in a subroutine
  RDMA/rxe: Cleanup rxe_mcast.c
  RDMA/rxe: For mcast copy qp list to temp array
  RDMA/rxe: Convert mca read locking to RCU

 drivers/infiniband/sw/rxe/rxe.c       |   2 +
 drivers/infiniband/sw/rxe/rxe_mcast.c | 237 ++++++++++++++++++++------
 drivers/infiniband/sw/rxe/rxe_recv.c  | 107 +++++++-----
 drivers/infiniband/sw/rxe/rxe_verbs.h |   4 +
 4 files changed, 256 insertions(+), 94 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH for-next v13 1/6] RDMA/rxe: Warn if mcast memory is not freed
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
@ 2022-02-23 23:07 ` Bob Pearson
  2022-02-23 23:07 ` [PATCH v13 for-next 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Print a warning if memory allocated by mcast
is not cleared when the rxe driver is unloaded.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 3520eb2db685..fce3994d8f7a 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -29,6 +29,8 @@ void rxe_dealloc(struct ib_device *ib_dev)
 	rxe_pool_cleanup(&rxe->mr_pool);
 	rxe_pool_cleanup(&rxe->mw_pool);
 
+	WARN_ON(!RB_EMPTY_ROOT(&rxe->mcg_tree));
+
 	if (rxe->tfm)
 		crypto_free_shash(rxe->tfm);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v13 for-next 2/6] RDMA/rxe: Collect mca init code in a subroutine
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
  2022-02-23 23:07 ` [PATCH for-next v13 1/6] RDMA/rxe: Warn if mcast memory is not freed Bob Pearson
@ 2022-02-23 23:07 ` Bob Pearson
  2022-02-23 23:07 ` [PATCH for-next v13 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Collect initialization code for struct rxe_mca into a subroutine,
__rxe_init_mca(), to cleanup rxe_attach_mcg() in rxe_mcast.c. Check
limit on total number of attached qp's.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 58 ++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
 2 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 4935fe5c5868..a0a7f8720f95 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -259,6 +259,46 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
 	spin_unlock_irqrestore(&mcg->rxe->mcg_lock, flags);
 }
 
+/**
+ * __rxe_init_mca - initialize a new mca holding lock
+ * @qp: qp object
+ * @mcg: mcg object
+ * @mca: empty space for new mca
+ *
+ * Context: caller must hold references on qp and mcg, rxe->mcg_lock
+ * and pass memory for new mca
+ *
+ * Returns: 0 on success else an error
+ */
+static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
+			  struct rxe_mca *mca)
+{
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
+	int n;
+
+	n = atomic_inc_return(&rxe->mcg_attach);
+	if (n > rxe->attr.max_total_mcast_qp_attach) {
+		atomic_dec(&rxe->mcg_attach);
+		return -ENOMEM;
+	}
+
+	n = atomic_inc_return(&mcg->qp_num);
+	if (n > rxe->attr.max_mcast_qp_attach) {
+		atomic_dec(&mcg->qp_num);
+		atomic_dec(&rxe->mcg_attach);
+		return -ENOMEM;
+	}
+
+	atomic_inc(&qp->mcg_num);
+
+	rxe_add_ref(qp);
+	mca->qp = qp;
+
+	list_add_tail(&mca->qp_list, &mcg->qp_list);
+
+	return 0;
+}
+
 static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 				  struct rxe_mcg *mcg)
 {
@@ -291,22 +331,9 @@ static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 		}
 	}
 
-	/* check limits after checking if already attached */
-	if (atomic_inc_return(&mcg->qp_num) > rxe->attr.max_mcast_qp_attach) {
-		atomic_dec(&mcg->qp_num);
+	err = __rxe_init_mca(qp, mcg, mca);
+	if (err)
 		kfree(mca);
-		err = -ENOMEM;
-		goto out;
-	}
-
-	/* protect pointer to qp in mca */
-	rxe_add_ref(qp);
-	mca->qp = qp;
-
-	atomic_inc(&qp->mcg_num);
-	list_add(&mca->qp_list, &mcg->qp_list);
-
-	err = 0;
 out:
 	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
 	return err;
@@ -329,6 +356,7 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 		if (mca->qp == qp) {
 			list_del(&mca->qp_list);
 			atomic_dec(&qp->mcg_num);
+			atomic_dec(&rxe->mcg_attach);
 			rxe_drop_ref(qp);
 
 			/* if the number of qp's attached to the
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 20fe3ee6589d..6b15251ff67a 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -401,6 +401,7 @@ struct rxe_dev {
 	spinlock_t		mcg_lock;
 	struct rb_root		mcg_tree;
 	atomic_t		mcg_num;
+	atomic_t		mcg_attach;
 
 	spinlock_t		pending_lock; /* guard pending_mmaps */
 	struct list_head	pending_mmaps;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v13 3/6] RDMA/rxe: Collect cleanup mca code in a subroutine
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
  2022-02-23 23:07 ` [PATCH for-next v13 1/6] RDMA/rxe: Warn if mcast memory is not freed Bob Pearson
  2022-02-23 23:07 ` [PATCH v13 for-next 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
@ 2022-02-23 23:07 ` Bob Pearson
  2022-02-23 23:07 ` [PATCH v13 for-next 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Collect cleanup code for struct rxe_mca into a subroutine,
__rxe_cleanup_mca() called in rxe_detach_mcg() in rxe_mcast.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 41 +++++++++++++++++----------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index a0a7f8720f95..66c1ae703976 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -339,13 +339,31 @@ static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 	return err;
 }
 
+/**
+ * __rxe_cleanup_mca - cleanup mca object holding lock
+ * @mca: mca object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold a reference to mcg and rxe->mcg_lock
+ */
+static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
+{
+	list_del(&mca->qp_list);
+
+	atomic_dec(&mcg->qp_num);
+	atomic_dec(&mcg->rxe->mcg_attach);
+	atomic_dec(&mca->qp->mcg_num);
+	rxe_drop_ref(mca->qp);
+
+	kfree(mca);
+}
+
 static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 				   union ib_gid *mgid)
 {
 	struct rxe_mcg *mcg;
 	struct rxe_mca *mca, *tmp;
 	unsigned long flags;
-	int err;
 
 	mcg = rxe_lookup_mcg(rxe, mgid);
 	if (!mcg)
@@ -354,37 +372,30 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 	spin_lock_irqsave(&rxe->mcg_lock, flags);
 	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
-			list_del(&mca->qp_list);
-			atomic_dec(&qp->mcg_num);
-			atomic_dec(&rxe->mcg_attach);
-			rxe_drop_ref(qp);
+			__rxe_cleanup_mca(mca, mcg);
 
 			/* if the number of qp's attached to the
 			 * mcast group falls to zero go ahead and
 			 * tear it down. This will not free the
 			 * object since we are still holding a ref
-			 * from the get key above.
+			 * from the get key above
 			 */
-			if (atomic_dec_return(&mcg->qp_num) <= 0)
+			if (atomic_read(&mcg->qp_num) <= 0)
 				__rxe_destroy_mcg(mcg);
 
 			/* drop the ref from get key. This will free the
 			 * object if qp_num is zero.
 			 */
 			kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-			kfree(mca);
-			err = 0;
-			goto out_unlock;
+
+			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+			return 0;
 		}
 	}
 
 	/* we didn't find the qp on the list */
-	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-	err = -EINVAL;
-
-out_unlock:
 	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
-	return err;
+	return -EINVAL;
 }
 
 int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v13 for-next 4/6] RDMA/rxe: Cleanup rxe_mcast.c
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (2 preceding siblings ...)
  2022-02-23 23:07 ` [PATCH for-next v13 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
@ 2022-02-23 23:07 ` Bob Pearson
  2022-02-23 23:07 ` [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Finish adding subroutine comment headers to subroutines in
rxe_mcast.c. Make minor api change cleanups.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++++++------
 1 file changed, 78 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 66c1ae703976..c399a29b648b 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -1,12 +1,33 @@
 // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
 /*
+ * Copyright (c) 2022 Hewlett Packard Enterprise, Inc. All rights reserved.
  * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
  * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
  */
 
+/*
+ * rxe_mcast.c implements driver support for multicast transport.
+ * It is based on two data structures struct rxe_mcg ('mcg') and
+ * struct rxe_mca ('mca'). An mcg is allocated each time a qp is
+ * attached to a new mgid for the first time. These are indexed by
+ * a red-black tree using the mgid. This data structure is searched
+ * for the mcg when a multicast packet is received and when another
+ * qp is attached to the same mgid. It is cleaned up when the last qp
+ * is detached from the mcg. Each time a qp is attached to an mcg an
+ * mca is created. It holds a pointer to the qp and is added to a list
+ * of qp's that are attached to the mcg. The qp_list is used to replicate
+ * mcast packets in the rxe receive path.
+ */
+
 #include "rxe.h"
-#include "rxe_loc.h"
 
+/**
+ * rxe_mcast_add - add multicast address to rxe device
+ * @rxe: rxe device object
+ * @mgid: multicast address as a gid
+ *
+ * Returns 0 on success else an error
+ */
 static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
 {
 	unsigned char ll_addr[ETH_ALEN];
@@ -16,6 +37,13 @@ static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
 	return dev_mc_add(rxe->ndev, ll_addr);
 }
 
+/**
+ * rxe_mcast_delete - delete multicast address from rxe device
+ * @rxe: rxe device object
+ * @mgid: multicast address as a gid
+ *
+ * Returns 0 on success else an error
+ */
 static int rxe_mcast_delete(struct rxe_dev *rxe, union ib_gid *mgid)
 {
 	unsigned char ll_addr[ETH_ALEN];
@@ -216,7 +244,7 @@ static struct rxe_mcg *rxe_get_mcg(struct rxe_dev *rxe, union ib_gid *mgid)
 
 /**
  * rxe_cleanup_mcg - cleanup mcg for kref_put
- * @kref:
+ * @kref: struct kref embnedded in mcg
  */
 void rxe_cleanup_mcg(struct kref *kref)
 {
@@ -299,9 +327,17 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
 	return 0;
 }
 
-static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
-				  struct rxe_mcg *mcg)
+/**
+ * rxe_attach_mcg - attach qp to mcg if not already attached
+ * @qp: qp object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold reference on qp and mcg.
+ * Returns: 0 on success else an error
+ */
+static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 {
+	struct rxe_dev *rxe = mcg->rxe;
 	struct rxe_mca *mca, *tmp;
 	unsigned long flags;
 	int err;
@@ -358,17 +394,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
 	kfree(mca);
 }
 
-static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
-				   union ib_gid *mgid)
+/**
+ * rxe_detach_mcg - detach qp from mcg
+ * @mcg: mcg object
+ * @qp: qp object
+ *
+ * Returns: 0 on success else an error if qp is not attached.
+ */
+static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 {
-	struct rxe_mcg *mcg;
+	struct rxe_dev *rxe = mcg->rxe;
 	struct rxe_mca *mca, *tmp;
 	unsigned long flags;
 
-	mcg = rxe_lookup_mcg(rxe, mgid);
-	if (!mcg)
-		return -EINVAL;
-
 	spin_lock_irqsave(&rxe->mcg_lock, flags);
 	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
@@ -378,16 +416,11 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 			 * mcast group falls to zero go ahead and
 			 * tear it down. This will not free the
 			 * object since we are still holding a ref
-			 * from the get key above
+			 * from the caller
 			 */
 			if (atomic_read(&mcg->qp_num) <= 0)
 				__rxe_destroy_mcg(mcg);
 
-			/* drop the ref from get key. This will free the
-			 * object if qp_num is zero.
-			 */
-			kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-
 			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
 			return 0;
 		}
@@ -398,6 +431,14 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 	return -EINVAL;
 }
 
+/**
+ * rxe_attach_mcast - attach qp to multicast group (see IBA-11.3.1)
+ * @ibqp: (IB) qp object
+ * @mgid: multicast IP address
+ * @mlid: multicast LID, ignored for RoCEv2 (see IBA-A17.5.6)
+ *
+ * Returns: 0 on success else an errno
+ */
 int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 {
 	int err;
@@ -410,20 +451,38 @@ int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 	if (IS_ERR(mcg))
 		return PTR_ERR(mcg);
 
-	err = rxe_attach_mcg(rxe, qp, mcg);
+	err = rxe_attach_mcg(mcg, qp);
 
 	/* if we failed to attach the first qp to mcg tear it down */
 	if (atomic_read(&mcg->qp_num) == 0)
 		rxe_destroy_mcg(mcg);
 
 	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+
 	return err;
 }
 
+/**
+ * rxe_detach_mcast - detach qp from multicast group (see IBA-11.3.2)
+ * @ibqp: address of (IB) qp object
+ * @mgid: multicast IP address
+ * @mlid: multicast LID, ignored for RoCEv2 (see IBA-A17.5.6)
+ *
+ * Returns: 0 on success else an errno
+ */
 int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 {
 	struct rxe_dev *rxe = to_rdev(ibqp->device);
 	struct rxe_qp *qp = to_rqp(ibqp);
+	struct rxe_mcg *mcg;
+	int err;
+
+	mcg = rxe_lookup_mcg(rxe, mgid);
+	if (!mcg)
+		return -EINVAL;
 
-	return rxe_detach_mcg(rxe, qp, mgid);
+	err = rxe_detach_mcg(mcg, qp);
+	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+
+	return err;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (3 preceding siblings ...)
  2022-02-23 23:07 ` [PATCH v13 for-next 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
@ 2022-02-23 23:07 ` Bob Pearson
  2022-02-24  0:26   ` Jason Gunthorpe
  2022-02-23 23:07 ` [PATCH for-next v13 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
  2022-02-24  0:32 ` [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Jason Gunthorpe
  6 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently rxe_rcv_mcast_pkt performs most of its work under the
rxe->mcg_lock and calls into rxe_rcv which queues the packets
to the responder and completer tasklets holding the lock which is
a very bad idea. This patch walks the qp_list in mcg and copies the
qp addresses to a temporary array under the lock but does the rest
of the work without holding the lock. The critical section is now
very small.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_recv.c | 103 +++++++++++++++++----------
 1 file changed, 64 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 53924453abef..9b21cbb22602 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -232,11 +232,15 @@ static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 
 static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 {
+	struct sk_buff *skb_copy;
 	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
+	struct rxe_pkt_info *pkt_copy;
 	struct rxe_mcg *mcg;
 	struct rxe_mca *mca;
 	struct rxe_qp *qp;
+	struct rxe_qp **qp_array;
 	union ib_gid dgid;
+	int n, nmax;
 	int err;
 
 	if (skb->protocol == htons(ETH_P_IP))
@@ -248,68 +252,89 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 	/* lookup mcast group corresponding to mgid, takes a ref */
 	mcg = rxe_lookup_mcg(rxe, &dgid);
 	if (!mcg)
-		goto drop;	/* mcast group not registered */
+		goto err_drop;	/* mcast group not registered */
+
+	/* this is the current number of qp's attached to mcg plus a
+	 * little room in case new qp's are attached between here
+	 * and when we finish walking the qp list. If someone can
+	 * attach more than 4 new qp's we will miss forwarding
+	 * packets to those qp's. This is actually OK since UD is
+	 * a unreliable service.
+	 */
+	nmax = atomic_read(&mcg->qp_num) + 4;
+	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
+	n = 0;
 
 	spin_lock_bh(&rxe->mcg_lock);
-
-	/* this is unreliable datagram service so we let
-	 * failures to deliver a multicast packet to a
-	 * single QP happen and just move on and try
-	 * the rest of them on the list
-	 */
 	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
-		qp = mca->qp;
+		/* protect the qp pointers in the list */
+		rxe_add_ref(mca->qp);
+		qp_array[n++] = mca->qp;
+		if (n == nmax)
+			break;
+	}
+	spin_unlock_bh(&rxe->mcg_lock);
+	nmax = n;
+	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
 
-		/* validate qp for incoming packet */
+	for (n = 0; n < nmax; n++) {
+		qp = qp_array[n];
+
+		/* since this is an unreliable transport if
+		 * one of the qp's fails to pass these checks
+		 * just don't forward a packet and continue
+		 * on to the other qp's. If there aren't any
+		 * drop the skb
+		 */
 		err = check_type_state(rxe, pkt, qp);
-		if (err)
+		if (err) {
+			rxe_drop_ref(qp);
+			if (n == nmax - 1)
+				goto err_free;
 			continue;
+		}
 
 		err = check_keys(rxe, pkt, bth_qpn(pkt), qp);
-		if (err)
+		if (err) {
+			rxe_drop_ref(qp);
+			if (n == nmax - 1)
+				goto err_free;
 			continue;
+		}
 
-		/* for all but the last QP create a new clone of the
-		 * skb and pass to the QP. Pass the original skb to
-		 * the last QP in the list.
+		/* for all but the last qp create a new copy(clone)
+		 * of the skb and pass to the qp. Pass the original
+		 * skb to the last qp in the list unless it failed
+		 * checks above
 		 */
-		if (mca->qp_list.next != &mcg->qp_list) {
-			struct sk_buff *cskb;
-			struct rxe_pkt_info *cpkt;
-
-			cskb = skb_clone(skb, GFP_ATOMIC);
-			if (unlikely(!cskb))
+		if (n < nmax - 1) {
+			skb_copy = skb_clone(skb, GFP_KERNEL);
+			if (unlikely(!skb_copy)) {
+				rxe_drop_ref(qp);
 				continue;
+			}
 
 			if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
-				kfree_skb(cskb);
-				break;
+				kfree_skb(skb_copy);
+				rxe_drop_ref(qp);
+				continue;
 			}
 
-			cpkt = SKB_TO_PKT(cskb);
-			cpkt->qp = qp;
-			rxe_add_ref(qp);
-			rxe_rcv_pkt(cpkt, cskb);
+			pkt_copy = SKB_TO_PKT(skb_copy);
+			pkt_copy->qp = qp;
+			rxe_rcv_pkt(pkt_copy, skb_copy);
 		} else {
 			pkt->qp = qp;
-			rxe_add_ref(qp);
 			rxe_rcv_pkt(pkt, skb);
-			skb = NULL;	/* mark consumed */
 		}
 	}
 
-	spin_unlock_bh(&rxe->mcg_lock);
-
-	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-
-	if (likely(!skb))
-		return;
-
-	/* This only occurs if one of the checks fails on the last
-	 * QP in the list above
-	 */
+	kfree(qp_array);
+	return;
 
-drop:
+err_free:
+	kfree(qp_array);
+err_drop:
 	kfree_skb(skb);
 	ib_device_put(&rxe->ib_dev);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v13 6/6] RDMA/rxe: Convert mca read locking to RCU
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (4 preceding siblings ...)
  2022-02-23 23:07 ` [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
@ 2022-02-23 23:07 ` Bob Pearson
  2022-02-24  0:28   ` Jason Gunthorpe
  2022-02-24  0:32 ` [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Jason Gunthorpe
  6 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 23:07 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Replace spinlock with rcu read locks for read side operations
on mca in rxe_recv.c and rxe_mcast.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 67 ++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_recv.c  |  6 +--
 drivers/infiniband/sw/rxe/rxe_verbs.h |  3 ++
 3 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index c399a29b648b..b2ca4bf5658f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -17,6 +17,12 @@
  * mca is created. It holds a pointer to the qp and is added to a list
  * of qp's that are attached to the mcg. The qp_list is used to replicate
  * mcast packets in the rxe receive path.
+ *
+ * The highest performance operations are mca list traversal when
+ * processing incoming multicast packets which need to be fanned out
+ * to the attached qp's. This list is protected by RCU locking for read
+ * operations and a spinlock in the rxe_dev struct for write operations.
+ * The red-black tree is protected by the same spinlock.
  */
 
 #include "rxe.h"
@@ -299,7 +305,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
  * Returns: 0 on success else an error
  */
 static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
-			  struct rxe_mca *mca)
+			      struct rxe_mca *mca)
 {
 	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	int n;
@@ -322,7 +328,12 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
 	rxe_add_ref(qp);
 	mca->qp = qp;
 
-	list_add_tail(&mca->qp_list, &mcg->qp_list);
+	kref_get(&mcg->ref_cnt);
+	mca->mcg = mcg;
+
+	init_completion(&mca->complete);
+
+	list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);
 
 	return 0;
 }
@@ -343,14 +354,14 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 	int err;
 
 	/* check to see if the qp is already a member of the group */
-	spin_lock_irqsave(&rxe->mcg_lock, flags);
-	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
-			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+			rcu_read_unlock();
 			return 0;
 		}
 	}
-	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+	rcu_read_unlock();
 
 	/* speculative alloc new mca without using GFP_ATOMIC */
 	mca = kzalloc(sizeof(*mca), GFP_KERNEL);
@@ -375,6 +386,20 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 	return err;
 }
 
+/**
+ * __rxe_destroy_mca - free mca resources
+ * @head: rcu_head embedded in mca
+ */
+static void rxe_destroy_mca(struct rcu_head *head)
+{
+	struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
+
+	atomic_dec(&mca->qp->mcg_num);
+	rxe_drop_ref(mca->qp);
+
+	complete(&mca->complete);
+}
+
 /**
  * __rxe_cleanup_mca - cleanup mca object holding lock
  * @mca: mca object
@@ -384,14 +409,12 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
  */
 static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
 {
-	list_del(&mca->qp_list);
-
+	mca->mcg = NULL;
+	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
 	atomic_dec(&mcg->qp_num);
 	atomic_dec(&mcg->rxe->mcg_attach);
-	atomic_dec(&mca->qp->mcg_num);
-	rxe_drop_ref(mca->qp);
 
-	kfree(mca);
+	list_del_rcu(&mca->qp_list);
 }
 
 /**
@@ -404,11 +427,10 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
 static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 {
 	struct rxe_dev *rxe = mcg->rxe;
-	struct rxe_mca *mca, *tmp;
-	unsigned long flags;
+	struct rxe_mca *mca;
 
-	spin_lock_irqsave(&rxe->mcg_lock, flags);
-	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
+	spin_lock_bh(&rxe->mcg_lock);
+	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
 			__rxe_cleanup_mca(mca, mcg);
 
@@ -420,14 +442,25 @@ static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 			 */
 			if (atomic_read(&mcg->qp_num) <= 0)
 				__rxe_destroy_mcg(mcg);
+			spin_unlock_bh(&rxe->mcg_lock);
+
+			/* schedule rxe_destroy_mca and then wait for
+			 * completion before returning to rdma-core.
+			 * Having an outstanding call_rcu() causes
+			 * rdma-core to fail. It may be simpler to
+			 * just call synchronize_rcu() and then
+			 * rxe_destroy_rcu(), but this works OK.
+			 */
+			call_rcu(&mca->rcu, rxe_destroy_mca);
+			wait_for_completion(&mca->complete);
+			kfree(mca);
 
-			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
 			return 0;
 		}
 	}
 
 	/* we didn't find the qp on the list */
-	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+	spin_unlock_bh(&rxe->mcg_lock);
 	return -EINVAL;
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 9b21cbb22602..c2cab85c6576 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -265,15 +265,15 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
 	n = 0;
 
-	spin_lock_bh(&rxe->mcg_lock);
-	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
 		/* protect the qp pointers in the list */
 		rxe_add_ref(mca->qp);
 		qp_array[n++] = mca->qp;
 		if (n == nmax)
 			break;
 	}
-	spin_unlock_bh(&rxe->mcg_lock);
+	rcu_read_unlock();
 	nmax = n;
 	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 6b15251ff67a..14a574e6140e 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -364,7 +364,10 @@ struct rxe_mcg {
 
 struct rxe_mca {
 	struct list_head	qp_list;
+	struct rcu_head		rcu;
 	struct rxe_qp		*qp;
+	struct rxe_mcg		*mcg;
+	struct completion	complete;
 };
 
 struct rxe_port {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array
  2022-02-23 23:07 ` [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
@ 2022-02-24  0:26   ` Jason Gunthorpe
  2022-02-24 17:29     ` Bob Pearson
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-24  0:26 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Wed, Feb 23, 2022 at 05:07:07PM -0600, Bob Pearson wrote:

> +	/* this is the current number of qp's attached to mcg plus a
> +	 * little room in case new qp's are attached between here
> +	 * and when we finish walking the qp list. If someone can
> +	 * attach more than 4 new qp's we will miss forwarding
> +	 * packets to those qp's. This is actually OK since UD is
> +	 * a unreliable service.
> +	 */
> +	nmax = atomic_read(&mcg->qp_num) + 4;
> +	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);

Check for allocation failure?

> +	n = 0;
>  	spin_lock_bh(&rxe->mcg_lock);
> -
> -	/* this is unreliable datagram service so we let
> -	 * failures to deliver a multicast packet to a
> -	 * single QP happen and just move on and try
> -	 * the rest of them on the list
> -	 */
>  	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
> -		qp = mca->qp;
> +		/* protect the qp pointers in the list */
> +		rxe_add_ref(mca->qp);
> +		qp_array[n++] = mca->qp;
> +		if (n == nmax)
> +			break;
> +	}
> +	spin_unlock_bh(&rxe->mcg_lock);

So the issue here is this needs to iterate over the list, but doesn't
want to hold a spinlock while it does the actions?

Perhaps the better pattern is switch the qp list to an xarray (and
delete the mca entirely), then you can use xa_for_each here and avoid
the allocation. The advantage of xarray is that iteration is
safely restartable, so the locks can be dropped.

xa_lock()
xa_for_each(...) {
   qp = mca->qp;
   rxe_add_ref(qp);
   xa_unlock();

   [.. stuff without lock ..]

   rxe_put_ref(qp)
   xa_lock();
}

This would eliminate the rxe_mca entirely as the qp can be stored
directly in the xarray.

In most cases I suppose a mcg will have a small number of QP so this
should be much faster than the linked list, and especially than the
allocation.

And when I write it like the above you can see the RCU failure you
mentioned before is just symptom of a larger bug, ie if RXE is doing
the above and replicating packets at the same instant that close
happens it will hit that WARN_ON as well since the qp ref is
temporarily elevated.

The only way to fully fix that bug is to properly wait for all
transient users of the QP to be finished before allowing the QP to be
destroyed.

But at least this version would make it not happen so reliably and
things can move on.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next v13 6/6] RDMA/rxe: Convert mca read locking to RCU
  2022-02-23 23:07 ` [PATCH for-next v13 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
@ 2022-02-24  0:28   ` Jason Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-24  0:28 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Wed, Feb 23, 2022 at 05:07:08PM -0600, Bob Pearson wrote:
> +			/* schedule rxe_destroy_mca and then wait for
> +			 * completion before returning to rdma-core.
> +			 * Having an outstanding call_rcu() causes
> +			 * rdma-core to fail. It may be simpler to
> +			 * just call synchronize_rcu() and then
> +			 * rxe_destroy_rcu(), but this works OK.
> +			 */
> +			call_rcu(&mca->rcu, rxe_destroy_mca);
> +			wait_for_completion(&mca->complete);

What you've done here is just open code synchronize_rcu().

The trouble with synchronize_ruc() is rcu grace periods can take huge
amounts of time (>1s) on busy servers, so you really don't want to
write synchronize_rcu(), it might take minutes to close a verbs FD if
the user created lots of MCAs.

It is the dark side of RCU locking, you have to be able to work in the
call_rcu style async cleanup, or it doesn't really work.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c
  2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (5 preceding siblings ...)
  2022-02-23 23:07 ` [PATCH for-next v13 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
@ 2022-02-24  0:32 ` Jason Gunthorpe
  6 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-24  0:32 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Wed, Feb 23, 2022 at 05:07:01PM -0600, Bob Pearson wrote:

> Bob Pearson (6):
>   RDMA/rxe: Warn if mcast memory is not freed
>   RDMA/rxe: Collect mca init code in a subroutine
>   RDMA/rxe: Collect cleanup mca code in a subroutine
>   RDMA/rxe: Cleanup rxe_mcast.c

I took these to for-next

Thanks,
Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array
  2022-02-24  0:26   ` Jason Gunthorpe
@ 2022-02-24 17:29     ` Bob Pearson
  0 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-24 17:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, linux-rdma

On 2/23/22 18:26, Jason Gunthorpe wrote:
> On Wed, Feb 23, 2022 at 05:07:07PM -0600, Bob Pearson wrote:
> 
>> +	/* this is the current number of qp's attached to mcg plus a
>> +	 * little room in case new qp's are attached between here
>> +	 * and when we finish walking the qp list. If someone can
>> +	 * attach more than 4 new qp's we will miss forwarding
>> +	 * packets to those qp's. This is actually OK since UD is
>> +	 * a unreliable service.
>> +	 */
>> +	nmax = atomic_read(&mcg->qp_num) + 4;
>> +	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
> 
> Check for allocation failure?
> 
>> +	n = 0;
>>  	spin_lock_bh(&rxe->mcg_lock);
>> -
>> -	/* this is unreliable datagram service so we let
>> -	 * failures to deliver a multicast packet to a
>> -	 * single QP happen and just move on and try
>> -	 * the rest of them on the list
>> -	 */
>>  	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
>> -		qp = mca->qp;
>> +		/* protect the qp pointers in the list */
>> +		rxe_add_ref(mca->qp);
>> +		qp_array[n++] = mca->qp;
>> +		if (n == nmax)
>> +			break;
>> +	}
>> +	spin_unlock_bh(&rxe->mcg_lock);
> 
> So the issue here is this needs to iterate over the list, but doesn't
> want to hold a spinlock while it does the actions?
It's not walking the linked list I am worried about but the call to rxe_rcv_pkt()
which can last a long time as it has to copy the payload to user space and then
generate a work completion and possibly a completion event.
> 
> Perhaps the better pattern is switch the qp list to an xarray (and
> delete the mca entirely), then you can use xa_for_each here and avoid
> the allocation. The advantage of xarray is that iteration is
> safely restartable, so the locks can be dropped.
> 
> xa_lock()
> xa_for_each(...) {
>    qp = mca->qp;
>    rxe_add_ref(qp);
>    xa_unlock();
> 
>    [.. stuff without lock ..]
> 
>    rxe_put_ref(qp)
>    xa_lock();
> }
> 
> This would eliminate the rxe_mca entirely as the qp can be stored
> directly in the xarray.
> 
> In most cases I suppose a mcg will have a small number of QP so this
> should be much faster than the linked list, and especially than the
> allocation.
The way the spec is written seems to anticipate that there is a fixed
sized table of qp's for each mcast address. The way this is now written
is sort of a compromise where I am guessing that the table kmalloc is
smaller than the skb clones so just left the linked list around to
make deltas to the list easier and keep separate from the linear list.

An alternative would be to include the array in the mcg struct and
scan the array to attach/detach qp's while holding a lock. The scan
here in rxe_rcv_mcast_pkt doesn't need to be locked as long as the qp's
are stored and cleared atomically. Just check to see if they are
non-zero at the moment the packet arrives. 
> 
> And when I write it like the above you can see the RCU failure you
> mentioned before is just symptom of a larger bug, ie if RXE is doing
> the above and replicating packets at the same instant that close
> happens it will hit that WARN_ON as well since the qp ref is
> temporarily elevated.
> 
> The only way to fully fix that bug is to properly wait for all
> transient users of the QP to be finished before allowing the QP to be
> destroyed.
> 
> But at least this version would make it not happen so reliably and
> things can move on.
> 
> Jason


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-02-24 17:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-23 23:07 [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Bob Pearson
2022-02-23 23:07 ` [PATCH for-next v13 1/6] RDMA/rxe: Warn if mcast memory is not freed Bob Pearson
2022-02-23 23:07 ` [PATCH v13 for-next 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
2022-02-23 23:07 ` [PATCH for-next v13 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
2022-02-23 23:07 ` [PATCH v13 for-next 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
2022-02-23 23:07 ` [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
2022-02-24  0:26   ` Jason Gunthorpe
2022-02-24 17:29     ` Bob Pearson
2022-02-23 23:07 ` [PATCH for-next v13 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
2022-02-24  0:28   ` Jason Gunthorpe
2022-02-24  0:32 ` [PATCH for-next v13 0/6] Move two object pools to rxe_mcast.c Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.