All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-next v6 0/8] RDMA resource tracking
@ 2018-01-25 15:12 Leon Romanovsky
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise

Hi Doug and Jason,

For this specific version, results from our verification will be on Sunday.

However their absence doesn't mean that it is not stable enough for inclusion.
I tested it on various mlx5 devices with RC/UD/XRCD pingpongs, boot, shutdowns
with and without traffic during shutdowns. It also survives syzkaller.

Steve tests it with krping and nvmeof on mlx4 and cxgb3/4.

Thanks

-------
Changelog:
v5->v6:
  * Removed SRCU
  * Added kref logic to protect resources
  * Dropped support of XRC QPs for now
v4->v5:
  * Rewrote logic to acquire kernel owners of PD and CQ objects
  * Simplified interface.
  * Added logic to return summary for all PID namespaces and not only
    init PID namespace.
  * Changed DB implementation from struct with arrays to be struct with
    hash table.
  * Moved to use SRCU per-hash table.
  * Dropped RW semaphores in favor of mutex to protect linked lists.
  * Dropped QP named annotation patch.
  * Used SPDX licenses.
v3->v4:
  * Added Steve ROB tags.
  * Updated restrack.h to be compatible with kernel-doc format
  * Fixed bug when not all QPs (more than one netlink page) were
    returned by netlink interface
  * Removed "MAX" values from object summary. In followup patches,
    I will eturn object summary per-namespace and not for init PID
    namespace only. Max values can be seen with ibv_devinfo tool.
  * Renamed "comm" to be kern_name.
  * Fix spelling errors.
v2->v3:
  * Added support of PID namespaces.
  * Rewrote rdma_restraack_add function to ensure that it won't
    appear in lists before it is valid.
  * Replace pid/task name caching logic to use task_struct instead.
  * Removed report of name of task's for user space objects. Users
    are expected to read it through /proc/PID/comm.
v1->v2:
 * Rebased on latest rdma/for-next
 * Replaced mutex lock which protect linked lists to be RW semaphore.
   It has no impact on current implementation, because there is only one
   reader (nldev) and it is serialized. However better to be prepared
   for multiple readers from the beginning.
 * Added reinitialization next QP entry logic to ensure that it exists
   and didn't vanish during fill_req_qp() work.
v0->v1:
 * Dropped RFC
 * Separated to thre series, one for for-rc, and two for-next.

-------
The original goal of this series was to allow ability to view connection
(QP) information about running processes, however I used this opportunity and
created common infrastructure to track and report various resources. The report
part is implemented in netlink (nldev), but smart ULPs can now create
advanced usage models based on device utilization.

The current implementation relies on one lock per-object per-device, so
creation/destroying of various objects (CQ, PD, e.t.c) on various or the
same devices doesn't interfere each with another.

The data protection is performed with SRCU and its reader-writer model
ensures that resource won't be destroyed till readers will finish their
work.

Possible future work will include:
 * Reducing number of locks in RDMA, because of SRCU.
 * Converting CMA to be based completely on resource tracking.
 * Addition of other objects and extending current to give full
   and detailed state of the RDMA kernel stack.
 * Replacing synchronize_srcu with call_srcu to make destroy flow
   non-blocking.
 * Provide reliable device reset flow, preserving resource creation ordering.

	Thanks
---------------------------------------

Leon Romanovsky (8):
  RDMA/core: Print caller name instead of function name
  RDMA/core: Save kernel caller name in PD and CQ objects
  RDMA/restrack: Add general infrastructure to track RDMA resources
  RDMA/core: Add resource tracking for create and destroy QPs
  RDMA/core: Add resource tracking for create and destroy CQs
  RDMA/core: Add resource tracking for create and destroy PDs
  RDMA/nldev: Provide global resource utilization
  RDMA/nldev: Provide detailed QP information

 drivers/infiniband/core/Makefile           |   2 +-
 drivers/infiniband/core/core_priv.h        |  26 ++
 drivers/infiniband/core/cq.c               |  16 +-
 drivers/infiniband/core/device.c           |   7 +
 drivers/infiniband/core/nldev.c            | 373 +++++++++++++++++++++++++++++
 drivers/infiniband/core/restrack.c         | 159 ++++++++++++
 drivers/infiniband/core/uverbs_cmd.c       |   7 +-
 drivers/infiniband/core/uverbs_std_types.c |   3 +
 drivers/infiniband/core/verbs.c            |  18 +-
 include/rdma/ib_verbs.h                    |  34 ++-
 include/rdma/restrack.h                    | 152 ++++++++++++
 include/uapi/rdma/rdma_netlink.h           |  55 +++++
 12 files changed, 835 insertions(+), 17 deletions(-)
 create mode 100644 drivers/infiniband/core/restrack.c
 create mode 100644 include/rdma/restrack.h

--
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 1/8] RDMA/core: Print caller name instead of function name
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-01-25 15:12   ` Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 2/8] RDMA/core: Save kernel caller name in PD and CQ objects Leon Romanovsky
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Reuse the name from the Kconfig to mark the caller.

Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/rdma/ib_verbs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 5e32fe781ca3..1c6e9f52f127 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2858,7 +2858,7 @@ enum ib_pd_flags {
 struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
 		const char *caller);
 #define ib_alloc_pd(device, flags) \
-	__ib_alloc_pd((device), (flags), __func__)
+	__ib_alloc_pd((device), (flags), KBUILD_MODNAME)
 void ib_dealloc_pd(struct ib_pd *pd);
 
 /**
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 2/8] RDMA/core: Save kernel caller name in PD and CQ objects
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2018-01-25 15:12   ` [PATCH rdma-next v6 1/8] RDMA/core: Print caller name instead of function name Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources Leon Romanovsky
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The KBUILD_MODNAME variable contains the module name
and it is known for kernel users during compilation,
so let's reuse it to track the owners.

Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cq.c    | 10 ++++++----
 drivers/infiniband/core/verbs.c |  4 ++--
 include/rdma/ib_verbs.h         | 13 ++++++++++---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index c8c5a5a7f433..d99565dbd12f 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -120,20 +120,22 @@ static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
 }
 
 /**
- * ib_alloc_cq - allocate a completion queue
+ * __ib_alloc_cq - allocate a completion queue
  * @dev:		device to allocate the CQ for
  * @private:		driver private data, accessible from cq->cq_context
  * @nr_cqe:		number of CQEs to allocate
  * @comp_vector:	HCA completion vectors for this CQ
  * @poll_ctx:		context to poll the CQ from.
+ * @caller:		module owner name.
  *
  * This is the proper interface to allocate a CQ for in-kernel users. A
  * CQ allocated with this interface will automatically be polled from the
  * specified context. The ULP must use wr->wr_cqe instead of wr->wr_id
  * to use this CQ abstraction.
  */
-struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
-		int nr_cqe, int comp_vector, enum ib_poll_context poll_ctx)
+struct ib_cq *__ib_alloc_cq(struct ib_device *dev, void *private,
+			    int nr_cqe, int comp_vector,
+			    enum ib_poll_context poll_ctx, const char *caller)
 {
 	struct ib_cq_init_attr cq_attr = {
 		.cqe		= nr_cqe,
@@ -185,7 +187,7 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
 	cq->device->destroy_cq(cq);
 	return ERR_PTR(ret);
 }
-EXPORT_SYMBOL(ib_alloc_cq);
+EXPORT_SYMBOL(__ib_alloc_cq);
 
 /**
  * ib_free_cq - free a completion queue
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index e9c3991a93ff..c2b347f6e8a2 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1782,7 +1782,7 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
 }
 EXPORT_SYMBOL(ib_detach_mcast);
 
-struct ib_xrcd *ib_alloc_xrcd(struct ib_device *device)
+struct ib_xrcd *__ib_alloc_xrcd(struct ib_device *device, const char *caller)
 {
 	struct ib_xrcd *xrcd;
 
@@ -1800,7 +1800,7 @@ struct ib_xrcd *ib_alloc_xrcd(struct ib_device *device)
 
 	return xrcd;
 }
-EXPORT_SYMBOL(ib_alloc_xrcd);
+EXPORT_SYMBOL(__ib_alloc_xrcd);
 
 int ib_dealloc_xrcd(struct ib_xrcd *xrcd)
 {
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1c6e9f52f127..f9cabb1b670e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3135,8 +3135,12 @@ static inline int ib_post_recv(struct ib_qp *qp,
 	return qp->device->post_recv(qp, recv_wr, bad_recv_wr);
 }
 
-struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
-		int nr_cqe, int comp_vector, enum ib_poll_context poll_ctx);
+struct ib_cq *__ib_alloc_cq(struct ib_device *dev, void *private,
+			    int nr_cqe, int comp_vector,
+			    enum ib_poll_context poll_ctx, const char *caller);
+#define ib_alloc_cq(device, priv, nr_cqe, comp_vect, poll_ctx) \
+	__ib_alloc_cq((device), (priv), (nr_cqe), (comp_vect), (poll_ctx), KBUILD_MODNAME)
+
 void ib_free_cq(struct ib_cq *cq);
 int ib_process_cq_direct(struct ib_cq *cq, int budget);
 
@@ -3560,8 +3564,11 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid);
 /**
  * ib_alloc_xrcd - Allocates an XRC domain.
  * @device: The device on which to allocate the XRC domain.
+ * @caller: Module name for kernel consumers
  */
-struct ib_xrcd *ib_alloc_xrcd(struct ib_device *device);
+struct ib_xrcd *__ib_alloc_xrcd(struct ib_device *device, const char *caller);
+#define ib_alloc_xrcd(device) \
+	__ib_alloc_xrcd((device), KBUILD_MODNAME)
 
 /**
  * ib_dealloc_xrcd - Deallocates an XRC domain.
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2018-01-25 15:12   ` [PATCH rdma-next v6 1/8] RDMA/core: Print caller name instead of function name Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 2/8] RDMA/core: Save kernel caller name in PD and CQ objects Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
       [not found]     ` <20180125151227.28202-4-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2018-01-25 15:12   ` [PATCH rdma-next v6 4/8] RDMA/core: Add resource tracking for create and destroy QPs Leon Romanovsky
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The RDMA subsystem has very strict set of objects to work on it,
but it completely lacks tracking facilities and no visibility of
resource utilization.

The following patch adds such infrastructure to keep track of RDMA
resources to help with debugging of user space applications. The primary
user of this infrastructure is RDMA nldev netlink (following patches),
but it is not limited too.

At this stage, the main three objects (PD, CQ and QP) are added,
and more will be added later.

Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/Makefile    |   2 +-
 drivers/infiniband/core/core_priv.h |   1 +
 drivers/infiniband/core/device.c    |   7 ++
 drivers/infiniband/core/restrack.c  | 159 ++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h             |  19 +++++
 include/rdma/restrack.h             | 152 ++++++++++++++++++++++++++++++++++
 6 files changed, 339 insertions(+), 1 deletion(-)
 create mode 100644 drivers/infiniband/core/restrack.c
 create mode 100644 include/rdma/restrack.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 504b926552c6..f69833db0a32 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
 				device.o fmr_pool.o cache.o netlink.o \
 				roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
 				multicast.o mad.o smi.o agent.o mad_rmpp.o \
-				security.o nldev.o
+				security.o nldev.o restrack.o
 
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index aef9aa0ac0e6..2b1372da708a 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -40,6 +40,7 @@
 #include <rdma/ib_verbs.h>
 #include <rdma/opa_addr.h>
 #include <rdma/ib_mad.h>
+#include <rdma/restrack.h>
 #include "mad_priv.h"
 
 /* Total number of ports combined across all struct ib_devices's */
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 2826e06311a5..c3e389f8c99a 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -263,6 +263,11 @@ struct ib_device *ib_alloc_device(size_t size)
 	if (!device)
 		return NULL;
 
+	if (rdma_restrack_init(&device->res)) {
+		kfree(device);
+		return NULL;
+	}
+
 	device->dev.class = &ib_class;
 	device_initialize(&device->dev);
 
@@ -596,6 +601,8 @@ void ib_unregister_device(struct ib_device *device)
 	}
 	up_read(&lists_rwsem);
 
+	rdma_restrack_clean(&device->res);
+
 	ib_device_unregister_rdmacg(device);
 	ib_device_unregister_sysfs(device);
 
diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
new file mode 100644
index 000000000000..ab1fca2d98e3
--- /dev/null
+++ b/drivers/infiniband/core/restrack.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause) */
+/*
+ * Copyright (c) 2017-2018 Mellanox Technologies. All rights reserved.
+ */
+
+#include <rdma/ib_verbs.h>
+#include <rdma/restrack.h>
+#include <linux/mutex.h>
+#include <linux/sched/task.h>
+#include <linux/uaccess.h>
+#include <linux/pid_namespace.h>
+
+int rdma_restrack_init(struct rdma_restrack_root *res)
+{
+	init_rwsem(&res->rwsem);
+	return 0;
+}
+
+void rdma_restrack_clean(struct rdma_restrack_root *res)
+{
+	WARN_ON_ONCE(!hash_empty(res->hash));
+}
+
+int rdma_restrack_count(struct rdma_restrack_root *res,
+			enum rdma_restrack_type type,
+			struct pid_namespace *ns)
+{
+	struct rdma_restrack_entry *e;
+	u32 cnt = 0;
+
+	down_read(&res->rwsem);
+	hash_for_each_possible(res->hash, e, node, type) {
+		if (ns == &init_pid_ns ||
+		    (!rdma_is_kernel_res(e) &&
+		     ns == task_active_pid_ns(e->task)))
+			cnt++;
+	}
+	up_read(&res->rwsem);
+	return cnt;
+}
+EXPORT_SYMBOL(rdma_restrack_count);
+
+static void set_kern_name(struct rdma_restrack_entry *res)
+{
+	enum rdma_restrack_type type = res->type;
+	struct ib_qp *qp;
+
+	if (type != RDMA_RESTRACK_QP)
+		/* PD and CQ types already have this name embedded in */
+		return;
+
+	qp = container_of(res, struct ib_qp, res);
+	if (!qp->pd) {
+		WARN_ONCE(true, "XRC QPs are not supported\n");
+		/* Survive, despite the programmer's error */
+		res->kern_name = " ";
+		return;
+	}
+
+	res->kern_name = qp->pd->res.kern_name;
+}
+
+static struct ib_device *res_to_dev(struct rdma_restrack_entry *res)
+{
+	enum rdma_restrack_type type = res->type;
+	struct ib_device *dev;
+	struct ib_xrcd *xrcd;
+	struct ib_pd *pd;
+	struct ib_cq *cq;
+	struct ib_qp *qp;
+
+	switch (type) {
+	case RDMA_RESTRACK_PD:
+		pd = container_of(res, struct ib_pd, res);
+		dev = pd->device;
+		break;
+	case RDMA_RESTRACK_CQ:
+		cq = container_of(res, struct ib_cq, res);
+		dev = cq->device;
+		break;
+	case RDMA_RESTRACK_QP:
+		qp = container_of(res, struct ib_qp, res);
+		dev = qp->device;
+		break;
+	case RDMA_RESTRACK_XRCD:
+		xrcd = container_of(res, struct ib_xrcd, res);
+		dev = xrcd->device;
+		break;
+	default:
+		WARN_ONCE(true, "Wrong resource tracking type %u\n", type);
+		return NULL;
+	}
+
+	return dev;
+}
+
+void rdma_restrack_add(struct rdma_restrack_entry *res)
+{
+	struct ib_device *dev = res_to_dev(res);
+
+	if (!dev)
+		return;
+
+	down_write(&dev->res.rwsem);
+	if (!uaccess_kernel()) {
+		get_task_struct(current);
+		res->task = current;
+		res->kern_name = NULL;
+	} else {
+		set_kern_name(res);
+		res->task = NULL;
+	}
+
+	kref_init(&res->kref);
+	res->valid = true;
+	hash_add(dev->res.hash, &res->node, res->type);
+	up_write(&dev->res.rwsem);
+}
+EXPORT_SYMBOL(rdma_restrack_add);
+
+void rdma_restrack_get(struct rdma_restrack_entry *res)
+{
+	kref_get(&res->kref);
+}
+EXPORT_SYMBOL(rdma_restrack_get);
+
+static void restrack_release(struct kref *kref)
+{
+	struct rdma_restrack_entry *res;
+	struct ib_device *dev;
+
+	res = container_of(kref, struct rdma_restrack_entry, kref);
+	dev = res_to_dev(res);
+	if (!dev)
+		return;
+
+	down_write(&dev->res.rwsem);
+	hash_del(&res->node);
+	res->valid = false;
+	if (res->task)
+		put_task_struct(res->task);
+	up_write(&dev->res.rwsem);
+
+}
+
+int rdma_restrack_put(struct rdma_restrack_entry *res)
+{
+	if (!res->valid)
+		return 1;
+
+	return kref_put(&res->kref, restrack_release);
+}
+EXPORT_SYMBOL(rdma_restrack_put);
+
+void rdma_restrack_del(struct rdma_restrack_entry *res)
+{
+	rdma_restrack_put(res);
+}
+EXPORT_SYMBOL(rdma_restrack_del);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f9cabb1b670e..65af66b477cb 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -63,6 +63,7 @@
 #include <linux/uaccess.h>
 #include <linux/cgroup_rdma.h>
 #include <uapi/rdma/ib_user_verbs.h>
+#include <rdma/restrack.h>
 
 #define IB_FW_VERSION_NAME_MAX	ETHTOOL_FWVERS_LEN
 
@@ -1530,6 +1531,7 @@ struct ib_pd {
 	 * Implementation details of the RDMA core, don't use in drivers:
 	 */
 	struct ib_mr	       *__internal_mr;
+	struct rdma_restrack_entry res;
 };
 
 struct ib_xrcd {
@@ -1539,6 +1541,10 @@ struct ib_xrcd {
 
 	struct mutex		tgt_qp_mutex;
 	struct list_head	tgt_qp_list;
+	/*
+	 * Implementation details of the RDMA core, don't use in drivers:
+	 */
+	struct rdma_restrack_entry res;
 };
 
 struct ib_ah {
@@ -1570,6 +1576,10 @@ struct ib_cq {
 		struct irq_poll		iop;
 		struct work_struct	work;
 	};
+	/*
+	 * Implementation details of the RDMA core, don't use in drivers:
+	 */
+	struct rdma_restrack_entry res;
 };
 
 struct ib_srq {
@@ -1746,6 +1756,11 @@ struct ib_qp {
 	struct ib_rwq_ind_table *rwq_ind_tbl;
 	struct ib_qp_security  *qp_sec;
 	u8			port;
+
+	/*
+	 * Implementation details of the RDMA core, don't use in drivers:
+	 */
+	struct rdma_restrack_entry     res;
 };
 
 struct ib_mr {
@@ -2352,6 +2367,10 @@ struct ib_device {
 #endif
 
 	u32                          index;
+	/*
+	 * Implementation details of the RDMA core, don't use in drivers
+	 */
+	struct rdma_restrack_root     res;
 
 	/**
 	 * The following mandatory functions are used only at device
diff --git a/include/rdma/restrack.h b/include/rdma/restrack.h
new file mode 100644
index 000000000000..255d92c43f6a
--- /dev/null
+++ b/include/rdma/restrack.h
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause) */
+/*
+ * Copyright (c) 2017-2018 Mellanox Technologies. All rights reserved.
+ */
+
+#ifndef _RDMA_RESTRACK_H_
+#define _RDMA_RESTRACK_H_
+
+#include <linux/typecheck.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/kref.h>
+
+/**
+ * enum rdma_restrack_type - HW objects to track
+ */
+enum rdma_restrack_type {
+	/**
+	 * @RDMA_RESTRACK_PD: Protection domain (PD)
+	 */
+	RDMA_RESTRACK_PD,
+	/**
+	 * @RDMA_RESTRACK_CQ: Completion queue (CQ)
+	 */
+	RDMA_RESTRACK_CQ,
+	/**
+	 * @RDMA_RESTRACK_QP: Queue pair (QP)
+	 */
+	RDMA_RESTRACK_QP,
+	/**
+	 * @RDMA_RESTRACK_XRCD: XRC domain (XRCD)
+	 */
+	RDMA_RESTRACK_XRCD,
+	/**
+	 * @RDMA_RESTRACK_MAX: Last entry, used for array dclarations
+	 */
+	RDMA_RESTRACK_MAX
+};
+
+#define RDMA_RESTRACK_HASH_BITS	8
+/**
+ * struct rdma_restrack_root - main resource tracking management
+ * entity, per-device
+ */
+struct rdma_restrack_root {
+	/*
+	 * @rwsem: Read/write lock to protect lists
+	 */
+	struct rw_semaphore	rwsem;
+	/**
+	 * @hash: global database for all resources per-device
+	 */
+	DECLARE_HASHTABLE(hash, RDMA_RESTRACK_HASH_BITS);
+};
+
+/**
+ * struct rdma_restrack_entry - metadata per-entry
+ */
+struct rdma_restrack_entry {
+	/**
+	 * @valid: validity indicator
+	 *
+	 * The entries are filled during rdma_restrack_add,
+	 * can be attempted to be free during rdma_restrack_del.
+	 *
+	 * As an example for that, see mlx5 QPs with type MLX5_IB_QPT_HW_GSI
+	 */
+	bool			valid;
+	/*
+	 * @kref: Protect destroy of the resource
+	 */
+	struct kref		kref;
+	/**
+	 * @task: owner of resource tracking entity
+	 *
+	 * There are two types of entities: created by user and created
+	 * by kernel.
+	 *
+	 * This is relevant for the entities created by users.
+	 * For the entities created by kernel, this pointer will be NULL.
+	 */
+	struct task_struct	*task;
+	/**
+	 * @kern_name: name of owner for the kernel created entities.
+	 */
+	const char		*kern_name;
+	/**
+	 * @node: hash table entry
+	 */
+	struct hlist_node	node;
+	/**
+	 * @type: various objects in restrack database
+	 */
+	enum rdma_restrack_type	type;
+};
+
+/**
+ * rdma_restrack_init() - initialize resource tracking
+ * @res:  resource tracking root
+ */
+int rdma_restrack_init(struct rdma_restrack_root *res);
+
+/**
+ * rdma_restrack_clean() - clean resource tracking
+ * @res:  resource tracking root
+ */
+void rdma_restrack_clean(struct rdma_restrack_root *res);
+
+/**
+ * rdma_restrack_count() - the current usage of specific object
+ * @res:  resource entry
+ * @type: actual type of object to operate
+ * @ns:   PID namespace
+ */
+int rdma_restrack_count(struct rdma_restrack_root *res,
+			enum rdma_restrack_type type,
+			struct pid_namespace *ns);
+
+/**
+ * rdma_restrack_add() - add object to the reource tracking database
+ * @res:  resource entry
+ */
+void rdma_restrack_add(struct rdma_restrack_entry *res);
+
+/**
+ * rdma_restrack_del() - delete object from the reource tracking database
+ * @res:  resource entry
+ * @type: actual type of object to operate
+ */
+void rdma_restrack_del(struct rdma_restrack_entry *res);
+
+/**
+ * rdma_is_kernel_res() - check the owner of resource
+ * @res:  resource entry
+ */
+static inline bool rdma_is_kernel_res(struct rdma_restrack_entry *res)
+{
+	return !res->task;
+}
+
+/**
+ * rdma_restrack_get() - grab to protect resource from release
+ * @res:  resource entry
+ */
+void rdma_restrack_get(struct rdma_restrack_entry *res);
+
+/**
+ * rdma_restrack_put() - relase resource
+ * @res:  resource entry
+ */
+int rdma_restrack_put(struct rdma_restrack_entry *res);
+#endif /* _RDMA_RESTRACK_H_ */
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 4/8] RDMA/core: Add resource tracking for create and destroy QPs
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2018-01-25 15:12   ` [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 5/8] RDMA/core: Add resource tracking for create and destroy CQs Leon Romanovsky
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Track create and destroy operations of QP objects.

Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/core_priv.h  | 25 +++++++++++++++++++++++++
 drivers/infiniband/core/uverbs_cmd.c |  3 +--
 drivers/infiniband/core/verbs.c      |  6 ++----
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 2b1372da708a..2067f52e1da0 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -301,4 +301,29 @@ struct ib_device *ib_device_get_by_index(u32 ifindex);
 /* RDMA device netlink */
 void nldev_init(void);
 void nldev_exit(void);
+
+static inline struct ib_qp *_ib_create_qp(struct ib_device *dev,
+					  struct ib_pd *pd,
+					  struct ib_qp_init_attr *attr,
+					  struct ib_udata *udata)
+{
+	struct ib_qp *qp;
+
+	qp = dev->create_qp(pd, attr, udata);
+	if (!IS_ERR(qp)) {
+		qp->device = dev;
+		qp->pd	   = pd;
+		/*
+		 * We don't track XRC QPs for now, because they don't have PD
+		 * and more importantly they are created internaly by driver,
+		 * see mlx5 create_dev_resources() as an example.
+		 */
+		if (attr->qp_type < IB_QPT_XRC_INI) {
+			qp->res.type = RDMA_RESTRACK_QP;
+			rdma_restrack_add(&qp->res);
+		}
+	}
+
+	return qp;
+}
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 4ddd61d90507..825325c764a1 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1514,7 +1514,7 @@ static int create_qp(struct ib_uverbs_file *file,
 	if (cmd->qp_type == IB_QPT_XRC_TGT)
 		qp = ib_create_qp(pd, &attr);
 	else
-		qp = device->create_qp(pd, &attr, uhw);
+		qp = _ib_create_qp(device, pd, &attr, uhw);
 
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
@@ -1527,7 +1527,6 @@ static int create_qp(struct ib_uverbs_file *file,
 			goto err_cb;
 
 		qp->real_qp	  = qp;
-		qp->device	  = device;
 		qp->pd		  = pd;
 		qp->send_cq	  = attr.send_cq;
 		qp->recv_cq	  = attr.recv_cq;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index c2b347f6e8a2..c3628a437403 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -844,7 +844,6 @@ static struct ib_qp *ib_create_xrc_qp(struct ib_qp *qp,
 
 	qp->event_handler = __ib_shared_qp_event_handler;
 	qp->qp_context = qp;
-	qp->pd = NULL;
 	qp->send_cq = qp->recv_cq = NULL;
 	qp->srq = NULL;
 	qp->xrcd = qp_init_attr->xrcd;
@@ -882,7 +881,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	if (qp_init_attr->cap.max_rdma_ctxs)
 		rdma_rw_init_qp(device, qp_init_attr);
 
-	qp = device->create_qp(pd, qp_init_attr, NULL);
+	qp = _ib_create_qp(device, pd, qp_init_attr, NULL);
 	if (IS_ERR(qp))
 		return qp;
 
@@ -892,7 +891,6 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 		return ERR_PTR(ret);
 	}
 
-	qp->device     = device;
 	qp->real_qp    = qp;
 	qp->uobject    = NULL;
 	qp->qp_type    = qp_init_attr->qp_type;
@@ -922,7 +920,6 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			atomic_inc(&qp_init_attr->srq->usecnt);
 	}
 
-	qp->pd	    = pd;
 	qp->send_cq = qp_init_attr->send_cq;
 	qp->xrcd    = NULL;
 
@@ -1538,6 +1535,7 @@ int ib_destroy_qp(struct ib_qp *qp)
 	if (!qp->uobject)
 		rdma_rw_cleanup_mrs(qp);
 
+	rdma_restrack_del(&qp->res);
 	ret = qp->device->destroy_qp(qp);
 	if (!ret) {
 		if (pd)
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 5/8] RDMA/core: Add resource tracking for create and destroy CQs
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2018-01-25 15:12   ` [PATCH rdma-next v6 4/8] RDMA/core: Add resource tracking for create and destroy QPs Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 6/8] RDMA/core: Add resource tracking for create and destroy PDs Leon Romanovsky
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Track create and destroy operations of CQ objects.

Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/cq.c               | 6 ++++++
 drivers/infiniband/core/uverbs_cmd.c       | 2 ++
 drivers/infiniband/core/uverbs_std_types.c | 3 +++
 drivers/infiniband/core/verbs.c            | 3 +++
 4 files changed, 14 insertions(+)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index d99565dbd12f..bc79ca8215d7 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -159,6 +159,10 @@ struct ib_cq *__ib_alloc_cq(struct ib_device *dev, void *private,
 	if (!cq->wc)
 		goto out_destroy_cq;
 
+	cq->res.type = RDMA_RESTRACK_CQ;
+	cq->res.kern_name = caller;
+	rdma_restrack_add(&cq->res);
+
 	switch (cq->poll_ctx) {
 	case IB_POLL_DIRECT:
 		cq->comp_handler = ib_cq_completion_direct;
@@ -183,6 +187,7 @@ struct ib_cq *__ib_alloc_cq(struct ib_device *dev, void *private,
 
 out_free_wc:
 	kfree(cq->wc);
+	rdma_restrack_del(&cq->res);
 out_destroy_cq:
 	cq->device->destroy_cq(cq);
 	return ERR_PTR(ret);
@@ -214,6 +219,7 @@ void ib_free_cq(struct ib_cq *cq)
 	}
 
 	kfree(cq->wc);
+	rdma_restrack_del(&cq->res);
 	ret = cq->device->destroy_cq(cq);
 	WARN_ON_ONCE(ret);
 }
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 825325c764a1..3e95acd29de7 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1033,6 +1033,8 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
 		goto err_cb;
 
 	uobj_alloc_commit(&obj->uobject);
+	cq->res.type = RDMA_RESTRACK_CQ;
+	rdma_restrack_add(&cq->res);
 
 	return obj;
 
diff --git a/drivers/infiniband/core/uverbs_std_types.c b/drivers/infiniband/core/uverbs_std_types.c
index c3ee5d9b336d..b571176babbe 100644
--- a/drivers/infiniband/core/uverbs_std_types.c
+++ b/drivers/infiniband/core/uverbs_std_types.c
@@ -35,6 +35,7 @@
 #include <rdma/ib_verbs.h>
 #include <linux/bug.h>
 #include <linux/file.h>
+#include <rdma/restrack.h>
 #include "rdma_core.h"
 #include "uverbs.h"
 
@@ -319,6 +320,8 @@ static int uverbs_create_cq_handler(struct ib_device *ib_dev,
 	obj->uobject.object = cq;
 	obj->uobject.user_handle = user_handle;
 	atomic_set(&cq->usecnt, 0);
+	cq->res.type = RDMA_RESTRACK_CQ;
+	rdma_restrack_add(&cq->res);
 
 	ret = uverbs_copy_to(attrs, CREATE_CQ_RESP_CQE, &cq->cqe);
 	if (ret)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index c3628a437403..983b49ffb8d5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1578,6 +1578,8 @@ struct ib_cq *ib_create_cq(struct ib_device *device,
 		cq->event_handler = event_handler;
 		cq->cq_context    = cq_context;
 		atomic_set(&cq->usecnt, 0);
+		cq->res.type = RDMA_RESTRACK_CQ;
+		rdma_restrack_add(&cq->res);
 	}
 
 	return cq;
@@ -1596,6 +1598,7 @@ int ib_destroy_cq(struct ib_cq *cq)
 	if (atomic_read(&cq->usecnt))
 		return -EBUSY;
 
+	rdma_restrack_del(&cq->res);
 	return cq->device->destroy_cq(cq);
 }
 EXPORT_SYMBOL(ib_destroy_cq);
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 6/8] RDMA/core: Add resource tracking for create and destroy PDs
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (4 preceding siblings ...)
  2018-01-25 15:12   ` [PATCH rdma-next v6 5/8] RDMA/core: Add resource tracking for create and destroy CQs Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 7/8] RDMA/nldev: Provide global resource utilization Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 8/8] RDMA/nldev: Provide detailed QP information Leon Romanovsky
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Track create and destroy operations of PD objects.

Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 2 ++
 drivers/infiniband/core/verbs.c      | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 3e95acd29de7..256934d1f64f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -340,6 +340,8 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
 	uobj->object = pd;
 	memset(&resp, 0, sizeof resp);
 	resp.pd_handle = uobj->id;
+	pd->res.type = RDMA_RESTRACK_PD;
+	rdma_restrack_add(&pd->res);
 
 	if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, sizeof resp)) {
 		ret = -EFAULT;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 983b49ffb8d5..a98a3e8412f8 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -263,6 +263,10 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
 		mr_access_flags |= IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE;
 	}
 
+	pd->res.type = RDMA_RESTRACK_PD;
+	pd->res.kern_name = caller;
+	rdma_restrack_add(&pd->res);
+
 	if (mr_access_flags) {
 		struct ib_mr *mr;
 
@@ -312,6 +316,7 @@ void ib_dealloc_pd(struct ib_pd *pd)
 	   requires the caller to guarantee we can't race here. */
 	WARN_ON(atomic_read(&pd->usecnt));
 
+	rdma_restrack_del(&pd->res);
 	/* Making delalloc_pd a void return is a WIP, no driver should return
 	   an error here. */
 	ret = pd->device->dealloc_pd(pd);
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 7/8] RDMA/nldev: Provide global resource utilization
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (5 preceding siblings ...)
  2018-01-25 15:12   ` [PATCH rdma-next v6 6/8] RDMA/core: Add resource tracking for create and destroy PDs Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
  2018-01-25 15:12   ` [PATCH rdma-next v6 8/8] RDMA/nldev: Provide detailed QP information Leon Romanovsky
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Export through netlink interface, the global device utilization
for the rdmatool as the main user of RDMA nldev interface.

Provide both dumpit and doit callbacks.

As an example of possible output from rdmatool for system with 5
Mellanox's card

$ rdma res
1: mlx5_0: qp 4 cq 5 pd 3
2: mlx5_1: qp 4 cq 5 pd 3
3: mlx5_2: qp 4 cq 5 pd 3
4: mlx5_3: qp 2 cq 3 pd 2
5: mlx5_4: qp 4 cq 5 pd 3

Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/nldev.c  | 151 +++++++++++++++++++++++++++++++++++++++
 include/uapi/rdma/rdma_netlink.h |  10 +++
 2 files changed, 161 insertions(+)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 5d790c507c7e..7bbd88a6b6a0 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -31,6 +31,8 @@
  */
 
 #include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
 #include <net/netlink.h>
 #include <rdma/rdma_netlink.h>
 
@@ -52,6 +54,11 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_PORT_STATE]	= { .type = NLA_U8 },
 	[RDMA_NLDEV_ATTR_PORT_PHYS_STATE] = { .type = NLA_U8 },
 	[RDMA_NLDEV_ATTR_DEV_NODE_TYPE] = { .type = NLA_U8 },
+	[RDMA_NLDEV_ATTR_RES_SUMMARY]	= { .type = NLA_NESTED },
+	[RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY]	= { .type = NLA_NESTED },
+	[RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_NAME] = { .type = NLA_NUL_STRING,
+					     .len = 16 },
+	[RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_CURR] = { .type = NLA_U64 },
 };
 
 static int fill_nldev_handle(struct sk_buff *msg, struct ib_device *device)
@@ -134,6 +141,65 @@ static int fill_port_info(struct sk_buff *msg,
 	return 0;
 }
 
+static int fill_res_info_entry(struct sk_buff *msg,
+			       const char *name, u64 curr)
+{
+	struct nlattr *entry_attr;
+
+	entry_attr = nla_nest_start(msg, RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY);
+	if (!entry_attr)
+		return -EMSGSIZE;
+
+	if (nla_put_string(msg, RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_NAME, name))
+		goto err;
+	if (nla_put_u64_64bit(msg,
+			      RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_CURR, curr, 0))
+		goto err;
+
+	nla_nest_end(msg, entry_attr);
+	return 0;
+
+err:
+	nla_nest_cancel(msg, entry_attr);
+	return -EMSGSIZE;
+}
+
+static int fill_res_info(struct sk_buff *msg, struct ib_device *device)
+{
+	static const char *names[RDMA_RESTRACK_MAX] = {
+		[RDMA_RESTRACK_PD] = "pd",
+		[RDMA_RESTRACK_CQ] = "cq",
+		[RDMA_RESTRACK_QP] = "qp",
+	};
+
+	struct rdma_restrack_root *res = &device->res;
+	struct nlattr *table_attr;
+	int ret, i, curr;
+
+	if (fill_nldev_handle(msg, device))
+		return -EMSGSIZE;
+
+	table_attr = nla_nest_start(msg, RDMA_NLDEV_ATTR_RES_SUMMARY);
+	if (!table_attr)
+		return -EMSGSIZE;
+
+	for (i = 0; i < RDMA_RESTRACK_MAX; i++) {
+		if (!names[i])
+			continue;
+		curr = rdma_restrack_count(res, i, task_active_pid_ns(current));
+		ret = fill_res_info_entry(msg, names[i], curr);
+		if (ret)
+			goto err;
+	}
+
+	nla_nest_end(msg, table_attr);
+	return 0;
+
+err:
+	nla_nest_cancel(msg, table_attr);
+	return ret;
+}
+
 static int nldev_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 			  struct netlink_ext_ack *extack)
 {
@@ -329,6 +395,87 @@ static int nldev_port_get_dumpit(struct sk_buff *skb,
 	return skb->len;
 }
 
+static int nldev_res_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
+			      struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
+	struct ib_device *device;
+	struct sk_buff *msg;
+	u32 index;
+	int ret;
+
+	ret = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1,
+			  nldev_policy, extack);
+	if (ret || !tb[RDMA_NLDEV_ATTR_DEV_INDEX])
+		return -EINVAL;
+
+	index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	device = ib_device_get_by_index(index);
+	if (!device)
+		return -EINVAL;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		goto err;
+
+	nlh = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
+			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_RES_GET),
+			0, 0);
+
+	ret = fill_res_info(msg, device);
+	if (ret)
+		goto err_free;
+
+	nlmsg_end(msg, nlh);
+	put_device(&device->dev);
+	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+
+err_free:
+	nlmsg_free(msg);
+err:
+	put_device(&device->dev);
+	return ret;
+}
+
+static int _nldev_res_get_dumpit(struct ib_device *device,
+				 struct sk_buff *skb,
+				 struct netlink_callback *cb,
+				 unsigned int idx)
+{
+	int start = cb->args[0];
+	struct nlmsghdr *nlh;
+
+	if (idx < start)
+		return 0;
+
+	nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
+			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_RES_GET),
+			0, NLM_F_MULTI);
+
+	if (fill_res_info(skb, device)) {
+		nlmsg_cancel(skb, nlh);
+		goto out;
+	}
+
+	nlmsg_end(skb, nlh);
+
+	idx++;
+
+out:
+	cb->args[0] = idx;
+	return skb->len;
+}
+
+static int nldev_res_get_dumpit(struct sk_buff *skb,
+				struct netlink_callback *cb)
+{
+	/*
+	 * There is no need to take lock, because
+	 * we are relying on ib_core's lists_rwsem
+	 */
+	return ib_enum_all_devs(_nldev_res_get_dumpit, skb, cb);
+}
+
 static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 	[RDMA_NLDEV_CMD_GET] = {
 		.doit = nldev_get_doit,
@@ -338,6 +485,10 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 		.doit = nldev_port_get_doit,
 		.dump = nldev_port_get_dumpit,
 	},
+	[RDMA_NLDEV_CMD_RES_GET] = {
+		.doit = nldev_res_get_doit,
+		.dump = nldev_res_get_dumpit,
+	},
 };
 
 void __init nldev_init(void)
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index cc002e316d09..e0f5cdc81541 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -236,6 +236,11 @@ enum rdma_nldev_command {
 	RDMA_NLDEV_CMD_PORT_NEW,
 	RDMA_NLDEV_CMD_PORT_DEL,
 
+	RDMA_NLDEV_CMD_RES_GET, /* can dump */
+	RDMA_NLDEV_CMD_RES_SET,
+	RDMA_NLDEV_CMD_RES_NEW,
+	RDMA_NLDEV_CMD_RES_DEL,
+
 	RDMA_NLDEV_NUM_OPS
 };
 
@@ -303,6 +308,11 @@ enum rdma_nldev_attr {
 
 	RDMA_NLDEV_ATTR_DEV_NODE_TYPE,		/* u8 */
 
+	RDMA_NLDEV_ATTR_RES_SUMMARY,		/* nested table */
+	RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY,	/* nested table */
+	RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_NAME,	/* string */
+	RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_CURR,	/* u64 */
+
 	RDMA_NLDEV_ATTR_MAX
 };
 #endif /* _UAPI_RDMA_NETLINK_H */
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH rdma-next v6 8/8] RDMA/nldev: Provide detailed QP information
       [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (6 preceding siblings ...)
  2018-01-25 15:12   ` [PATCH rdma-next v6 7/8] RDMA/nldev: Provide global resource utilization Leon Romanovsky
@ 2018-01-25 15:12   ` Leon Romanovsky
  7 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 15:12 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Mark Bloch, Steve Wise,
	Leon Romanovsky

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Implement RDMA nldev netlink interface to get detailed
QP information.

Currently only dumpit variant is implemented.

Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/nldev.c  | 222 +++++++++++++++++++++++++++++++++++++++
 include/uapi/rdma/rdma_netlink.h |  45 ++++++++
 2 files changed, 267 insertions(+)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 7bbd88a6b6a0..c840fbc58a23 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -59,6 +59,18 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_NAME] = { .type = NLA_NUL_STRING,
 					     .len = 16 },
 	[RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_CURR] = { .type = NLA_U64 },
+	[RDMA_NLDEV_ATTR_RES_QP]		= { .type = NLA_NESTED },
+	[RDMA_NLDEV_ATTR_RES_QP_ENTRY]		= { .type = NLA_NESTED },
+	[RDMA_NLDEV_ATTR_RES_LQPN]		= { .type = NLA_U32 },
+	[RDMA_NLDEV_ATTR_RES_RQPN]		= { .type = NLA_U32 },
+	[RDMA_NLDEV_ATTR_RES_RQ_PSN]		= { .type = NLA_U32 },
+	[RDMA_NLDEV_ATTR_RES_SQ_PSN]		= { .type = NLA_U32 },
+	[RDMA_NLDEV_ATTR_RES_PATH_MIG_STATE] = { .type = NLA_U8 },
+	[RDMA_NLDEV_ATTR_RES_TYPE]		= { .type = NLA_U8 },
+	[RDMA_NLDEV_ATTR_RES_STATE]		= { .type = NLA_U8 },
+	[RDMA_NLDEV_ATTR_RES_PID]		= { .type = NLA_U32 },
+	[RDMA_NLDEV_ATTR_RES_KERN_NAME]		= { .type = NLA_NUL_STRING,
+						    .len = TASK_COMM_LEN },
 };
 
 static int fill_nldev_handle(struct sk_buff *msg, struct ib_device *device)
@@ -200,6 +212,78 @@ static int fill_res_info(struct sk_buff *msg, struct ib_device *device)
 	return ret;
 }
 
+static int fill_res_qp_entry(struct sk_buff *msg,
+			     struct ib_qp *qp, uint32_t port)
+{
+	struct rdma_restrack_entry *res = &qp->res;
+	struct ib_qp_init_attr qp_init_attr;
+	struct nlattr *entry_attr;
+	struct ib_qp_attr qp_attr;
+	int ret;
+
+	ret = ib_query_qp(qp, &qp_attr, 0, &qp_init_attr);
+	if (ret)
+		return ret;
+
+	if (port && port != qp_attr.port_num)
+		return 0;
+
+	entry_attr = nla_nest_start(msg, RDMA_NLDEV_ATTR_RES_QP_ENTRY);
+	if (!entry_attr)
+		goto out;
+
+	/* In create_qp() port is not set yet */
+	if (qp_attr.port_num &&
+	    nla_put_u32(msg, RDMA_NLDEV_ATTR_PORT_INDEX, qp_attr.port_num))
+		goto err;
+
+	if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_LQPN, qp->qp_num))
+		goto err;
+	if (qp->qp_type == IB_QPT_RC || qp->qp_type == IB_QPT_UC) {
+		if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_RQPN,
+				qp_attr.dest_qp_num))
+			goto err;
+		if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_RQ_PSN,
+				qp_attr.rq_psn))
+			goto err;
+	}
+
+	if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_SQ_PSN, qp_attr.sq_psn))
+		goto err;
+
+	if (qp->qp_type == IB_QPT_RC || qp->qp_type == IB_QPT_UC ||
+	    qp->qp_type == IB_QPT_XRC_INI || qp->qp_type == IB_QPT_XRC_TGT) {
+		if (nla_put_u8(msg, RDMA_NLDEV_ATTR_RES_PATH_MIG_STATE,
+			       qp_attr.path_mig_state))
+			goto err;
+	}
+	if (nla_put_u8(msg, RDMA_NLDEV_ATTR_RES_TYPE, qp->qp_type))
+		goto err;
+	if (nla_put_u8(msg, RDMA_NLDEV_ATTR_RES_STATE, qp_attr.qp_state))
+		goto err;
+
+	/*
+	 * Existence of task means that it is user QP and netlink
+	 * user is invited to go and read /proc/PID/comm to get name
+	 * of the task file and res->task_com should be NULL.
+	 */
+	if (rdma_is_kernel_res(res)) {
+		if (nla_put_string(msg, RDMA_NLDEV_ATTR_RES_KERN_NAME, res->kern_name))
+			goto err;
+	} else {
+		if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_PID, task_pid_vnr(res->task)))
+			goto err;
+	}
+
+	nla_nest_end(msg, entry_attr);
+	return 0;
+
+err:
+	nla_nest_cancel(msg, entry_attr);
+out:
+	return -EMSGSIZE;
+}
+
 static int nldev_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 			  struct netlink_ext_ack *extack)
 {
@@ -476,6 +560,131 @@ static int nldev_res_get_dumpit(struct sk_buff *skb,
 	return ib_enum_all_devs(_nldev_res_get_dumpit, skb, cb);
 }
 
+static int nldev_res_get_qp_dumpit(struct sk_buff *skb,
+				   struct netlink_callback *cb)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
+	struct rdma_restrack_entry *res;
+	int err, ret = 0, idx = 0;
+	struct nlattr *table_attr;
+	struct ib_device *device;
+	int start = cb->args[0];
+	struct ib_qp *qp = NULL;
+	struct nlmsghdr *nlh;
+	u32 index, port = 0;
+
+	err = nlmsg_parse(cb->nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1,
+			  nldev_policy, NULL);
+	/*
+	 * Right now, we are expecting the device index to get QP information,
+	 * but it is possible to extend this code to return all devices in
+	 * one shot by checking the existence of RDMA_NLDEV_ATTR_DEV_INDEX.
+	 * if it doesn't exist, we will iterate over all devices.
+	 *
+	 * But it is not needed for now.
+	 */
+	if (err || !tb[RDMA_NLDEV_ATTR_DEV_INDEX])
+		return -EINVAL;
+
+	index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	device = ib_device_get_by_index(index);
+	if (!device)
+		return -EINVAL;
+
+	/*
+	 * If no PORT_INDEX is supplied, we will return all QPs from that device
+	 */
+	if (tb[RDMA_NLDEV_ATTR_PORT_INDEX]) {
+		port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+		if (!rdma_is_port_valid(device, port)) {
+			ret = -EINVAL;
+			goto err_index;
+		}
+	}
+
+	nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
+			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_RES_QP_GET),
+			0, NLM_F_MULTI);
+
+	if (fill_nldev_handle(skb, device)) {
+		ret = -EMSGSIZE;
+		goto err;
+	}
+
+	table_attr = nla_nest_start(skb, RDMA_NLDEV_ATTR_RES_QP);
+	if (!table_attr) {
+		ret = -EMSGSIZE;
+		goto err;
+	}
+
+	down_read(&device->res.rwsem);
+	hash_for_each_possible(device->res.hash, res, node, RDMA_RESTRACK_QP) {
+		if (idx < start) {
+			idx++;
+			continue;
+		}
+
+		if ((rdma_is_kernel_res(res) &&
+		     task_active_pid_ns(current) != &init_pid_ns) ||
+		    (!rdma_is_kernel_res(res) &&
+		     task_active_pid_ns(current) != task_active_pid_ns(res->task)))
+			/*
+			 * 1. Kernel QPs should be visible in init namsapce only
+			 * 2. Preent only QPs visible in the current namespace
+			 */
+			continue;
+
+		rdma_restrack_get(res);
+		qp = container_of(res, struct ib_qp, res);
+
+		up_read(&device->res.rwsem);
+		ret = fill_res_qp_entry(skb, qp, port);
+		down_read(&device->res.rwsem);
+		/*
+		 * Return resource back, but it won't be released till
+		 * the &device->res.rwsem will be released for write.
+		 */
+		rdma_restrack_put(res);
+
+		if (ret == -EMSGSIZE)
+			/*
+			 * There is a chance to optimize here.
+			 * It can be done by using list_prepare_entry
+			 * and list_for_each_entry_continue afterwards.
+			 */
+			break;
+		if (ret)
+			goto res_err;
+		idx++;
+	}
+	up_read(&device->res.rwsem);
+
+	nla_nest_end(skb, table_attr);
+	nlmsg_end(skb, nlh);
+	cb->args[0] = idx;
+
+	/*
+	 * No more QPs to fill, cancel the message and
+	 * return 0 to mark end of dumpit.
+	 */
+	if (!qp)
+		goto err;
+
+	put_device(&device->dev);
+	return skb->len;
+
+res_err:
+	nla_nest_cancel(skb, table_attr);
+	up_read(&device->res.rwsem);
+
+err:
+	nlmsg_cancel(skb, nlh);
+
+err_index:
+	put_device(&device->dev);
+	return ret;
+}
+
 static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 	[RDMA_NLDEV_CMD_GET] = {
 		.doit = nldev_get_doit,
@@ -489,6 +698,19 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 		.doit = nldev_res_get_doit,
 		.dump = nldev_res_get_dumpit,
 	},
+	[RDMA_NLDEV_CMD_RES_QP_GET] = {
+		.dump = nldev_res_get_qp_dumpit,
+		/*
+		 * .doit is not implemented yet for two reasons:
+		 * 1. It is not needed yet.
+		 * 2. There is a need to provide identifier, while it is easy
+		 * for the QPs (device index + port index + LQPN), it is not
+		 * the case for the rest of resources (PD and CQ). Because it
+		 * is better to provide similar interface for all resources,
+		 * let's wait till we will have other resources implemented
+		 * too.
+		 */
+	},
 };
 
 void __init nldev_init(void)
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index e0f5cdc81541..23bef4015982 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -241,6 +241,11 @@ enum rdma_nldev_command {
 	RDMA_NLDEV_CMD_RES_NEW,
 	RDMA_NLDEV_CMD_RES_DEL,
 
+	RDMA_NLDEV_CMD_RES_QP_GET, /* can dump */
+	RDMA_NLDEV_CMD_RES_QP_SET,
+	RDMA_NLDEV_CMD_RES_QP_NEW,
+	RDMA_NLDEV_CMD_RES_QP_DEL,
+
 	RDMA_NLDEV_NUM_OPS
 };
 
@@ -313,6 +318,46 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_NAME,	/* string */
 	RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_CURR,	/* u64 */
 
+	RDMA_NLDEV_ATTR_RES_QP,			/* nested table */
+	RDMA_NLDEV_ATTR_RES_QP_ENTRY,		/* nested table */
+	/*
+	 * Local QPN
+	 */
+	RDMA_NLDEV_ATTR_RES_LQPN,		/* u32 */
+	/*
+	 * Remote QPN,
+	 * Applicable for RC and UC only IBTA 11.2.5.3 QUERY QUEUE PAIR
+	 */
+	RDMA_NLDEV_ATTR_RES_RQPN,		/* u32 */
+	/*
+	 * Receive Queue PSN,
+	 * Applicable for RC and UC only 11.2.5.3 QUERY QUEUE PAIR
+	 */
+	RDMA_NLDEV_ATTR_RES_RQ_PSN,		/* u32 */
+	/*
+	 * Send Queue PSN
+	 */
+	RDMA_NLDEV_ATTR_RES_SQ_PSN,		/* u32 */
+	RDMA_NLDEV_ATTR_RES_PATH_MIG_STATE,	/* u8 */
+	/*
+	 * QP types as visible to RDMA/core, the reserved QPT
+	 * are not exported through this interface.
+	 */
+	RDMA_NLDEV_ATTR_RES_TYPE,		/* u8 */
+	RDMA_NLDEV_ATTR_RES_STATE,		/* u8 */
+	/*
+	 * Process ID which created object,
+	 * in case of kernel origin, PID won't exist.
+	 */
+	RDMA_NLDEV_ATTR_RES_PID,		/* u32 */
+	/*
+	 * The name of process created following resource.
+	 * It will exist only for kernel objects.
+	 * For user created objects, the user is supposed
+	 * to read /proc/PID/comm file.
+	 */
+	RDMA_NLDEV_ATTR_RES_KERN_NAME,		/* string */
+
 	RDMA_NLDEV_ATTR_MAX
 };
 #endif /* _UAPI_RDMA_NETLINK_H */
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]     ` <20180125151227.28202-4-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-01-25 16:39       ` Doug Ledford
       [not found]         ` <1516898399.27592.139.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-01-25 17:45       ` Jason Gunthorpe
  1 sibling, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2018-01-25 16:39 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: RDMA mailing list, Mark Bloch, Steve Wise, Leon Romanovsky

[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]

On Thu, 2018-01-25 at 17:12 +0200, Leon Romanovsky wrote:
> +       if (rdma_restrack_init(&device->res)) {
> +               kfree(device);
> +               return NULL;
> +       }
> +

[ snip ]

> +int rdma_restrack_init(struct rdma_restrack_root *res)
> +{
> +       init_rwsem(&res->rwsem);
> +       return 0;
> +}
> +

Is there a reason you are doing it this way?  Right now, the init can't
fail and the test for failure is a no-op.  I looked through the
remaining patches and nothing in this patchset changes that (at least
that I caught).  Is there a future intent here?  If so, I can see
leaving this as is (although if there is a respin, fixing it and then
changing it back when you add the actual failure opportunity would make
sense).  But, if not, the init function should be void and there should
be no failure test when calling it.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]     ` <20180125151227.28202-4-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2018-01-25 16:39       ` Doug Ledford
@ 2018-01-25 17:45       ` Jason Gunthorpe
       [not found]         ` <20180125174529.GB3739-uk2M96/98Pc@public.gmane.org>
  1 sibling, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2018-01-25 17:45 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise, Leon Romanovsky

On Thu, Jan 25, 2018 at 05:12:22PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> +
> +/**
> + * struct rdma_restrack_entry - metadata per-entry
> + */
> +struct rdma_restrack_entry {
> +	/**
> +	 * @valid: validity indicator
> +	 *
> +	 * The entries are filled during rdma_restrack_add,
> +	 * can be attempted to be free during rdma_restrack_del.
> +	 *
> +	 * As an example for that, see mlx5 QPs with type MLX5_IB_QPT_HW_GSI
> +	 */
> +	bool			valid;
> +	/*
> +	 * @kref: Protect destroy of the resource
> +	 */
> +	struct kref		kref;

Sticking a kref in a random structure that is itself not malloc'd is a
big red flag:

@@ -1746,6 +1756,11 @@ struct ib_qp {
+       struct rdma_restrack_root     res;


Not seeing how this works at all.

qp does this:

+       rdma_restrack_del(&qp->res);
        ret = qp->device->destroy_qp(qp);

And for instance the mlx5 implementation of destroy_qp() does:

int mlx5_ib_destroy_qp(struct ib_qp *qp)
{
        struct mlx5_ib_qp *mqp = to_mqp(qp);
[..]

        kfree(mqp);

So we kfree the kref containing memory outside the kref's release
function. Super big red flag.

rdma_restrack_del is just:

+       return kref_put(&res->kref, restrack_release);

So we don't do anything to block progress to the kfree unless
the kref drefs to 0. But the reader holds a get/put across its
critical section so we never hit the 0 case when there is another
user.

Not even remotely right.

Adding a kref means it has to be added to struct ib_qp and then kfree's
in the drivers need to change to kref_put. That is alot of work.

The best idea I've had to make this locking work is what I emailed you
a few days ago..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]         ` <1516898399.27592.139.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-25 19:35           ` Leon Romanovsky
  0 siblings, 0 replies; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 19:35 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Jason Gunthorpe, RDMA mailing list, Mark Bloch, Steve Wise

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

On Thu, Jan 25, 2018 at 11:39:59AM -0500, Doug Ledford wrote:
> On Thu, 2018-01-25 at 17:12 +0200, Leon Romanovsky wrote:
> > +       if (rdma_restrack_init(&device->res)) {
> > +               kfree(device);
> > +               return NULL;
> > +       }
> > +
>
> [ snip ]
>
> > +int rdma_restrack_init(struct rdma_restrack_root *res)
> > +{
> > +       init_rwsem(&res->rwsem);
> > +       return 0;
> > +}
> > +
>
> Is there a reason you are doing it this way?  Right now, the init can't
> fail and the test for failure is a no-op.  I looked through the
> remaining patches and nothing in this patchset changes that (at least
> that I caught).  Is there a future intent here?  If so, I can see
> leaving this as is (although if there is a respin, fixing it and then
> changing it back when you add the actual failure opportunity would make
> sense).  But, if not, the init function should be void and there should
> be no failure test when calling it.

I don't use the returned value, it is leftover.

>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>     GPG KeyID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]         ` <20180125174529.GB3739-uk2M96/98Pc@public.gmane.org>
@ 2018-01-25 19:44           ` Leon Romanovsky
       [not found]             ` <20180125194436.GR1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 19:44 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

[-- Attachment #1: Type: text/plain, Size: 2347 bytes --]

On Thu, Jan 25, 2018 at 10:45:29AM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2018 at 05:12:22PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > +
> > +/**
> > + * struct rdma_restrack_entry - metadata per-entry
> > + */
> > +struct rdma_restrack_entry {
> > +	/**
> > +	 * @valid: validity indicator
> > +	 *
> > +	 * The entries are filled during rdma_restrack_add,
> > +	 * can be attempted to be free during rdma_restrack_del.
> > +	 *
> > +	 * As an example for that, see mlx5 QPs with type MLX5_IB_QPT_HW_GSI
> > +	 */
> > +	bool			valid;
> > +	/*
> > +	 * @kref: Protect destroy of the resource
> > +	 */
> > +	struct kref		kref;
>
> Sticking a kref in a random structure that is itself not malloc'd is a
> big red flag:

It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
malloced as part of their creation.

>
> @@ -1746,6 +1756,11 @@ struct ib_qp {
> +       struct rdma_restrack_root     res;
>
>
> Not seeing how this works at all.
>
> qp does this:
>
> +       rdma_restrack_del(&qp->res);
>         ret = qp->device->destroy_qp(qp);
>
> And for instance the mlx5 implementation of destroy_qp() does:
>
> int mlx5_ib_destroy_qp(struct ib_qp *qp)
> {
>         struct mlx5_ib_qp *mqp = to_mqp(qp);
> [..]
>
>         kfree(mqp);
>
> So we kfree the kref containing memory outside the kref's release
> function. Super big red flag.

It is worth to look on the actual implementation, the
rdma_restrack_del() doesn't do a lot: removes from the list and marks
the resource as not valid.

>
> rdma_restrack_del is just:
>
> +       return kref_put(&res->kref, restrack_release);
>
> So we don't do anything to block progress to the kfree unless
> the kref drefs to 0. But the reader holds a get/put across its
> critical section so we never hit the 0 case when there is another
> user.
>
> Not even remotely right.
>
> Adding a kref means it has to be added to struct ib_qp and then kfree's
> in the drivers need to change to kref_put. That is alot of work.
>
> The best idea I've had to make this locking work is what I emailed you
> a few days ago..

It is more or less the same idea, but needs to move
down_write(&dev->res.rwsem) to be part of rdma_restrack_del and not
release routine. It will cause to block the progress of release.

Thanks

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]             ` <20180125194436.GR1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2018-01-25 20:10               ` Jason Gunthorpe
       [not found]                 ` <20180125201023.GA9162-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2018-01-25 20:10 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

On Thu, Jan 25, 2018 at 09:44:36PM +0200, Leon Romanovsky wrote:
> > > +	/*
> > > +	 * @kref: Protect destroy of the resource
> > > +	 */
> > > +	struct kref		kref;
> >
> > Sticking a kref in a random structure that is itself not malloc'd is a
> > big red flag:
> 
> It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
> malloced as part of their creation.

I mean the kref isn't in any way linked the lifetime of the malloc
that contains it. So it isn't really a kref, it is just a refcount.

> > @@ -1746,6 +1756,11 @@ struct ib_qp {
> > +       struct rdma_restrack_root     res;
> >
> >
> > Not seeing how this works at all.
> >
> > qp does this:
> >
> > +       rdma_restrack_del(&qp->res);
> >         ret = qp->device->destroy_qp(qp);
> >
> > And for instance the mlx5 implementation of destroy_qp() does:
> >
> > int mlx5_ib_destroy_qp(struct ib_qp *qp)
> > {
> >         struct mlx5_ib_qp *mqp = to_mqp(qp);
> > [..]
> >
> >         kfree(mqp);
> >
> > So we kfree the kref containing memory outside the kref's release
> > function. Super big red flag.
> 
> It is worth to look on the actual implementation, the
> rdma_restrack_del() doesn't do a lot: removes from the list and marks
> the resource as not valid.

It doesn't even do that if the kref count is elevated.

Which is the whole point, the rdma_restrack_del *doesn't* prevent the
use-after-free on the read side.

> > The best idea I've had to make this locking work is what I emailed you
> > a few days ago..
> 
> It is more or less the same idea, but needs to move
> down_write(&dev->res.rwsem) to be part of rdma_restrack_del and not
> release routine. It will cause to block the progress of release.

That doesn't help - the kref is still totally non-functional as a kref:

 CPU0                                CPU1

                                 down_read()
                                 rdma_restrack_get()
				 up_read()
down_write()
restrack_release()
up_write()
kfree(qp)
                                 fill_res_qp_entry(qp)
				 // boom, use after free

A kref is not a lock, and a kref that *doesn't* put kfree in the
release function is probably not a kref but a refcount.

What I sent you wasn't remotely like this, it had two nested locks,
the outer mutex and a very special open coded inner read/write lock
that doesn't deadlock when nested..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]                 ` <20180125201023.GA9162-uk2M96/98Pc@public.gmane.org>
@ 2018-01-25 20:27                   ` Leon Romanovsky
       [not found]                     ` <20180125202710.GS1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 20:27 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

[-- Attachment #1: Type: text/plain, Size: 3338 bytes --]

On Thu, Jan 25, 2018 at 01:10:23PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2018 at 09:44:36PM +0200, Leon Romanovsky wrote:
> > > > +	/*
> > > > +	 * @kref: Protect destroy of the resource
> > > > +	 */
> > > > +	struct kref		kref;
> > >
> > > Sticking a kref in a random structure that is itself not malloc'd is a
> > > big red flag:
> >
> > It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
> > malloced as part of their creation.
>
> I mean the kref isn't in any way linked the lifetime of the malloc
> that contains it. So it isn't really a kref, it is just a refcount.

For now, yes, in the future no. IMHO it is the direction to manage RDMA
objects.

>
> > > @@ -1746,6 +1756,11 @@ struct ib_qp {
> > > +       struct rdma_restrack_root     res;
> > >
> > >
> > > Not seeing how this works at all.
> > >
> > > qp does this:
> > >
> > > +       rdma_restrack_del(&qp->res);
> > >         ret = qp->device->destroy_qp(qp);
> > >
> > > And for instance the mlx5 implementation of destroy_qp() does:
> > >
> > > int mlx5_ib_destroy_qp(struct ib_qp *qp)
> > > {
> > >         struct mlx5_ib_qp *mqp = to_mqp(qp);
> > > [..]
> > >
> > >         kfree(mqp);
> > >
> > > So we kfree the kref containing memory outside the kref's release
> > > function. Super big red flag.
> >
> > It is worth to look on the actual implementation, the
> > rdma_restrack_del() doesn't do a lot: removes from the list and marks
> > the resource as not valid.
>
> It doesn't even do that if the kref count is elevated.
>
> Which is the whole point, the rdma_restrack_del *doesn't* prevent the
> use-after-free on the read side.

I understand it and agree that it is the bug, which is very easy to fix.

>
> > > The best idea I've had to make this locking work is what I emailed you
> > > a few days ago..
> >
> > It is more or less the same idea, but needs to move
> > down_write(&dev->res.rwsem) to be part of rdma_restrack_del and not
> > release routine. It will cause to block the progress of release.
>
> That doesn't help - the kref is still totally non-functional as a kref:
>
>  CPU0                                CPU1
>
>                                  down_read()
>                                  rdma_restrack_get()
> 				 up_read()
> down_write()
> restrack_release()
> up_write()
> kfree(qp)
>                                  fill_res_qp_entry(qp)
> 				 // boom, use after free
>
> A kref is not a lock, and a kref that *doesn't* put kfree in the
> release function is probably not a kref but a refcount.
>
> What I sent you wasn't remotely like this, it had two nested locks,
> the outer mutex and a very special open coded inner read/write lock
> that doesn't deadlock when nested..

I still didn't like your approach, because it has three words which i don't want
to see here: "special open-coded variant". I believe that the way to go is to add
completion structure and convert _del to be something like that:

void rdma_restrack_del(struct rdma_restrack_entry *res)
{
	if (!rdma_restrack_put(res))
		/* it is pseudo-code */
		wait_for_completion(!res->valid);
	return;
}


>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]                     ` <20180125202710.GS1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2018-01-25 21:00                       ` Jason Gunthorpe
       [not found]                         ` <20180125210026.GA10719-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2018-01-25 21:00 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

On Thu, Jan 25, 2018 at 10:27:10PM +0200, Leon Romanovsky wrote:
> On Thu, Jan 25, 2018 at 01:10:23PM -0700, Jason Gunthorpe wrote:
> > On Thu, Jan 25, 2018 at 09:44:36PM +0200, Leon Romanovsky wrote:
> > > > > +	/*
> > > > > +	 * @kref: Protect destroy of the resource
> > > > > +	 */
> > > > > +	struct kref		kref;
> > > >
> > > > Sticking a kref in a random structure that is itself not malloc'd is a
> > > > big red flag:
> > >
> > > It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
> > > malloced as part of their creation.
> >
> > I mean the kref isn't in any way linked the lifetime of the malloc
> > that contains it. So it isn't really a kref, it is just a refcount.
> 
> For now, yes, in the future no. IMHO it is the direction to manage RDMA
> objects.

Maybe but until we do that this doesn't have struct kref semantics at
all and shouldn't be called kref..

> > What I sent you wasn't remotely like this, it had two nested locks,
> > the outer mutex and a very special open coded inner read/write lock
> > that doesn't deadlock when nested..
> 
> I still didn't like your approach, because it has three words which i don't want
> to see here: "special open-coded variant". I believe that the way to go is to add
> completion structure and convert _del to be something like that:

Well, we can code the same idea with a completion, it is a little
clearer but uses more memory for the per-object completion struct.

Read side:
  mutex_lock(list_lock)
  for_each_list(...,obj...) { // Must NOT use _safe here
     // The object is in process of being deleted
     if (refcount_inc_not_zero(&obj->refcount))
          continue;
     mutex_unlock(list_lock);
      ...
     mutex_lock(list_lock);
     if (refcount_dec_and_test(&obj->recount))
         complete(&res->completion);
  }

Destroy side:
   if (refcount_dec_and_test(&obj->ref))
         complete(&obj->completion);
   wait_for_completion(obj->free);

   mutex_lock(list_lock)
   list_del(obj);


The refcount starts at 1 during init. Destroy triggers the freeing
process by decr. Once the refcount reaches 0 it latches at 0 as 'to be
destroyed' and the pointer can never leave the list_lock critical
section.

It is still tricky..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]                         ` <20180125210026.GA10719-uk2M96/98Pc@public.gmane.org>
@ 2018-01-25 21:18                           ` Leon Romanovsky
       [not found]                             ` <20180125211831.GT1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-25 21:18 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

[-- Attachment #1: Type: text/plain, Size: 2891 bytes --]

On Thu, Jan 25, 2018 at 02:00:26PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2018 at 10:27:10PM +0200, Leon Romanovsky wrote:
> > On Thu, Jan 25, 2018 at 01:10:23PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Jan 25, 2018 at 09:44:36PM +0200, Leon Romanovsky wrote:
> > > > > > +	/*
> > > > > > +	 * @kref: Protect destroy of the resource
> > > > > > +	 */
> > > > > > +	struct kref		kref;
> > > > >
> > > > > Sticking a kref in a random structure that is itself not malloc'd is a
> > > > > big red flag:
> > > >
> > > > It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
> > > > malloced as part of their creation.
> > >
> > > I mean the kref isn't in any way linked the lifetime of the malloc
> > > that contains it. So it isn't really a kref, it is just a refcount.
> >
> > For now, yes, in the future no. IMHO it is the direction to manage RDMA
> > objects.
>
> Maybe but until we do that this doesn't have struct kref semantics at
> all and shouldn't be called kref..

You are actually open-coded kref semantics with addition of completions.

>
> > > What I sent you wasn't remotely like this, it had two nested locks,
> > > the outer mutex and a very special open coded inner read/write lock
> > > that doesn't deadlock when nested..
> >
> > I still didn't like your approach, because it has three words which i don't want
> > to see here: "special open-coded variant". I believe that the way to go is to add
> > completion structure and convert _del to be something like that:
>
> Well, we can code the same idea with a completion, it is a little
> clearer but uses more memory for the per-object completion struct.
>
> Read side:
>   mutex_lock(list_lock)
>   for_each_list(...,obj...) { // Must NOT use _safe here
>      // The object is in process of being deleted
>      if (refcount_inc_not_zero(&obj->refcount))
>           continue;
>      mutex_unlock(list_lock);
>       ...
>      mutex_lock(list_lock);
>      if (refcount_dec_and_test(&obj->recount))
>          complete(&res->completion);
>   }
>
> Destroy side:
>    if (refcount_dec_and_test(&obj->ref))
>          complete(&obj->completion);
>    wait_for_completion(obj->free);
>
>    mutex_lock(list_lock)
>    list_del(obj);
>
>
> The refcount starts at 1 during init. Destroy triggers the freeing
> process by decr. Once the refcount reaches 0 it latches at 0 as 'to be
> destroyed' and the pointer can never leave the list_lock critical
> section.
>
> It is still tricky..

This is more or less how nldev does reads, with exception that counters
(kref) incremented and decremented under r/w semaphore, it eliminates the
check "if (refcount_inc_not_zero(&obj->refcount))" and instead complete
it calls to release (kref semantics).

It is worth to read the nldev part too.

The bug is that I didn't block rdma_restrack_del(), but the semantics
and locking scheme is right.

Thanks

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]                             ` <20180125211831.GT1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2018-01-25 21:57                               ` Jason Gunthorpe
       [not found]                                 ` <20180125215716.GA11537-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2018-01-25 21:57 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

On Thu, Jan 25, 2018 at 11:18:31PM +0200, Leon Romanovsky wrote:
> On Thu, Jan 25, 2018 at 02:00:26PM -0700, Jason Gunthorpe wrote:
> > On Thu, Jan 25, 2018 at 10:27:10PM +0200, Leon Romanovsky wrote:
> > > On Thu, Jan 25, 2018 at 01:10:23PM -0700, Jason Gunthorpe wrote:
> > > > On Thu, Jan 25, 2018 at 09:44:36PM +0200, Leon Romanovsky wrote:
> > > > > > > +	/*
> > > > > > > +	 * @kref: Protect destroy of the resource
> > > > > > > +	 */
> > > > > > > +	struct kref		kref;
> > > > > >
> > > > > > Sticking a kref in a random structure that is itself not malloc'd is a
> > > > > > big red flag:
> > > > >
> > > > > It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
> > > > > malloced as part of their creation.
> > > >
> > > > I mean the kref isn't in any way linked the lifetime of the malloc
> > > > that contains it. So it isn't really a kref, it is just a refcount.
> > >
> > > For now, yes, in the future no. IMHO it is the direction to manage RDMA
> > > objects.
> >
> > Maybe but until we do that this doesn't have struct kref semantics at
> > all and shouldn't be called kref..
> 
> You are actually open-coded kref semantics with addition of completions.

I generally don't like seeing 'complete()' as a kref release
function. There have been very subtle problems with code trying that
in the past..

But yes, you can force things to work that way if very, very careful.

I've often thought I'd like a helper type that had the kref+complete
semantics, because that does seem to crop up in a few places and it
would really help clarity.

> > > > What I sent you wasn't remotely like this, it had two nested locks,
> > > > the outer mutex and a very special open coded inner read/write lock
> > > > that doesn't deadlock when nested..
> > >
> > > I still didn't like your approach, because it has three words which i don't want
> > > to see here: "special open-coded variant". I believe that the way to go is to add
> > > completion structure and convert _del to be something like that:
> >
> > Well, we can code the same idea with a completion, it is a little
> > clearer but uses more memory for the per-object completion struct.
> >
> > Read side:
> >   mutex_lock(list_lock)
> >   for_each_list(...,obj...) { // Must NOT use _safe here
> >      // The object is in process of being deleted
> >      if (refcount_inc_not_zero(&obj->refcount))
> >           continue;
> >      mutex_unlock(list_lock);
> >       ...
> >      mutex_lock(list_lock);
> >      if (refcount_dec_and_test(&obj->recount))
> >          complete(&res->completion);
> >   }
> >
> > Destroy side:
> >    if (refcount_dec_and_test(&obj->ref))
> >          complete(&obj->completion);
> >    wait_for_completion(obj->free);
> >
> >    mutex_lock(list_lock)
> >    list_del(obj);
> >
> >
> > The refcount starts at 1 during init. Destroy triggers the freeing
> > process by decr. Once the refcount reaches 0 it latches at 0 as 'to be
> > destroyed' and the pointer can never leave the list_lock critical
> > section.
> >
> > It is still tricky..
> 
> This is more or less how nldev does reads, with exception that counters
> (kref) incremented and decremented under r/w semaphore, it eliminates the
> check "if (refcount_inc_not_zero(&obj->refcount))" and instead complete
> it calls to release (kref semantics).

Well, the read side is close, but no you can't skip the
refcount_inc_not_zero approach. It is subtly avoiding a race.

The v6 maybe tried to solve that race by having the release function
delete from the lists, but that just creates a deadlock:

Read side is doing:

+               down_read(&device->res.rwsem);
+		rdma_restrack_put(res);

Which could eventualy call

+static void restrack_release(struct kref *kref)
+{
+	down_write(&dev->res.rwsem);

Which is obviously an instant deadlock.

If a kref is used then the only thing the release function can do is
call complete, and you still must use kref_get_not_zero.

This is why I hate forcing krefs to do this - krefs evoke certain
expectations regarding how get and put work that this scheme throws
out the window. It makes it very hard to review if krefs don't follow
the usual kref expectations.

And this sort of bug, not using kref_get_not_zero or similar, is
actually the thing I usually find when people try and build these
things out of krefs :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]                                 ` <20180125215716.GA11537-uk2M96/98Pc@public.gmane.org>
@ 2018-01-26  6:24                                   ` Leon Romanovsky
       [not found]                                     ` <20180126062423.GV1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Romanovsky @ 2018-01-26  6:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

[-- Attachment #1: Type: text/plain, Size: 1792 bytes --]

On Thu, Jan 25, 2018 at 02:57:16PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2018 at 11:18:31PM +0200, Leon Romanovsky wrote:
> > On Thu, Jan 25, 2018 at 02:00:26PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Jan 25, 2018 at 10:27:10PM +0200, Leon Romanovsky wrote:
> > > > On Thu, Jan 25, 2018 at 01:10:23PM -0700, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 25, 2018 at 09:44:36PM +0200, Leon Romanovsky wrote:
> > > > > > > > +	/*
> > > > > > > > +	 * @kref: Protect destroy of the resource
> > > > > > > > +	 */
> > > > > > > > +	struct kref		kref;
> > > > > > >
> > > > > > > Sticking a kref in a random structure that is itself not malloc'd is a
> > > > > > > big red flag:
> > > > > >
> > > > > > It is not "rand", but embedded part of ib_qp,ib_pd and ib_cq. It is
> > > > > > malloced as part of their creation.
> > > > >
> > > > > I mean the kref isn't in any way linked the lifetime of the malloc
> > > > > that contains it. So it isn't really a kref, it is just a refcount.
> > > >
> > > > For now, yes, in the future no. IMHO it is the direction to manage RDMA
> > > > objects.
> > >
> > > Maybe but until we do that this doesn't have struct kref semantics at
> > > all and shouldn't be called kref..
> >
> > You are actually open-coded kref semantics with addition of completions.
>
> I generally don't like seeing 'complete()' as a kref release
> function. There have been very subtle problems with code trying that
> in the past..
>
> But yes, you can force things to work that way if very, very careful.
>
> I've often thought I'd like a helper type that had the kref+complete
> semantics, because that does seem to crop up in a few places and it
> would really help clarity.

Actually, it is used a lot, one of closest examples to us - qib sdma
kref counting.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources
       [not found]                                     ` <20180126062423.GV1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2018-01-26 16:05                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2018-01-26 16:05 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, RDMA mailing list, Mark Bloch, Steve Wise

On Fri, Jan 26, 2018 at 08:24:23AM +0200, Leon Romanovsky wrote:

> > I've often thought I'd like a helper type that had the kref+complete
> > semantics, because that does seem to crop up in a few places and it
> > would really help clarity.
> 
> Actually, it is used a lot, one of closest examples to us - qib sdma
> kref counting.

Looks like it has the same race bug between the put and
wait_for_completion compared to the async sdma_sw_clean_up_task. Needs
the 'get if not zero' approach too.

Kinda proves my point right? People using krefs for this get it wrong
because it DOESN'T follow well understood kref semantics, because it
ISN'T a kref - it a usecount.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-01-26 16:05 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-25 15:12 [PATCH rdma-next v6 0/8] RDMA resource tracking Leon Romanovsky
     [not found] ` <20180125151227.28202-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-01-25 15:12   ` [PATCH rdma-next v6 1/8] RDMA/core: Print caller name instead of function name Leon Romanovsky
2018-01-25 15:12   ` [PATCH rdma-next v6 2/8] RDMA/core: Save kernel caller name in PD and CQ objects Leon Romanovsky
2018-01-25 15:12   ` [PATCH rdma-next v6 3/8] RDMA/restrack: Add general infrastructure to track RDMA resources Leon Romanovsky
     [not found]     ` <20180125151227.28202-4-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-01-25 16:39       ` Doug Ledford
     [not found]         ` <1516898399.27592.139.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-25 19:35           ` Leon Romanovsky
2018-01-25 17:45       ` Jason Gunthorpe
     [not found]         ` <20180125174529.GB3739-uk2M96/98Pc@public.gmane.org>
2018-01-25 19:44           ` Leon Romanovsky
     [not found]             ` <20180125194436.GR1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2018-01-25 20:10               ` Jason Gunthorpe
     [not found]                 ` <20180125201023.GA9162-uk2M96/98Pc@public.gmane.org>
2018-01-25 20:27                   ` Leon Romanovsky
     [not found]                     ` <20180125202710.GS1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2018-01-25 21:00                       ` Jason Gunthorpe
     [not found]                         ` <20180125210026.GA10719-uk2M96/98Pc@public.gmane.org>
2018-01-25 21:18                           ` Leon Romanovsky
     [not found]                             ` <20180125211831.GT1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2018-01-25 21:57                               ` Jason Gunthorpe
     [not found]                                 ` <20180125215716.GA11537-uk2M96/98Pc@public.gmane.org>
2018-01-26  6:24                                   ` Leon Romanovsky
     [not found]                                     ` <20180126062423.GV1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2018-01-26 16:05                                       ` Jason Gunthorpe
2018-01-25 15:12   ` [PATCH rdma-next v6 4/8] RDMA/core: Add resource tracking for create and destroy QPs Leon Romanovsky
2018-01-25 15:12   ` [PATCH rdma-next v6 5/8] RDMA/core: Add resource tracking for create and destroy CQs Leon Romanovsky
2018-01-25 15:12   ` [PATCH rdma-next v6 6/8] RDMA/core: Add resource tracking for create and destroy PDs Leon Romanovsky
2018-01-25 15:12   ` [PATCH rdma-next v6 7/8] RDMA/nldev: Provide global resource utilization Leon Romanovsky
2018-01-25 15:12   ` [PATCH rdma-next v6 8/8] RDMA/nldev: Provide detailed QP information Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.