Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH rdma-next v2 0/4] ODP information and statistics
@ 2019-10-06 15:51 Leon Romanovsky
  2019-10-06 15:51 ` [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters Leon Romanovsky
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-06 15:51 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Erez Alfasi

From: Leon Romanovsky <leonro@mellanox.com>

Changelog:
 v2:
 * Since umems can disappear during rereg flow and the fact that we
   are not locking its during our counter usage (uverbs prevents this
   by holding the uobject write lock), expose possible race bugs. Move
   the counters into mlx5_ib_mr to avoid racing bugs (related patches
   - #1, #3 & #4).
 * Fix page invalidation counting (Patch #1).
 * Make the code more elegant by defining fill_function type and use it
   within res_common_{dumpit, doit} (Patch #2).
 * Put an ODP implicit indicator within mlx5 reg MR operation,
   indicating when a given MR is ODP implicit registered and use
   its indication when dumping ODP type.
 * Since the counters has been moved to mlx5_ib_mr, the ODP stats are now
   filled with internal to mlx5 driver function. Remove the fill_odp_stats
   device operation from the reason mentioned above.
 v1: https://lore.kernel.org/linux-rdma/20190830081612.2611-1-leon@kernel.org
 * Dropped umem patch, because it doesn't follow our IB model, where
   UMEM is driver object and ib_core object (Jason).
 * Removed the ODP type indicator from ib_umem_odp not needed after
   commit fd7dbf035edc ("RDMA/odp: Make it clearer when a umem is an implicit ODP umem")
 * Since umems are not part of core MR (from the reason mentioned
   above) there is no way to access the odp type as was previously done via nldev
   (old patch #3). Instead, patch #4 is adding mlx5 implementation for fill_res_entry
   and dumping ODP type as part of the driver table entry, as its driver details.
 * Counter types are now atomic64_t instead of u64.
 v0: https://lore.kernel.org/linux-rdma/20190807103403.8102-1-leon@kernel.org

-----------------------------------------------------------------------------
Hi,

This series from Erez refactors exposes ODP type information (explicit,
implicit) and statistics through netlink interface.

Thanks



Erez Alfasi (4):
  IB/mlx5: Introduce ODP diagnostic counters
  RDMA/nldev: Allow different fill function per resource
  RDMA/mlx5: Return ODP type per MR
  RDMA/nldev: Provide MR statistics

 drivers/infiniband/core/device.c      |  1 +
 drivers/infiniband/core/nldev.c       | 98 ++++++++++++++++++++-------
 drivers/infiniband/hw/mlx5/Makefile   |  2 +-
 drivers/infiniband/hw/mlx5/main.c     |  3 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  9 +++
 drivers/infiniband/hw/mlx5/odp.c      | 20 ++++++
 drivers/infiniband/hw/mlx5/restrack.c | 92 +++++++++++++++++++++++++
 include/rdma/ib_verbs.h               | 13 ++++
 include/rdma/restrack.h               |  6 ++
 9 files changed, 218 insertions(+), 26 deletions(-)
 create mode 100644 drivers/infiniband/hw/mlx5/restrack.c

--
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters
  2019-10-06 15:51 [PATCH rdma-next v2 0/4] ODP information and statistics Leon Romanovsky
@ 2019-10-06 15:51 ` Leon Romanovsky
  2019-10-08 19:57   ` Jason Gunthorpe
  2019-10-06 15:51 ` [PATCH rdma-next v2 2/4] RDMA/nldev: Allow different fill function per resource Leon Romanovsky
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-06 15:51 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Erez Alfasi

From: Erez Alfasi <ereza@mellanox.com>

Introduce ODP diagnostic counters and count the following
per MR within IB/mlx5 driver:
 1) Page faults:
	Total number of faulted pages.
 2) Page invalidations:
	Total number of pages invalidated by the OS during all
	invalidation events. The translations can be no longer
	valid due to either non-present pages or mapping changes.
 3) Prefetched pages:
	When prefetching a page, page fault is generated
	in order to bring the page to the main memory.
	The prefetched pages counter will be updated
	during a page fault flow only if it was derived
	from prefetching operation.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  4 ++++
 drivers/infiniband/hw/mlx5/odp.c     | 18 ++++++++++++++++++
 include/rdma/ib_verbs.h              |  6 ++++++
 3 files changed, 28 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index bf30d53d94dc..5aae05ebf64b 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -585,6 +585,9 @@ struct mlx5_ib_dm {
 					  IB_ACCESS_REMOTE_READ   |\
 					  IB_ZERO_BASED)
 
+#define mlx5_update_odp_stats(mr, counter_name, value)		\
+	atomic64_add(value, &((mr)->odp_stats.counter_name))
+
 struct mlx5_ib_mr {
 	struct ib_mr		ibmr;
 	void			*descs;
@@ -622,6 +625,7 @@ struct mlx5_ib_mr {
 	wait_queue_head_t       q_leaf_free;
 	struct mlx5_async_work  cb_work;
 	atomic_t		num_pending_prefetch;
+	struct ib_odp_counters	odp_stats;
 };
 
 static inline bool is_odp_mr(struct mlx5_ib_mr *mr)
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 95cf0249b015..966783bfb557 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -261,6 +261,10 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 				blk_start_idx = idx;
 				in_block = 1;
 			}
+
+			/* Count page invalidations */
+			mlx5_update_odp_stats(mr, invalidations,
+					      (idx - blk_start_idx + 1));
 		} else {
 			u64 umr_offset = idx & umr_block_mask;
 
@@ -287,6 +291,7 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 
 	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
 
+
 	if (unlikely(!umem_odp->npages && mr->parent &&
 		     !umem_odp->dying)) {
 		WRITE_ONCE(umem_odp->dying, 1);
@@ -801,6 +806,19 @@ static int pagefault_single_data_segment(struct mlx5_ib_dev *dev,
 		if (ret < 0)
 			goto srcu_unlock;
 
+		/*
+		 * When prefetching a page, page fault is generated
+		 * in order to bring the page to the main memory.
+		 * In the current flow, page faults are being counted.
+		 * Prefetched pages counter will be updated as well
+		 * only if the current page fault flow was derived
+		 * from prefetching flow.
+		 */
+		mlx5_update_odp_stats(mr, faults, ret);
+
+		if (prefetch)
+			mlx5_update_odp_stats(mr, prefetched, ret);
+
 		npages += ret;
 		ret = 0;
 		break;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6a47ba85c54c..3990054656b9 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2218,6 +2218,12 @@ struct rdma_netdev_alloc_params {
 				      struct net_device *netdev, void *param);
 };
 
+struct ib_odp_counters {
+	atomic64_t faults;
+	atomic64_t invalidations;
+	atomic64_t prefetched;
+};
+
 struct ib_counters {
 	struct ib_device	*device;
 	struct ib_uobject	*uobject;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH rdma-next v2 2/4] RDMA/nldev: Allow different fill function per resource
  2019-10-06 15:51 [PATCH rdma-next v2 0/4] ODP information and statistics Leon Romanovsky
  2019-10-06 15:51 ` [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters Leon Romanovsky
@ 2019-10-06 15:51 ` Leon Romanovsky
  2019-10-08 19:58   ` Jason Gunthorpe
  2019-10-06 15:51 ` [PATCH rdma-next v2 3/4] RDMA/mlx5: Return ODP type per MR Leon Romanovsky
  2019-10-06 15:51 ` [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics Leon Romanovsky
  3 siblings, 1 reply; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-06 15:51 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Erez Alfasi

From: Erez Alfasi <ereza@mellanox.com>

So far res_get_common_{dumpit, doit} was using the default
resource fill function which was defined as part of the
nldev_fill_res_entry fill_entries.

Add a fill function pointer as an argument allows us to use
different fill function in case we want to dump different
values then 'rdma resource' flow do, but still use the same
existing general resources dumping flow.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/nldev.c | 44 +++++++++++++++++----------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index c6fe0c52f6dc..963956eeda3f 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -42,6 +42,9 @@
 #include "cma_priv.h"
 #include "restrack.h"
 
+typedef int (*res_fill_func_t)(struct sk_buff*, bool,
+			       struct rdma_restrack_entry*, uint32_t);
+
 /*
  * Sort array elements by the netlink attribute name
  */
@@ -1129,8 +1132,6 @@ static int nldev_res_get_dumpit(struct sk_buff *skb,
 }
 
 struct nldev_fill_res_entry {
-	int (*fill_res_func)(struct sk_buff *msg, bool has_cap_net_admin,
-			     struct rdma_restrack_entry *res, u32 port);
 	enum rdma_nldev_attr nldev_attr;
 	enum rdma_nldev_command nldev_cmd;
 	u8 flags;
@@ -1144,21 +1145,18 @@ enum nldev_res_flags {
 
 static const struct nldev_fill_res_entry fill_entries[RDMA_RESTRACK_MAX] = {
 	[RDMA_RESTRACK_QP] = {
-		.fill_res_func = fill_res_qp_entry,
 		.nldev_cmd = RDMA_NLDEV_CMD_RES_QP_GET,
 		.nldev_attr = RDMA_NLDEV_ATTR_RES_QP,
 		.entry = RDMA_NLDEV_ATTR_RES_QP_ENTRY,
 		.id = RDMA_NLDEV_ATTR_RES_LQPN,
 	},
 	[RDMA_RESTRACK_CM_ID] = {
-		.fill_res_func = fill_res_cm_id_entry,
 		.nldev_cmd = RDMA_NLDEV_CMD_RES_CM_ID_GET,
 		.nldev_attr = RDMA_NLDEV_ATTR_RES_CM_ID,
 		.entry = RDMA_NLDEV_ATTR_RES_CM_ID_ENTRY,
 		.id = RDMA_NLDEV_ATTR_RES_CM_IDN,
 	},
 	[RDMA_RESTRACK_CQ] = {
-		.fill_res_func = fill_res_cq_entry,
 		.nldev_cmd = RDMA_NLDEV_CMD_RES_CQ_GET,
 		.nldev_attr = RDMA_NLDEV_ATTR_RES_CQ,
 		.flags = NLDEV_PER_DEV,
@@ -1166,7 +1164,6 @@ static const struct nldev_fill_res_entry fill_entries[RDMA_RESTRACK_MAX] = {
 		.id = RDMA_NLDEV_ATTR_RES_CQN,
 	},
 	[RDMA_RESTRACK_MR] = {
-		.fill_res_func = fill_res_mr_entry,
 		.nldev_cmd = RDMA_NLDEV_CMD_RES_MR_GET,
 		.nldev_attr = RDMA_NLDEV_ATTR_RES_MR,
 		.flags = NLDEV_PER_DEV,
@@ -1174,7 +1171,6 @@ static const struct nldev_fill_res_entry fill_entries[RDMA_RESTRACK_MAX] = {
 		.id = RDMA_NLDEV_ATTR_RES_MRN,
 	},
 	[RDMA_RESTRACK_PD] = {
-		.fill_res_func = fill_res_pd_entry,
 		.nldev_cmd = RDMA_NLDEV_CMD_RES_PD_GET,
 		.nldev_attr = RDMA_NLDEV_ATTR_RES_PD,
 		.flags = NLDEV_PER_DEV,
@@ -1182,7 +1178,6 @@ static const struct nldev_fill_res_entry fill_entries[RDMA_RESTRACK_MAX] = {
 		.id = RDMA_NLDEV_ATTR_RES_PDN,
 	},
 	[RDMA_RESTRACK_COUNTER] = {
-		.fill_res_func = fill_res_counter_entry,
 		.nldev_cmd = RDMA_NLDEV_CMD_STAT_GET,
 		.nldev_attr = RDMA_NLDEV_ATTR_STAT_COUNTER,
 		.entry = RDMA_NLDEV_ATTR_STAT_COUNTER_ENTRY,
@@ -1192,7 +1187,8 @@ static const struct nldev_fill_res_entry fill_entries[RDMA_RESTRACK_MAX] = {
 
 static int res_get_common_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 			       struct netlink_ext_ack *extack,
-			       enum rdma_restrack_type res_type)
+			       enum rdma_restrack_type res_type,
+			       res_fill_func_t fill_func)
 {
 	const struct nldev_fill_res_entry *fe = &fill_entries[res_type];
 	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
@@ -1250,7 +1246,9 @@ static int res_get_common_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	}
 
 	has_cap_net_admin = netlink_capable(skb, CAP_NET_ADMIN);
-	ret = fe->fill_res_func(msg, has_cap_net_admin, res, port);
+
+	ret = fill_func(msg, has_cap_net_admin, res, port);
+
 	rdma_restrack_put(res);
 	if (ret)
 		goto err_free;
@@ -1270,7 +1268,8 @@ static int res_get_common_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 static int res_get_common_dumpit(struct sk_buff *skb,
 				 struct netlink_callback *cb,
-				 enum rdma_restrack_type res_type)
+				 enum rdma_restrack_type res_type,
+				 res_fill_func_t fill_func)
 {
 	const struct nldev_fill_res_entry *fe = &fill_entries[res_type];
 	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
@@ -1355,7 +1354,8 @@ static int res_get_common_dumpit(struct sk_buff *skb,
 			goto msg_full;
 		}
 
-		ret = fe->fill_res_func(skb, has_cap_net_admin, res, port);
+		ret = fill_func(skb, has_cap_net_admin, res, port);
+
 		rdma_restrack_put(res);
 
 		if (ret) {
@@ -1398,17 +1398,19 @@ next:		idx++;
 	return ret;
 }
 
-#define RES_GET_FUNCS(name, type)                                              \
-	static int nldev_res_get_##name##_dumpit(struct sk_buff *skb,          \
+#define RES_GET_FUNCS(name, type)					       \
+	static int nldev_res_get_##name##_dumpit(struct sk_buff *skb,	       \
 						 struct netlink_callback *cb)  \
-	{                                                                      \
-		return res_get_common_dumpit(skb, cb, type);                   \
-	}                                                                      \
-	static int nldev_res_get_##name##_doit(struct sk_buff *skb,            \
-					       struct nlmsghdr *nlh,           \
+	{								       \
+		return res_get_common_dumpit(skb, cb, type,		       \
+					     fill_res_##name##_entry);	       \
+	}								       \
+	static int nldev_res_get_##name##_doit(struct sk_buff *skb,	       \
+					       struct nlmsghdr *nlh,	       \
 					       struct netlink_ext_ack *extack) \
-	{                                                                      \
-		return res_get_common_doit(skb, nlh, extack, type);            \
+	{								       \
+		return res_get_common_doit(skb, nlh, extack, type,	       \
+					   fill_res_##name##_entry);	       \
 	}
 
 RES_GET_FUNCS(qp, RDMA_RESTRACK_QP);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH rdma-next v2 3/4] RDMA/mlx5: Return ODP type per MR
  2019-10-06 15:51 [PATCH rdma-next v2 0/4] ODP information and statistics Leon Romanovsky
  2019-10-06 15:51 ` [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters Leon Romanovsky
  2019-10-06 15:51 ` [PATCH rdma-next v2 2/4] RDMA/nldev: Allow different fill function per resource Leon Romanovsky
@ 2019-10-06 15:51 ` Leon Romanovsky
  2019-10-06 15:51 ` [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics Leon Romanovsky
  3 siblings, 0 replies; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-06 15:51 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Erez Alfasi

From: Erez Alfasi <ereza@mellanox.com>

Provide an ODP explicit/implicit type as part
of 'rdma -dd resource show mr' dump.

For example:
~$: rdma -dd resource show mr
dev mlx5_0 mrn 1 rkey 0xa99a lkey 0xa99a mrlen 50000000
pdn 9 pid 7372 comm ibv_rc_pingpong drv_odp explicit

For non-ODP MRs, we won't print "drv_odp ..." at all.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/nldev.c       | 13 ++++++++
 drivers/infiniband/hw/mlx5/Makefile   |  2 +-
 drivers/infiniband/hw/mlx5/main.c     |  1 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  3 ++
 drivers/infiniband/hw/mlx5/odp.c      |  2 ++
 drivers/infiniband/hw/mlx5/restrack.c | 48 +++++++++++++++++++++++++++
 include/rdma/restrack.h               |  3 ++
 7 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 drivers/infiniband/hw/mlx5/restrack.c

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 963956eeda3f..6114465959e1 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -183,6 +183,19 @@ static int _rdma_nl_put_driver_u64(struct sk_buff *msg, const char *name,
 	return 0;
 }
 
+int rdma_nl_put_driver_string(struct sk_buff *msg, const char *name,
+			      const char *str)
+{
+	if (put_driver_name_print_type(msg, name,
+				       RDMA_NLDEV_PRINT_TYPE_UNSPEC))
+		return -EMSGSIZE;
+	if (nla_put_string(msg, RDMA_NLDEV_ATTR_DRIVER_STRING, str))
+		return -EMSGSIZE;
+
+	return 0;
+}
+EXPORT_SYMBOL(rdma_nl_put_driver_string);
+
 int rdma_nl_put_driver_u32(struct sk_buff *msg, const char *name, u32 value)
 {
 	return _rdma_nl_put_driver_u32(msg, name, RDMA_NLDEV_PRINT_TYPE_UNSPEC,
diff --git a/drivers/infiniband/hw/mlx5/Makefile b/drivers/infiniband/hw/mlx5/Makefile
index 9924be8384d8..d0a043ccbe58 100644
--- a/drivers/infiniband/hw/mlx5/Makefile
+++ b/drivers/infiniband/hw/mlx5/Makefile
@@ -3,7 +3,7 @@ obj-$(CONFIG_MLX5_INFINIBAND)	+= mlx5_ib.o
 
 mlx5_ib-y :=	main.o cq.o doorbell.o qp.o mem.o srq_cmd.o \
 		srq.o mr.o ah.o mad.o gsi.o ib_virt.o cmd.o \
-		cong.o
+		cong.o restrack.o
 mlx5_ib-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += odp.o
 mlx5_ib-$(CONFIG_MLX5_ESWITCH) += ib_rep.o
 mlx5_ib-$(CONFIG_INFINIBAND_USER_ACCESS) += devx.o
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index b95c2b05f682..3c3c19129cdd 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6269,6 +6269,7 @@ static const struct ib_device_ops mlx5_ib_dev_ops = {
 	.disassociate_ucontext = mlx5_ib_disassociate_ucontext,
 	.drain_rq = mlx5_ib_drain_rq,
 	.drain_sq = mlx5_ib_drain_sq,
+	.fill_res_entry = mlx5_ib_fill_res_entry,
 	.get_dev_fw_str = get_dev_fw_str,
 	.get_dma_mr = mlx5_ib_get_dma_mr,
 	.get_link_layer = mlx5_ib_port_link_layer,
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 5aae05ebf64b..a0ca1ef16e4e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -626,6 +626,7 @@ struct mlx5_ib_mr {
 	struct mlx5_async_work  cb_work;
 	atomic_t		num_pending_prefetch;
 	struct ib_odp_counters	odp_stats;
+	bool			is_odp_implicit;
 };
 
 static inline bool is_odp_mr(struct mlx5_ib_mr *mr)
@@ -1339,6 +1340,8 @@ struct mlx5_core_dev *mlx5_ib_get_native_port_mdev(struct mlx5_ib_dev *dev,
 						   u8 *native_port_num);
 void mlx5_ib_put_native_port_mdev(struct mlx5_ib_dev *dev,
 				  u8 port_num);
+int mlx5_ib_fill_res_entry(struct sk_buff *msg,
+			   struct rdma_restrack_entry *res);
 
 #if IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS)
 int mlx5_ib_devx_create(struct mlx5_ib_dev *dev, bool is_user);
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 966783bfb557..3450affd002a 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -540,6 +540,8 @@ struct mlx5_ib_mr *mlx5_ib_alloc_implicit_mr(struct mlx5_ib_pd *pd,
 	atomic_set(&imr->num_leaf_free, 0);
 	atomic_set(&imr->num_pending_prefetch, 0);
 
+	imr->is_odp_implicit = true;
+
 	return imr;
 }
 
diff --git a/drivers/infiniband/hw/mlx5/restrack.c b/drivers/infiniband/hw/mlx5/restrack.c
new file mode 100644
index 000000000000..065049f52b83
--- /dev/null
+++ b/drivers/infiniband/hw/mlx5/restrack.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2019, Mellanox Technologies inc.  All rights reserved.
+ */
+
+#include <uapi/rdma/rdma_netlink.h>
+#include <rdma/ib_umem_odp.h>
+#include <rdma/restrack.h>
+#include "mlx5_ib.h"
+
+static int fill_res_mr_entry(struct sk_buff *msg,
+			     struct rdma_restrack_entry *res)
+{
+	struct ib_mr *ibmr = container_of(res, struct ib_mr, res);
+	struct mlx5_ib_mr *mr = to_mmr(ibmr);
+	struct nlattr *table_attr;
+
+	if (!(mr->access_flags & IB_ACCESS_ON_DEMAND))
+		return 0;
+
+	table_attr = nla_nest_start(msg, RDMA_NLDEV_ATTR_DRIVER);
+	if (!table_attr)
+		goto err;
+
+	if (mr->is_odp_implicit) {
+		if (rdma_nl_put_driver_string(msg, "odp", "implicit"))
+			goto err;
+	} else {
+		if (rdma_nl_put_driver_string(msg, "odp", "explicit"))
+			goto err;
+	}
+
+	nla_nest_end(msg, table_attr);
+	return 0;
+
+err:
+	nla_nest_cancel(msg, table_attr);
+	return -EMSGSIZE;
+}
+
+int mlx5_ib_fill_res_entry(struct sk_buff *msg,
+			   struct rdma_restrack_entry *res)
+{
+	if (res->type == RDMA_RESTRACK_MR)
+		return fill_res_mr_entry(msg, res);
+
+	return 0;
+}
diff --git a/include/rdma/restrack.h b/include/rdma/restrack.h
index 83df1ec6664e..fe9b3c507a9c 100644
--- a/include/rdma/restrack.h
+++ b/include/rdma/restrack.h
@@ -156,6 +156,9 @@ int rdma_nl_put_driver_u32_hex(struct sk_buff *msg, const char *name,
 int rdma_nl_put_driver_u64(struct sk_buff *msg, const char *name, u64 value);
 int rdma_nl_put_driver_u64_hex(struct sk_buff *msg, const char *name,
 			       u64 value);
+int rdma_nl_put_driver_string(struct sk_buff *msg, const char *name,
+			      const char *str);
+
 struct rdma_restrack_entry *rdma_restrack_get_byid(struct ib_device *dev,
 						   enum rdma_restrack_type type,
 						   u32 id);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics
  2019-10-06 15:51 [PATCH rdma-next v2 0/4] ODP information and statistics Leon Romanovsky
                   ` (2 preceding siblings ...)
  2019-10-06 15:51 ` [PATCH rdma-next v2 3/4] RDMA/mlx5: Return ODP type per MR Leon Romanovsky
@ 2019-10-06 15:51 ` Leon Romanovsky
  2019-10-09 14:40   ` Jason Gunthorpe
  3 siblings, 1 reply; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-06 15:51 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Erez Alfasi

From: Erez Alfasi <ereza@mellanox.com>

Add RDMA nldev netlink interface for dumping MR
statistics information.

Output example:
ereza@dev~$: ./ibv_rc_pingpong -o -P -s 500000000
  local address:  LID 0x0001, QPN 0x00008a, PSN 0xf81096, GID ::

ereza@dev~$: rdma stat show mr
dev mlx5_0 mrn 2 page_faults 122071 page_invalidations 0
prefetched_pages 122071

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/device.c      |  1 +
 drivers/infiniband/core/nldev.c       | 41 ++++++++++++++++++++++---
 drivers/infiniband/hw/mlx5/main.c     |  2 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  2 ++
 drivers/infiniband/hw/mlx5/restrack.c | 44 +++++++++++++++++++++++++++
 include/rdma/ib_verbs.h               |  7 +++++
 include/rdma/restrack.h               |  3 ++
 7 files changed, 96 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index a667636f74bf..2e53aa25f0c7 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2606,6 +2606,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 	SET_DEVICE_OP(dev_ops, drain_sq);
 	SET_DEVICE_OP(dev_ops, enable_driver);
 	SET_DEVICE_OP(dev_ops, fill_res_entry);
+	SET_DEVICE_OP(dev_ops, fill_stat_entry);
 	SET_DEVICE_OP(dev_ops, get_dev_fw_str);
 	SET_DEVICE_OP(dev_ops, get_dma_mr);
 	SET_DEVICE_OP(dev_ops, get_hw_stats);
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 6114465959e1..1d9d89fd9ce9 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -454,6 +454,14 @@ static bool fill_res_entry(struct ib_device *dev, struct sk_buff *msg,
 	return dev->ops.fill_res_entry(msg, res);
 }
 
+static bool fill_stat_entry(struct ib_device *dev, struct sk_buff *msg,
+			    struct rdma_restrack_entry *res)
+{
+	if (!dev->ops.fill_stat_entry)
+		return false;
+	return dev->ops.fill_stat_entry(msg, res);
+}
+
 static int fill_res_qp_entry(struct sk_buff *msg, bool has_cap_net_admin,
 			     struct rdma_restrack_entry *res, uint32_t port)
 {
@@ -751,8 +759,8 @@ static int fill_stat_counter_qps(struct sk_buff *msg,
 	return ret;
 }
 
-static int fill_stat_hwcounter_entry(struct sk_buff *msg,
-				     const char *name, u64 value)
+int fill_stat_hwcounter_entry(struct sk_buff *msg,
+			      const char *name, u64 value)
 {
 	struct nlattr *entry_attr;
 
@@ -774,6 +782,25 @@ static int fill_stat_hwcounter_entry(struct sk_buff *msg,
 	nla_nest_cancel(msg, entry_attr);
 	return -EMSGSIZE;
 }
+EXPORT_SYMBOL(fill_stat_hwcounter_entry);
+
+static int fill_stat_mr_entry(struct sk_buff *msg, bool has_cap_net_admin,
+			      struct rdma_restrack_entry *res, uint32_t port)
+{
+	struct ib_mr *mr = container_of(res, struct ib_mr, res);
+	struct ib_device *dev = mr->pd->device;
+
+	if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_MRN, res->id))
+		goto err;
+
+	if (fill_stat_entry(dev, msg, res))
+		goto err;
+
+	return 0;
+
+err:
+	return -EMSGSIZE;
+}
 
 static int fill_stat_counter_hwcounters(struct sk_buff *msg,
 					struct rdma_counter *counter)
@@ -2010,7 +2037,10 @@ static int nldev_stat_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	case RDMA_NLDEV_ATTR_RES_QP:
 		ret = stat_get_doit_qp(skb, nlh, extack, tb);
 		break;
-
+	case RDMA_NLDEV_ATTR_RES_MR:
+		ret = res_get_common_doit(skb, nlh, extack, RDMA_RESTRACK_MR,
+					  fill_stat_mr_entry);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -2034,7 +2064,10 @@ static int nldev_stat_get_dumpit(struct sk_buff *skb,
 	case RDMA_NLDEV_ATTR_RES_QP:
 		ret = nldev_res_get_counter_dumpit(skb, cb);
 		break;
-
+	case RDMA_NLDEV_ATTR_RES_MR:
+		ret = res_get_common_dumpit(skb, cb, RDMA_RESTRACK_MR,
+					    fill_stat_mr_entry);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 3c3c19129cdd..fa23c8e7043b 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -67,6 +67,7 @@
 #include <rdma/uverbs_std_types.h>
 #include <rdma/mlx5_user_ioctl_verbs.h>
 #include <rdma/mlx5_user_ioctl_cmds.h>
+#include <rdma/ib_umem_odp.h>
 
 #define UVERBS_MODULE_NAME mlx5_ib
 #include <rdma/uverbs_named_ioctl.h>
@@ -6270,6 +6271,7 @@ static const struct ib_device_ops mlx5_ib_dev_ops = {
 	.drain_rq = mlx5_ib_drain_rq,
 	.drain_sq = mlx5_ib_drain_sq,
 	.fill_res_entry = mlx5_ib_fill_res_entry,
+	.fill_stat_entry = mlx5_ib_fill_stat_entry,
 	.get_dev_fw_str = get_dev_fw_str,
 	.get_dma_mr = mlx5_ib_get_dma_mr,
 	.get_link_layer = mlx5_ib_port_link_layer,
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index a0ca1ef16e4e..e9bdb48cf1d3 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1342,6 +1342,8 @@ void mlx5_ib_put_native_port_mdev(struct mlx5_ib_dev *dev,
 				  u8 port_num);
 int mlx5_ib_fill_res_entry(struct sk_buff *msg,
 			   struct rdma_restrack_entry *res);
+int mlx5_ib_fill_stat_entry(struct sk_buff *msg,
+			    struct rdma_restrack_entry *res);
 
 #if IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS)
 int mlx5_ib_devx_create(struct mlx5_ib_dev *dev, bool is_user);
diff --git a/drivers/infiniband/hw/mlx5/restrack.c b/drivers/infiniband/hw/mlx5/restrack.c
index 065049f52b83..bfa33b033a86 100644
--- a/drivers/infiniband/hw/mlx5/restrack.c
+++ b/drivers/infiniband/hw/mlx5/restrack.c
@@ -8,6 +8,41 @@
 #include <rdma/restrack.h>
 #include "mlx5_ib.h"
 
+static int fill_stat_mr_entry(struct sk_buff *msg,
+			      struct rdma_restrack_entry *res)
+{
+	struct ib_mr *ibmr = container_of(res, struct ib_mr, res);
+	struct mlx5_ib_mr *mr = to_mmr(ibmr);
+	struct nlattr *table_attr;
+
+	if (!(mr->access_flags & IB_ACCESS_ON_DEMAND))
+		return 0;
+
+	table_attr = nla_nest_start(msg,
+				    RDMA_NLDEV_ATTR_STAT_HWCOUNTERS);
+
+	if (!table_attr)
+		goto err;
+
+	if (fill_stat_hwcounter_entry(msg, "page_faults",
+				      atomic64_read(&mr->odp_stats.faults)))
+		goto err_table;
+	if (fill_stat_hwcounter_entry(msg, "page_invalidations",
+				      atomic64_read(&mr->odp_stats.invalidations)))
+		goto err_table;
+	if (fill_stat_hwcounter_entry(msg, "prefetched_pages",
+				      atomic64_read(&mr->odp_stats.prefetched)))
+		goto err_table;
+
+	nla_nest_end(msg, table_attr);
+	return 0;
+
+err_table:
+	nla_nest_cancel(msg, table_attr);
+err:
+	return -EMSGSIZE;
+}
+
 static int fill_res_mr_entry(struct sk_buff *msg,
 			     struct rdma_restrack_entry *res)
 {
@@ -46,3 +81,12 @@ int mlx5_ib_fill_res_entry(struct sk_buff *msg,
 
 	return 0;
 }
+
+int mlx5_ib_fill_stat_entry(struct sk_buff *msg,
+			    struct rdma_restrack_entry *res)
+{
+	if (res->type == RDMA_RESTRACK_MR)
+		return fill_stat_mr_entry(msg, res);
+
+	return 0;
+}
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3990054656b9..4f671378dbfc 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2569,6 +2569,13 @@ struct ib_device_ops {
 	 */
 	int (*counter_update_stats)(struct rdma_counter *counter);
 
+	/**
+	 * Allows rdma drivers to add their own restrack attributes
+	 * dumped via 'rdma stat' iproute2 command.
+	 */
+	int (*fill_stat_entry)(struct sk_buff *msg,
+			       struct rdma_restrack_entry *entry);
+
 	DECLARE_RDMA_OBJ_SIZE(ib_ah);
 	DECLARE_RDMA_OBJ_SIZE(ib_cq);
 	DECLARE_RDMA_OBJ_SIZE(ib_pd);
diff --git a/include/rdma/restrack.h b/include/rdma/restrack.h
index fe9b3c507a9c..90fdbe8a24a6 100644
--- a/include/rdma/restrack.h
+++ b/include/rdma/restrack.h
@@ -162,4 +162,7 @@ int rdma_nl_put_driver_string(struct sk_buff *msg, const char *name,
 struct rdma_restrack_entry *rdma_restrack_get_byid(struct ib_device *dev,
 						   enum rdma_restrack_type type,
 						   u32 id);
+int fill_stat_hwcounter_entry(struct sk_buff *msg,
+			      const char *name, u64 value);
+
 #endif /* _RDMA_RESTRACK_H_ */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters
  2019-10-06 15:51 ` [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters Leon Romanovsky
@ 2019-10-08 19:57   ` Jason Gunthorpe
  2019-10-10  9:02     ` Leon Romanovsky
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2019-10-08 19:57 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Erez Alfasi

On Sun, Oct 06, 2019 at 06:51:36PM +0300, Leon Romanovsky wrote:
> From: Erez Alfasi <ereza@mellanox.com>
> 
> Introduce ODP diagnostic counters and count the following
> per MR within IB/mlx5 driver:
>  1) Page faults:
> 	Total number of faulted pages.
>  2) Page invalidations:
> 	Total number of pages invalidated by the OS during all
> 	invalidation events. The translations can be no longer
> 	valid due to either non-present pages or mapping changes.
>  3) Prefetched pages:
> 	When prefetching a page, page fault is generated
> 	in order to bring the page to the main memory.
> 	The prefetched pages counter will be updated
> 	during a page fault flow only if it was derived
> 	from prefetching operation.
> 
> Signed-off-by: Erez Alfasi <ereza@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  4 ++++
>  drivers/infiniband/hw/mlx5/odp.c     | 18 ++++++++++++++++++
>  include/rdma/ib_verbs.h              |  6 ++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index bf30d53d94dc..5aae05ebf64b 100644
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -585,6 +585,9 @@ struct mlx5_ib_dm {
>  					  IB_ACCESS_REMOTE_READ   |\
>  					  IB_ZERO_BASED)
>  
> +#define mlx5_update_odp_stats(mr, counter_name, value)		\
> +	atomic64_add(value, &((mr)->odp_stats.counter_name))
> +
>  struct mlx5_ib_mr {
>  	struct ib_mr		ibmr;
>  	void			*descs;
> @@ -622,6 +625,7 @@ struct mlx5_ib_mr {
>  	wait_queue_head_t       q_leaf_free;
>  	struct mlx5_async_work  cb_work;
>  	atomic_t		num_pending_prefetch;
> +	struct ib_odp_counters	odp_stats;
>  };
>  
>  static inline bool is_odp_mr(struct mlx5_ib_mr *mr)
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index 95cf0249b015..966783bfb557 100644
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -261,6 +261,10 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
>  				blk_start_idx = idx;
>  				in_block = 1;
>  			}
> +
> +			/* Count page invalidations */
> +			mlx5_update_odp_stats(mr, invalidations,
> +					      (idx - blk_start_idx + 1));

I feel like these should be batched and the atomic done once at the
end of the routine..

>  		} else {
>  			u64 umr_offset = idx & umr_block_mask;
>  
> @@ -287,6 +291,7 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
>  
>  	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
>  
> +
>  	if (unlikely(!umem_odp->npages && mr->parent &&
>  		     !umem_odp->dying)) {
>  		WRITE_ONCE(umem_odp->dying, 1);
> @@ -801,6 +806,19 @@ static int pagefault_single_data_segment(struct mlx5_ib_dev *dev,
>  		if (ret < 0)
>  			goto srcu_unlock;
>  
> +		/*
> +		 * When prefetching a page, page fault is generated
> +		 * in order to bring the page to the main memory.
> +		 * In the current flow, page faults are being counted.
> +		 * Prefetched pages counter will be updated as well
> +		 * only if the current page fault flow was derived
> +		 * from prefetching flow.
> +		 */
> +		mlx5_update_odp_stats(mr, faults, ret);
> +
> +		if (prefetch)
> +			mlx5_update_odp_stats(mr, prefetched, ret);

Hm, I'm about to post a series that eliminates 'prefetch' here..

This is also not quite right for prefetch as we are doing a form of
prefetching in the mlx5_ib_mr_rdma_pfault_handler() too, although it
is less clear how to count those. Maybe this should be split to SQ/RQ
faults?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next v2 2/4] RDMA/nldev: Allow different fill function per resource
  2019-10-06 15:51 ` [PATCH rdma-next v2 2/4] RDMA/nldev: Allow different fill function per resource Leon Romanovsky
@ 2019-10-08 19:58   ` Jason Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2019-10-08 19:58 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Erez Alfasi

On Sun, Oct 06, 2019 at 06:51:37PM +0300, Leon Romanovsky wrote:
> From: Erez Alfasi <ereza@mellanox.com>
> 
> So far res_get_common_{dumpit, doit} was using the default
> resource fill function which was defined as part of the
> nldev_fill_res_entry fill_entries.
> 
> Add a fill function pointer as an argument allows us to use
> different fill function in case we want to dump different
> values then 'rdma resource' flow do, but still use the same
> existing general resources dumping flow.
> 
> Signed-off-by: Erez Alfasi <ereza@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/infiniband/core/nldev.c | 44 +++++++++++++++++----------------
>  1 file changed, 23 insertions(+), 21 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics
  2019-10-06 15:51 ` [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics Leon Romanovsky
@ 2019-10-09 14:40   ` Jason Gunthorpe
  2019-10-10  9:02     ` Leon Romanovsky
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2019-10-09 14:40 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Erez Alfasi

On Sun, Oct 06, 2019 at 06:51:39PM +0300, Leon Romanovsky wrote:
> From: Erez Alfasi <ereza@mellanox.com>
> 
> Add RDMA nldev netlink interface for dumping MR
> statistics information.
> 
> Output example:
> ereza@dev~$: ./ibv_rc_pingpong -o -P -s 500000000
>   local address:  LID 0x0001, QPN 0x00008a, PSN 0xf81096, GID ::
> 
> ereza@dev~$: rdma stat show mr
> dev mlx5_0 mrn 2 page_faults 122071 page_invalidations 0
> prefetched_pages 122071
> 
> Signed-off-by: Erez Alfasi <ereza@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/core/device.c      |  1 +
>  drivers/infiniband/core/nldev.c       | 41 ++++++++++++++++++++++---
>  drivers/infiniband/hw/mlx5/main.c     |  2 ++
>  drivers/infiniband/hw/mlx5/mlx5_ib.h  |  2 ++
>  drivers/infiniband/hw/mlx5/restrack.c | 44 +++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h               |  7 +++++
>  include/rdma/restrack.h               |  3 ++
>  7 files changed, 96 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index a667636f74bf..2e53aa25f0c7 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -2606,6 +2606,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
>  	SET_DEVICE_OP(dev_ops, drain_sq);
>  	SET_DEVICE_OP(dev_ops, enable_driver);
>  	SET_DEVICE_OP(dev_ops, fill_res_entry);
> +	SET_DEVICE_OP(dev_ops, fill_stat_entry);
>  	SET_DEVICE_OP(dev_ops, get_dev_fw_str);
>  	SET_DEVICE_OP(dev_ops, get_dma_mr);
>  	SET_DEVICE_OP(dev_ops, get_hw_stats);
> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
> index 6114465959e1..1d9d89fd9ce9 100644
> +++ b/drivers/infiniband/core/nldev.c
> @@ -454,6 +454,14 @@ static bool fill_res_entry(struct ib_device *dev, struct sk_buff *msg,
>  	return dev->ops.fill_res_entry(msg, res);
>  }
>  
> +static bool fill_stat_entry(struct ib_device *dev, struct sk_buff *msg,
> +			    struct rdma_restrack_entry *res)
> +{
> +	if (!dev->ops.fill_stat_entry)
> +		return false;
> +	return dev->ops.fill_stat_entry(msg, res);
> +}

Why do we need this function called in only one place?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters
  2019-10-08 19:57   ` Jason Gunthorpe
@ 2019-10-10  9:02     ` Leon Romanovsky
  2019-10-10 19:49       ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-10  9:02 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Erez Alfasi

On Tue, Oct 08, 2019 at 07:57:05PM +0000, Jason Gunthorpe wrote:
> On Sun, Oct 06, 2019 at 06:51:36PM +0300, Leon Romanovsky wrote:
> > From: Erez Alfasi <ereza@mellanox.com>
> >
> > Introduce ODP diagnostic counters and count the following
> > per MR within IB/mlx5 driver:
> >  1) Page faults:
> > 	Total number of faulted pages.
> >  2) Page invalidations:
> > 	Total number of pages invalidated by the OS during all
> > 	invalidation events. The translations can be no longer
> > 	valid due to either non-present pages or mapping changes.
> >  3) Prefetched pages:
> > 	When prefetching a page, page fault is generated
> > 	in order to bring the page to the main memory.
> > 	The prefetched pages counter will be updated
> > 	during a page fault flow only if it was derived
> > 	from prefetching operation.
> >
> > Signed-off-by: Erez Alfasi <ereza@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> >  drivers/infiniband/hw/mlx5/mlx5_ib.h |  4 ++++
> >  drivers/infiniband/hw/mlx5/odp.c     | 18 ++++++++++++++++++
> >  include/rdma/ib_verbs.h              |  6 ++++++
> >  3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> > index bf30d53d94dc..5aae05ebf64b 100644
> > +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> > @@ -585,6 +585,9 @@ struct mlx5_ib_dm {
> >  					  IB_ACCESS_REMOTE_READ   |\
> >  					  IB_ZERO_BASED)
> >
> > +#define mlx5_update_odp_stats(mr, counter_name, value)		\
> > +	atomic64_add(value, &((mr)->odp_stats.counter_name))
> > +
> >  struct mlx5_ib_mr {
> >  	struct ib_mr		ibmr;
> >  	void			*descs;
> > @@ -622,6 +625,7 @@ struct mlx5_ib_mr {
> >  	wait_queue_head_t       q_leaf_free;
> >  	struct mlx5_async_work  cb_work;
> >  	atomic_t		num_pending_prefetch;
> > +	struct ib_odp_counters	odp_stats;
> >  };
> >
> >  static inline bool is_odp_mr(struct mlx5_ib_mr *mr)
> > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> > index 95cf0249b015..966783bfb557 100644
> > +++ b/drivers/infiniband/hw/mlx5/odp.c
> > @@ -261,6 +261,10 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
> >  				blk_start_idx = idx;
> >  				in_block = 1;
> >  			}
> > +
> > +			/* Count page invalidations */
> > +			mlx5_update_odp_stats(mr, invalidations,
> > +					      (idx - blk_start_idx + 1));
>
> I feel like these should be batched and the atomic done once at the
> end of the routine..

We can, but does it worth it?

>
> >  		} else {
> >  			u64 umr_offset = idx & umr_block_mask;
> >
> > @@ -287,6 +291,7 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
> >
> >  	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
> >
> > +
> >  	if (unlikely(!umem_odp->npages && mr->parent &&
> >  		     !umem_odp->dying)) {
> >  		WRITE_ONCE(umem_odp->dying, 1);
> > @@ -801,6 +806,19 @@ static int pagefault_single_data_segment(struct mlx5_ib_dev *dev,
> >  		if (ret < 0)
> >  			goto srcu_unlock;
> >
> > +		/*
> > +		 * When prefetching a page, page fault is generated
> > +		 * in order to bring the page to the main memory.
> > +		 * In the current flow, page faults are being counted.
> > +		 * Prefetched pages counter will be updated as well
> > +		 * only if the current page fault flow was derived
> > +		 * from prefetching flow.
> > +		 */
> > +		mlx5_update_odp_stats(mr, faults, ret);
> > +
> > +		if (prefetch)
> > +			mlx5_update_odp_stats(mr, prefetched, ret);
>
> Hm, I'm about to post a series that eliminates 'prefetch' here..

Jason,

For various reasons we are delaying this series for months already.
Let's drop "prefetch" counter for now and merge everything without
it.

>
> This is also not quite right for prefetch as we are doing a form of
> prefetching in the mlx5_ib_mr_rdma_pfault_handler() too, although it
> is less clear how to count those. Maybe this should be split to SQ/RQ
> faults?

mlx5_ib_mr_rdma_pfault_handler() calls to pagefault_single_data_segment()
without MLX5_PF_FLAGS_PREFETCH, so I'm unsure that this counter should
count mlx5_ib_mr_rdma_pfault_handler() pagefaults.

However the idea to separate SQ/RQ for everything sounds appealing.

Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics
  2019-10-09 14:40   ` Jason Gunthorpe
@ 2019-10-10  9:02     ` Leon Romanovsky
  0 siblings, 0 replies; 11+ messages in thread
From: Leon Romanovsky @ 2019-10-10  9:02 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Erez Alfasi

On Wed, Oct 09, 2019 at 02:40:16PM +0000, Jason Gunthorpe wrote:
> On Sun, Oct 06, 2019 at 06:51:39PM +0300, Leon Romanovsky wrote:
> > From: Erez Alfasi <ereza@mellanox.com>
> >
> > Add RDMA nldev netlink interface for dumping MR
> > statistics information.
> >
> > Output example:
> > ereza@dev~$: ./ibv_rc_pingpong -o -P -s 500000000
> >   local address:  LID 0x0001, QPN 0x00008a, PSN 0xf81096, GID ::
> >
> > ereza@dev~$: rdma stat show mr
> > dev mlx5_0 mrn 2 page_faults 122071 page_invalidations 0
> > prefetched_pages 122071
> >
> > Signed-off-by: Erez Alfasi <ereza@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> >  drivers/infiniband/core/device.c      |  1 +
> >  drivers/infiniband/core/nldev.c       | 41 ++++++++++++++++++++++---
> >  drivers/infiniband/hw/mlx5/main.c     |  2 ++
> >  drivers/infiniband/hw/mlx5/mlx5_ib.h  |  2 ++
> >  drivers/infiniband/hw/mlx5/restrack.c | 44 +++++++++++++++++++++++++++
> >  include/rdma/ib_verbs.h               |  7 +++++
> >  include/rdma/restrack.h               |  3 ++
> >  7 files changed, 96 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> > index a667636f74bf..2e53aa25f0c7 100644
> > +++ b/drivers/infiniband/core/device.c
> > @@ -2606,6 +2606,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
> >  	SET_DEVICE_OP(dev_ops, drain_sq);
> >  	SET_DEVICE_OP(dev_ops, enable_driver);
> >  	SET_DEVICE_OP(dev_ops, fill_res_entry);
> > +	SET_DEVICE_OP(dev_ops, fill_stat_entry);
> >  	SET_DEVICE_OP(dev_ops, get_dev_fw_str);
> >  	SET_DEVICE_OP(dev_ops, get_dma_mr);
> >  	SET_DEVICE_OP(dev_ops, get_hw_stats);
> > diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
> > index 6114465959e1..1d9d89fd9ce9 100644
> > +++ b/drivers/infiniband/core/nldev.c
> > @@ -454,6 +454,14 @@ static bool fill_res_entry(struct ib_device *dev, struct sk_buff *msg,
> >  	return dev->ops.fill_res_entry(msg, res);
> >  }
> >
> > +static bool fill_stat_entry(struct ib_device *dev, struct sk_buff *msg,
> > +			    struct rdma_restrack_entry *res)
> > +{
> > +	if (!dev->ops.fill_stat_entry)
> > +		return false;
> > +	return dev->ops.fill_stat_entry(msg, res);
> > +}
>
> Why do we need this function called in only one place?

Be consistent with other fill_... functions.

>
> Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters
  2019-10-10  9:02     ` Leon Romanovsky
@ 2019-10-10 19:49       ` Jason Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2019-10-10 19:49 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, RDMA mailing list, Erez Alfasi

On Thu, Oct 10, 2019 at 12:02:03PM +0300, Leon Romanovsky wrote:

> > >  static inline bool is_odp_mr(struct mlx5_ib_mr *mr)
> > > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> > > index 95cf0249b015..966783bfb557 100644
> > > +++ b/drivers/infiniband/hw/mlx5/odp.c
> > > @@ -261,6 +261,10 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
> > >  				blk_start_idx = idx;
> > >  				in_block = 1;
> > >  			}
> > > +
> > > +			/* Count page invalidations */
> > > +			mlx5_update_odp_stats(mr, invalidations,
> > > +					      (idx - blk_start_idx + 1));
> >
> > I feel like these should be batched and the atomic done once at the
> > end of the routine..
> 
> We can, but does it worth it?

Probably since it is so simple, atomics are very expensive

> For various reasons we are delaying this series for months already.
> Let's drop "prefetch" counter for now and merge everything without
> it.

OK, I guess the counters are extendible as we go along, however see below:

> > This is also not quite right for prefetch as we are doing a form of
> > prefetching in the mlx5_ib_mr_rdma_pfault_handler() too, although it
> > is less clear how to count those. Maybe this should be split to SQ/RQ
> > faults?
> 
> mlx5_ib_mr_rdma_pfault_handler() calls to pagefault_single_data_segment()
> without MLX5_PF_FLAGS_PREFETCH, so I'm unsure that this counter should
> count mlx5_ib_mr_rdma_pfault_handler() pagefaults.
> 
> However the idea to separate SQ/RQ for everything sounds appealing.

Let's at least have a well defined counter design. SQ/RQ seems like a
good split to me as they have quite different behavior on mlx5
hardware, so splitting the existing counter seems good anyhow

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-06 15:51 [PATCH rdma-next v2 0/4] ODP information and statistics Leon Romanovsky
2019-10-06 15:51 ` [PATCH rdma-next v2 1/4] IB/mlx5: Introduce ODP diagnostic counters Leon Romanovsky
2019-10-08 19:57   ` Jason Gunthorpe
2019-10-10  9:02     ` Leon Romanovsky
2019-10-10 19:49       ` Jason Gunthorpe
2019-10-06 15:51 ` [PATCH rdma-next v2 2/4] RDMA/nldev: Allow different fill function per resource Leon Romanovsky
2019-10-08 19:58   ` Jason Gunthorpe
2019-10-06 15:51 ` [PATCH rdma-next v2 3/4] RDMA/mlx5: Return ODP type per MR Leon Romanovsky
2019-10-06 15:51 ` [PATCH rdma-next v2 4/4] RDMA/nldev: Provide MR statistics Leon Romanovsky
2019-10-09 14:40   ` Jason Gunthorpe
2019-10-10  9:02     ` Leon Romanovsky

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org linux-rdma@archiver.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/ public-inbox