All of lore.kernel.org
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver
@ 2020-04-02 19:21 Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-02 19:21 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

From: Vu Pham <vuhuong@mellanox.com>

Current mlx5 net PMD and future mlx5(regex,...) PMDs that run
and share the same HCAs need to use common memory management
driver. Memory management codes embeddedly use multi-process IPC
for primary/secondary processes to register and sync on memory
registrations MRs. That's the main reason to move multi-process
IPC APIs to mlx5 common driver and make it become the base commit.

Vu Pham (4):
  common/mlx5: refactor multi-process IPC handling codes to common
    driver
  net/mlx5: modify net PMD to use common multi-process APIs
  common/mlx5: refactor memory management codes
  net/mlx5: modify net PMD to use common memory management driver

 drivers/common/mlx5/Makefile                    |    4 +-
 drivers/common/mlx5/meson.build                 |    2 +
 drivers/common/mlx5/mlx5_common_mp.c            |  188 ++++
 drivers/common/mlx5/mlx5_common_mp.h            |   98 ++
 drivers/common/mlx5/mlx5_common_mr.c            | 1106 +++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   27 +
 drivers/net/mlx5/mlx5.c                         |   19 +-
 drivers/net/mlx5/mlx5.h                         |   55 +-
 drivers/net/mlx5/mlx5_mp.c                      |  242 +----
 drivers/net/mlx5/mlx5_mr.c                      | 1167 +----------------------
 drivers/net/mlx5/mlx5_mr.h                      |   87 +-
 drivers/net/mlx5/mlx5_rxtx.c                    |    4 +-
 drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
 drivers/net/mlx5/mlx5_trigger.c                 |    1 +
 drivers/net/mlx5/mlx5_txq.c                     |    3 +-
 17 files changed, 1690 insertions(+), 1485 deletions(-)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

-- 
2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 1/4] common/mlx5: refactor multi-process IPC handling codes to common driver
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
@ 2020-04-02 19:21 ` Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-02 19:21 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

From: Vu Pham <vuhuong@mellanox.com>

Refactor common multi-process handling codes from net PMD to common
driver. Using tuple mp_id{name, port_id} as standard input parameter
for all multi-process IPC APIs instead of using rte_eth_dev.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/common/mlx5/mlx5_common_mp.c            | 188 ++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++++
 drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
 3 files changed, 299 insertions(+)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
new file mode 100644
index 0000000000..da55143bc1
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2019 6WIND S.A.
+ * Copyright 2019 Mellanox Technologies, Ltd
+ */
+
+#include <stdio.h>
+#include <time.h>
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "mlx5_common_mp.h"
+#include "mlx5_common_utils.h"
+
+/**
+ * Request Memory Region creation to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
+	req->args.addr = addr;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs queue state modification to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param sm
+ *   State modify parameters.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+			       struct mlx5_mp_arg_queue_state_modify *sm)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
+	req->args.state_modify = *sm;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs command file descriptor for mmap to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ *
+ * @return
+ *   fd on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	if (res->result) {
+		rte_errno = -res->result;
+		DRV_LOG(ERR,
+			"port %u failed to get command FD from primary process",
+			mp_id->port_id);
+		ret = -rte_errno;
+		goto exit;
+	}
+	MLX5_ASSERT(mp_res->num_fds == 1);
+	ret = mp_res->fds[0];
+	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
+		mp_id->port_id, ret);
+exit:
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Initialize by primary process.
+ */
+int
+mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action)
+{
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+
+	/* primary is allowed to not support IPC */
+	ret = rte_mp_action_register(name, primary_action);
+	if (ret && rte_errno != ENOTSUP)
+		return -1;
+	return 0;
+}
+
+/**
+ * Un-initialize by primary process.
+ */
+void
+mlx5_mp_uninit_primary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	rte_mp_action_unregister(name);
+}
+
+/**
+ * Initialize by secondary process.
+ */
+int
+mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	return rte_mp_action_register(name, secondary_action);
+}
+
+/**
+ * Un-initialize by secondary process.
+ */
+void
+mlx5_mp_uninit_secondary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	rte_mp_action_unregister(name);
+}
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
new file mode 100644
index 0000000000..7aab77acb2
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MP_H_
+#define RTE_PMD_MLX5_COMMON_MP_H_
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_eal.h>
+#include <rte_string_fns.h>
+
+/* Request types for IPC. */
+enum mlx5_mp_req_type {
+	MLX5_MP_REQ_VERBS_CMD_FD = 1,
+	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_START_RXTX,
+	MLX5_MP_REQ_STOP_RXTX,
+	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
+};
+
+struct mlx5_mp_arg_queue_state_modify {
+	uint8_t is_wq; /* Set if WQ. */
+	uint16_t queue_id; /* DPDK queue ID. */
+	enum ibv_wq_state state; /* WQ requested state. */
+};
+
+/* Pameters for IPC. */
+struct mlx5_mp_param {
+	enum mlx5_mp_req_type type;
+	int port_id;
+	int result;
+	RTE_STD_C11
+	union {
+		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_queue_state_modify state_modify;
+		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
+	} args;
+};
+
+/*  Identifier of a MP process */
+struct mlx5_mp_id {
+	char name[RTE_MP_MAX_NAME_LEN];
+	uint16_t port_id;
+};
+
+/** Request timeout for IPC. */
+#define MLX5_MP_REQ_TIMEOUT_SEC 5
+
+/**
+ * Initialize IPC message.
+ *
+ * @param[in] port_id
+ *   Port ID of the device.
+ * @param[out] msg
+ *   Pointer to message to fill in.
+ * @param[in] type
+ *   Message type.
+ */
+static inline void
+mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
+	    enum mlx5_mp_req_type type)
+{
+	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
+
+	memset(msg, 0, sizeof(*msg));
+	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
+	msg->len_param = sizeof(*param);
+	param->type = type;
+	param->port_id = mp_id->port_id;
+}
+
+__rte_experimental
+int mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action);
+__rte_experimental
+void mlx5_mp_uninit_primary(const char *name);
+__rte_experimental
+int mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action);
+__rte_experimental
+void mlx5_mp_uninit_secondary(const char *name);
+__rte_experimental
+int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
+__rte_experimental
+int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+				   struct mlx5_mp_arg_queue_state_modify *sm);
+__rte_experimental
+int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
+
+#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index aede2a0a51..265703d1c9 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -48,4 +48,17 @@ DPDK_20.0.1 {
 	mlx5_nl_vlan_vmwa_delete;
 
 	mlx5_translate_port_name;
+
+};
+
+EXPERIMENTAL {
+        global:
+
+	mlx5_mp_init_primary;
+	mlx5_mp_uninit_primary;
+	mlx5_mp_init_secondary;
+	mlx5_mp_uninit_secondary;
+	mlx5_mp_req_mr_create;
+	mlx5_mp_req_queue_state_modify;
+	mlx5_mp_req_verbs_cmd_fd;
 };
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 2/4] net/mlx5: modify net PMD to use common multi-process APIs
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
@ 2020-04-02 19:21 ` Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 3/4] common/mlx5: refactor memory management codes Vu Pham
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-02 19:21 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

From: Vu Pham <vuhuong@mellanox.com>

Modify net PMD to use multi-process IPC APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/common/mlx5/Makefile    |   3 +-
 drivers/common/mlx5/meson.build |   1 +
 drivers/net/mlx5/mlx5.c         |  15 ++-
 drivers/net/mlx5/mlx5.h         |  43 +-------
 drivers/net/mlx5/mlx5_mp.c      | 234 +++-------------------------------------
 drivers/net/mlx5/mlx5_mr.c      |   2 +-
 drivers/net/mlx5/mlx5_rxtx.c    |   3 +-
 7 files changed, 37 insertions(+), 264 deletions(-)

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index f32933d592..2a88492731 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -17,6 +17,7 @@ endif
 SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
+SRCS-y += mlx5_common_mp.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
@@ -46,7 +47,7 @@ endif
 LDLIBS += -lrte_eal -lrte_pci -lrte_kvargs -lrte_net
 
 # A few warnings cannot be avoided in external headers.
-CFLAGS += -Wno-error=cast-qual -UPEDANTIC
+CFLAGS += -Wno-error=cast-qual  -UPEDANTIC -DALLOW_EXPERIMENTAL_API
 
 EXPORT_MAP := rte_common_mlx5_version.map
 
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index f671710714..83671861c9 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -55,6 +55,7 @@ sources = files(
 	'mlx5_devx_cmds.c',
 	'mlx5_common.c',
 	'mlx5_nl.c',
+	'mlx5_common_mp.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 94aaa60579..f802bcee3d 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -38,6 +38,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
@@ -1694,7 +1695,8 @@ mlx5_init_once(void)
 		rte_rwlock_init(&sd->mem_event_rwlock);
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
-		ret = mlx5_mp_init_primary();
+		ret = mlx5_mp_init_primary(MLX5_MP_NAME,
+					   mlx5_mp_primary_handle);
 		if (ret)
 			goto out;
 		sd->init_done = true;
@@ -1702,7 +1704,8 @@ mlx5_init_once(void)
 	case RTE_PROC_SECONDARY:
 		if (ld->init_done)
 			break;
-		ret = mlx5_mp_init_secondary();
+		ret = mlx5_mp_init_secondary(MLX5_MP_NAME,
+					     mlx5_mp_secondary_handle);
 		if (ret)
 			goto out;
 		++sd->secondary_cnt;
@@ -2177,6 +2180,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	}
 	DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct mlx5_mp_id mp_id;
+
 		eth_dev = rte_eth_dev_attach_secondary(name);
 		if (eth_dev == NULL) {
 			DRV_LOG(ERR, "can not attach rte ethdev");
@@ -2188,8 +2193,10 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
+		mp_id.port_id = eth_dev->data->port_id;
+		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 		/* Receive command fd from primary process */
-		err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
+		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
 			return NULL;
 		/* Remap UAR for Tx queues. */
@@ -2353,6 +2360,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	priv->ibv_port = spawn->ibv_port;
 	priv->pci_dev = spawn->pci_dev;
 	priv->mtu = RTE_ETHER_MTU;
+	priv->mp_id.port_id = port_id;
+	strlcpy(priv->mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 #ifndef RTE_ARCH_64
 	/* Initialize UAR access locks for 32bit implementations. */
 	rte_spinlock_init(&priv->uar_lock_cq);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index d7c519bae0..dc02e148c3 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -36,43 +36,13 @@
 #include <mlx5_devx_cmds.h>
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
-/* Request types for IPC. */
-enum mlx5_mp_req_type {
-	MLX5_MP_REQ_VERBS_CMD_FD = 1,
-	MLX5_MP_REQ_CREATE_MR,
-	MLX5_MP_REQ_START_RXTX,
-	MLX5_MP_REQ_STOP_RXTX,
-	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
-};
-
-struct mlx5_mp_arg_queue_state_modify {
-	uint8_t is_wq; /* Set if WQ. */
-	uint16_t queue_id; /* DPDK queue ID. */
-	enum ibv_wq_state state; /* WQ requested state. */
-};
-
-/* Pameters for IPC. */
-struct mlx5_mp_param {
-	enum mlx5_mp_req_type type;
-	int port_id;
-	int result;
-	RTE_STD_C11
-	union {
-		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
-		struct mlx5_mp_arg_queue_state_modify state_modify;
-		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
-	} args;
-};
-
-/** Request timeout for IPC. */
-#define MLX5_MP_REQ_TIMEOUT_SEC 5
-
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
@@ -555,6 +525,7 @@ struct mlx5_priv {
 #endif
 	uint8_t skip_default_rss_reta; /* Skip configuration of default reta. */
 	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
+	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
 };
 
 #define PORT_ID(priv) ((priv)->dev_data->port_id)
@@ -750,16 +721,10 @@ int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
 
 /* mlx5_mp.c */
+int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
+int mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
 void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);
 void mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev);
-int mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr);
-int mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev);
-int mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-				   struct mlx5_mp_arg_queue_state_modify *sm);
-int mlx5_mp_init_primary(void);
-void mlx5_mp_uninit_primary(void);
-int mlx5_mp_init_secondary(void);
-void mlx5_mp_uninit_secondary(void);
 
 /* mlx5_socket.c */
 
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 55d408fe95..43684dbc3a 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -10,46 +10,14 @@
 #include <rte_ethdev_driver.h>
 #include <rte_string_fns.h>
 
+#include <mlx5_common_mp.h>
+
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 
-/**
- * Initialize IPC message.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[out] msg
- *   Pointer to message to fill in.
- * @param[in] type
- *   Message type.
- */
-static inline void
-mp_init_msg(struct rte_eth_dev *dev, struct rte_mp_msg *msg,
-	    enum mlx5_mp_req_type type)
-{
-	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
-
-	memset(msg, 0, sizeof(*msg));
-	strlcpy(msg->name, MLX5_MP_NAME, sizeof(msg->name));
-	msg->len_param = sizeof(*param);
-	param->type = type;
-	param->port_id = dev->data->port_id;
-}
-
-/**
- * IPC message handler of primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[in] peer
- *   Pointer to the peer socket path.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-static int
-mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
@@ -71,21 +39,21 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_VERBS_CMD_FD:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		mp_res.num_fds = 1;
 		mp_res.fds[0] = priv->sh->ctx->cmd_fd;
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_QUEUE_STATE_MODIFY:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = mlx5_queue_state_modify_primary
 					(dev, &param->args.state_modify);
 		ret = rte_mp_reply(&mp_res, peer);
@@ -110,14 +78,15 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
-mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
 	const struct mlx5_mp_param *param =
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
+	struct mlx5_priv *priv;
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
@@ -127,13 +96,14 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		return -rte_errno;
 	}
 	dev = &rte_eth_devices[param->port_id];
+	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_START_RXTX:
 		DRV_LOG(INFO, "port %u starting datapath", dev->data->port_id);
 		rte_mb();
 		dev->rx_pkt_burst = mlx5_select_rx_function(dev);
 		dev->tx_pkt_burst = mlx5_select_tx_function(dev);
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -142,7 +112,7 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		dev->rx_pkt_burst = removed_rx_burst;
 		dev->tx_pkt_burst = removed_tx_burst;
 		rte_mb();
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -171,6 +141,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 	struct rte_mp_reply mp_rep;
 	struct mlx5_mp_param *res;
 	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret;
 	int i;
 
@@ -182,7 +153,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 			dev->data->port_id, type);
 		return;
 	}
-	mp_init_msg(dev, &mp_req, type);
+	mp_init_msg(&priv->mp_id, &mp_req, type);
 	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
 	if (ret) {
 		if (rte_errno != ENOTSUP)
@@ -234,178 +205,3 @@ mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev)
 {
 	mp_req_on_rxtx(dev, MLX5_MP_REQ_STOP_RXTX);
 }
-
-/**
- * Request Memory Region creation to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_CREATE_MR);
-	req->args.addr = addr;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	if (ret)
-		rte_errno = -ret;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs queue state modification to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param sm
- *   State modify parameters.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-			       struct mlx5_mp_arg_queue_state_modify *sm)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
-	req->args.state_modify = *sm;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs command file descriptor for mmap to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- *
- * @return
- *   fd on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	if (res->result) {
-		rte_errno = -res->result;
-		DRV_LOG(ERR,
-			"port %u failed to get command FD from primary process",
-			dev->data->port_id);
-		ret = -rte_errno;
-		goto exit;
-	}
-	MLX5_ASSERT(mp_res->num_fds == 1);
-	ret = mp_res->fds[0];
-	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
-		dev->data->port_id, ret);
-exit:
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Initialize by primary process.
- */
-int
-mlx5_mp_init_primary(void)
-{
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-
-	/* primary is allowed to not support IPC */
-	ret = rte_mp_action_register(MLX5_MP_NAME, mp_primary_handle);
-	if (ret && rte_errno != ENOTSUP)
-		return -1;
-	return 0;
-}
-
-/**
- * Un-initialize by primary process.
- */
-void
-mlx5_mp_uninit_primary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
-
-/**
- * Initialize by secondary process.
- */
-int
-mlx5_mp_init_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	return rte_mp_action_register(MLX5_MP_NAME, mp_secondary_handle);
-}
-
-/**
- * Un-initialize by secondary process.
- */
-void
-mlx5_mp_uninit_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 6aa578646f..8097211b55 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -540,7 +540,7 @@ mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
 
 	DEBUG("port %u requesting MR creation for address (%p)",
 	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(dev, addr);
+	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
 	if (ret) {
 		DEBUG("port %u fail to request MR creation for address (%p)",
 		      dev->data->port_id, (void *)addr);
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index f3bf763769..fc7591c2b0 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1000,6 +1000,7 @@ static int
 mlx5_queue_state_modify(struct rte_eth_dev *dev,
 			struct mlx5_mp_arg_queue_state_modify *sm)
 {
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret = 0;
 
 	switch (rte_eal_process_type()) {
@@ -1007,7 +1008,7 @@ mlx5_queue_state_modify(struct rte_eth_dev *dev,
 		ret = mlx5_queue_state_modify_primary(dev, sm);
 		break;
 	case RTE_PROC_SECONDARY:
-		ret = mlx5_mp_req_queue_state_modify(dev, sm);
+		ret = mlx5_mp_req_queue_state_modify(&priv->mp_id, sm);
 		break;
 	default:
 		break;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 3/4] common/mlx5: refactor memory management codes
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
@ 2020-04-02 19:21 ` Vu Pham
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 4/4] net/mlx5: modify net PMD to use common memory management driver Vu Pham
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-02 19:21 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

From: Vu Pham <vuhuong@mellanox.com>

Refactor common memory btree and cache management to common driver.
Replace some input parameters of MR APIs to more common datastructure
like PD, port_id, share_cache,... so that muliptle PMD drivers can
use those MR APIs.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/common/mlx5/mlx5_common_mr.c            | 1106 +++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   14 +
 3 files changed, 1280 insertions(+)
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
new file mode 100644
index 0000000000..46eaf12350
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -0,0 +1,1106 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2016 6WIND S.A.
+ * Copyright 2020 Mellanox Technologies, Ltd
+ */
+#include <rte_eal_memconfig.h>
+#include <rte_errno.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_rwlock.h>
+
+#include "mlx5_glue.h"
+#include "mlx5_common_mp.h"
+#include "mlx5_common_mr.h"
+#include "mlx5_common_utils.h"
+
+struct mr_find_contig_memsegs_data {
+	uintptr_t addr;
+	uintptr_t start;
+	uintptr_t end;
+	const struct rte_memseg_list *msl;
+};
+
+/**
+ * Expand B-tree table to a given size. Can't be called with holding
+ * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries for expansion.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_expand(struct mlx5_mr_btree *bt, int n)
+{
+	void *mem;
+	int ret = 0;
+
+	if (n <= bt->size)
+		return ret;
+	/*
+	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
+	 * used inside if there's no room to expand. Because this is a quite
+	 * rare case and a part of very slow path, it is very acceptable.
+	 * Initially cache_bh[] will be given practically enough space and once
+	 * it is expanded, expansion wouldn't be needed again ever.
+	 */
+	mem = rte_realloc(bt->table, n * sizeof(struct mr_cache_entry), 0);
+	if (mem == NULL) {
+		/* Not an error, B-tree search will be skipped. */
+		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
+			(void *)bt);
+		ret = -1;
+	} else {
+		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
+		bt->table = mem;
+		bt->size = n;
+	}
+	return ret;
+}
+
+/**
+ * Look up LKey from given B-tree lookup table, store the last index and return
+ * searched LKey.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param[out] idx
+ *   Pointer to index. Even on search failure, returns index where it stops
+ *   searching so that index can be used when inserting a new entry.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t n;
+	uint16_t base = 0;
+
+	MLX5_ASSERT(bt != NULL);
+	lkp_tbl = *bt->table;
+	n = bt->len;
+	/* First entry must be NULL for comparison. */
+	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
+				    lkp_tbl[0].lkey == UINT32_MAX));
+	/* Binary search. */
+	do {
+		register uint16_t delta = n >> 1;
+
+		if (addr < lkp_tbl[base + delta].start) {
+			n = delta;
+		} else {
+			base += delta;
+			n -= delta;
+		}
+	} while (n > 1);
+	MLX5_ASSERT(addr >= lkp_tbl[base].start);
+	*idx = base;
+	if (addr < lkp_tbl[base].end)
+		return lkp_tbl[base].lkey;
+	/* Not found. */
+	return UINT32_MAX;
+}
+
+/**
+ * Insert an entry to B-tree lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param entry
+ *   Pointer to new entry to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_insert(struct mlx5_mr_btree *bt, struct mr_cache_entry *entry)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t idx = 0;
+	size_t shift;
+
+	MLX5_ASSERT(bt != NULL);
+	MLX5_ASSERT(bt->len <= bt->size);
+	MLX5_ASSERT(bt->len > 0);
+	lkp_tbl = *bt->table;
+	/* Find out the slot for insertion. */
+	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
+		DRV_LOG(DEBUG,
+			"abort insertion to B-tree(%p): already exist at"
+			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+			(void *)bt, idx, entry->start, entry->end, entry->lkey);
+		/* Already exist, return. */
+		return 0;
+	}
+	/* If table is full, return error. */
+	if (unlikely(bt->len == bt->size)) {
+		bt->overflow = 1;
+		return -1;
+	}
+	/* Insert entry. */
+	++idx;
+	shift = (bt->len - idx) * sizeof(struct mr_cache_entry);
+	if (shift)
+		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
+	lkp_tbl[idx] = *entry;
+	bt->len++;
+	DRV_LOG(DEBUG,
+		"inserted B-tree(%p)[%u],"
+		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		(void *)bt, idx, entry->start, entry->end, entry->lkey);
+	return 0;
+}
+
+/**
+ * Initialize B-tree and allocate memory for lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries to allocate.
+ * @param socket
+ *   NUMA socket on which memory must be allocated.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
+{
+	if (bt == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	MLX5_ASSERT(!bt->table && !bt->size);
+	memset(bt, 0, sizeof(*bt));
+	bt->table = rte_calloc_socket("B-tree table",
+				      n, sizeof(struct mr_cache_entry),
+				      0, socket);
+	if (bt->table == NULL) {
+		rte_errno = ENOMEM;
+		DEBUG("failed to allocate memory for btree cache on socket %d",
+		      socket);
+		return -rte_errno;
+	}
+	bt->size = n;
+	/* First entry must be NULL for binary search. */
+	(*bt->table)[bt->len++] = (struct mr_cache_entry) {
+		.lkey = UINT32_MAX,
+	};
+	DEBUG("initialized B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	return 0;
+}
+
+/**
+ * Free B-tree resources.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
+{
+	if (bt == NULL)
+		return;
+	DEBUG("freeing B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	rte_free(bt->table);
+	memset(bt, 0, sizeof(*bt));
+}
+
+/**
+ * Dump all the entries in a B-tree
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	int idx;
+	struct mr_cache_entry *lkp_tbl;
+
+	if (bt == NULL)
+		return;
+	lkp_tbl = *bt->table;
+	for (idx = 0; idx < bt->len; ++idx) {
+		struct mr_cache_entry *entry = &lkp_tbl[idx];
+
+		DEBUG("B-tree(%p)[%u],"
+		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
+	}
+#endif
+}
+
+/**
+ * Find virtually contiguous memory chunk in a given MR.
+ *
+ * @param dev
+ *   Pointer to MR structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If not found, this will not be
+ *   updated.
+ * @param start_idx
+ *   Start index of the memseg bitmap.
+ *
+ * @return
+ *   Next index to go on lookup.
+ */
+static int
+mr_find_next_chunk(struct mlx5_mr *mr, struct mr_cache_entry *entry,
+		   int base_idx)
+{
+	uintptr_t start = 0;
+	uintptr_t end = 0;
+	uint32_t idx = 0;
+
+	/* MR for external memory doesn't have memseg list. */
+	if (mr->msl == NULL) {
+		struct ibv_mr *ibv_mr = mr->ibv_mr;
+
+		MLX5_ASSERT(mr->ms_bmp_n == 1);
+		MLX5_ASSERT(mr->ms_n == 1);
+		MLX5_ASSERT(base_idx == 0);
+		/*
+		 * Can't search it from memseg list but get it directly from
+		 * verbs MR as there's only one chunk.
+		 */
+		entry->start = (uintptr_t)ibv_mr->addr;
+		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+		/* Returning 1 ends iteration. */
+		return 1;
+	}
+	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
+		if (rte_bitmap_get(mr->ms_bmp, idx)) {
+			const struct rte_memseg_list *msl;
+			const struct rte_memseg *ms;
+
+			msl = mr->msl;
+			ms = rte_fbarray_get(&msl->memseg_arr,
+					     mr->ms_base_idx + idx);
+			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+			if (!start)
+				start = ms->addr_64;
+			end = ms->addr_64 + ms->hugepage_sz;
+		} else if (start) {
+			/* Passed the end of a fragment. */
+			break;
+		}
+	}
+	if (start) {
+		/* Found one chunk. */
+		entry->start = start;
+		entry->end = end;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+	}
+	return idx;
+}
+
+/**
+ * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
+ * Then, this entry will have to be searched by mr_lookup_list() in
+ * mlx5_mr_create() on miss.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr
+ *   Pointer to MR to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr)
+{
+	unsigned int n;
+
+	DRV_LOG(DEBUG, "Inserting MR(%p) to global cache(%p)",
+		(void *)mr, (void *)share_cache);
+	for (n = 0; n < mr->ms_bmp_n; ) {
+		struct mr_cache_entry entry;
+
+		memset(&entry, 0, sizeof(entry));
+		/* Find a contiguous chunk and advance the index. */
+		n = mr_find_next_chunk(mr, &entry, n);
+		if (!entry.end)
+			break;
+		if (mr_btree_insert(&share_cache->cache, &entry) < 0) {
+			/*
+			 * Overflowed, but the global table cannot be expanded
+			 * because of deadlock.
+			 */
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/**
+ * Look up address in the original global MR list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Found MR on match, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr)
+{
+	struct mlx5_mr *mr;
+
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret;
+
+			memset(&ret, 0, sizeof(ret));
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (addr >= ret.start && addr < ret.end) {
+				/* Found. */
+				*entry = ret;
+				return mr;
+			}
+		}
+	}
+	return NULL;
+}
+
+/**
+ * Look up address on global MR cache.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr)
+{
+	uint16_t idx;
+	uint32_t lkey = UINT32_MAX;
+	struct mlx5_mr *mr;
+
+	/*
+	 * If the global cache has overflowed since it failed to expand the
+	 * B-tree table, it can't have all the existing MRs. Then, the address
+	 * has to be searched by traversing the original MR list instead, which
+	 * is very slow path. Otherwise, the global cache is all inclusive.
+	 */
+	if (!unlikely(share_cache->cache.overflow)) {
+		lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+		if (lkey != UINT32_MAX)
+			*entry = (*share_cache->cache.table)[idx];
+	} else {
+		/* Falling back to the slowest path. */
+		mr = mlx5_mr_lookup_list(share_cache, entry, addr);
+		if (mr != NULL)
+			lkey = entry->lkey;
+	}
+	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
+					   addr < entry->end));
+	return lkey;
+}
+
+/**
+ * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
+ * can raise memory free event and the callback function will spin on the lock.
+ *
+ * @param mr
+ *   Pointer to MR to free.
+ */
+static void
+mr_free(struct mlx5_mr *mr)
+{
+	if (mr == NULL)
+		return;
+	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
+	if (mr->ibv_mr != NULL)
+		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
+	if (mr->ms_bmp != NULL)
+		rte_bitmap_free(mr->ms_bmp);
+	rte_free(mr);
+}
+
+void
+mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr;
+
+	DRV_LOG(DEBUG, "Rebuild dev cache[] %p", (void *)share_cache);
+	/* Flush cache to rebuild. */
+	share_cache->cache.len = 1;
+	share_cache->cache.overflow = 0;
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr)
+		if (mlx5_mr_insert_cache(share_cache, mr) < 0)
+			return;
+}
+
+/**
+ * Release resources of detached MR having no online entry.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+static void
+mlx5_mr_garbage_collect(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
+
+	/* Must be called from the primary process. */
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/*
+	 * MR can't be freed with holding the lock because rte_free() could call
+	 * memory free callback function. This will be a deadlock situation.
+	 */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach the whole free list and release it after unlocking. */
+	free_list = share_cache->mr_free_list;
+	LIST_INIT(&share_cache->mr_free_list);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Release resources. */
+	mr_next = LIST_FIRST(&free_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		mr_free(mr);
+	}
+}
+
+/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
+static int
+mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
+			  const struct rte_memseg *ms, size_t len, void *arg)
+{
+	struct mr_find_contig_memsegs_data *data = arg;
+
+	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
+		return 0;
+	/* Found, save it and stop walking. */
+	data->start = ms->addr_64;
+	data->end = ms->addr_64 + len;
+	data->msl = msl;
+	return 1;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This API should be called on a secondary process, then a request is sent to
+ * the primary process in order to create a MR for the address. As the global MR
+ * list is on the shared memory, following LKey lookup should succeed unless the
+ * request fails.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create_secondary(struct ibv_pd *pd __rte_unused,
+			 struct mlx5_mp_id *mp_id,
+			 struct mlx5_mr_share_cache *share_cache,
+			 struct mr_cache_entry *entry, uintptr_t addr,
+			 unsigned int mr_ext_memseg_en __rte_unused)
+{
+	int ret;
+
+	DEBUG("port %u requesting MR creation for address (%p)",
+	      mp_id->port_id, (void *)addr);
+	ret = mlx5_mp_req_mr_create(mp_id, addr);
+	if (ret) {
+		DEBUG("Fail to request MR creation for address (%p)",
+		      (void *)addr);
+		return UINT32_MAX;
+	}
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	DEBUG("MR CREATED by primary process for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
+	      (void *)addr, entry->start, entry->end, entry->lkey);
+	return entry->lkey;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * Register entire virtually contiguous memory chunk around the address.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en)
+{
+	struct mr_find_contig_memsegs_data data = {.addr = addr, };
+	struct mr_find_contig_memsegs_data data_re;
+	const struct rte_memseg_list *msl;
+	const struct rte_memseg *ms;
+	struct mlx5_mr *mr = NULL;
+	int ms_idx_shift = -1;
+	uint32_t bmp_size;
+	void *bmp_mem;
+	uint32_t ms_n;
+	uint32_t n;
+	size_t len;
+
+	DRV_LOG(DEBUG, "Creating a MR using address (%p)", (void *)addr);
+	/*
+	 * Release detached MRs if any. This can't be called with holding either
+	 * memory_hotplug_lock or share_cache->rwlock. MRs on the free list have
+	 * been detached by the memory free event but it couldn't be released
+	 * inside the callback due to deadlock. As a result, releasing resources
+	 * is quite opportunistic.
+	 */
+	mlx5_mr_garbage_collect(share_cache);
+	/*
+	 * If enabled, find out a contiguous virtual address chunk in use, to
+	 * which the given address belongs, in order to register maximum range.
+	 * In the best case where mempools are not dynamically recreated and
+	 * '--socket-mem' is specified as an EAL option, it is very likely to
+	 * have only one MR(LKey) per a socket and per a hugepage-size even
+	 * though the system memory is highly fragmented. As the whole memory
+	 * chunk will be pinned by kernel, it can't be reused unless entire
+	 * chunk is freed from EAL.
+	 *
+	 * If disabled, just register one memseg (page). Then, memory
+	 * consumption will be minimized but it may drop performance if there
+	 * are many MRs to lookup on the datapath.
+	 */
+	if (!mr_ext_memseg_en) {
+		data.msl = rte_mem_virt2memseg_list((void *)addr);
+		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
+		data.end = data.start + data.msl->page_sz;
+	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
+		DRV_LOG(WARNING,
+			"Unable to find virtually contiguous"
+			" chunk for address (%p)."
+			" rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_nolock;
+	}
+alloc_resources:
+	/* Addresses must be page-aligned. */
+	MLX5_ASSERT(data.msl);
+	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
+	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
+	msl = data.msl;
+	ms = rte_mem_virt2memseg((void *)data.start, msl);
+	len = data.end - data.start;
+	MLX5_ASSERT(ms);
+	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+	/* Number of memsegs in the range. */
+	ms_n = len / msl->page_sz;
+	DEBUG("Extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " page_sz=0x%" PRIx64 ", ms_n=%u",
+	      (void *)addr, data.start, data.end, msl->page_sz, ms_n);
+	/* Size of memory for bitmap. */
+	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE) +
+				bmp_size,
+				RTE_CACHE_LINE_SIZE, msl->socket_id);
+	if (mr == NULL) {
+		DEBUG("Unable to allocate memory for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = ENOMEM;
+		goto err_nolock;
+	}
+	mr->msl = msl;
+	/*
+	 * Save the index of the first memseg and initialize memseg bitmap. To
+	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
+	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
+	 */
+	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
+	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
+	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
+	if (mr->ms_bmp == NULL) {
+		DEBUG("Unable to initialize bitmap for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = EINVAL;
+		goto err_nolock;
+	}
+	/*
+	 * Should recheck whether the extended contiguous chunk is still valid.
+	 * Because memory_hotplug_lock can't be held if there's any memory
+	 * related calls in a critical path, resource allocation above can't be
+	 * locked. If the memory has been changed at this point, try again with
+	 * just single page. If not, go on with the big chunk atomically from
+	 * here.
+	 */
+	rte_mcfg_mem_read_lock();
+	data_re = data;
+	if (len > msl->page_sz &&
+	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
+		DEBUG("Unable to find virtually contiguous"
+		      " chunk for address (%p)."
+		      " rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_memlock;
+	}
+	if (data.start != data_re.start || data.end != data_re.end) {
+		/*
+		 * The extended contiguous chunk has been changed. Try again
+		 * with single memseg instead.
+		 */
+		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
+		data.end = data.start + msl->page_sz;
+		rte_mcfg_mem_read_unlock();
+		mr_free(mr);
+		goto alloc_resources;
+	}
+	MLX5_ASSERT(data.msl == data_re.msl);
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/*
+	 * Check the address is really missing. If other thread already created
+	 * one or it is not found due to overflow, abort and return.
+	 */
+	if (mlx5_mr_lookup_cache(share_cache, entry, addr) != UINT32_MAX) {
+		/*
+		 * Insert to the global cache table. It may fail due to
+		 * low-on-memory. Then, this entry will have to be searched
+		 * here again.
+		 */
+		mr_btree_insert(&share_cache->cache, entry);
+		DEBUG("Found MR for %p on final lookup, abort", (void *)addr);
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		rte_mcfg_mem_read_unlock();
+		/*
+		 * Must be unlocked before calling rte_free() because
+		 * mlx5_mr_mem_event_free_cb() can be called inside.
+		 */
+		mr_free(mr);
+		return entry->lkey;
+	}
+	/*
+	 * Trim start and end addresses for verbs MR. Set bits for registering
+	 * memsegs but exclude already registered ones. Bitmap can be
+	 * fragmented.
+	 */
+	for (n = 0; n < ms_n; ++n) {
+		uintptr_t start;
+		struct mr_cache_entry ret;
+
+		memset(&ret, 0, sizeof(ret));
+		start = data_re.start + n * msl->page_sz;
+		/* Exclude memsegs already registered by other MRs. */
+		if (mlx5_mr_lookup_cache(share_cache, &ret, start) ==
+		    UINT32_MAX) {
+			/*
+			 * Start from the first unregistered memseg in the
+			 * extended range.
+			 */
+			if (ms_idx_shift == -1) {
+				mr->ms_base_idx += n;
+				data.start = start;
+				ms_idx_shift = n;
+			}
+			data.end = start + msl->page_sz;
+			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
+			++mr->ms_n;
+		}
+	}
+	len = data.end - data.start;
+	mr->ms_bmp_n = len / msl->page_sz;
+	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
+	/*
+	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
+	 * called with holding the memory lock because it doesn't use
+	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
+	 * through mlx5_alloc_verbs_buf().
+	 */
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)data.start, len,
+				       IBV_ACCESS_LOCAL_WRITE);
+	if (mr->ibv_mr == NULL) {
+		DEBUG("Fail to create a verbs MR for address (%p)",
+		      (void *)addr);
+		rte_errno = EINVAL;
+		goto err_mrlock;
+	}
+	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
+	MLX5_ASSERT(mr->ibv_mr->length == len);
+	LIST_INSERT_HEAD(&share_cache->mr_list, mr, mr);
+	DEBUG("MR CREATED (%p) for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+	      (void *)mr, (void *)addr, data.start, data.end,
+	      rte_cpu_to_be_32(mr->ibv_mr->lkey),
+	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	/* Insert to the global cache table. */
+	mlx5_mr_insert_cache(share_cache, mr);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	rte_mcfg_mem_read_unlock();
+	return entry->lkey;
+err_mrlock:
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+err_memlock:
+	rte_mcfg_mem_read_unlock();
+err_nolock:
+	/*
+	 * In case of error, as this can be called in a datapath, a warning
+	 * message per an error is preferable instead. Must be unlocked before
+	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
+	 * inside.
+	 */
+	mr_free(mr);
+	return UINT32_MAX;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This can be called from primary and secondary process.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+	       struct mlx5_mr_share_cache *share_cache,
+	       struct mr_cache_entry *entry, uintptr_t addr,
+	       unsigned int mr_ext_memseg_en)
+{
+	uint32_t ret = 0;
+
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		ret = mlx5_mr_create_primary(pd, share_cache, entry,
+					     addr, mr_ext_memseg_en);
+		break;
+	case RTE_PROC_SECONDARY:
+		ret = mlx5_mr_create_secondary(pd, mp_id, share_cache, entry,
+					       addr, mr_ext_memseg_en);
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+/**
+ * Look up address in the global MR cache table. If not found, create a new MR.
+ * Insert the found/created entry to local bottom-half cache table.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this is not written.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_lookup_caches(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+		 struct mlx5_mr_share_cache *share_cache,
+		 struct mlx5_mr_ctrl *mr_ctrl,
+		 struct mr_cache_entry *entry, uintptr_t addr,
+		 unsigned int mr_ext_memseg_en)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	uint32_t lkey;
+	uint16_t idx;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in the global cache. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+	if (lkey != UINT32_MAX) {
+		/* Found. */
+		*entry = (*share_cache->cache.table)[idx];
+		rte_rwlock_read_unlock(&share_cache->rwlock);
+		/*
+		 * Update local cache. Even if it fails, return the found entry
+		 * to update top-half cache. Next time, this entry will be found
+		 * in the global cache.
+		 */
+		mr_btree_insert(bt, entry);
+		return lkey;
+	}
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/* First time to see the address? Create a new MR. */
+	lkey = mlx5_mr_create(pd, mp_id, share_cache, entry, addr,
+			      mr_ext_memseg_en);
+	/*
+	 * Update the local cache if successfully created a new global MR. Even
+	 * if failed to create one, there's no action to take in this datapath
+	 * code. As returning LKey is invalid, this will eventually make HW
+	 * fail.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half of LKey search on datapath. First search in cache_bh[] and if
+ * misses, search in the global MR cache table and update the new entry to
+ * per-queue local caches.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en)
+{
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+	/* Victim in top-half cache to replace with new entry. */
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		/*
+		 * If missed in local lookup table, search in the global cache
+		 * and local cache_bh[] will be updated inside if possible.
+		 * Top-half cache entry will also be updated.
+		 */
+		lkey = mr_lookup_caches(pd, mp_id, share_cache, mr_ctrl,
+					repl, addr, mr_ext_memseg_en);
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
+
+/**
+ * Release all the created MRs and resources on global MR cache of a device.
+ * list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+void
+mlx5_mr_release_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach from MR list and move to free list. */
+	mr_next = LIST_FIRST(&share_cache->mr_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		LIST_REMOVE(mr, mr);
+		LIST_INSERT_HEAD(&share_cache->mr_free_list, mr, mr);
+	}
+	LIST_INIT(&share_cache->mr_list);
+	/* Free global cache. */
+	mlx5_mr_btree_free(&share_cache->cache);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Free all remaining MRs. */
+	mlx5_mr_garbage_collect(share_cache);
+}
+
+/**
+ * Flush all of the local cache entries.
+ *
+ * @param mr_ctrl
+ *   Pointer to per-queue MR local cache.
+ */
+void
+mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
+{
+	/* Reset the most-recently-used index. */
+	mr_ctrl->mru = 0;
+	/* Reset the linear search array. */
+	mr_ctrl->head = 0;
+	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
+	/* Reset the B-tree table. */
+	mr_ctrl->cache_bh.len = 1;
+	mr_ctrl->cache_bh.overflow = 0;
+	/* Update the generation number. */
+	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
+	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
+		(void *)mr_ctrl, mr_ctrl->cur_gen);
+}
+
+/**
+ * Creates a memory region for external memory, that is memory which is not
+ * part of the DPDK memory segments.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param addr
+ *   Starting virtual address of memory.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @param socked_id
+ *   Socket to allocate heap memory for the control structures.
+ *
+ * @return
+ *   Pointer to MR structure on success, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len, int socket_id)
+{
+	struct mlx5_mr *mr = NULL;
+
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE),
+				RTE_CACHE_LINE_SIZE, socket_id);
+	if (mr == NULL)
+		return NULL;
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)addr, len,
+				       IBV_ACCESS_LOCAL_WRITE);
+	if (mr->ibv_mr == NULL) {
+		DRV_LOG(WARNING,
+			"Fail to create a verbs MR for address (%p)",
+			(void *)addr);
+		rte_free(mr);
+		return NULL;
+	}
+	mr->msl = NULL; /* Mark it is external memory. */
+	mr->ms_bmp = NULL;
+	mr->ms_n = 1;
+	mr->ms_bmp_n = 1;
+	DRV_LOG(DEBUG,
+		"MR CREATED (%p) for external memory %p:\n"
+		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+		(void *)mr, (void *)addr,
+		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	return mr;
+}
+
+/**
+ * Dump all the created MRs and the global cache entries.
+ *
+ * @param sh
+ *   Pointer to Ethernet device shared context.
+ */
+void
+mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	struct mlx5_mr *mr;
+	int mr_n = 0;
+	int chunk_n = 0;
+
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		DEBUG("MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
+		      mr_n++, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		      mr->ms_n, mr->ms_bmp_n);
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret = { 0, };
+
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (!ret.end)
+				break;
+			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
+			      chunk_n++, ret.start, ret.end);
+		}
+	}
+	DEBUG("Dumping global cache %p", (void *)share_cache);
+	mlx5_mr_btree_dump(&share_cache->cache);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+#endif
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
new file mode 100644
index 0000000000..e805f96375
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MR_H_
+#define RTE_PMD_MLX5_COMMON_MR_H_
+
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/queue.h>
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx5dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_rwlock.h>
+#include <rte_bitmap.h>
+#include <rte_memory.h>
+
+#include "mlx5_common_mp.h"
+
+/* Size of per-queue MR cache array for linear search. */
+#define MLX5_MR_CACHE_N 8
+#define MLX5_MR_BTREE_CACHE_N 256
+
+/* Memory Region object. */
+struct mlx5_mr {
+	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
+	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
+	const struct rte_memseg_list *msl;
+	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
+	int ms_n; /* Number of memsegs in use. */
+	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
+	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
+};
+
+/* Cache entry for Memory Region. */
+struct mr_cache_entry {
+	uintptr_t start; /* Start address of MR. */
+	uintptr_t end; /* End address of MR. */
+	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
+} __rte_packed;
+
+/* MR Cache table for Binary search. */
+struct mlx5_mr_btree {
+	uint16_t len; /* Number of entries. */
+	uint16_t size; /* Total number of entries. */
+	int overflow; /* Mark failure of table expansion. */
+	struct mr_cache_entry (*table)[];
+} __rte_packed;
+
+/* Per-queue MR control descriptor. */
+struct mlx5_mr_ctrl {
+	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
+	uint32_t cur_gen; /* Generation number saved to flush caches. */
+	uint16_t mru; /* Index of last hit entry in top-half cache. */
+	uint16_t head; /* Index of the oldest entry in top-half cache. */
+	struct mr_cache_entry cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
+	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
+} __rte_packed;
+
+LIST_HEAD(mlx5_mr_list, mlx5_mr);
+
+/* Global per-device MR cache. */
+struct mlx5_mr_share_cache {
+	uint32_t dev_gen; /* Generation number to flush local caches. */
+	rte_rwlock_t rwlock; /* MR cache Lock. */
+	struct mlx5_mr_btree cache; /* Global MR cache table. */
+	struct mlx5_mr_list mr_list; /* Registered MR list. */
+	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+} __rte_packed;
+
+/**
+ * Look up LKey from given lookup table by linear search. Firstly look up the
+ * last-hit entry. If miss, the entire array is searched. If found, update the
+ * last-hit index and return LKey.
+ *
+ * @param lkp_tbl
+ *   Pointer to lookup table.
+ * @param[in,out] cached_idx
+ *   Pointer to last-hit index.
+ * @param n
+ *   Size of lookup table.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static __rte_always_inline uint32_t
+mlx5_mr_lookup_lkey(struct mr_cache_entry *lkp_tbl, uint16_t *cached_idx,
+		    uint16_t n, uintptr_t addr)
+{
+	uint16_t idx;
+
+	if (likely(addr >= lkp_tbl[*cached_idx].start &&
+		   addr < lkp_tbl[*cached_idx].end))
+		return lkp_tbl[*cached_idx].lkey;
+	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
+		if (addr >= lkp_tbl[idx].start &&
+		    addr < lkp_tbl[idx].end) {
+			/* Found. */
+			*cached_idx = idx;
+			return lkp_tbl[idx].lkey;
+		}
+	}
+	return UINT32_MAX;
+}
+
+__rte_experimental
+int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
+__rte_experimental
+void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
+__rte_experimental
+void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused);
+__rte_experimental
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en);
+__rte_experimental
+void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
+__rte_experimental
+void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
+__rte_experimental
+void mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache);
+__rte_experimental
+void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
+__rte_experimental
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr);
+__rte_experimental
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len,
+		   int socket_id);
+__rte_experimental
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en);
+
+#endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index 265703d1c9..b58a378278 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -61,4 +61,18 @@ EXPERIMENTAL {
 	mlx5_mp_req_mr_create;
 	mlx5_mp_req_queue_state_modify;
 	mlx5_mp_req_verbs_cmd_fd;
+
+	mlx5_mr_btree_init;
+	mlx5_mr_btree_free;
+	mlx5_mr_btree_dump;
+	mlx5_mr_addr2mr_bh;
+	mlx5_mr_release_cache;
+	mlx5_mr_dump_cache;
+	mlx5_mr_rebuild_cache;
+	mlx5_mr_insert_cache;
+	mlx5_mr_lookup_cache;
+	mlx5_mr_lookup_list;
+	mlx5_create_mr_ext;
+	mlx5_mr_create_primary;
+	mlx5_mr_flush_local_cache;
 };
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 4/4] net/mlx5: modify net PMD to use common memory management driver
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                   ` (2 preceding siblings ...)
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 3/4] common/mlx5: refactor memory management codes Vu Pham
@ 2020-04-02 19:21 ` Vu Pham
  2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-02 19:21 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

From: Vu Pham <vuhuong@mellanox.com>

Modify mlx5 net PMD driver to use memory managment APIs from
common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/common/mlx5/Makefile     |    1 +
 drivers/common/mlx5/meson.build  |    1 +
 drivers/net/mlx5/mlx5.c          |    4 +-
 drivers/net/mlx5/mlx5.h          |   12 +-
 drivers/net/mlx5/mlx5_mp.c       |    8 +-
 drivers/net/mlx5/mlx5_mr.c       | 1167 ++------------------------------------
 drivers/net/mlx5/mlx5_mr.h       |   87 +--
 drivers/net/mlx5/mlx5_rxtx.c     |    1 +
 drivers/net/mlx5/mlx5_rxtx.h     |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h |    2 +
 drivers/net/mlx5/mlx5_trigger.c  |    1 +
 drivers/net/mlx5/mlx5_txq.c      |    3 +-
 12 files changed, 75 insertions(+), 1222 deletions(-)

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index 2a88492731..26267c957a 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -18,6 +18,7 @@ SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
 SRCS-y += mlx5_common_mp.c
+SRCS-y += mlx5_common_mr.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index 83671861c9..175251b691 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -56,6 +56,7 @@ sources = files(
 	'mlx5_common.c',
 	'mlx5_nl.c',
 	'mlx5_common_mp.c',
+	'mlx5_common_mr.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f802bcee3d..183b1c87e8 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -612,7 +612,7 @@ mlx5_alloc_shared_ibctx(const struct mlx5_dev_spawn_data *spawn,
 	 * At this point the device is not added to the memory
 	 * event list yet, context is just being created.
 	 */
-	err = mlx5_mr_btree_init(&sh->mr.cache,
+	err = mlx5_mr_btree_init(&sh->share_cache.cache,
 				 MLX5_MR_BTREE_CACHE_N * 2,
 				 spawn->pci_dev->device.numa_node);
 	if (err) {
@@ -684,7 +684,7 @@ mlx5_free_shared_ibctx(struct mlx5_ibv_shared *sh)
 	LIST_REMOVE(sh, mem_event_cb);
 	rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
 	/* Release created Memory Regions. */
-	mlx5_mr_release(sh);
+	mlx5_mr_release_cache(&sh->share_cache);
 	/* Remove context from the global device list. */
 	LIST_REMOVE(sh, next);
 	/*
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index dc02e148c3..2f21c9b898 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -37,10 +37,10 @@
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /** Key string for IPC. */
@@ -194,8 +194,6 @@ struct mlx5_verbs_alloc_ctx {
 	const void *obj; /* Pointer to the DPDK object. */
 };
 
-LIST_HEAD(mlx5_mr_list, mlx5_mr);
-
 /* Flow drop context necessary due to Verbs API. */
 struct mlx5_drop {
 	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
@@ -386,13 +384,7 @@ struct mlx5_ibv_shared {
 	struct ibv_device_attr_ex device_attr; /* Device properties. */
 	LIST_ENTRY(mlx5_ibv_shared) mem_event_cb;
 	/**< Called by memory event callback. */
-	struct {
-		uint32_t dev_gen; /* Generation number to flush local caches. */
-		rte_rwlock_t rwlock; /* MR Lock. */
-		struct mlx5_mr_btree cache; /* Global MR cache table. */
-		struct mlx5_mr_list mr_list; /* Registered MR list. */
-		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
-	} mr;
+	struct mlx5_mr_share_cache share_cache;
 	/* Shared DV/DR flow data section. */
 	pthread_mutex_t dv_mutex; /* DV context mutex. */
 	uint32_t dv_meta_mask; /* flow META metadata supported mask. */
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 43684dbc3a..7ad322d474 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -11,6 +11,7 @@
 #include <rte_string_fns.h>
 
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
@@ -25,7 +26,7 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
 	struct mlx5_priv *priv;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 	int ret;
 
@@ -40,7 +41,10 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
 		mp_init_msg(&priv->mp_id, &mp_res, param->type);
-		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
+		lkey = mlx5_mr_create_primary(priv->sh->pd,
+					      &priv->sh->share_cache,
+					      &entry, param->args.addr,
+					      priv->config.mr_ext_memseg_en);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 8097211b55..2b4b3e2891 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -18,6 +18,8 @@
 #include <rte_bus_pci.h>
 
 #include <mlx5_glue.h>
+#include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_mr.h"
@@ -36,833 +38,6 @@ struct mr_update_mp_data {
 	int ret;
 };
 
-/**
- * Expand B-tree table to a given size. Can't be called with holding
- * memory_hotplug_lock or sh->mr.rwlock due to rte_realloc().
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries for expansion.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_expand(struct mlx5_mr_btree *bt, int n)
-{
-	void *mem;
-	int ret = 0;
-
-	if (n <= bt->size)
-		return ret;
-	/*
-	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
-	 * used inside if there's no room to expand. Because this is a quite
-	 * rare case and a part of very slow path, it is very acceptable.
-	 * Initially cache_bh[] will be given practically enough space and once
-	 * it is expanded, expansion wouldn't be needed again ever.
-	 */
-	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
-	if (mem == NULL) {
-		/* Not an error, B-tree search will be skipped. */
-		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
-			(void *)bt);
-		ret = -1;
-	} else {
-		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
-		bt->table = mem;
-		bt->size = n;
-	}
-	return ret;
-}
-
-/**
- * Look up LKey from given B-tree lookup table, store the last index and return
- * searched LKey.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param[out] idx
- *   Pointer to index. Even on search failure, returns index where it stops
- *   searching so that index can be used when inserting a new entry.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t n;
-	uint16_t base = 0;
-
-	MLX5_ASSERT(bt != NULL);
-	lkp_tbl = *bt->table;
-	n = bt->len;
-	/* First entry must be NULL for comparison. */
-	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
-				    lkp_tbl[0].lkey == UINT32_MAX));
-	/* Binary search. */
-	do {
-		register uint16_t delta = n >> 1;
-
-		if (addr < lkp_tbl[base + delta].start) {
-			n = delta;
-		} else {
-			base += delta;
-			n -= delta;
-		}
-	} while (n > 1);
-	MLX5_ASSERT(addr >= lkp_tbl[base].start);
-	*idx = base;
-	if (addr < lkp_tbl[base].end)
-		return lkp_tbl[base].lkey;
-	/* Not found. */
-	return UINT32_MAX;
-}
-
-/**
- * Insert an entry to B-tree lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param entry
- *   Pointer to new entry to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t idx = 0;
-	size_t shift;
-
-	MLX5_ASSERT(bt != NULL);
-	MLX5_ASSERT(bt->len <= bt->size);
-	MLX5_ASSERT(bt->len > 0);
-	lkp_tbl = *bt->table;
-	/* Find out the slot for insertion. */
-	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
-		DRV_LOG(DEBUG,
-			"abort insertion to B-tree(%p): already exist at"
-			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-			(void *)bt, idx, entry->start, entry->end, entry->lkey);
-		/* Already exist, return. */
-		return 0;
-	}
-	/* If table is full, return error. */
-	if (unlikely(bt->len == bt->size)) {
-		bt->overflow = 1;
-		return -1;
-	}
-	/* Insert entry. */
-	++idx;
-	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
-	if (shift)
-		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
-	lkp_tbl[idx] = *entry;
-	bt->len++;
-	DRV_LOG(DEBUG,
-		"inserted B-tree(%p)[%u],"
-		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		(void *)bt, idx, entry->start, entry->end, entry->lkey);
-	return 0;
-}
-
-/**
- * Initialize B-tree and allocate memory for lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries to allocate.
- * @param socket
- *   NUMA socket on which memory must be allocated.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
-{
-	if (bt == NULL) {
-		rte_errno = EINVAL;
-		return -rte_errno;
-	}
-	MLX5_ASSERT(!bt->table && !bt->size);
-	memset(bt, 0, sizeof(*bt));
-	bt->table = rte_calloc_socket("B-tree table",
-				      n, sizeof(struct mlx5_mr_cache),
-				      0, socket);
-	if (bt->table == NULL) {
-		rte_errno = ENOMEM;
-		DEBUG("failed to allocate memory for btree cache on socket %d",
-		      socket);
-		return -rte_errno;
-	}
-	bt->size = n;
-	/* First entry must be NULL for binary search. */
-	(*bt->table)[bt->len++] = (struct mlx5_mr_cache) {
-		.lkey = UINT32_MAX,
-	};
-	DEBUG("initialized B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	return 0;
-}
-
-/**
- * Free B-tree resources.
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
-{
-	if (bt == NULL)
-		return;
-	DEBUG("freeing B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	rte_free(bt->table);
-	memset(bt, 0, sizeof(*bt));
-}
-
-/**
- * Dump all the entries in a B-tree
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	int idx;
-	struct mlx5_mr_cache *lkp_tbl;
-
-	if (bt == NULL)
-		return;
-	lkp_tbl = *bt->table;
-	for (idx = 0; idx < bt->len; ++idx) {
-		struct mlx5_mr_cache *entry = &lkp_tbl[idx];
-
-		DEBUG("B-tree(%p)[%u],"
-		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
-	}
-#endif
-}
-
-/**
- * Find virtually contiguous memory chunk in a given MR.
- *
- * @param dev
- *   Pointer to MR structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If not found, this will not be
- *   updated.
- * @param start_idx
- *   Start index of the memseg bitmap.
- *
- * @return
- *   Next index to go on lookup.
- */
-static int
-mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
-		   int base_idx)
-{
-	uintptr_t start = 0;
-	uintptr_t end = 0;
-	uint32_t idx = 0;
-
-	/* MR for external memory doesn't have memseg list. */
-	if (mr->msl == NULL) {
-		struct ibv_mr *ibv_mr = mr->ibv_mr;
-
-		MLX5_ASSERT(mr->ms_bmp_n == 1);
-		MLX5_ASSERT(mr->ms_n == 1);
-		MLX5_ASSERT(base_idx == 0);
-		/*
-		 * Can't search it from memseg list but get it directly from
-		 * verbs MR as there's only one chunk.
-		 */
-		entry->start = (uintptr_t)ibv_mr->addr;
-		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-		/* Returning 1 ends iteration. */
-		return 1;
-	}
-	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
-		if (rte_bitmap_get(mr->ms_bmp, idx)) {
-			const struct rte_memseg_list *msl;
-			const struct rte_memseg *ms;
-
-			msl = mr->msl;
-			ms = rte_fbarray_get(&msl->memseg_arr,
-					     mr->ms_base_idx + idx);
-			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-			if (!start)
-				start = ms->addr_64;
-			end = ms->addr_64 + ms->hugepage_sz;
-		} else if (start) {
-			/* Passed the end of a fragment. */
-			break;
-		}
-	}
-	if (start) {
-		/* Found one chunk. */
-		entry->start = start;
-		entry->end = end;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-	}
-	return idx;
-}
-
-/**
- * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
- * Then, this entry will have to be searched by mr_lookup_dev_list() in
- * mlx5_mr_create() on miss.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param mr
- *   Pointer to MR to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_insert_dev_cache(struct mlx5_ibv_shared *sh, struct mlx5_mr *mr)
-{
-	unsigned int n;
-
-	DRV_LOG(DEBUG, "device %s inserting MR(%p) to global cache",
-		sh->ibdev_name, (void *)mr);
-	for (n = 0; n < mr->ms_bmp_n; ) {
-		struct mlx5_mr_cache entry;
-
-		memset(&entry, 0, sizeof(entry));
-		/* Find a contiguous chunk and advance the index. */
-		n = mr_find_next_chunk(mr, &entry, n);
-		if (!entry.end)
-			break;
-		if (mr_btree_insert(&sh->mr.cache, &entry) < 0) {
-			/*
-			 * Overflowed, but the global table cannot be expanded
-			 * because of deadlock.
-			 */
-			return -1;
-		}
-	}
-	return 0;
-}
-
-/**
- * Look up address in the original global MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Found MR on match, NULL otherwise.
- */
-static struct mlx5_mr *
-mr_lookup_dev_list(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-		   uintptr_t addr)
-{
-	struct mlx5_mr *mr;
-
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret;
-
-			memset(&ret, 0, sizeof(ret));
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (addr >= ret.start && addr < ret.end) {
-				/* Found. */
-				*entry = ret;
-				return mr;
-			}
-		}
-	}
-	return NULL;
-}
-
-/**
- * Look up address on device.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mr_lookup_dev(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-	      uintptr_t addr)
-{
-	uint16_t idx;
-	uint32_t lkey = UINT32_MAX;
-	struct mlx5_mr *mr;
-
-	/*
-	 * If the global cache has overflowed since it failed to expand the
-	 * B-tree table, it can't have all the existing MRs. Then, the address
-	 * has to be searched by traversing the original MR list instead, which
-	 * is very slow path. Otherwise, the global cache is all inclusive.
-	 */
-	if (!unlikely(sh->mr.cache.overflow)) {
-		lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-		if (lkey != UINT32_MAX)
-			*entry = (*sh->mr.cache.table)[idx];
-	} else {
-		/* Falling back to the slowest path. */
-		mr = mr_lookup_dev_list(sh, entry, addr);
-		if (mr != NULL)
-			lkey = entry->lkey;
-	}
-	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
-					   addr < entry->end));
-	return lkey;
-}
-
-/**
- * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
- * can raise memory free event and the callback function will spin on the lock.
- *
- * @param mr
- *   Pointer to MR to free.
- */
-static void
-mr_free(struct mlx5_mr *mr)
-{
-	if (mr == NULL)
-		return;
-	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
-	if (mr->ibv_mr != NULL)
-		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
-	if (mr->ms_bmp != NULL)
-		rte_bitmap_free(mr->ms_bmp);
-	rte_free(mr);
-}
-
-/**
- * Release resources of detached MR having no online entry.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mlx5_mr_garbage_collect(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
-
-	/* Must be called from the primary process. */
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	/*
-	 * MR can't be freed with holding the lock because rte_free() could call
-	 * memory free callback function. This will be a deadlock situation.
-	 */
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach the whole free list and release it after unlocking. */
-	free_list = sh->mr.mr_free_list;
-	LIST_INIT(&sh->mr.mr_free_list);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Release resources. */
-	mr_next = LIST_FIRST(&free_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		mr_free(mr);
-	}
-}
-
-/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
-static int
-mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
-			  const struct rte_memseg *ms, size_t len, void *arg)
-{
-	struct mr_find_contig_memsegs_data *data = arg;
-
-	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
-		return 0;
-	/* Found, save it and stop walking. */
-	data->start = ms->addr_64;
-	data->end = ms->addr_64 + len;
-	data->msl = msl;
-	return 1;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This API should be called on a secondary process, then a request is sent to
- * the primary process in order to create a MR for the address. As the global MR
- * list is on the shared memory, following LKey lookup should succeed unless the
- * request fails.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-			 uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	int ret;
-
-	DEBUG("port %u requesting MR creation for address (%p)",
-	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
-	if (ret) {
-		DEBUG("port %u fail to request MR creation for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		return UINT32_MAX;
-	}
-	rte_rwlock_read_lock(&priv->sh->mr.rwlock);
-	/* Fill in output data. */
-	mr_lookup_dev(priv->sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_read_unlock(&priv->sh->mr.rwlock);
-	DEBUG("port %u MR CREATED by primary process for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
-	      dev->data->port_id, (void *)addr,
-	      entry->start, entry->end, entry->lkey);
-	return entry->lkey;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * Register entire virtually contiguous memory chunk around the address.
- * This must be called from the primary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-uint32_t
-mlx5_mr_create_primary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-		       uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_dev_config *config = &priv->config;
-	const struct rte_memseg_list *msl;
-	const struct rte_memseg *ms;
-	struct mlx5_mr *mr = NULL;
-	size_t len;
-	uint32_t ms_n;
-	uint32_t bmp_size;
-	void *bmp_mem;
-	int ms_idx_shift = -1;
-	unsigned int n;
-	struct mr_find_contig_memsegs_data data = {
-		.addr = addr,
-	};
-	struct mr_find_contig_memsegs_data data_re;
-
-	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
-		dev->data->port_id, (void *)addr);
-	/*
-	 * Release detached MRs if any. This can't be called with holding either
-	 * memory_hotplug_lock or sh->mr.rwlock. MRs on the free list have
-	 * been detached by the memory free event but it couldn't be released
-	 * inside the callback due to deadlock. As a result, releasing resources
-	 * is quite opportunistic.
-	 */
-	mlx5_mr_garbage_collect(sh);
-	/*
-	 * If enabled, find out a contiguous virtual address chunk in use, to
-	 * which the given address belongs, in order to register maximum range.
-	 * In the best case where mempools are not dynamically recreated and
-	 * '--socket-mem' is specified as an EAL option, it is very likely to
-	 * have only one MR(LKey) per a socket and per a hugepage-size even
-	 * though the system memory is highly fragmented. As the whole memory
-	 * chunk will be pinned by kernel, it can't be reused unless entire
-	 * chunk is freed from EAL.
-	 *
-	 * If disabled, just register one memseg (page). Then, memory
-	 * consumption will be minimized but it may drop performance if there
-	 * are many MRs to lookup on the datapath.
-	 */
-	if (!config->mr_ext_memseg_en) {
-		data.msl = rte_mem_virt2memseg_list((void *)addr);
-		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
-		data.end = data.start + data.msl->page_sz;
-	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
-		DRV_LOG(WARNING,
-			"port %u unable to find virtually contiguous"
-			" chunk for address (%p)."
-			" rte_memseg_contig_walk() failed.",
-			dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_nolock;
-	}
-alloc_resources:
-	/* Addresses must be page-aligned. */
-	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
-	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
-	msl = data.msl;
-	ms = rte_mem_virt2memseg((void *)data.start, msl);
-	len = data.end - data.start;
-	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-	/* Number of memsegs in the range. */
-	ms_n = len / msl->page_sz;
-	DEBUG("port %u extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " page_sz=0x%" PRIx64 ", ms_n=%u",
-	      dev->data->port_id, (void *)addr,
-	      data.start, data.end, msl->page_sz, ms_n);
-	/* Size of memory for bitmap. */
-	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE) +
-				bmp_size,
-				RTE_CACHE_LINE_SIZE, msl->socket_id);
-	if (mr == NULL) {
-		DEBUG("port %u unable to allocate memory for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENOMEM;
-		goto err_nolock;
-	}
-	mr->msl = msl;
-	/*
-	 * Save the index of the first memseg and initialize memseg bitmap. To
-	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
-	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
-	 */
-	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
-	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
-	if (mr->ms_bmp == NULL) {
-		DEBUG("port %u unable to initialize bitmap for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_nolock;
-	}
-	/*
-	 * Should recheck whether the extended contiguous chunk is still valid.
-	 * Because memory_hotplug_lock can't be held if there's any memory
-	 * related calls in a critical path, resource allocation above can't be
-	 * locked. If the memory has been changed at this point, try again with
-	 * just single page. If not, go on with the big chunk atomically from
-	 * here.
-	 */
-	rte_mcfg_mem_read_lock();
-	data_re = data;
-	if (len > msl->page_sz &&
-	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
-		DEBUG("port %u unable to find virtually contiguous"
-		      " chunk for address (%p)."
-		      " rte_memseg_contig_walk() failed.",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_memlock;
-	}
-	if (data.start != data_re.start || data.end != data_re.end) {
-		/*
-		 * The extended contiguous chunk has been changed. Try again
-		 * with single memseg instead.
-		 */
-		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
-		data.end = data.start + msl->page_sz;
-		rte_mcfg_mem_read_unlock();
-		mr_free(mr);
-		goto alloc_resources;
-	}
-	MLX5_ASSERT(data.msl == data_re.msl);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/*
-	 * Check the address is really missing. If other thread already created
-	 * one or it is not found due to overflow, abort and return.
-	 */
-	if (mr_lookup_dev(sh, entry, addr) != UINT32_MAX) {
-		/*
-		 * Insert to the global cache table. It may fail due to
-		 * low-on-memory. Then, this entry will have to be searched
-		 * here again.
-		 */
-		mr_btree_insert(&sh->mr.cache, entry);
-		DEBUG("port %u found MR for %p on final lookup, abort",
-		      dev->data->port_id, (void *)addr);
-		rte_rwlock_write_unlock(&sh->mr.rwlock);
-		rte_mcfg_mem_read_unlock();
-		/*
-		 * Must be unlocked before calling rte_free() because
-		 * mlx5_mr_mem_event_free_cb() can be called inside.
-		 */
-		mr_free(mr);
-		return entry->lkey;
-	}
-	/*
-	 * Trim start and end addresses for verbs MR. Set bits for registering
-	 * memsegs but exclude already registered ones. Bitmap can be
-	 * fragmented.
-	 */
-	for (n = 0; n < ms_n; ++n) {
-		uintptr_t start;
-		struct mlx5_mr_cache ret;
-
-		memset(&ret, 0, sizeof(ret));
-		start = data_re.start + n * msl->page_sz;
-		/* Exclude memsegs already registered by other MRs. */
-		if (mr_lookup_dev(sh, &ret, start) == UINT32_MAX) {
-			/*
-			 * Start from the first unregistered memseg in the
-			 * extended range.
-			 */
-			if (ms_idx_shift == -1) {
-				mr->ms_base_idx += n;
-				data.start = start;
-				ms_idx_shift = n;
-			}
-			data.end = start + msl->page_sz;
-			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
-			++mr->ms_n;
-		}
-	}
-	len = data.end - data.start;
-	mr->ms_bmp_n = len / msl->page_sz;
-	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
-	/*
-	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
-	 * called with holding the memory lock because it doesn't use
-	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
-	 * through mlx5_alloc_verbs_buf().
-	 */
-	mr->ibv_mr = mlx5_glue->reg_mr(sh->pd, (void *)data.start, len,
-				       IBV_ACCESS_LOCAL_WRITE);
-	if (mr->ibv_mr == NULL) {
-		DEBUG("port %u fail to create a verbs MR for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_mrlock;
-	}
-	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
-	MLX5_ASSERT(mr->ibv_mr->length == len);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
-	DEBUG("port %u MR CREATED (%p) for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-	      dev->data->port_id, (void *)mr, (void *)addr,
-	      data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	/* Fill in output data. */
-	mr_lookup_dev(sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	rte_mcfg_mem_read_unlock();
-	return entry->lkey;
-err_mrlock:
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-err_memlock:
-	rte_mcfg_mem_read_unlock();
-err_nolock:
-	/*
-	 * In case of error, as this can be called in a datapath, a warning
-	 * message per an error is preferable instead. Must be unlocked before
-	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
-	 * inside.
-	 */
-	mr_free(mr);
-	return UINT32_MAX;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This can be called from primary and secondary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-	       uintptr_t addr)
-{
-	uint32_t ret = 0;
-
-	switch (rte_eal_process_type()) {
-	case RTE_PROC_PRIMARY:
-		ret = mlx5_mr_create_primary(dev, entry, addr);
-		break;
-	case RTE_PROC_SECONDARY:
-		ret = mlx5_mr_create_secondary(dev, entry, addr);
-		break;
-	default:
-		break;
-	}
-	return ret;
-}
-
-/**
- * Rebuild the global B-tree cache of device from the original MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mr_rebuild_dev_cache(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr;
-
-	DRV_LOG(DEBUG, "device %s rebuild dev cache[]", sh->ibdev_name);
-	/* Flush cache to rebuild. */
-	sh->mr.cache.len = 1;
-	sh->mr.cache.overflow = 0;
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr)
-		if (mr_insert_dev_cache(sh, mr) < 0)
-			return;
-}
-
 /**
  * Callback for memory free event. Iterate freed memsegs and check whether it
  * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a
@@ -899,18 +74,18 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		    RTE_ALIGN((uintptr_t)addr, msl->page_sz));
 	MLX5_ASSERT(len == RTE_ALIGN(len, msl->page_sz));
 	ms_n = len / msl->page_sz;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
 	/* Clear bits of freed memsegs from MR. */
 	for (i = 0; i < ms_n; ++i) {
 		const struct rte_memseg *ms;
-		struct mlx5_mr_cache entry;
+		struct mr_cache_entry entry;
 		uintptr_t start;
 		int ms_idx;
 		uint32_t pos;
 
 		/* Find MR having this memseg. */
 		start = (uintptr_t)addr + i * msl->page_sz;
-		mr = mr_lookup_dev_list(sh, &entry, start);
+		mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, start);
 		if (mr == NULL)
 			continue;
 		MLX5_ASSERT(mr->msl); /* Can't be external memory. */
@@ -926,7 +101,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rte_bitmap_clear(mr->ms_bmp, pos);
 		if (--mr->ms_n == 0) {
 			LIST_REMOVE(mr, mr);
-			LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+			LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 			DEBUG("device %s remove MR(%p) from list",
 			      sh->ibdev_name, (void *)mr);
 		}
@@ -937,7 +112,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rebuild = 1;
 	}
 	if (rebuild) {
-		mr_rebuild_dev_cache(sh);
+		mlx5_mr_rebuild_cache(&sh->share_cache);
 		/*
 		 * Flush local caches by propagating invalidation across cores.
 		 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -947,12 +122,12 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		 * generation below) will be guaranteed to be seen by other core
 		 * before the core sees the newly allocated memory.
 		 */
-		++sh->mr.dev_gen;
+		++sh->share_cache.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
-		      sh->mr.dev_gen);
+		      sh->share_cache.dev_gen);
 		rte_smp_wmb();
 	}
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
 
 /**
@@ -989,111 +164,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Look up address in the global MR cache table. If not found, create a new MR.
- * Insert the found/created entry to local bottom-half cache table.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this is not written.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   struct mlx5_mr_cache *entry, uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
-	uint16_t idx;
-	uint32_t lkey;
-
-	/* If local cache table is full, try to double it. */
-	if (unlikely(bt->len == bt->size))
-		mr_btree_expand(bt, bt->size << 1);
-	/* Look up in the global cache. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-	if (lkey != UINT32_MAX) {
-		/* Found. */
-		*entry = (*sh->mr.cache.table)[idx];
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
-		/*
-		 * Update local cache. Even if it fails, return the found entry
-		 * to update top-half cache. Next time, this entry will be found
-		 * in the global cache.
-		 */
-		mr_btree_insert(bt, entry);
-		return lkey;
-	}
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-	/* First time to see the address? Create a new MR. */
-	lkey = mlx5_mr_create(dev, entry, addr);
-	/*
-	 * Update the local cache if successfully created a new global MR. Even
-	 * if failed to create one, there's no action to take in this datapath
-	 * code. As returning LKey is invalid, this will eventually make HW
-	 * fail.
-	 */
-	if (lkey != UINT32_MAX)
-		mr_btree_insert(bt, entry);
-	return lkey;
-}
-
-/**
- * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] and if
- * misses, search in the global MR cache table and update the new entry to
- * per-queue local caches.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   uintptr_t addr)
-{
-	uint32_t lkey;
-	uint16_t bh_idx = 0;
-	/* Victim in top-half cache to replace with new entry. */
-	struct mlx5_mr_cache *repl = &mr_ctrl->cache[mr_ctrl->head];
-
-	/* Binary-search MR translation table. */
-	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
-	/* Update top-half cache. */
-	if (likely(lkey != UINT32_MAX)) {
-		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
-	} else {
-		/*
-		 * If missed in local lookup table, search in the global cache
-		 * and local cache_bh[] will be updated inside if possible.
-		 * Top-half cache entry will also be updated.
-		 */
-		lkey = mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
-		if (unlikely(lkey == UINT32_MAX))
-			return UINT32_MAX;
-	}
-	/* Update the most recently used entry. */
-	mr_ctrl->mru = mr_ctrl->head;
-	/* Point to the next victim, the oldest. */
-	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
-	return lkey;
-}
-
 /**
  * Bottom-half of LKey search on Rx.
  *
@@ -1113,7 +183,9 @@ mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
 	struct mlx5_priv *priv = rxq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1135,7 +207,9 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
 	struct mlx5_priv *priv = txq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1164,81 +238,6 @@ mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	return lkey;
 }
 
-/**
- * Flush all of the local cache entries.
- *
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- */
-void
-mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
-{
-	/* Reset the most-recently-used index. */
-	mr_ctrl->mru = 0;
-	/* Reset the linear search array. */
-	mr_ctrl->head = 0;
-	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
-	/* Reset the B-tree table. */
-	mr_ctrl->cache_bh.len = 1;
-	mr_ctrl->cache_bh.overflow = 0;
-	/* Update the generation number. */
-	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
-	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
-		(void *)mr_ctrl, mr_ctrl->cur_gen);
-}
-
-/**
- * Creates a memory region for external memory, that is memory which is not
- * part of the DPDK memory segments.
- *
- * @param dev
- *   Pointer to the ethernet device.
- * @param addr
- *   Starting virtual address of memory.
- * @param len
- *   Length of memory segment being mapped.
- * @param socked_id
- *   Socket to allocate heap memory for the control structures.
- *
- * @return
- *   Pointer to MR structure on success, NULL otherwise.
- */
-static struct mlx5_mr *
-mlx5_create_mr_ext(struct rte_eth_dev *dev, uintptr_t addr, size_t len,
-		   int socket_id)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_mr *mr = NULL;
-
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE),
-				RTE_CACHE_LINE_SIZE, socket_id);
-	if (mr == NULL)
-		return NULL;
-	mr->ibv_mr = mlx5_glue->reg_mr(priv->sh->pd, (void *)addr, len,
-				       IBV_ACCESS_LOCAL_WRITE);
-	if (mr->ibv_mr == NULL) {
-		DRV_LOG(WARNING,
-			"port %u fail to create a verbs MR for address (%p)",
-			dev->data->port_id, (void *)addr);
-		rte_free(mr);
-		return NULL;
-	}
-	mr->msl = NULL; /* Mark it is external memory. */
-	mr->ms_bmp = NULL;
-	mr->ms_n = 1;
-	mr->ms_bmp_n = 1;
-	DRV_LOG(DEBUG,
-		"port %u MR CREATED (%p) for external memory %p:\n"
-		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-		dev->data->port_id, (void *)mr, (void *)addr,
-		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	return mr;
-}
-
 /**
  * Called during rte_mempool_mem_iter() by mlx5_mr_update_ext_mp().
  *
@@ -1265,19 +264,19 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 	struct mlx5_mr *mr = NULL;
 	uintptr_t addr = (uintptr_t)memhdr->addr;
 	size_t len = memhdr->len;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	/* If already registered, it should return. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_lookup_dev(sh, &entry, addr);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	lkey = mlx5_mr_lookup_cache(&sh->share_cache, &entry, addr);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	if (lkey != UINT32_MAX)
 		return;
 	DRV_LOG(DEBUG, "port %u register MR for chunk #%d of mempool (%s)",
 		dev->data->port_id, mem_idx, mp->name);
-	mr = mlx5_create_mr_ext(dev, addr, len, mp->socket_id);
+	mr = mlx5_create_mr_ext(sh->pd, addr, len, mp->socket_id);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to allocate a new MR of"
@@ -1286,13 +285,14 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 		data->ret = -1;
 		return;
 	}
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	/* Insert to the local cache table */
-	mlx5_mr_addr2mr_bh(dev, mr_ctrl, addr);
+	mlx5_mr_addr2mr_bh(sh->pd, &priv->mp_id, &sh->share_cache,
+			   mr_ctrl, addr, priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1349,19 +349,19 @@ mlx5_dma_map(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	priv = dev->data->dev_private;
-	mr = mlx5_create_mr_ext(dev, (uintptr_t)addr, len, SOCKET_ID_ANY);
+	sh = priv->sh;
+	mr = mlx5_create_mr_ext(sh->pd, (uintptr_t)addr, len, SOCKET_ID_ANY);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to dma map", dev->data->port_id);
 		rte_errno = EINVAL;
 		return -1;
 	}
-	sh = priv->sh;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1388,7 +388,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	struct mlx5_priv *priv;
 	struct mlx5_ibv_shared *sh;
 	struct mlx5_mr *mr;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 
 	dev = pci_dev_to_eth_dev(pdev);
 	if (!dev) {
@@ -1399,10 +399,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	}
 	priv = dev->data->dev_private;
 	sh = priv->sh;
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	mr = mr_lookup_dev_list(sh, &entry, (uintptr_t)addr);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, (uintptr_t)addr);
 	if (!mr) {
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
+		rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 		DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't registered "
 				 "to PCI device %p", (uintptr_t)addr,
 				 (void *)pdev);
@@ -1410,10 +410,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	LIST_REMOVE(mr, mr);
-	LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
 	      (void *)mr);
-	mr_rebuild_dev_cache(sh);
+	mlx5_mr_rebuild_cache(&sh->share_cache);
 	/*
 	 * Flush local caches by propagating invalidation across cores.
 	 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -1423,10 +423,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	 * generation below) will be guaranteed to be seen by other core
 	 * before the core sees the newly allocated memory.
 	 */
-	++sh->mr.dev_gen;
-	DEBUG("broadcasting local cache flush, gen=%d",	sh->mr.dev_gen);
+	++sh->share_cache.dev_gen;
+	DEBUG("broadcasting local cache flush, gen=%d",
+	      sh->share_cache.dev_gen);
 	rte_smp_wmb();
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1501,14 +502,19 @@ mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
 		     unsigned mem_idx __rte_unused)
 {
 	struct mr_update_mp_data *data = opaque;
+	struct rte_eth_dev *dev = data->dev;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
 	uint32_t lkey;
 
 	/* Stop iteration if failed in the previous walk. */
 	if (data->ret < 0)
 		return;
 	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr);
+	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, data->mr_ctrl,
+				  (uintptr_t)memhdr->addr,
+				  priv->config.mr_ext_memseg_en);
 	if (lkey == UINT32_MAX)
 		data->ret = -1;
 }
@@ -1543,76 +549,3 @@ mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 	}
 	return data.ret;
 }
-
-/**
- * Dump all the created MRs and the global cache entries.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	struct mlx5_mr *mr;
-	int mr_n = 0;
-	int chunk_n = 0;
-
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		DEBUG("device %s MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
-		      sh->ibdev_name, mr_n++,
-		      rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		      mr->ms_n, mr->ms_bmp_n);
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret = { 0, };
-
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (!ret.end)
-				break;
-			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
-			      chunk_n++, ret.start, ret.end);
-		}
-	}
-	DEBUG("device %s dumping global cache", sh->ibdev_name);
-	mlx5_mr_btree_dump(&sh->mr.cache);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-#endif
-}
-
-/**
- * Release all the created MRs and resources for shared device context.
- * list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_release(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-
-	if (rte_log_can_log(mlx5_logtype, RTE_LOG_DEBUG))
-		mlx5_mr_dump_dev(sh);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach from MR list and move to free list. */
-	mr_next = LIST_FIRST(&sh->mr.mr_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		LIST_REMOVE(mr, mr);
-		LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
-	}
-	LIST_INIT(&sh->mr.mr_list);
-	/* Free global cache. */
-	mlx5_mr_btree_free(&sh->mr.cache);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Free all remaining MRs. */
-	mlx5_mr_garbage_collect(sh);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 48264c8294..0c5877b3d6 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -24,99 +24,16 @@
 #include <rte_ethdev.h>
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_memory.h>
 
-/* Memory Region object. */
-struct mlx5_mr {
-	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
-	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
-	const struct rte_memseg_list *msl;
-	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
-	int ms_n; /* Number of memsegs in use. */
-	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
-	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
-};
-
-/* Cache entry for Memory Region. */
-struct mlx5_mr_cache {
-	uintptr_t start; /* Start address of MR. */
-	uintptr_t end; /* End address of MR. */
-	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
-} __rte_packed;
-
-/* MR Cache table for Binary search. */
-struct mlx5_mr_btree {
-	uint16_t len; /* Number of entries. */
-	uint16_t size; /* Total number of entries. */
-	int overflow; /* Mark failure of table expansion. */
-	struct mlx5_mr_cache (*table)[];
-} __rte_packed;
-
-/* Per-queue MR control descriptor. */
-struct mlx5_mr_ctrl {
-	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
-	uint32_t cur_gen; /* Generation number saved to flush caches. */
-	uint16_t mru; /* Index of last hit entry in top-half cache. */
-	uint16_t head; /* Index of the oldest entry in top-half cache. */
-	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
-	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
-} __rte_packed;
-
-struct mlx5_ibv_shared;
-extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
-extern rte_rwlock_t mlx5_mem_event_rwlock;
+#include <mlx5_common_mr.h>
 
 /* First entry must be NULL for comparison. */
 #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
 
-int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
-void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
-uint32_t mlx5_mr_create_primary(struct rte_eth_dev *dev,
-				struct mlx5_mr_cache *entry, uintptr_t addr);
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
 int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 		      struct rte_mempool *mp);
-void mlx5_mr_release(struct mlx5_ibv_shared *sh);
-
-/* Debug purpose functions. */
-void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt);
-void mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh);
-
-/**
- * Look up LKey from given lookup table by linear search. Firstly look up the
- * last-hit entry. If miss, the entire array is searched. If found, update the
- * last-hit index and return LKey.
- *
- * @param lkp_tbl
- *   Pointer to lookup table.
- * @param[in,out] cached_idx
- *   Pointer to last-hit index.
- * @param n
- *   Size of lookup table.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static __rte_always_inline uint32_t
-mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t *cached_idx,
-		     uint16_t n, uintptr_t addr)
-{
-	uint16_t idx;
-
-	if (likely(addr >= lkp_tbl[*cached_idx].start &&
-		   addr < lkp_tbl[*cached_idx].end))
-		return lkp_tbl[*cached_idx].lkey;
-	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
-		if (addr >= lkp_tbl[idx].start &&
-		    addr < lkp_tbl[idx].end) {
-			/* Found. */
-			*cached_idx = idx;
-			return lkp_tbl[idx].lkey;
-		}
-	}
-	return UINT32_MAX;
-}
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index fc7591c2b0..5f9b670442 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -33,6 +33,7 @@
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_autoconf.h"
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 939778aa55..84161ad6af 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -34,11 +34,11 @@
 #include <mlx5_glue.h>
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /* Support tunnel matching. */
@@ -598,8 +598,8 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half (Binary Search) on miss. */
@@ -630,8 +630,8 @@ mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	if (unlikely(*mr_ctrl->dev_gen_ptr != mr_ctrl->cur_gen))
 		mlx5_mr_flush_local_cache(mr_ctrl);
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half on miss. */
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index ea925156f0..6ddcbfb0ad 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -13,6 +13,8 @@
 
 #include "mlx5_autoconf.h"
 
+#include "mlx5_mr.h"
+
 /* HW checksum offload capabilities of vectorized Tx. */
 #define MLX5_VEC_TX_CKSUM_OFFLOAD_CAP \
 	(DEV_TX_OFFLOAD_IPV4_CKSUM | \
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 571b7a003c..a2b634354b 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -11,6 +11,7 @@
 #include <rte_alarm.h>
 
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 #include "rte_pmd_mlx5.h"
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 57bc116450..5901b3bf36 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -30,6 +30,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
@@ -1278,7 +1279,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		goto error;
 	}
 	/* Save pointer of global generation number to check memory event. */
-	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->mr.dev_gen;
+	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	MLX5_ASSERT(desc > MLX5_TX_COMP_THRESH);
 	tmpl->txq.offloads = conf->offloads |
 			     dev->data->dev_conf.txmode.offloads;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                   ` (3 preceding siblings ...)
  2020-04-02 19:21 ` [dpdk-dev] [PATCH 4/4] net/mlx5: modify net PMD to use common memory management driver Vu Pham
@ 2020-04-07 16:48 ` Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling " Vu Pham
                     ` (3 more replies)
  2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-13 21:17 ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Vu Pham
  6 siblings, 4 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-07 16:48 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Current mlx5 net PMD and future mlx5(regex,...) PMDs that run
and share the same HCAs need to use common memory management
driver. Memory management codes embeddedly use multi-process IPC 
for primary/secondary processes to register and sync on memory
registrations MRs. That's the main reason to move multi-process 
IPC APIs to mlx5 common driver and make it become the base commit.

Vu Pham (4):
  common/mlx5: refactor MP IPC handling codes to common driver
  net/mlx5: modify net pmd to use common multi-process APIs
  common/mlx5: refactor memory management codes
  net/mlx5: modify net pmd to use common MR driver

 drivers/common/mlx5/Makefile                    |    4 +-
 drivers/common/mlx5/meson.build                 |    2 +
 drivers/common/mlx5/mlx5_common_mp.c            |  188 ++++
 drivers/common/mlx5/mlx5_common_mp.h            |   98 ++
 drivers/common/mlx5/mlx5_common_mr.c            | 1108 +++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   27 +
 drivers/net/mlx5/mlx5.c                         |   19 +-
 drivers/net/mlx5/mlx5.h                         |   55 +-
 drivers/net/mlx5/mlx5_mp.c                      |  242 +----
 drivers/net/mlx5/mlx5_mr.c                      | 1169 +----------------------
 drivers/net/mlx5/mlx5_mr.h                      |   87 +-
 drivers/net/mlx5/mlx5_rxtx.c                    |    4 +-
 drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
 drivers/net/mlx5/mlx5_trigger.c                 |    1 +
 drivers/net/mlx5/mlx5_txq.c                     |    3 +-
 17 files changed, 1692 insertions(+), 1487 deletions(-)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

-- 
2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling codes to common driver
  2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
@ 2020-04-07 16:48   ` Vu Pham
  2020-04-08  9:05     ` Slava Ovsiienko
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 2/4] net/mlx5: modify net pmd to use common multi-process APIs Vu Pham
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-07 16:48 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Refactor common mp handling codes from net pmd to common driver.
Using port_id as standard input parameter for all MP IPC APIs
instead of using rte_eth_dev.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/mlx5_common_mp.c            | 188 ++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++++
 drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
 3 files changed, 299 insertions(+)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
new file mode 100644
index 0000000000..da55143bc1
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2019 6WIND S.A.
+ * Copyright 2019 Mellanox Technologies, Ltd
+ */
+
+#include <stdio.h>
+#include <time.h>
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "mlx5_common_mp.h"
+#include "mlx5_common_utils.h"
+
+/**
+ * Request Memory Region creation to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
+	req->args.addr = addr;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs queue state modification to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param sm
+ *   State modify parameters.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+			       struct mlx5_mp_arg_queue_state_modify *sm)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
+	req->args.state_modify = *sm;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs command file descriptor for mmap to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ *
+ * @return
+ *   fd on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	if (res->result) {
+		rte_errno = -res->result;
+		DRV_LOG(ERR,
+			"port %u failed to get command FD from primary process",
+			mp_id->port_id);
+		ret = -rte_errno;
+		goto exit;
+	}
+	MLX5_ASSERT(mp_res->num_fds == 1);
+	ret = mp_res->fds[0];
+	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
+		mp_id->port_id, ret);
+exit:
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Initialize by primary process.
+ */
+int
+mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action)
+{
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+
+	/* primary is allowed to not support IPC */
+	ret = rte_mp_action_register(name, primary_action);
+	if (ret && rte_errno != ENOTSUP)
+		return -1;
+	return 0;
+}
+
+/**
+ * Un-initialize by primary process.
+ */
+void
+mlx5_mp_uninit_primary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	rte_mp_action_unregister(name);
+}
+
+/**
+ * Initialize by secondary process.
+ */
+int
+mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	return rte_mp_action_register(name, secondary_action);
+}
+
+/**
+ * Un-initialize by secondary process.
+ */
+void
+mlx5_mp_uninit_secondary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	rte_mp_action_unregister(name);
+}
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
new file mode 100644
index 0000000000..7aab77acb2
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MP_H_
+#define RTE_PMD_MLX5_COMMON_MP_H_
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_eal.h>
+#include <rte_string_fns.h>
+
+/* Request types for IPC. */
+enum mlx5_mp_req_type {
+	MLX5_MP_REQ_VERBS_CMD_FD = 1,
+	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_START_RXTX,
+	MLX5_MP_REQ_STOP_RXTX,
+	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
+};
+
+struct mlx5_mp_arg_queue_state_modify {
+	uint8_t is_wq; /* Set if WQ. */
+	uint16_t queue_id; /* DPDK queue ID. */
+	enum ibv_wq_state state; /* WQ requested state. */
+};
+
+/* Pameters for IPC. */
+struct mlx5_mp_param {
+	enum mlx5_mp_req_type type;
+	int port_id;
+	int result;
+	RTE_STD_C11
+	union {
+		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_queue_state_modify state_modify;
+		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
+	} args;
+};
+
+/*  Identifier of a MP process */
+struct mlx5_mp_id {
+	char name[RTE_MP_MAX_NAME_LEN];
+	uint16_t port_id;
+};
+
+/** Request timeout for IPC. */
+#define MLX5_MP_REQ_TIMEOUT_SEC 5
+
+/**
+ * Initialize IPC message.
+ *
+ * @param[in] port_id
+ *   Port ID of the device.
+ * @param[out] msg
+ *   Pointer to message to fill in.
+ * @param[in] type
+ *   Message type.
+ */
+static inline void
+mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
+	    enum mlx5_mp_req_type type)
+{
+	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
+
+	memset(msg, 0, sizeof(*msg));
+	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
+	msg->len_param = sizeof(*param);
+	param->type = type;
+	param->port_id = mp_id->port_id;
+}
+
+__rte_experimental
+int mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action);
+__rte_experimental
+void mlx5_mp_uninit_primary(const char *name);
+__rte_experimental
+int mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action);
+__rte_experimental
+void mlx5_mp_uninit_secondary(const char *name);
+__rte_experimental
+int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
+__rte_experimental
+int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+				   struct mlx5_mp_arg_queue_state_modify *sm);
+__rte_experimental
+int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
+
+#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index aede2a0a51..265703d1c9 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -48,4 +48,17 @@ DPDK_20.0.1 {
 	mlx5_nl_vlan_vmwa_delete;
 
 	mlx5_translate_port_name;
+
+};
+
+EXPERIMENTAL {
+        global:
+
+	mlx5_mp_init_primary;
+	mlx5_mp_uninit_primary;
+	mlx5_mp_init_secondary;
+	mlx5_mp_uninit_secondary;
+	mlx5_mp_req_mr_create;
+	mlx5_mp_req_queue_state_modify;
+	mlx5_mp_req_verbs_cmd_fd;
 };
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 2/4] net/mlx5: modify net pmd to use common multi-process APIs
  2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling " Vu Pham
@ 2020-04-07 16:48   ` Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 3/4] common/mlx5: refactor memory management codes Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 4/4] net/mlx5: modify net pmd to use common MR driver Vu Pham
  3 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-07 16:48 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Modify net pmd to use common Multi-Process APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/Makefile    |   3 +-
 drivers/common/mlx5/meson.build |   1 +
 drivers/net/mlx5/mlx5.c         |  15 ++-
 drivers/net/mlx5/mlx5.h         |  43 +-------
 drivers/net/mlx5/mlx5_mp.c      | 234 +++-------------------------------------
 drivers/net/mlx5/mlx5_mr.c      |   2 +-
 drivers/net/mlx5/mlx5_rxtx.c    |   3 +-
 7 files changed, 37 insertions(+), 264 deletions(-)

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index f32933d592..2a88492731 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -17,6 +17,7 @@ endif
 SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
+SRCS-y += mlx5_common_mp.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
@@ -46,7 +47,7 @@ endif
 LDLIBS += -lrte_eal -lrte_pci -lrte_kvargs -lrte_net
 
 # A few warnings cannot be avoided in external headers.
-CFLAGS += -Wno-error=cast-qual -UPEDANTIC
+CFLAGS += -Wno-error=cast-qual  -UPEDANTIC -DALLOW_EXPERIMENTAL_API
 
 EXPORT_MAP := rte_common_mlx5_version.map
 
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index f671710714..83671861c9 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -55,6 +55,7 @@ sources = files(
 	'mlx5_devx_cmds.c',
 	'mlx5_common.c',
 	'mlx5_nl.c',
+	'mlx5_common_mp.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 6a11b141da..9eac8011f3 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -38,6 +38,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
@@ -1714,7 +1715,8 @@ mlx5_init_once(void)
 		rte_rwlock_init(&sd->mem_event_rwlock);
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
-		ret = mlx5_mp_init_primary();
+		ret = mlx5_mp_init_primary(MLX5_MP_NAME,
+					   mlx5_mp_primary_handle);
 		if (ret)
 			goto out;
 		sd->init_done = true;
@@ -1722,7 +1724,8 @@ mlx5_init_once(void)
 	case RTE_PROC_SECONDARY:
 		if (ld->init_done)
 			break;
-		ret = mlx5_mp_init_secondary();
+		ret = mlx5_mp_init_secondary(MLX5_MP_NAME,
+					     mlx5_mp_secondary_handle);
 		if (ret)
 			goto out;
 		++sd->secondary_cnt;
@@ -2197,6 +2200,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	}
 	DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct mlx5_mp_id mp_id;
+
 		eth_dev = rte_eth_dev_attach_secondary(name);
 		if (eth_dev == NULL) {
 			DRV_LOG(ERR, "can not attach rte ethdev");
@@ -2208,8 +2213,10 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
+		mp_id.port_id = eth_dev->data->port_id;
+		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 		/* Receive command fd from primary process */
-		err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
+		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
 			return NULL;
 		/* Remap UAR for Tx queues. */
@@ -2373,6 +2380,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	priv->ibv_port = spawn->ibv_port;
 	priv->pci_dev = spawn->pci_dev;
 	priv->mtu = RTE_ETHER_MTU;
+	priv->mp_id.port_id = port_id;
+	strlcpy(priv->mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 #ifndef RTE_ARCH_64
 	/* Initialize UAR access locks for 32bit implementations. */
 	rte_spinlock_init(&priv->uar_lock_cq);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 34ab4758b1..9e15600afd 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -36,43 +36,13 @@
 #include <mlx5_devx_cmds.h>
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
-/* Request types for IPC. */
-enum mlx5_mp_req_type {
-	MLX5_MP_REQ_VERBS_CMD_FD = 1,
-	MLX5_MP_REQ_CREATE_MR,
-	MLX5_MP_REQ_START_RXTX,
-	MLX5_MP_REQ_STOP_RXTX,
-	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
-};
-
-struct mlx5_mp_arg_queue_state_modify {
-	uint8_t is_wq; /* Set if WQ. */
-	uint16_t queue_id; /* DPDK queue ID. */
-	enum ibv_wq_state state; /* WQ requested state. */
-};
-
-/* Pameters for IPC. */
-struct mlx5_mp_param {
-	enum mlx5_mp_req_type type;
-	int port_id;
-	int result;
-	RTE_STD_C11
-	union {
-		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
-		struct mlx5_mp_arg_queue_state_modify state_modify;
-		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
-	} args;
-};
-
-/** Request timeout for IPC. */
-#define MLX5_MP_REQ_TIMEOUT_SEC 5
-
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
@@ -561,6 +531,7 @@ struct mlx5_priv {
 #endif
 	uint8_t skip_default_rss_reta; /* Skip configuration of default reta. */
 	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
+	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
 };
 
 #define PORT_ID(priv) ((priv)->dev_data->port_id)
@@ -761,16 +732,10 @@ int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
 
 /* mlx5_mp.c */
+int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
+int mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
 void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);
 void mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev);
-int mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr);
-int mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev);
-int mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-				   struct mlx5_mp_arg_queue_state_modify *sm);
-int mlx5_mp_init_primary(void);
-void mlx5_mp_uninit_primary(void);
-int mlx5_mp_init_secondary(void);
-void mlx5_mp_uninit_secondary(void);
 
 /* mlx5_socket.c */
 
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 55d408fe95..43684dbc3a 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -10,46 +10,14 @@
 #include <rte_ethdev_driver.h>
 #include <rte_string_fns.h>
 
+#include <mlx5_common_mp.h>
+
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 
-/**
- * Initialize IPC message.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[out] msg
- *   Pointer to message to fill in.
- * @param[in] type
- *   Message type.
- */
-static inline void
-mp_init_msg(struct rte_eth_dev *dev, struct rte_mp_msg *msg,
-	    enum mlx5_mp_req_type type)
-{
-	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
-
-	memset(msg, 0, sizeof(*msg));
-	strlcpy(msg->name, MLX5_MP_NAME, sizeof(msg->name));
-	msg->len_param = sizeof(*param);
-	param->type = type;
-	param->port_id = dev->data->port_id;
-}
-
-/**
- * IPC message handler of primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[in] peer
- *   Pointer to the peer socket path.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-static int
-mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
@@ -71,21 +39,21 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_VERBS_CMD_FD:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		mp_res.num_fds = 1;
 		mp_res.fds[0] = priv->sh->ctx->cmd_fd;
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_QUEUE_STATE_MODIFY:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = mlx5_queue_state_modify_primary
 					(dev, &param->args.state_modify);
 		ret = rte_mp_reply(&mp_res, peer);
@@ -110,14 +78,15 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
-mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
 	const struct mlx5_mp_param *param =
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
+	struct mlx5_priv *priv;
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
@@ -127,13 +96,14 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		return -rte_errno;
 	}
 	dev = &rte_eth_devices[param->port_id];
+	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_START_RXTX:
 		DRV_LOG(INFO, "port %u starting datapath", dev->data->port_id);
 		rte_mb();
 		dev->rx_pkt_burst = mlx5_select_rx_function(dev);
 		dev->tx_pkt_burst = mlx5_select_tx_function(dev);
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -142,7 +112,7 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		dev->rx_pkt_burst = removed_rx_burst;
 		dev->tx_pkt_burst = removed_tx_burst;
 		rte_mb();
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -171,6 +141,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 	struct rte_mp_reply mp_rep;
 	struct mlx5_mp_param *res;
 	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret;
 	int i;
 
@@ -182,7 +153,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 			dev->data->port_id, type);
 		return;
 	}
-	mp_init_msg(dev, &mp_req, type);
+	mp_init_msg(&priv->mp_id, &mp_req, type);
 	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
 	if (ret) {
 		if (rte_errno != ENOTSUP)
@@ -234,178 +205,3 @@ mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev)
 {
 	mp_req_on_rxtx(dev, MLX5_MP_REQ_STOP_RXTX);
 }
-
-/**
- * Request Memory Region creation to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_CREATE_MR);
-	req->args.addr = addr;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	if (ret)
-		rte_errno = -ret;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs queue state modification to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param sm
- *   State modify parameters.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-			       struct mlx5_mp_arg_queue_state_modify *sm)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
-	req->args.state_modify = *sm;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs command file descriptor for mmap to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- *
- * @return
- *   fd on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	if (res->result) {
-		rte_errno = -res->result;
-		DRV_LOG(ERR,
-			"port %u failed to get command FD from primary process",
-			dev->data->port_id);
-		ret = -rte_errno;
-		goto exit;
-	}
-	MLX5_ASSERT(mp_res->num_fds == 1);
-	ret = mp_res->fds[0];
-	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
-		dev->data->port_id, ret);
-exit:
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Initialize by primary process.
- */
-int
-mlx5_mp_init_primary(void)
-{
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-
-	/* primary is allowed to not support IPC */
-	ret = rte_mp_action_register(MLX5_MP_NAME, mp_primary_handle);
-	if (ret && rte_errno != ENOTSUP)
-		return -1;
-	return 0;
-}
-
-/**
- * Un-initialize by primary process.
- */
-void
-mlx5_mp_uninit_primary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
-
-/**
- * Initialize by secondary process.
- */
-int
-mlx5_mp_init_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	return rte_mp_action_register(MLX5_MP_NAME, mp_secondary_handle);
-}
-
-/**
- * Un-initialize by secondary process.
- */
-void
-mlx5_mp_uninit_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index a8f185a208..9151992a72 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -540,7 +540,7 @@ mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
 
 	DEBUG("port %u requesting MR creation for address (%p)",
 	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(dev, addr);
+	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
 	if (ret) {
 		DEBUG("port %u fail to request MR creation for address (%p)",
 		      dev->data->port_id, (void *)addr);
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index f3bf763769..fc7591c2b0 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1000,6 +1000,7 @@ static int
 mlx5_queue_state_modify(struct rte_eth_dev *dev,
 			struct mlx5_mp_arg_queue_state_modify *sm)
 {
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret = 0;
 
 	switch (rte_eal_process_type()) {
@@ -1007,7 +1008,7 @@ mlx5_queue_state_modify(struct rte_eth_dev *dev,
 		ret = mlx5_queue_state_modify_primary(dev, sm);
 		break;
 	case RTE_PROC_SECONDARY:
-		ret = mlx5_mp_req_queue_state_modify(dev, sm);
+		ret = mlx5_mp_req_queue_state_modify(&priv->mp_id, sm);
 		break;
 	default:
 		break;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 3/4] common/mlx5: refactor memory management codes
  2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling " Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 2/4] net/mlx5: modify net pmd to use common multi-process APIs Vu Pham
@ 2020-04-07 16:48   ` Vu Pham
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 4/4] net/mlx5: modify net pmd to use common MR driver Vu Pham
  3 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-07 16:48 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Refactor common memory btree and cache management to common driver.
Replace some input parameters of MR APIs to more common datastructure
like PD, port_id, share_cache,... so that muliptle PMD drivers can
use those MR APIs.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/mlx5_common_mr.c            | 1108 +++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   14 +
 3 files changed, 1282 insertions(+)
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
new file mode 100644
index 0000000000..9d4a06dd5b
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -0,0 +1,1108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2016 6WIND S.A.
+ * Copyright 2020 Mellanox Technologies, Ltd
+ */
+#include <rte_eal_memconfig.h>
+#include <rte_errno.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_rwlock.h>
+
+#include "mlx5_glue.h"
+#include "mlx5_common_mp.h"
+#include "mlx5_common_mr.h"
+#include "mlx5_common_utils.h"
+
+struct mr_find_contig_memsegs_data {
+	uintptr_t addr;
+	uintptr_t start;
+	uintptr_t end;
+	const struct rte_memseg_list *msl;
+};
+
+/**
+ * Expand B-tree table to a given size. Can't be called with holding
+ * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries for expansion.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_expand(struct mlx5_mr_btree *bt, int n)
+{
+	void *mem;
+	int ret = 0;
+
+	if (n <= bt->size)
+		return ret;
+	/*
+	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
+	 * used inside if there's no room to expand. Because this is a quite
+	 * rare case and a part of very slow path, it is very acceptable.
+	 * Initially cache_bh[] will be given practically enough space and once
+	 * it is expanded, expansion wouldn't be needed again ever.
+	 */
+	mem = rte_realloc(bt->table, n * sizeof(struct mr_cache_entry), 0);
+	if (mem == NULL) {
+		/* Not an error, B-tree search will be skipped. */
+		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
+			(void *)bt);
+		ret = -1;
+	} else {
+		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
+		bt->table = mem;
+		bt->size = n;
+	}
+	return ret;
+}
+
+/**
+ * Look up LKey from given B-tree lookup table, store the last index and return
+ * searched LKey.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param[out] idx
+ *   Pointer to index. Even on search failure, returns index where it stops
+ *   searching so that index can be used when inserting a new entry.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t n;
+	uint16_t base = 0;
+
+	MLX5_ASSERT(bt != NULL);
+	lkp_tbl = *bt->table;
+	n = bt->len;
+	/* First entry must be NULL for comparison. */
+	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
+				    lkp_tbl[0].lkey == UINT32_MAX));
+	/* Binary search. */
+	do {
+		register uint16_t delta = n >> 1;
+
+		if (addr < lkp_tbl[base + delta].start) {
+			n = delta;
+		} else {
+			base += delta;
+			n -= delta;
+		}
+	} while (n > 1);
+	MLX5_ASSERT(addr >= lkp_tbl[base].start);
+	*idx = base;
+	if (addr < lkp_tbl[base].end)
+		return lkp_tbl[base].lkey;
+	/* Not found. */
+	return UINT32_MAX;
+}
+
+/**
+ * Insert an entry to B-tree lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param entry
+ *   Pointer to new entry to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_insert(struct mlx5_mr_btree *bt, struct mr_cache_entry *entry)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t idx = 0;
+	size_t shift;
+
+	MLX5_ASSERT(bt != NULL);
+	MLX5_ASSERT(bt->len <= bt->size);
+	MLX5_ASSERT(bt->len > 0);
+	lkp_tbl = *bt->table;
+	/* Find out the slot for insertion. */
+	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
+		DRV_LOG(DEBUG,
+			"abort insertion to B-tree(%p): already exist at"
+			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+			(void *)bt, idx, entry->start, entry->end, entry->lkey);
+		/* Already exist, return. */
+		return 0;
+	}
+	/* If table is full, return error. */
+	if (unlikely(bt->len == bt->size)) {
+		bt->overflow = 1;
+		return -1;
+	}
+	/* Insert entry. */
+	++idx;
+	shift = (bt->len - idx) * sizeof(struct mr_cache_entry);
+	if (shift)
+		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
+	lkp_tbl[idx] = *entry;
+	bt->len++;
+	DRV_LOG(DEBUG,
+		"inserted B-tree(%p)[%u],"
+		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		(void *)bt, idx, entry->start, entry->end, entry->lkey);
+	return 0;
+}
+
+/**
+ * Initialize B-tree and allocate memory for lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries to allocate.
+ * @param socket
+ *   NUMA socket on which memory must be allocated.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
+{
+	if (bt == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	MLX5_ASSERT(!bt->table && !bt->size);
+	memset(bt, 0, sizeof(*bt));
+	bt->table = rte_calloc_socket("B-tree table",
+				      n, sizeof(struct mr_cache_entry),
+				      0, socket);
+	if (bt->table == NULL) {
+		rte_errno = ENOMEM;
+		DEBUG("failed to allocate memory for btree cache on socket %d",
+		      socket);
+		return -rte_errno;
+	}
+	bt->size = n;
+	/* First entry must be NULL for binary search. */
+	(*bt->table)[bt->len++] = (struct mr_cache_entry) {
+		.lkey = UINT32_MAX,
+	};
+	DEBUG("initialized B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	return 0;
+}
+
+/**
+ * Free B-tree resources.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
+{
+	if (bt == NULL)
+		return;
+	DEBUG("freeing B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	rte_free(bt->table);
+	memset(bt, 0, sizeof(*bt));
+}
+
+/**
+ * Dump all the entries in a B-tree
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	int idx;
+	struct mr_cache_entry *lkp_tbl;
+
+	if (bt == NULL)
+		return;
+	lkp_tbl = *bt->table;
+	for (idx = 0; idx < bt->len; ++idx) {
+		struct mr_cache_entry *entry = &lkp_tbl[idx];
+
+		DEBUG("B-tree(%p)[%u],"
+		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
+	}
+#endif
+}
+
+/**
+ * Find virtually contiguous memory chunk in a given MR.
+ *
+ * @param dev
+ *   Pointer to MR structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If not found, this will not be
+ *   updated.
+ * @param start_idx
+ *   Start index of the memseg bitmap.
+ *
+ * @return
+ *   Next index to go on lookup.
+ */
+static int
+mr_find_next_chunk(struct mlx5_mr *mr, struct mr_cache_entry *entry,
+		   int base_idx)
+{
+	uintptr_t start = 0;
+	uintptr_t end = 0;
+	uint32_t idx = 0;
+
+	/* MR for external memory doesn't have memseg list. */
+	if (mr->msl == NULL) {
+		struct ibv_mr *ibv_mr = mr->ibv_mr;
+
+		MLX5_ASSERT(mr->ms_bmp_n == 1);
+		MLX5_ASSERT(mr->ms_n == 1);
+		MLX5_ASSERT(base_idx == 0);
+		/*
+		 * Can't search it from memseg list but get it directly from
+		 * verbs MR as there's only one chunk.
+		 */
+		entry->start = (uintptr_t)ibv_mr->addr;
+		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+		/* Returning 1 ends iteration. */
+		return 1;
+	}
+	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
+		if (rte_bitmap_get(mr->ms_bmp, idx)) {
+			const struct rte_memseg_list *msl;
+			const struct rte_memseg *ms;
+
+			msl = mr->msl;
+			ms = rte_fbarray_get(&msl->memseg_arr,
+					     mr->ms_base_idx + idx);
+			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+			if (!start)
+				start = ms->addr_64;
+			end = ms->addr_64 + ms->hugepage_sz;
+		} else if (start) {
+			/* Passed the end of a fragment. */
+			break;
+		}
+	}
+	if (start) {
+		/* Found one chunk. */
+		entry->start = start;
+		entry->end = end;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+	}
+	return idx;
+}
+
+/**
+ * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
+ * Then, this entry will have to be searched by mr_lookup_list() in
+ * mlx5_mr_create() on miss.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr
+ *   Pointer to MR to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr)
+{
+	unsigned int n;
+
+	DRV_LOG(DEBUG, "Inserting MR(%p) to global cache(%p)",
+		(void *)mr, (void *)share_cache);
+	for (n = 0; n < mr->ms_bmp_n; ) {
+		struct mr_cache_entry entry;
+
+		memset(&entry, 0, sizeof(entry));
+		/* Find a contiguous chunk and advance the index. */
+		n = mr_find_next_chunk(mr, &entry, n);
+		if (!entry.end)
+			break;
+		if (mr_btree_insert(&share_cache->cache, &entry) < 0) {
+			/*
+			 * Overflowed, but the global table cannot be expanded
+			 * because of deadlock.
+			 */
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/**
+ * Look up address in the original global MR list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Found MR on match, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr)
+{
+	struct mlx5_mr *mr;
+
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret;
+
+			memset(&ret, 0, sizeof(ret));
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (addr >= ret.start && addr < ret.end) {
+				/* Found. */
+				*entry = ret;
+				return mr;
+			}
+		}
+	}
+	return NULL;
+}
+
+/**
+ * Look up address on global MR cache.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr)
+{
+	uint16_t idx;
+	uint32_t lkey = UINT32_MAX;
+	struct mlx5_mr *mr;
+
+	/*
+	 * If the global cache has overflowed since it failed to expand the
+	 * B-tree table, it can't have all the existing MRs. Then, the address
+	 * has to be searched by traversing the original MR list instead, which
+	 * is very slow path. Otherwise, the global cache is all inclusive.
+	 */
+	if (!unlikely(share_cache->cache.overflow)) {
+		lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+		if (lkey != UINT32_MAX)
+			*entry = (*share_cache->cache.table)[idx];
+	} else {
+		/* Falling back to the slowest path. */
+		mr = mlx5_mr_lookup_list(share_cache, entry, addr);
+		if (mr != NULL)
+			lkey = entry->lkey;
+	}
+	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
+					   addr < entry->end));
+	return lkey;
+}
+
+/**
+ * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
+ * can raise memory free event and the callback function will spin on the lock.
+ *
+ * @param mr
+ *   Pointer to MR to free.
+ */
+static void
+mr_free(struct mlx5_mr *mr)
+{
+	if (mr == NULL)
+		return;
+	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
+	if (mr->ibv_mr != NULL)
+		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
+	if (mr->ms_bmp != NULL)
+		rte_bitmap_free(mr->ms_bmp);
+	rte_free(mr);
+}
+
+void
+mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr;
+
+	DRV_LOG(DEBUG, "Rebuild dev cache[] %p", (void *)share_cache);
+	/* Flush cache to rebuild. */
+	share_cache->cache.len = 1;
+	share_cache->cache.overflow = 0;
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr)
+		if (mlx5_mr_insert_cache(share_cache, mr) < 0)
+			return;
+}
+
+/**
+ * Release resources of detached MR having no online entry.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+static void
+mlx5_mr_garbage_collect(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
+
+	/* Must be called from the primary process. */
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/*
+	 * MR can't be freed with holding the lock because rte_free() could call
+	 * memory free callback function. This will be a deadlock situation.
+	 */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach the whole free list and release it after unlocking. */
+	free_list = share_cache->mr_free_list;
+	LIST_INIT(&share_cache->mr_free_list);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Release resources. */
+	mr_next = LIST_FIRST(&free_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		mr_free(mr);
+	}
+}
+
+/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
+static int
+mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
+			  const struct rte_memseg *ms, size_t len, void *arg)
+{
+	struct mr_find_contig_memsegs_data *data = arg;
+
+	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
+		return 0;
+	/* Found, save it and stop walking. */
+	data->start = ms->addr_64;
+	data->end = ms->addr_64 + len;
+	data->msl = msl;
+	return 1;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This API should be called on a secondary process, then a request is sent to
+ * the primary process in order to create a MR for the address. As the global MR
+ * list is on the shared memory, following LKey lookup should succeed unless the
+ * request fails.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create_secondary(struct ibv_pd *pd __rte_unused,
+			 struct mlx5_mp_id *mp_id,
+			 struct mlx5_mr_share_cache *share_cache,
+			 struct mr_cache_entry *entry, uintptr_t addr,
+			 unsigned int mr_ext_memseg_en __rte_unused)
+{
+	int ret;
+
+	DEBUG("port %u requesting MR creation for address (%p)",
+	      mp_id->port_id, (void *)addr);
+	ret = mlx5_mp_req_mr_create(mp_id, addr);
+	if (ret) {
+		DEBUG("Fail to request MR creation for address (%p)",
+		      (void *)addr);
+		return UINT32_MAX;
+	}
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	DEBUG("MR CREATED by primary process for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
+	      (void *)addr, entry->start, entry->end, entry->lkey);
+	return entry->lkey;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * Register entire virtually contiguous memory chunk around the address.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en)
+{
+	struct mr_find_contig_memsegs_data data = {.addr = addr, };
+	struct mr_find_contig_memsegs_data data_re;
+	const struct rte_memseg_list *msl;
+	const struct rte_memseg *ms;
+	struct mlx5_mr *mr = NULL;
+	int ms_idx_shift = -1;
+	uint32_t bmp_size;
+	void *bmp_mem;
+	uint32_t ms_n;
+	uint32_t n;
+	size_t len;
+
+	DRV_LOG(DEBUG, "Creating a MR using address (%p)", (void *)addr);
+	/*
+	 * Release detached MRs if any. This can't be called with holding either
+	 * memory_hotplug_lock or share_cache->rwlock. MRs on the free list have
+	 * been detached by the memory free event but it couldn't be released
+	 * inside the callback due to deadlock. As a result, releasing resources
+	 * is quite opportunistic.
+	 */
+	mlx5_mr_garbage_collect(share_cache);
+	/*
+	 * If enabled, find out a contiguous virtual address chunk in use, to
+	 * which the given address belongs, in order to register maximum range.
+	 * In the best case where mempools are not dynamically recreated and
+	 * '--socket-mem' is specified as an EAL option, it is very likely to
+	 * have only one MR(LKey) per a socket and per a hugepage-size even
+	 * though the system memory is highly fragmented. As the whole memory
+	 * chunk will be pinned by kernel, it can't be reused unless entire
+	 * chunk is freed from EAL.
+	 *
+	 * If disabled, just register one memseg (page). Then, memory
+	 * consumption will be minimized but it may drop performance if there
+	 * are many MRs to lookup on the datapath.
+	 */
+	if (!mr_ext_memseg_en) {
+		data.msl = rte_mem_virt2memseg_list((void *)addr);
+		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
+		data.end = data.start + data.msl->page_sz;
+	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
+		DRV_LOG(WARNING,
+			"Unable to find virtually contiguous"
+			" chunk for address (%p)."
+			" rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_nolock;
+	}
+alloc_resources:
+	/* Addresses must be page-aligned. */
+	MLX5_ASSERT(data.msl);
+	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
+	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
+	msl = data.msl;
+	ms = rte_mem_virt2memseg((void *)data.start, msl);
+	len = data.end - data.start;
+	MLX5_ASSERT(ms);
+	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+	/* Number of memsegs in the range. */
+	ms_n = len / msl->page_sz;
+	DEBUG("Extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " page_sz=0x%" PRIx64 ", ms_n=%u",
+	      (void *)addr, data.start, data.end, msl->page_sz, ms_n);
+	/* Size of memory for bitmap. */
+	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE) +
+				bmp_size,
+				RTE_CACHE_LINE_SIZE, msl->socket_id);
+	if (mr == NULL) {
+		DEBUG("Unable to allocate memory for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = ENOMEM;
+		goto err_nolock;
+	}
+	mr->msl = msl;
+	/*
+	 * Save the index of the first memseg and initialize memseg bitmap. To
+	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
+	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
+	 */
+	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
+	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
+	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
+	if (mr->ms_bmp == NULL) {
+		DEBUG("Unable to initialize bitmap for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = EINVAL;
+		goto err_nolock;
+	}
+	/*
+	 * Should recheck whether the extended contiguous chunk is still valid.
+	 * Because memory_hotplug_lock can't be held if there's any memory
+	 * related calls in a critical path, resource allocation above can't be
+	 * locked. If the memory has been changed at this point, try again with
+	 * just single page. If not, go on with the big chunk atomically from
+	 * here.
+	 */
+	rte_mcfg_mem_read_lock();
+	data_re = data;
+	if (len > msl->page_sz &&
+	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
+		DEBUG("Unable to find virtually contiguous"
+		      " chunk for address (%p)."
+		      " rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_memlock;
+	}
+	if (data.start != data_re.start || data.end != data_re.end) {
+		/*
+		 * The extended contiguous chunk has been changed. Try again
+		 * with single memseg instead.
+		 */
+		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
+		data.end = data.start + msl->page_sz;
+		rte_mcfg_mem_read_unlock();
+		mr_free(mr);
+		goto alloc_resources;
+	}
+	MLX5_ASSERT(data.msl == data_re.msl);
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/*
+	 * Check the address is really missing. If other thread already created
+	 * one or it is not found due to overflow, abort and return.
+	 */
+	if (mlx5_mr_lookup_cache(share_cache, entry, addr) != UINT32_MAX) {
+		/*
+		 * Insert to the global cache table. It may fail due to
+		 * low-on-memory. Then, this entry will have to be searched
+		 * here again.
+		 */
+		mr_btree_insert(&share_cache->cache, entry);
+		DEBUG("Found MR for %p on final lookup, abort", (void *)addr);
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		rte_mcfg_mem_read_unlock();
+		/*
+		 * Must be unlocked before calling rte_free() because
+		 * mlx5_mr_mem_event_free_cb() can be called inside.
+		 */
+		mr_free(mr);
+		return entry->lkey;
+	}
+	/*
+	 * Trim start and end addresses for verbs MR. Set bits for registering
+	 * memsegs but exclude already registered ones. Bitmap can be
+	 * fragmented.
+	 */
+	for (n = 0; n < ms_n; ++n) {
+		uintptr_t start;
+		struct mr_cache_entry ret;
+
+		memset(&ret, 0, sizeof(ret));
+		start = data_re.start + n * msl->page_sz;
+		/* Exclude memsegs already registered by other MRs. */
+		if (mlx5_mr_lookup_cache(share_cache, &ret, start) ==
+		    UINT32_MAX) {
+			/*
+			 * Start from the first unregistered memseg in the
+			 * extended range.
+			 */
+			if (ms_idx_shift == -1) {
+				mr->ms_base_idx += n;
+				data.start = start;
+				ms_idx_shift = n;
+			}
+			data.end = start + msl->page_sz;
+			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
+			++mr->ms_n;
+		}
+	}
+	len = data.end - data.start;
+	mr->ms_bmp_n = len / msl->page_sz;
+	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
+	/*
+	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
+	 * called with holding the memory lock because it doesn't use
+	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
+	 * through mlx5_alloc_verbs_buf().
+	 */
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)data.start, len,
+				       IBV_ACCESS_LOCAL_WRITE |
+					   IBV_ACCESS_RELAXED_ORDERING);
+	if (mr->ibv_mr == NULL) {
+		DEBUG("Fail to create a verbs MR for address (%p)",
+		      (void *)addr);
+		rte_errno = EINVAL;
+		goto err_mrlock;
+	}
+	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
+	MLX5_ASSERT(mr->ibv_mr->length == len);
+	LIST_INSERT_HEAD(&share_cache->mr_list, mr, mr);
+	DEBUG("MR CREATED (%p) for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+	      (void *)mr, (void *)addr, data.start, data.end,
+	      rte_cpu_to_be_32(mr->ibv_mr->lkey),
+	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	/* Insert to the global cache table. */
+	mlx5_mr_insert_cache(share_cache, mr);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	rte_mcfg_mem_read_unlock();
+	return entry->lkey;
+err_mrlock:
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+err_memlock:
+	rte_mcfg_mem_read_unlock();
+err_nolock:
+	/*
+	 * In case of error, as this can be called in a datapath, a warning
+	 * message per an error is preferable instead. Must be unlocked before
+	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
+	 * inside.
+	 */
+	mr_free(mr);
+	return UINT32_MAX;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This can be called from primary and secondary process.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+	       struct mlx5_mr_share_cache *share_cache,
+	       struct mr_cache_entry *entry, uintptr_t addr,
+	       unsigned int mr_ext_memseg_en)
+{
+	uint32_t ret = 0;
+
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		ret = mlx5_mr_create_primary(pd, share_cache, entry,
+					     addr, mr_ext_memseg_en);
+		break;
+	case RTE_PROC_SECONDARY:
+		ret = mlx5_mr_create_secondary(pd, mp_id, share_cache, entry,
+					       addr, mr_ext_memseg_en);
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+/**
+ * Look up address in the global MR cache table. If not found, create a new MR.
+ * Insert the found/created entry to local bottom-half cache table.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this is not written.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_lookup_caches(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+		 struct mlx5_mr_share_cache *share_cache,
+		 struct mlx5_mr_ctrl *mr_ctrl,
+		 struct mr_cache_entry *entry, uintptr_t addr,
+		 unsigned int mr_ext_memseg_en)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	uint32_t lkey;
+	uint16_t idx;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in the global cache. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+	if (lkey != UINT32_MAX) {
+		/* Found. */
+		*entry = (*share_cache->cache.table)[idx];
+		rte_rwlock_read_unlock(&share_cache->rwlock);
+		/*
+		 * Update local cache. Even if it fails, return the found entry
+		 * to update top-half cache. Next time, this entry will be found
+		 * in the global cache.
+		 */
+		mr_btree_insert(bt, entry);
+		return lkey;
+	}
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/* First time to see the address? Create a new MR. */
+	lkey = mlx5_mr_create(pd, mp_id, share_cache, entry, addr,
+			      mr_ext_memseg_en);
+	/*
+	 * Update the local cache if successfully created a new global MR. Even
+	 * if failed to create one, there's no action to take in this datapath
+	 * code. As returning LKey is invalid, this will eventually make HW
+	 * fail.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half of LKey search on datapath. First search in cache_bh[] and if
+ * misses, search in the global MR cache table and update the new entry to
+ * per-queue local caches.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en)
+{
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+	/* Victim in top-half cache to replace with new entry. */
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		/*
+		 * If missed in local lookup table, search in the global cache
+		 * and local cache_bh[] will be updated inside if possible.
+		 * Top-half cache entry will also be updated.
+		 */
+		lkey = mr_lookup_caches(pd, mp_id, share_cache, mr_ctrl,
+					repl, addr, mr_ext_memseg_en);
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
+
+/**
+ * Release all the created MRs and resources on global MR cache of a device.
+ * list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+void
+mlx5_mr_release_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach from MR list and move to free list. */
+	mr_next = LIST_FIRST(&share_cache->mr_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		LIST_REMOVE(mr, mr);
+		LIST_INSERT_HEAD(&share_cache->mr_free_list, mr, mr);
+	}
+	LIST_INIT(&share_cache->mr_list);
+	/* Free global cache. */
+	mlx5_mr_btree_free(&share_cache->cache);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Free all remaining MRs. */
+	mlx5_mr_garbage_collect(share_cache);
+}
+
+/**
+ * Flush all of the local cache entries.
+ *
+ * @param mr_ctrl
+ *   Pointer to per-queue MR local cache.
+ */
+void
+mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
+{
+	/* Reset the most-recently-used index. */
+	mr_ctrl->mru = 0;
+	/* Reset the linear search array. */
+	mr_ctrl->head = 0;
+	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
+	/* Reset the B-tree table. */
+	mr_ctrl->cache_bh.len = 1;
+	mr_ctrl->cache_bh.overflow = 0;
+	/* Update the generation number. */
+	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
+	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
+		(void *)mr_ctrl, mr_ctrl->cur_gen);
+}
+
+/**
+ * Creates a memory region for external memory, that is memory which is not
+ * part of the DPDK memory segments.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param addr
+ *   Starting virtual address of memory.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @param socked_id
+ *   Socket to allocate heap memory for the control structures.
+ *
+ * @return
+ *   Pointer to MR structure on success, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len, int socket_id)
+{
+	struct mlx5_mr *mr = NULL;
+
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE),
+				RTE_CACHE_LINE_SIZE, socket_id);
+	if (mr == NULL)
+		return NULL;
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)addr, len,
+				       IBV_ACCESS_LOCAL_WRITE |
+					   IBV_ACCESS_RELAXED_ORDERING);
+	if (mr->ibv_mr == NULL) {
+		DRV_LOG(WARNING,
+			"Fail to create a verbs MR for address (%p)",
+			(void *)addr);
+		rte_free(mr);
+		return NULL;
+	}
+	mr->msl = NULL; /* Mark it is external memory. */
+	mr->ms_bmp = NULL;
+	mr->ms_n = 1;
+	mr->ms_bmp_n = 1;
+	DRV_LOG(DEBUG,
+		"MR CREATED (%p) for external memory %p:\n"
+		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+		(void *)mr, (void *)addr,
+		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	return mr;
+}
+
+/**
+ * Dump all the created MRs and the global cache entries.
+ *
+ * @param sh
+ *   Pointer to Ethernet device shared context.
+ */
+void
+mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	struct mlx5_mr *mr;
+	int mr_n = 0;
+	int chunk_n = 0;
+
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		DEBUG("MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
+		      mr_n++, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		      mr->ms_n, mr->ms_bmp_n);
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret = { 0, };
+
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (!ret.end)
+				break;
+			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
+			      chunk_n++, ret.start, ret.end);
+		}
+	}
+	DEBUG("Dumping global cache %p", (void *)share_cache);
+	mlx5_mr_btree_dump(&share_cache->cache);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+#endif
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
new file mode 100644
index 0000000000..e805f96375
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MR_H_
+#define RTE_PMD_MLX5_COMMON_MR_H_
+
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/queue.h>
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx5dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_rwlock.h>
+#include <rte_bitmap.h>
+#include <rte_memory.h>
+
+#include "mlx5_common_mp.h"
+
+/* Size of per-queue MR cache array for linear search. */
+#define MLX5_MR_CACHE_N 8
+#define MLX5_MR_BTREE_CACHE_N 256
+
+/* Memory Region object. */
+struct mlx5_mr {
+	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
+	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
+	const struct rte_memseg_list *msl;
+	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
+	int ms_n; /* Number of memsegs in use. */
+	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
+	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
+};
+
+/* Cache entry for Memory Region. */
+struct mr_cache_entry {
+	uintptr_t start; /* Start address of MR. */
+	uintptr_t end; /* End address of MR. */
+	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
+} __rte_packed;
+
+/* MR Cache table for Binary search. */
+struct mlx5_mr_btree {
+	uint16_t len; /* Number of entries. */
+	uint16_t size; /* Total number of entries. */
+	int overflow; /* Mark failure of table expansion. */
+	struct mr_cache_entry (*table)[];
+} __rte_packed;
+
+/* Per-queue MR control descriptor. */
+struct mlx5_mr_ctrl {
+	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
+	uint32_t cur_gen; /* Generation number saved to flush caches. */
+	uint16_t mru; /* Index of last hit entry in top-half cache. */
+	uint16_t head; /* Index of the oldest entry in top-half cache. */
+	struct mr_cache_entry cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
+	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
+} __rte_packed;
+
+LIST_HEAD(mlx5_mr_list, mlx5_mr);
+
+/* Global per-device MR cache. */
+struct mlx5_mr_share_cache {
+	uint32_t dev_gen; /* Generation number to flush local caches. */
+	rte_rwlock_t rwlock; /* MR cache Lock. */
+	struct mlx5_mr_btree cache; /* Global MR cache table. */
+	struct mlx5_mr_list mr_list; /* Registered MR list. */
+	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+} __rte_packed;
+
+/**
+ * Look up LKey from given lookup table by linear search. Firstly look up the
+ * last-hit entry. If miss, the entire array is searched. If found, update the
+ * last-hit index and return LKey.
+ *
+ * @param lkp_tbl
+ *   Pointer to lookup table.
+ * @param[in,out] cached_idx
+ *   Pointer to last-hit index.
+ * @param n
+ *   Size of lookup table.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static __rte_always_inline uint32_t
+mlx5_mr_lookup_lkey(struct mr_cache_entry *lkp_tbl, uint16_t *cached_idx,
+		    uint16_t n, uintptr_t addr)
+{
+	uint16_t idx;
+
+	if (likely(addr >= lkp_tbl[*cached_idx].start &&
+		   addr < lkp_tbl[*cached_idx].end))
+		return lkp_tbl[*cached_idx].lkey;
+	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
+		if (addr >= lkp_tbl[idx].start &&
+		    addr < lkp_tbl[idx].end) {
+			/* Found. */
+			*cached_idx = idx;
+			return lkp_tbl[idx].lkey;
+		}
+	}
+	return UINT32_MAX;
+}
+
+__rte_experimental
+int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
+__rte_experimental
+void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
+__rte_experimental
+void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused);
+__rte_experimental
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en);
+__rte_experimental
+void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
+__rte_experimental
+void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
+__rte_experimental
+void mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache);
+__rte_experimental
+void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
+__rte_experimental
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr);
+__rte_experimental
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len,
+		   int socket_id);
+__rte_experimental
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en);
+
+#endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index 265703d1c9..b58a378278 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -61,4 +61,18 @@ EXPERIMENTAL {
 	mlx5_mp_req_mr_create;
 	mlx5_mp_req_queue_state_modify;
 	mlx5_mp_req_verbs_cmd_fd;
+
+	mlx5_mr_btree_init;
+	mlx5_mr_btree_free;
+	mlx5_mr_btree_dump;
+	mlx5_mr_addr2mr_bh;
+	mlx5_mr_release_cache;
+	mlx5_mr_dump_cache;
+	mlx5_mr_rebuild_cache;
+	mlx5_mr_insert_cache;
+	mlx5_mr_lookup_cache;
+	mlx5_mr_lookup_list;
+	mlx5_create_mr_ext;
+	mlx5_mr_create_primary;
+	mlx5_mr_flush_local_cache;
 };
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 4/4] net/mlx5: modify net pmd to use common MR driver
  2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                     ` (2 preceding siblings ...)
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 3/4] common/mlx5: refactor memory management codes Vu Pham
@ 2020-04-07 16:48   ` Vu Pham
  3 siblings, 0 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-07 16:48 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Modify mlx5 net pmd driver to use MR managment APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/Makefile     |    1 +
 drivers/common/mlx5/meson.build  |    1 +
 drivers/net/mlx5/mlx5.c          |    4 +-
 drivers/net/mlx5/mlx5.h          |   12 +-
 drivers/net/mlx5/mlx5_mp.c       |    8 +-
 drivers/net/mlx5/mlx5_mr.c       | 1169 ++------------------------------------
 drivers/net/mlx5/mlx5_mr.h       |   87 +--
 drivers/net/mlx5/mlx5_rxtx.c     |    1 +
 drivers/net/mlx5/mlx5_rxtx.h     |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h |    2 +
 drivers/net/mlx5/mlx5_trigger.c  |    1 +
 drivers/net/mlx5/mlx5_txq.c      |    3 +-
 12 files changed, 75 insertions(+), 1224 deletions(-)

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index 2a88492731..26267c957a 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -18,6 +18,7 @@ SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
 SRCS-y += mlx5_common_mp.c
+SRCS-y += mlx5_common_mr.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index 83671861c9..175251b691 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -56,6 +56,7 @@ sources = files(
 	'mlx5_common.c',
 	'mlx5_nl.c',
 	'mlx5_common_mp.c',
+	'mlx5_common_mr.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 9eac8011f3..f45055d96f 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -618,7 +618,7 @@ mlx5_alloc_shared_ibctx(const struct mlx5_dev_spawn_data *spawn,
 	 * At this point the device is not added to the memory
 	 * event list yet, context is just being created.
 	 */
-	err = mlx5_mr_btree_init(&sh->mr.cache,
+	err = mlx5_mr_btree_init(&sh->share_cache.cache,
 				 MLX5_MR_BTREE_CACHE_N * 2,
 				 spawn->pci_dev->device.numa_node);
 	if (err) {
@@ -690,7 +690,7 @@ mlx5_free_shared_ibctx(struct mlx5_ibv_shared *sh)
 	LIST_REMOVE(sh, mem_event_cb);
 	rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
 	/* Release created Memory Regions. */
-	mlx5_mr_release(sh);
+	mlx5_mr_release_cache(&sh->share_cache);
 	/* Remove context from the global device list. */
 	LIST_REMOVE(sh, next);
 	/*
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 9e15600afd..41b6e78369 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -37,10 +37,10 @@
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /** Key string for IPC. */
@@ -198,8 +198,6 @@ struct mlx5_verbs_alloc_ctx {
 	const void *obj; /* Pointer to the DPDK object. */
 };
 
-LIST_HEAD(mlx5_mr_list, mlx5_mr);
-
 /* Flow drop context necessary due to Verbs API. */
 struct mlx5_drop {
 	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
@@ -390,13 +388,7 @@ struct mlx5_ibv_shared {
 	struct ibv_device_attr_ex device_attr; /* Device properties. */
 	LIST_ENTRY(mlx5_ibv_shared) mem_event_cb;
 	/**< Called by memory event callback. */
-	struct {
-		uint32_t dev_gen; /* Generation number to flush local caches. */
-		rte_rwlock_t rwlock; /* MR Lock. */
-		struct mlx5_mr_btree cache; /* Global MR cache table. */
-		struct mlx5_mr_list mr_list; /* Registered MR list. */
-		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
-	} mr;
+	struct mlx5_mr_share_cache share_cache;
 	/* Shared DV/DR flow data section. */
 	pthread_mutex_t dv_mutex; /* DV context mutex. */
 	uint32_t dv_meta_mask; /* flow META metadata supported mask. */
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 43684dbc3a..7ad322d474 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -11,6 +11,7 @@
 #include <rte_string_fns.h>
 
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
@@ -25,7 +26,7 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
 	struct mlx5_priv *priv;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 	int ret;
 
@@ -40,7 +41,10 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
 		mp_init_msg(&priv->mp_id, &mp_res, param->type);
-		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
+		lkey = mlx5_mr_create_primary(priv->sh->pd,
+					      &priv->sh->share_cache,
+					      &entry, param->args.addr,
+					      priv->config.mr_ext_memseg_en);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 9151992a72..2b4b3e2891 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -18,6 +18,8 @@
 #include <rte_bus_pci.h>
 
 #include <mlx5_glue.h>
+#include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_mr.h"
@@ -36,834 +38,6 @@ struct mr_update_mp_data {
 	int ret;
 };
 
-/**
- * Expand B-tree table to a given size. Can't be called with holding
- * memory_hotplug_lock or sh->mr.rwlock due to rte_realloc().
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries for expansion.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_expand(struct mlx5_mr_btree *bt, int n)
-{
-	void *mem;
-	int ret = 0;
-
-	if (n <= bt->size)
-		return ret;
-	/*
-	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
-	 * used inside if there's no room to expand. Because this is a quite
-	 * rare case and a part of very slow path, it is very acceptable.
-	 * Initially cache_bh[] will be given practically enough space and once
-	 * it is expanded, expansion wouldn't be needed again ever.
-	 */
-	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
-	if (mem == NULL) {
-		/* Not an error, B-tree search will be skipped. */
-		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
-			(void *)bt);
-		ret = -1;
-	} else {
-		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
-		bt->table = mem;
-		bt->size = n;
-	}
-	return ret;
-}
-
-/**
- * Look up LKey from given B-tree lookup table, store the last index and return
- * searched LKey.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param[out] idx
- *   Pointer to index. Even on search failure, returns index where it stops
- *   searching so that index can be used when inserting a new entry.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t n;
-	uint16_t base = 0;
-
-	MLX5_ASSERT(bt != NULL);
-	lkp_tbl = *bt->table;
-	n = bt->len;
-	/* First entry must be NULL for comparison. */
-	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
-				    lkp_tbl[0].lkey == UINT32_MAX));
-	/* Binary search. */
-	do {
-		register uint16_t delta = n >> 1;
-
-		if (addr < lkp_tbl[base + delta].start) {
-			n = delta;
-		} else {
-			base += delta;
-			n -= delta;
-		}
-	} while (n > 1);
-	MLX5_ASSERT(addr >= lkp_tbl[base].start);
-	*idx = base;
-	if (addr < lkp_tbl[base].end)
-		return lkp_tbl[base].lkey;
-	/* Not found. */
-	return UINT32_MAX;
-}
-
-/**
- * Insert an entry to B-tree lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param entry
- *   Pointer to new entry to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t idx = 0;
-	size_t shift;
-
-	MLX5_ASSERT(bt != NULL);
-	MLX5_ASSERT(bt->len <= bt->size);
-	MLX5_ASSERT(bt->len > 0);
-	lkp_tbl = *bt->table;
-	/* Find out the slot for insertion. */
-	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
-		DRV_LOG(DEBUG,
-			"abort insertion to B-tree(%p): already exist at"
-			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-			(void *)bt, idx, entry->start, entry->end, entry->lkey);
-		/* Already exist, return. */
-		return 0;
-	}
-	/* If table is full, return error. */
-	if (unlikely(bt->len == bt->size)) {
-		bt->overflow = 1;
-		return -1;
-	}
-	/* Insert entry. */
-	++idx;
-	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
-	if (shift)
-		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
-	lkp_tbl[idx] = *entry;
-	bt->len++;
-	DRV_LOG(DEBUG,
-		"inserted B-tree(%p)[%u],"
-		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		(void *)bt, idx, entry->start, entry->end, entry->lkey);
-	return 0;
-}
-
-/**
- * Initialize B-tree and allocate memory for lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries to allocate.
- * @param socket
- *   NUMA socket on which memory must be allocated.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
-{
-	if (bt == NULL) {
-		rte_errno = EINVAL;
-		return -rte_errno;
-	}
-	MLX5_ASSERT(!bt->table && !bt->size);
-	memset(bt, 0, sizeof(*bt));
-	bt->table = rte_calloc_socket("B-tree table",
-				      n, sizeof(struct mlx5_mr_cache),
-				      0, socket);
-	if (bt->table == NULL) {
-		rte_errno = ENOMEM;
-		DEBUG("failed to allocate memory for btree cache on socket %d",
-		      socket);
-		return -rte_errno;
-	}
-	bt->size = n;
-	/* First entry must be NULL for binary search. */
-	(*bt->table)[bt->len++] = (struct mlx5_mr_cache) {
-		.lkey = UINT32_MAX,
-	};
-	DEBUG("initialized B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	return 0;
-}
-
-/**
- * Free B-tree resources.
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
-{
-	if (bt == NULL)
-		return;
-	DEBUG("freeing B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	rte_free(bt->table);
-	memset(bt, 0, sizeof(*bt));
-}
-
-/**
- * Dump all the entries in a B-tree
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	int idx;
-	struct mlx5_mr_cache *lkp_tbl;
-
-	if (bt == NULL)
-		return;
-	lkp_tbl = *bt->table;
-	for (idx = 0; idx < bt->len; ++idx) {
-		struct mlx5_mr_cache *entry = &lkp_tbl[idx];
-
-		DEBUG("B-tree(%p)[%u],"
-		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
-	}
-#endif
-}
-
-/**
- * Find virtually contiguous memory chunk in a given MR.
- *
- * @param dev
- *   Pointer to MR structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If not found, this will not be
- *   updated.
- * @param start_idx
- *   Start index of the memseg bitmap.
- *
- * @return
- *   Next index to go on lookup.
- */
-static int
-mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
-		   int base_idx)
-{
-	uintptr_t start = 0;
-	uintptr_t end = 0;
-	uint32_t idx = 0;
-
-	/* MR for external memory doesn't have memseg list. */
-	if (mr->msl == NULL) {
-		struct ibv_mr *ibv_mr = mr->ibv_mr;
-
-		MLX5_ASSERT(mr->ms_bmp_n == 1);
-		MLX5_ASSERT(mr->ms_n == 1);
-		MLX5_ASSERT(base_idx == 0);
-		/*
-		 * Can't search it from memseg list but get it directly from
-		 * verbs MR as there's only one chunk.
-		 */
-		entry->start = (uintptr_t)ibv_mr->addr;
-		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-		/* Returning 1 ends iteration. */
-		return 1;
-	}
-	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
-		if (rte_bitmap_get(mr->ms_bmp, idx)) {
-			const struct rte_memseg_list *msl;
-			const struct rte_memseg *ms;
-
-			msl = mr->msl;
-			ms = rte_fbarray_get(&msl->memseg_arr,
-					     mr->ms_base_idx + idx);
-			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-			if (!start)
-				start = ms->addr_64;
-			end = ms->addr_64 + ms->hugepage_sz;
-		} else if (start) {
-			/* Passed the end of a fragment. */
-			break;
-		}
-	}
-	if (start) {
-		/* Found one chunk. */
-		entry->start = start;
-		entry->end = end;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-	}
-	return idx;
-}
-
-/**
- * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
- * Then, this entry will have to be searched by mr_lookup_dev_list() in
- * mlx5_mr_create() on miss.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param mr
- *   Pointer to MR to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_insert_dev_cache(struct mlx5_ibv_shared *sh, struct mlx5_mr *mr)
-{
-	unsigned int n;
-
-	DRV_LOG(DEBUG, "device %s inserting MR(%p) to global cache",
-		sh->ibdev_name, (void *)mr);
-	for (n = 0; n < mr->ms_bmp_n; ) {
-		struct mlx5_mr_cache entry;
-
-		memset(&entry, 0, sizeof(entry));
-		/* Find a contiguous chunk and advance the index. */
-		n = mr_find_next_chunk(mr, &entry, n);
-		if (!entry.end)
-			break;
-		if (mr_btree_insert(&sh->mr.cache, &entry) < 0) {
-			/*
-			 * Overflowed, but the global table cannot be expanded
-			 * because of deadlock.
-			 */
-			return -1;
-		}
-	}
-	return 0;
-}
-
-/**
- * Look up address in the original global MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Found MR on match, NULL otherwise.
- */
-static struct mlx5_mr *
-mr_lookup_dev_list(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-		   uintptr_t addr)
-{
-	struct mlx5_mr *mr;
-
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret;
-
-			memset(&ret, 0, sizeof(ret));
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (addr >= ret.start && addr < ret.end) {
-				/* Found. */
-				*entry = ret;
-				return mr;
-			}
-		}
-	}
-	return NULL;
-}
-
-/**
- * Look up address on device.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mr_lookup_dev(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-	      uintptr_t addr)
-{
-	uint16_t idx;
-	uint32_t lkey = UINT32_MAX;
-	struct mlx5_mr *mr;
-
-	/*
-	 * If the global cache has overflowed since it failed to expand the
-	 * B-tree table, it can't have all the existing MRs. Then, the address
-	 * has to be searched by traversing the original MR list instead, which
-	 * is very slow path. Otherwise, the global cache is all inclusive.
-	 */
-	if (!unlikely(sh->mr.cache.overflow)) {
-		lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-		if (lkey != UINT32_MAX)
-			*entry = (*sh->mr.cache.table)[idx];
-	} else {
-		/* Falling back to the slowest path. */
-		mr = mr_lookup_dev_list(sh, entry, addr);
-		if (mr != NULL)
-			lkey = entry->lkey;
-	}
-	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
-					   addr < entry->end));
-	return lkey;
-}
-
-/**
- * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
- * can raise memory free event and the callback function will spin on the lock.
- *
- * @param mr
- *   Pointer to MR to free.
- */
-static void
-mr_free(struct mlx5_mr *mr)
-{
-	if (mr == NULL)
-		return;
-	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
-	if (mr->ibv_mr != NULL)
-		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
-	if (mr->ms_bmp != NULL)
-		rte_bitmap_free(mr->ms_bmp);
-	rte_free(mr);
-}
-
-/**
- * Release resources of detached MR having no online entry.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mlx5_mr_garbage_collect(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
-
-	/* Must be called from the primary process. */
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	/*
-	 * MR can't be freed with holding the lock because rte_free() could call
-	 * memory free callback function. This will be a deadlock situation.
-	 */
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach the whole free list and release it after unlocking. */
-	free_list = sh->mr.mr_free_list;
-	LIST_INIT(&sh->mr.mr_free_list);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Release resources. */
-	mr_next = LIST_FIRST(&free_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		mr_free(mr);
-	}
-}
-
-/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
-static int
-mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
-			  const struct rte_memseg *ms, size_t len, void *arg)
-{
-	struct mr_find_contig_memsegs_data *data = arg;
-
-	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
-		return 0;
-	/* Found, save it and stop walking. */
-	data->start = ms->addr_64;
-	data->end = ms->addr_64 + len;
-	data->msl = msl;
-	return 1;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This API should be called on a secondary process, then a request is sent to
- * the primary process in order to create a MR for the address. As the global MR
- * list is on the shared memory, following LKey lookup should succeed unless the
- * request fails.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-			 uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	int ret;
-
-	DEBUG("port %u requesting MR creation for address (%p)",
-	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
-	if (ret) {
-		DEBUG("port %u fail to request MR creation for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		return UINT32_MAX;
-	}
-	rte_rwlock_read_lock(&priv->sh->mr.rwlock);
-	/* Fill in output data. */
-	mr_lookup_dev(priv->sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_read_unlock(&priv->sh->mr.rwlock);
-	DEBUG("port %u MR CREATED by primary process for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
-	      dev->data->port_id, (void *)addr,
-	      entry->start, entry->end, entry->lkey);
-	return entry->lkey;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * Register entire virtually contiguous memory chunk around the address.
- * This must be called from the primary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-uint32_t
-mlx5_mr_create_primary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-		       uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_dev_config *config = &priv->config;
-	const struct rte_memseg_list *msl;
-	const struct rte_memseg *ms;
-	struct mlx5_mr *mr = NULL;
-	size_t len;
-	uint32_t ms_n;
-	uint32_t bmp_size;
-	void *bmp_mem;
-	int ms_idx_shift = -1;
-	unsigned int n;
-	struct mr_find_contig_memsegs_data data = {
-		.addr = addr,
-	};
-	struct mr_find_contig_memsegs_data data_re;
-
-	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
-		dev->data->port_id, (void *)addr);
-	/*
-	 * Release detached MRs if any. This can't be called with holding either
-	 * memory_hotplug_lock or sh->mr.rwlock. MRs on the free list have
-	 * been detached by the memory free event but it couldn't be released
-	 * inside the callback due to deadlock. As a result, releasing resources
-	 * is quite opportunistic.
-	 */
-	mlx5_mr_garbage_collect(sh);
-	/*
-	 * If enabled, find out a contiguous virtual address chunk in use, to
-	 * which the given address belongs, in order to register maximum range.
-	 * In the best case where mempools are not dynamically recreated and
-	 * '--socket-mem' is specified as an EAL option, it is very likely to
-	 * have only one MR(LKey) per a socket and per a hugepage-size even
-	 * though the system memory is highly fragmented. As the whole memory
-	 * chunk will be pinned by kernel, it can't be reused unless entire
-	 * chunk is freed from EAL.
-	 *
-	 * If disabled, just register one memseg (page). Then, memory
-	 * consumption will be minimized but it may drop performance if there
-	 * are many MRs to lookup on the datapath.
-	 */
-	if (!config->mr_ext_memseg_en) {
-		data.msl = rte_mem_virt2memseg_list((void *)addr);
-		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
-		data.end = data.start + data.msl->page_sz;
-	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
-		DRV_LOG(WARNING,
-			"port %u unable to find virtually contiguous"
-			" chunk for address (%p)."
-			" rte_memseg_contig_walk() failed.",
-			dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_nolock;
-	}
-alloc_resources:
-	/* Addresses must be page-aligned. */
-	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
-	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
-	msl = data.msl;
-	ms = rte_mem_virt2memseg((void *)data.start, msl);
-	len = data.end - data.start;
-	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-	/* Number of memsegs in the range. */
-	ms_n = len / msl->page_sz;
-	DEBUG("port %u extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " page_sz=0x%" PRIx64 ", ms_n=%u",
-	      dev->data->port_id, (void *)addr,
-	      data.start, data.end, msl->page_sz, ms_n);
-	/* Size of memory for bitmap. */
-	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE) +
-				bmp_size,
-				RTE_CACHE_LINE_SIZE, msl->socket_id);
-	if (mr == NULL) {
-		DEBUG("port %u unable to allocate memory for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENOMEM;
-		goto err_nolock;
-	}
-	mr->msl = msl;
-	/*
-	 * Save the index of the first memseg and initialize memseg bitmap. To
-	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
-	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
-	 */
-	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
-	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
-	if (mr->ms_bmp == NULL) {
-		DEBUG("port %u unable to initialize bitmap for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_nolock;
-	}
-	/*
-	 * Should recheck whether the extended contiguous chunk is still valid.
-	 * Because memory_hotplug_lock can't be held if there's any memory
-	 * related calls in a critical path, resource allocation above can't be
-	 * locked. If the memory has been changed at this point, try again with
-	 * just single page. If not, go on with the big chunk atomically from
-	 * here.
-	 */
-	rte_mcfg_mem_read_lock();
-	data_re = data;
-	if (len > msl->page_sz &&
-	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
-		DEBUG("port %u unable to find virtually contiguous"
-		      " chunk for address (%p)."
-		      " rte_memseg_contig_walk() failed.",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_memlock;
-	}
-	if (data.start != data_re.start || data.end != data_re.end) {
-		/*
-		 * The extended contiguous chunk has been changed. Try again
-		 * with single memseg instead.
-		 */
-		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
-		data.end = data.start + msl->page_sz;
-		rte_mcfg_mem_read_unlock();
-		mr_free(mr);
-		goto alloc_resources;
-	}
-	MLX5_ASSERT(data.msl == data_re.msl);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/*
-	 * Check the address is really missing. If other thread already created
-	 * one or it is not found due to overflow, abort and return.
-	 */
-	if (mr_lookup_dev(sh, entry, addr) != UINT32_MAX) {
-		/*
-		 * Insert to the global cache table. It may fail due to
-		 * low-on-memory. Then, this entry will have to be searched
-		 * here again.
-		 */
-		mr_btree_insert(&sh->mr.cache, entry);
-		DEBUG("port %u found MR for %p on final lookup, abort",
-		      dev->data->port_id, (void *)addr);
-		rte_rwlock_write_unlock(&sh->mr.rwlock);
-		rte_mcfg_mem_read_unlock();
-		/*
-		 * Must be unlocked before calling rte_free() because
-		 * mlx5_mr_mem_event_free_cb() can be called inside.
-		 */
-		mr_free(mr);
-		return entry->lkey;
-	}
-	/*
-	 * Trim start and end addresses for verbs MR. Set bits for registering
-	 * memsegs but exclude already registered ones. Bitmap can be
-	 * fragmented.
-	 */
-	for (n = 0; n < ms_n; ++n) {
-		uintptr_t start;
-		struct mlx5_mr_cache ret;
-
-		memset(&ret, 0, sizeof(ret));
-		start = data_re.start + n * msl->page_sz;
-		/* Exclude memsegs already registered by other MRs. */
-		if (mr_lookup_dev(sh, &ret, start) == UINT32_MAX) {
-			/*
-			 * Start from the first unregistered memseg in the
-			 * extended range.
-			 */
-			if (ms_idx_shift == -1) {
-				mr->ms_base_idx += n;
-				data.start = start;
-				ms_idx_shift = n;
-			}
-			data.end = start + msl->page_sz;
-			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
-			++mr->ms_n;
-		}
-	}
-	len = data.end - data.start;
-	mr->ms_bmp_n = len / msl->page_sz;
-	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
-	/*
-	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
-	 * called with holding the memory lock because it doesn't use
-	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
-	 * through mlx5_alloc_verbs_buf().
-	 */
-	mr->ibv_mr = mlx5_glue->reg_mr(sh->pd, (void *)data.start, len,
-				       IBV_ACCESS_LOCAL_WRITE |
-					   IBV_ACCESS_RELAXED_ORDERING);
-	if (mr->ibv_mr == NULL) {
-		DEBUG("port %u fail to create a verbs MR for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_mrlock;
-	}
-	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
-	MLX5_ASSERT(mr->ibv_mr->length == len);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
-	DEBUG("port %u MR CREATED (%p) for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-	      dev->data->port_id, (void *)mr, (void *)addr,
-	      data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	/* Fill in output data. */
-	mr_lookup_dev(sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	rte_mcfg_mem_read_unlock();
-	return entry->lkey;
-err_mrlock:
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-err_memlock:
-	rte_mcfg_mem_read_unlock();
-err_nolock:
-	/*
-	 * In case of error, as this can be called in a datapath, a warning
-	 * message per an error is preferable instead. Must be unlocked before
-	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
-	 * inside.
-	 */
-	mr_free(mr);
-	return UINT32_MAX;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This can be called from primary and secondary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-	       uintptr_t addr)
-{
-	uint32_t ret = 0;
-
-	switch (rte_eal_process_type()) {
-	case RTE_PROC_PRIMARY:
-		ret = mlx5_mr_create_primary(dev, entry, addr);
-		break;
-	case RTE_PROC_SECONDARY:
-		ret = mlx5_mr_create_secondary(dev, entry, addr);
-		break;
-	default:
-		break;
-	}
-	return ret;
-}
-
-/**
- * Rebuild the global B-tree cache of device from the original MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mr_rebuild_dev_cache(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr;
-
-	DRV_LOG(DEBUG, "device %s rebuild dev cache[]", sh->ibdev_name);
-	/* Flush cache to rebuild. */
-	sh->mr.cache.len = 1;
-	sh->mr.cache.overflow = 0;
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr)
-		if (mr_insert_dev_cache(sh, mr) < 0)
-			return;
-}
-
 /**
  * Callback for memory free event. Iterate freed memsegs and check whether it
  * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a
@@ -900,18 +74,18 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		    RTE_ALIGN((uintptr_t)addr, msl->page_sz));
 	MLX5_ASSERT(len == RTE_ALIGN(len, msl->page_sz));
 	ms_n = len / msl->page_sz;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
 	/* Clear bits of freed memsegs from MR. */
 	for (i = 0; i < ms_n; ++i) {
 		const struct rte_memseg *ms;
-		struct mlx5_mr_cache entry;
+		struct mr_cache_entry entry;
 		uintptr_t start;
 		int ms_idx;
 		uint32_t pos;
 
 		/* Find MR having this memseg. */
 		start = (uintptr_t)addr + i * msl->page_sz;
-		mr = mr_lookup_dev_list(sh, &entry, start);
+		mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, start);
 		if (mr == NULL)
 			continue;
 		MLX5_ASSERT(mr->msl); /* Can't be external memory. */
@@ -927,7 +101,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rte_bitmap_clear(mr->ms_bmp, pos);
 		if (--mr->ms_n == 0) {
 			LIST_REMOVE(mr, mr);
-			LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+			LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 			DEBUG("device %s remove MR(%p) from list",
 			      sh->ibdev_name, (void *)mr);
 		}
@@ -938,7 +112,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rebuild = 1;
 	}
 	if (rebuild) {
-		mr_rebuild_dev_cache(sh);
+		mlx5_mr_rebuild_cache(&sh->share_cache);
 		/*
 		 * Flush local caches by propagating invalidation across cores.
 		 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -948,12 +122,12 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		 * generation below) will be guaranteed to be seen by other core
 		 * before the core sees the newly allocated memory.
 		 */
-		++sh->mr.dev_gen;
+		++sh->share_cache.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
-		      sh->mr.dev_gen);
+		      sh->share_cache.dev_gen);
 		rte_smp_wmb();
 	}
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
 
 /**
@@ -990,111 +164,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Look up address in the global MR cache table. If not found, create a new MR.
- * Insert the found/created entry to local bottom-half cache table.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this is not written.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   struct mlx5_mr_cache *entry, uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
-	uint16_t idx;
-	uint32_t lkey;
-
-	/* If local cache table is full, try to double it. */
-	if (unlikely(bt->len == bt->size))
-		mr_btree_expand(bt, bt->size << 1);
-	/* Look up in the global cache. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-	if (lkey != UINT32_MAX) {
-		/* Found. */
-		*entry = (*sh->mr.cache.table)[idx];
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
-		/*
-		 * Update local cache. Even if it fails, return the found entry
-		 * to update top-half cache. Next time, this entry will be found
-		 * in the global cache.
-		 */
-		mr_btree_insert(bt, entry);
-		return lkey;
-	}
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-	/* First time to see the address? Create a new MR. */
-	lkey = mlx5_mr_create(dev, entry, addr);
-	/*
-	 * Update the local cache if successfully created a new global MR. Even
-	 * if failed to create one, there's no action to take in this datapath
-	 * code. As returning LKey is invalid, this will eventually make HW
-	 * fail.
-	 */
-	if (lkey != UINT32_MAX)
-		mr_btree_insert(bt, entry);
-	return lkey;
-}
-
-/**
- * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] and if
- * misses, search in the global MR cache table and update the new entry to
- * per-queue local caches.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   uintptr_t addr)
-{
-	uint32_t lkey;
-	uint16_t bh_idx = 0;
-	/* Victim in top-half cache to replace with new entry. */
-	struct mlx5_mr_cache *repl = &mr_ctrl->cache[mr_ctrl->head];
-
-	/* Binary-search MR translation table. */
-	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
-	/* Update top-half cache. */
-	if (likely(lkey != UINT32_MAX)) {
-		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
-	} else {
-		/*
-		 * If missed in local lookup table, search in the global cache
-		 * and local cache_bh[] will be updated inside if possible.
-		 * Top-half cache entry will also be updated.
-		 */
-		lkey = mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
-		if (unlikely(lkey == UINT32_MAX))
-			return UINT32_MAX;
-	}
-	/* Update the most recently used entry. */
-	mr_ctrl->mru = mr_ctrl->head;
-	/* Point to the next victim, the oldest. */
-	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
-	return lkey;
-}
-
 /**
  * Bottom-half of LKey search on Rx.
  *
@@ -1114,7 +183,9 @@ mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
 	struct mlx5_priv *priv = rxq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1136,7 +207,9 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
 	struct mlx5_priv *priv = txq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1165,82 +238,6 @@ mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	return lkey;
 }
 
-/**
- * Flush all of the local cache entries.
- *
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- */
-void
-mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
-{
-	/* Reset the most-recently-used index. */
-	mr_ctrl->mru = 0;
-	/* Reset the linear search array. */
-	mr_ctrl->head = 0;
-	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
-	/* Reset the B-tree table. */
-	mr_ctrl->cache_bh.len = 1;
-	mr_ctrl->cache_bh.overflow = 0;
-	/* Update the generation number. */
-	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
-	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
-		(void *)mr_ctrl, mr_ctrl->cur_gen);
-}
-
-/**
- * Creates a memory region for external memory, that is memory which is not
- * part of the DPDK memory segments.
- *
- * @param dev
- *   Pointer to the ethernet device.
- * @param addr
- *   Starting virtual address of memory.
- * @param len
- *   Length of memory segment being mapped.
- * @param socked_id
- *   Socket to allocate heap memory for the control structures.
- *
- * @return
- *   Pointer to MR structure on success, NULL otherwise.
- */
-static struct mlx5_mr *
-mlx5_create_mr_ext(struct rte_eth_dev *dev, uintptr_t addr, size_t len,
-		   int socket_id)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_mr *mr = NULL;
-
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE),
-				RTE_CACHE_LINE_SIZE, socket_id);
-	if (mr == NULL)
-		return NULL;
-	mr->ibv_mr = mlx5_glue->reg_mr(priv->sh->pd, (void *)addr, len,
-				       IBV_ACCESS_LOCAL_WRITE |
-					   IBV_ACCESS_RELAXED_ORDERING);
-	if (mr->ibv_mr == NULL) {
-		DRV_LOG(WARNING,
-			"port %u fail to create a verbs MR for address (%p)",
-			dev->data->port_id, (void *)addr);
-		rte_free(mr);
-		return NULL;
-	}
-	mr->msl = NULL; /* Mark it is external memory. */
-	mr->ms_bmp = NULL;
-	mr->ms_n = 1;
-	mr->ms_bmp_n = 1;
-	DRV_LOG(DEBUG,
-		"port %u MR CREATED (%p) for external memory %p:\n"
-		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-		dev->data->port_id, (void *)mr, (void *)addr,
-		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	return mr;
-}
-
 /**
  * Called during rte_mempool_mem_iter() by mlx5_mr_update_ext_mp().
  *
@@ -1267,19 +264,19 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 	struct mlx5_mr *mr = NULL;
 	uintptr_t addr = (uintptr_t)memhdr->addr;
 	size_t len = memhdr->len;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	/* If already registered, it should return. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_lookup_dev(sh, &entry, addr);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	lkey = mlx5_mr_lookup_cache(&sh->share_cache, &entry, addr);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	if (lkey != UINT32_MAX)
 		return;
 	DRV_LOG(DEBUG, "port %u register MR for chunk #%d of mempool (%s)",
 		dev->data->port_id, mem_idx, mp->name);
-	mr = mlx5_create_mr_ext(dev, addr, len, mp->socket_id);
+	mr = mlx5_create_mr_ext(sh->pd, addr, len, mp->socket_id);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to allocate a new MR of"
@@ -1288,13 +285,14 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 		data->ret = -1;
 		return;
 	}
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	/* Insert to the local cache table */
-	mlx5_mr_addr2mr_bh(dev, mr_ctrl, addr);
+	mlx5_mr_addr2mr_bh(sh->pd, &priv->mp_id, &sh->share_cache,
+			   mr_ctrl, addr, priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1351,19 +349,19 @@ mlx5_dma_map(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	priv = dev->data->dev_private;
-	mr = mlx5_create_mr_ext(dev, (uintptr_t)addr, len, SOCKET_ID_ANY);
+	sh = priv->sh;
+	mr = mlx5_create_mr_ext(sh->pd, (uintptr_t)addr, len, SOCKET_ID_ANY);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to dma map", dev->data->port_id);
 		rte_errno = EINVAL;
 		return -1;
 	}
-	sh = priv->sh;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1390,7 +388,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	struct mlx5_priv *priv;
 	struct mlx5_ibv_shared *sh;
 	struct mlx5_mr *mr;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 
 	dev = pci_dev_to_eth_dev(pdev);
 	if (!dev) {
@@ -1401,10 +399,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	}
 	priv = dev->data->dev_private;
 	sh = priv->sh;
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	mr = mr_lookup_dev_list(sh, &entry, (uintptr_t)addr);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, (uintptr_t)addr);
 	if (!mr) {
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
+		rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 		DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't registered "
 				 "to PCI device %p", (uintptr_t)addr,
 				 (void *)pdev);
@@ -1412,10 +410,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	LIST_REMOVE(mr, mr);
-	LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
 	      (void *)mr);
-	mr_rebuild_dev_cache(sh);
+	mlx5_mr_rebuild_cache(&sh->share_cache);
 	/*
 	 * Flush local caches by propagating invalidation across cores.
 	 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -1425,10 +423,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	 * generation below) will be guaranteed to be seen by other core
 	 * before the core sees the newly allocated memory.
 	 */
-	++sh->mr.dev_gen;
-	DEBUG("broadcasting local cache flush, gen=%d",	sh->mr.dev_gen);
+	++sh->share_cache.dev_gen;
+	DEBUG("broadcasting local cache flush, gen=%d",
+	      sh->share_cache.dev_gen);
 	rte_smp_wmb();
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1503,14 +502,19 @@ mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
 		     unsigned mem_idx __rte_unused)
 {
 	struct mr_update_mp_data *data = opaque;
+	struct rte_eth_dev *dev = data->dev;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
 	uint32_t lkey;
 
 	/* Stop iteration if failed in the previous walk. */
 	if (data->ret < 0)
 		return;
 	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr);
+	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, data->mr_ctrl,
+				  (uintptr_t)memhdr->addr,
+				  priv->config.mr_ext_memseg_en);
 	if (lkey == UINT32_MAX)
 		data->ret = -1;
 }
@@ -1545,76 +549,3 @@ mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 	}
 	return data.ret;
 }
-
-/**
- * Dump all the created MRs and the global cache entries.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	struct mlx5_mr *mr;
-	int mr_n = 0;
-	int chunk_n = 0;
-
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		DEBUG("device %s MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
-		      sh->ibdev_name, mr_n++,
-		      rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		      mr->ms_n, mr->ms_bmp_n);
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret = { 0, };
-
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (!ret.end)
-				break;
-			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
-			      chunk_n++, ret.start, ret.end);
-		}
-	}
-	DEBUG("device %s dumping global cache", sh->ibdev_name);
-	mlx5_mr_btree_dump(&sh->mr.cache);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-#endif
-}
-
-/**
- * Release all the created MRs and resources for shared device context.
- * list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_release(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-
-	if (rte_log_can_log(mlx5_logtype, RTE_LOG_DEBUG))
-		mlx5_mr_dump_dev(sh);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach from MR list and move to free list. */
-	mr_next = LIST_FIRST(&sh->mr.mr_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		LIST_REMOVE(mr, mr);
-		LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
-	}
-	LIST_INIT(&sh->mr.mr_list);
-	/* Free global cache. */
-	mlx5_mr_btree_free(&sh->mr.cache);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Free all remaining MRs. */
-	mlx5_mr_garbage_collect(sh);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 48264c8294..0c5877b3d6 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -24,99 +24,16 @@
 #include <rte_ethdev.h>
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_memory.h>
 
-/* Memory Region object. */
-struct mlx5_mr {
-	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
-	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
-	const struct rte_memseg_list *msl;
-	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
-	int ms_n; /* Number of memsegs in use. */
-	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
-	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
-};
-
-/* Cache entry for Memory Region. */
-struct mlx5_mr_cache {
-	uintptr_t start; /* Start address of MR. */
-	uintptr_t end; /* End address of MR. */
-	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
-} __rte_packed;
-
-/* MR Cache table for Binary search. */
-struct mlx5_mr_btree {
-	uint16_t len; /* Number of entries. */
-	uint16_t size; /* Total number of entries. */
-	int overflow; /* Mark failure of table expansion. */
-	struct mlx5_mr_cache (*table)[];
-} __rte_packed;
-
-/* Per-queue MR control descriptor. */
-struct mlx5_mr_ctrl {
-	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
-	uint32_t cur_gen; /* Generation number saved to flush caches. */
-	uint16_t mru; /* Index of last hit entry in top-half cache. */
-	uint16_t head; /* Index of the oldest entry in top-half cache. */
-	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
-	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
-} __rte_packed;
-
-struct mlx5_ibv_shared;
-extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
-extern rte_rwlock_t mlx5_mem_event_rwlock;
+#include <mlx5_common_mr.h>
 
 /* First entry must be NULL for comparison. */
 #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
 
-int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
-void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
-uint32_t mlx5_mr_create_primary(struct rte_eth_dev *dev,
-				struct mlx5_mr_cache *entry, uintptr_t addr);
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
 int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 		      struct rte_mempool *mp);
-void mlx5_mr_release(struct mlx5_ibv_shared *sh);
-
-/* Debug purpose functions. */
-void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt);
-void mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh);
-
-/**
- * Look up LKey from given lookup table by linear search. Firstly look up the
- * last-hit entry. If miss, the entire array is searched. If found, update the
- * last-hit index and return LKey.
- *
- * @param lkp_tbl
- *   Pointer to lookup table.
- * @param[in,out] cached_idx
- *   Pointer to last-hit index.
- * @param n
- *   Size of lookup table.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static __rte_always_inline uint32_t
-mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t *cached_idx,
-		     uint16_t n, uintptr_t addr)
-{
-	uint16_t idx;
-
-	if (likely(addr >= lkp_tbl[*cached_idx].start &&
-		   addr < lkp_tbl[*cached_idx].end))
-		return lkp_tbl[*cached_idx].lkey;
-	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
-		if (addr >= lkp_tbl[idx].start &&
-		    addr < lkp_tbl[idx].end) {
-			/* Found. */
-			*cached_idx = idx;
-			return lkp_tbl[idx].lkey;
-		}
-	}
-	return UINT32_MAX;
-}
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index fc7591c2b0..5f9b670442 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -33,6 +33,7 @@
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_autoconf.h"
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 939778aa55..84161ad6af 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -34,11 +34,11 @@
 #include <mlx5_glue.h>
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /* Support tunnel matching. */
@@ -598,8 +598,8 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half (Binary Search) on miss. */
@@ -630,8 +630,8 @@ mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	if (unlikely(*mr_ctrl->dev_gen_ptr != mr_ctrl->cur_gen))
 		mlx5_mr_flush_local_cache(mr_ctrl);
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half on miss. */
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index ea925156f0..6ddcbfb0ad 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -13,6 +13,8 @@
 
 #include "mlx5_autoconf.h"
 
+#include "mlx5_mr.h"
+
 /* HW checksum offload capabilities of vectorized Tx. */
 #define MLX5_VEC_TX_CKSUM_OFFLOAD_CAP \
 	(DEV_TX_OFFLOAD_IPV4_CKSUM | \
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 438b705952..759670408b 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -11,6 +11,7 @@
 #include <rte_alarm.h>
 
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 #include "rte_pmd_mlx5.h"
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 0653f4cf30..29e5cabab6 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -30,6 +30,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
@@ -1289,7 +1290,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		goto error;
 	}
 	/* Save pointer of global generation number to check memory event. */
-	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->mr.dev_gen;
+	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	MLX5_ASSERT(desc > MLX5_TX_COMP_THRESH);
 	tmpl->txq.offloads = conf->offloads |
 			     dev->data->dev_conf.txmode.offloads;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                   ` (4 preceding siblings ...)
  2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
@ 2020-04-07 17:00 ` Vu Pham
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
                     ` (3 more replies)
  2020-04-13 21:17 ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Vu Pham
  6 siblings, 4 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-07 17:00 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Current mlx5 net PMD and future mlx5(regex,...) PMDs that run
and share the same HCAs need to use common memory management
driver. Memory management codes embeddedly use multi-process IPC 
for primary/secondary processes to register and sync on memory
registrations MRs. That's the main reason to move multi-process 
IPC APIs to mlx5 common driver and make it become the base commit.

Vu Pham (4):
  common/mlx5: refactor multi-process IPC handling codes to common
    driver
  net/mlx5: modify net PMD to use common multi-process APIs
  common/mlx5: refactor memory management codes
  net/mlx5: modify net PMD to use common MR driver

 drivers/common/mlx5/Makefile                    |    4 +-
 drivers/common/mlx5/meson.build                 |    2 +
 drivers/common/mlx5/mlx5_common_mp.c            |  188 ++++
 drivers/common/mlx5/mlx5_common_mp.h            |   98 ++
 drivers/common/mlx5/mlx5_common_mr.c            | 1108 +++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   27 +
 drivers/net/mlx5/mlx5.c                         |   19 +-
 drivers/net/mlx5/mlx5.h                         |   55 +-
 drivers/net/mlx5/mlx5_mp.c                      |  242 +----
 drivers/net/mlx5/mlx5_mr.c                      | 1169 +----------------------
 drivers/net/mlx5/mlx5_mr.h                      |   87 +-
 drivers/net/mlx5/mlx5_rxtx.c                    |    4 +-
 drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
 drivers/net/mlx5/mlx5_trigger.c                 |    1 +
 drivers/net/mlx5/mlx5_txq.c                     |    3 +-
 17 files changed, 1692 insertions(+), 1487 deletions(-)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

-- 
2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling codes to common driver
  2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
@ 2020-04-07 17:00   ` Vu Pham
  2020-04-08  9:05     ` Slava Ovsiienko
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-07 17:00 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Refactor common multi-process handling codes from net PMD to common
driver. Using tuple mp_id{name, port_id} as standard input parameter
for all multi-process IPC APIs instead of using rte_eth_dev.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/mlx5_common_mp.c            | 188 ++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++++
 drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
 3 files changed, 299 insertions(+)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
new file mode 100644
index 0000000000..da55143bc1
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2019 6WIND S.A.
+ * Copyright 2019 Mellanox Technologies, Ltd
+ */
+
+#include <stdio.h>
+#include <time.h>
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "mlx5_common_mp.h"
+#include "mlx5_common_utils.h"
+
+/**
+ * Request Memory Region creation to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
+	req->args.addr = addr;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs queue state modification to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param sm
+ *   State modify parameters.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+			       struct mlx5_mp_arg_queue_state_modify *sm)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
+	req->args.state_modify = *sm;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs command file descriptor for mmap to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ *
+ * @return
+ *   fd on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	if (res->result) {
+		rte_errno = -res->result;
+		DRV_LOG(ERR,
+			"port %u failed to get command FD from primary process",
+			mp_id->port_id);
+		ret = -rte_errno;
+		goto exit;
+	}
+	MLX5_ASSERT(mp_res->num_fds == 1);
+	ret = mp_res->fds[0];
+	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
+		mp_id->port_id, ret);
+exit:
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Initialize by primary process.
+ */
+int
+mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action)
+{
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+
+	/* primary is allowed to not support IPC */
+	ret = rte_mp_action_register(name, primary_action);
+	if (ret && rte_errno != ENOTSUP)
+		return -1;
+	return 0;
+}
+
+/**
+ * Un-initialize by primary process.
+ */
+void
+mlx5_mp_uninit_primary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	rte_mp_action_unregister(name);
+}
+
+/**
+ * Initialize by secondary process.
+ */
+int
+mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	return rte_mp_action_register(name, secondary_action);
+}
+
+/**
+ * Un-initialize by secondary process.
+ */
+void
+mlx5_mp_uninit_secondary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	rte_mp_action_unregister(name);
+}
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
new file mode 100644
index 0000000000..7aab77acb2
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MP_H_
+#define RTE_PMD_MLX5_COMMON_MP_H_
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_eal.h>
+#include <rte_string_fns.h>
+
+/* Request types for IPC. */
+enum mlx5_mp_req_type {
+	MLX5_MP_REQ_VERBS_CMD_FD = 1,
+	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_START_RXTX,
+	MLX5_MP_REQ_STOP_RXTX,
+	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
+};
+
+struct mlx5_mp_arg_queue_state_modify {
+	uint8_t is_wq; /* Set if WQ. */
+	uint16_t queue_id; /* DPDK queue ID. */
+	enum ibv_wq_state state; /* WQ requested state. */
+};
+
+/* Pameters for IPC. */
+struct mlx5_mp_param {
+	enum mlx5_mp_req_type type;
+	int port_id;
+	int result;
+	RTE_STD_C11
+	union {
+		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_queue_state_modify state_modify;
+		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
+	} args;
+};
+
+/*  Identifier of a MP process */
+struct mlx5_mp_id {
+	char name[RTE_MP_MAX_NAME_LEN];
+	uint16_t port_id;
+};
+
+/** Request timeout for IPC. */
+#define MLX5_MP_REQ_TIMEOUT_SEC 5
+
+/**
+ * Initialize IPC message.
+ *
+ * @param[in] port_id
+ *   Port ID of the device.
+ * @param[out] msg
+ *   Pointer to message to fill in.
+ * @param[in] type
+ *   Message type.
+ */
+static inline void
+mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
+	    enum mlx5_mp_req_type type)
+{
+	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
+
+	memset(msg, 0, sizeof(*msg));
+	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
+	msg->len_param = sizeof(*param);
+	param->type = type;
+	param->port_id = mp_id->port_id;
+}
+
+__rte_experimental
+int mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action);
+__rte_experimental
+void mlx5_mp_uninit_primary(const char *name);
+__rte_experimental
+int mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action);
+__rte_experimental
+void mlx5_mp_uninit_secondary(const char *name);
+__rte_experimental
+int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
+__rte_experimental
+int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+				   struct mlx5_mp_arg_queue_state_modify *sm);
+__rte_experimental
+int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
+
+#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index aede2a0a51..265703d1c9 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -48,4 +48,17 @@ DPDK_20.0.1 {
 	mlx5_nl_vlan_vmwa_delete;
 
 	mlx5_translate_port_name;
+
+};
+
+EXPERIMENTAL {
+        global:
+
+	mlx5_mp_init_primary;
+	mlx5_mp_uninit_primary;
+	mlx5_mp_init_secondary;
+	mlx5_mp_uninit_secondary;
+	mlx5_mp_req_mr_create;
+	mlx5_mp_req_queue_state_modify;
+	mlx5_mp_req_verbs_cmd_fd;
 };
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-process APIs
  2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
@ 2020-04-07 17:00   ` Vu Pham
  2020-04-08  9:05     ` Slava Ovsiienko
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: refactor memory management codes Vu Pham
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver Vu Pham
  3 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-07 17:00 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Modify net PMD to use multi-process APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/Makefile    |   3 +-
 drivers/common/mlx5/meson.build |   1 +
 drivers/net/mlx5/mlx5.c         |  15 ++-
 drivers/net/mlx5/mlx5.h         |  43 +-------
 drivers/net/mlx5/mlx5_mp.c      | 234 +++-------------------------------------
 drivers/net/mlx5/mlx5_mr.c      |   2 +-
 drivers/net/mlx5/mlx5_rxtx.c    |   3 +-
 7 files changed, 37 insertions(+), 264 deletions(-)

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index f32933d592..2a88492731 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -17,6 +17,7 @@ endif
 SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
+SRCS-y += mlx5_common_mp.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
@@ -46,7 +47,7 @@ endif
 LDLIBS += -lrte_eal -lrte_pci -lrte_kvargs -lrte_net
 
 # A few warnings cannot be avoided in external headers.
-CFLAGS += -Wno-error=cast-qual -UPEDANTIC
+CFLAGS += -Wno-error=cast-qual  -UPEDANTIC -DALLOW_EXPERIMENTAL_API
 
 EXPORT_MAP := rte_common_mlx5_version.map
 
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index f671710714..83671861c9 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -55,6 +55,7 @@ sources = files(
 	'mlx5_devx_cmds.c',
 	'mlx5_common.c',
 	'mlx5_nl.c',
+	'mlx5_common_mp.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 6a11b141da..9eac8011f3 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -38,6 +38,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
@@ -1714,7 +1715,8 @@ mlx5_init_once(void)
 		rte_rwlock_init(&sd->mem_event_rwlock);
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
-		ret = mlx5_mp_init_primary();
+		ret = mlx5_mp_init_primary(MLX5_MP_NAME,
+					   mlx5_mp_primary_handle);
 		if (ret)
 			goto out;
 		sd->init_done = true;
@@ -1722,7 +1724,8 @@ mlx5_init_once(void)
 	case RTE_PROC_SECONDARY:
 		if (ld->init_done)
 			break;
-		ret = mlx5_mp_init_secondary();
+		ret = mlx5_mp_init_secondary(MLX5_MP_NAME,
+					     mlx5_mp_secondary_handle);
 		if (ret)
 			goto out;
 		++sd->secondary_cnt;
@@ -2197,6 +2200,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	}
 	DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct mlx5_mp_id mp_id;
+
 		eth_dev = rte_eth_dev_attach_secondary(name);
 		if (eth_dev == NULL) {
 			DRV_LOG(ERR, "can not attach rte ethdev");
@@ -2208,8 +2213,10 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
+		mp_id.port_id = eth_dev->data->port_id;
+		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 		/* Receive command fd from primary process */
-		err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
+		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
 			return NULL;
 		/* Remap UAR for Tx queues. */
@@ -2373,6 +2380,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	priv->ibv_port = spawn->ibv_port;
 	priv->pci_dev = spawn->pci_dev;
 	priv->mtu = RTE_ETHER_MTU;
+	priv->mp_id.port_id = port_id;
+	strlcpy(priv->mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 #ifndef RTE_ARCH_64
 	/* Initialize UAR access locks for 32bit implementations. */
 	rte_spinlock_init(&priv->uar_lock_cq);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 34ab4758b1..9e15600afd 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -36,43 +36,13 @@
 #include <mlx5_devx_cmds.h>
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
-/* Request types for IPC. */
-enum mlx5_mp_req_type {
-	MLX5_MP_REQ_VERBS_CMD_FD = 1,
-	MLX5_MP_REQ_CREATE_MR,
-	MLX5_MP_REQ_START_RXTX,
-	MLX5_MP_REQ_STOP_RXTX,
-	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
-};
-
-struct mlx5_mp_arg_queue_state_modify {
-	uint8_t is_wq; /* Set if WQ. */
-	uint16_t queue_id; /* DPDK queue ID. */
-	enum ibv_wq_state state; /* WQ requested state. */
-};
-
-/* Pameters for IPC. */
-struct mlx5_mp_param {
-	enum mlx5_mp_req_type type;
-	int port_id;
-	int result;
-	RTE_STD_C11
-	union {
-		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
-		struct mlx5_mp_arg_queue_state_modify state_modify;
-		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
-	} args;
-};
-
-/** Request timeout for IPC. */
-#define MLX5_MP_REQ_TIMEOUT_SEC 5
-
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
@@ -561,6 +531,7 @@ struct mlx5_priv {
 #endif
 	uint8_t skip_default_rss_reta; /* Skip configuration of default reta. */
 	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
+	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
 };
 
 #define PORT_ID(priv) ((priv)->dev_data->port_id)
@@ -761,16 +732,10 @@ int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
 
 /* mlx5_mp.c */
+int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
+int mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
 void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);
 void mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev);
-int mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr);
-int mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev);
-int mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-				   struct mlx5_mp_arg_queue_state_modify *sm);
-int mlx5_mp_init_primary(void);
-void mlx5_mp_uninit_primary(void);
-int mlx5_mp_init_secondary(void);
-void mlx5_mp_uninit_secondary(void);
 
 /* mlx5_socket.c */
 
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 55d408fe95..43684dbc3a 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -10,46 +10,14 @@
 #include <rte_ethdev_driver.h>
 #include <rte_string_fns.h>
 
+#include <mlx5_common_mp.h>
+
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 
-/**
- * Initialize IPC message.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[out] msg
- *   Pointer to message to fill in.
- * @param[in] type
- *   Message type.
- */
-static inline void
-mp_init_msg(struct rte_eth_dev *dev, struct rte_mp_msg *msg,
-	    enum mlx5_mp_req_type type)
-{
-	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
-
-	memset(msg, 0, sizeof(*msg));
-	strlcpy(msg->name, MLX5_MP_NAME, sizeof(msg->name));
-	msg->len_param = sizeof(*param);
-	param->type = type;
-	param->port_id = dev->data->port_id;
-}
-
-/**
- * IPC message handler of primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[in] peer
- *   Pointer to the peer socket path.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-static int
-mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
@@ -71,21 +39,21 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_VERBS_CMD_FD:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		mp_res.num_fds = 1;
 		mp_res.fds[0] = priv->sh->ctx->cmd_fd;
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_QUEUE_STATE_MODIFY:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = mlx5_queue_state_modify_primary
 					(dev, &param->args.state_modify);
 		ret = rte_mp_reply(&mp_res, peer);
@@ -110,14 +78,15 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
-mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
 	const struct mlx5_mp_param *param =
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
+	struct mlx5_priv *priv;
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
@@ -127,13 +96,14 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		return -rte_errno;
 	}
 	dev = &rte_eth_devices[param->port_id];
+	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_START_RXTX:
 		DRV_LOG(INFO, "port %u starting datapath", dev->data->port_id);
 		rte_mb();
 		dev->rx_pkt_burst = mlx5_select_rx_function(dev);
 		dev->tx_pkt_burst = mlx5_select_tx_function(dev);
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -142,7 +112,7 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		dev->rx_pkt_burst = removed_rx_burst;
 		dev->tx_pkt_burst = removed_tx_burst;
 		rte_mb();
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -171,6 +141,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 	struct rte_mp_reply mp_rep;
 	struct mlx5_mp_param *res;
 	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret;
 	int i;
 
@@ -182,7 +153,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 			dev->data->port_id, type);
 		return;
 	}
-	mp_init_msg(dev, &mp_req, type);
+	mp_init_msg(&priv->mp_id, &mp_req, type);
 	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
 	if (ret) {
 		if (rte_errno != ENOTSUP)
@@ -234,178 +205,3 @@ mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev)
 {
 	mp_req_on_rxtx(dev, MLX5_MP_REQ_STOP_RXTX);
 }
-
-/**
- * Request Memory Region creation to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_CREATE_MR);
-	req->args.addr = addr;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	if (ret)
-		rte_errno = -ret;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs queue state modification to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param sm
- *   State modify parameters.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-			       struct mlx5_mp_arg_queue_state_modify *sm)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
-	req->args.state_modify = *sm;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs command file descriptor for mmap to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- *
- * @return
- *   fd on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	if (res->result) {
-		rte_errno = -res->result;
-		DRV_LOG(ERR,
-			"port %u failed to get command FD from primary process",
-			dev->data->port_id);
-		ret = -rte_errno;
-		goto exit;
-	}
-	MLX5_ASSERT(mp_res->num_fds == 1);
-	ret = mp_res->fds[0];
-	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
-		dev->data->port_id, ret);
-exit:
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Initialize by primary process.
- */
-int
-mlx5_mp_init_primary(void)
-{
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-
-	/* primary is allowed to not support IPC */
-	ret = rte_mp_action_register(MLX5_MP_NAME, mp_primary_handle);
-	if (ret && rte_errno != ENOTSUP)
-		return -1;
-	return 0;
-}
-
-/**
- * Un-initialize by primary process.
- */
-void
-mlx5_mp_uninit_primary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
-
-/**
- * Initialize by secondary process.
- */
-int
-mlx5_mp_init_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	return rte_mp_action_register(MLX5_MP_NAME, mp_secondary_handle);
-}
-
-/**
- * Un-initialize by secondary process.
- */
-void
-mlx5_mp_uninit_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index a8f185a208..9151992a72 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -540,7 +540,7 @@ mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
 
 	DEBUG("port %u requesting MR creation for address (%p)",
 	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(dev, addr);
+	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
 	if (ret) {
 		DEBUG("port %u fail to request MR creation for address (%p)",
 		      dev->data->port_id, (void *)addr);
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index f3bf763769..fc7591c2b0 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1000,6 +1000,7 @@ static int
 mlx5_queue_state_modify(struct rte_eth_dev *dev,
 			struct mlx5_mp_arg_queue_state_modify *sm)
 {
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret = 0;
 
 	switch (rte_eal_process_type()) {
@@ -1007,7 +1008,7 @@ mlx5_queue_state_modify(struct rte_eth_dev *dev,
 		ret = mlx5_queue_state_modify_primary(dev, sm);
 		break;
 	case RTE_PROC_SECONDARY:
-		ret = mlx5_mp_req_queue_state_modify(dev, sm);
+		ret = mlx5_mp_req_queue_state_modify(&priv->mp_id, sm);
 		break;
 	default:
 		break;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v3 3/4] common/mlx5: refactor memory management codes
  2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
@ 2020-04-07 17:00   ` Vu Pham
  2020-04-08  9:04     ` Slava Ovsiienko
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver Vu Pham
  3 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-07 17:00 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Refactor common memory btree and cache management to common driver.
Replace some input parameters of MR APIs to more common datastructure
like PD, port_id, share_cache,... so that muliptle PMD drivers can
use those MR APIs.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/mlx5_common_mr.c            | 1108 +++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   14 +
 3 files changed, 1282 insertions(+)
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
new file mode 100644
index 0000000000..9d4a06dd5b
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -0,0 +1,1108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2016 6WIND S.A.
+ * Copyright 2020 Mellanox Technologies, Ltd
+ */
+#include <rte_eal_memconfig.h>
+#include <rte_errno.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_rwlock.h>
+
+#include "mlx5_glue.h"
+#include "mlx5_common_mp.h"
+#include "mlx5_common_mr.h"
+#include "mlx5_common_utils.h"
+
+struct mr_find_contig_memsegs_data {
+	uintptr_t addr;
+	uintptr_t start;
+	uintptr_t end;
+	const struct rte_memseg_list *msl;
+};
+
+/**
+ * Expand B-tree table to a given size. Can't be called with holding
+ * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries for expansion.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_expand(struct mlx5_mr_btree *bt, int n)
+{
+	void *mem;
+	int ret = 0;
+
+	if (n <= bt->size)
+		return ret;
+	/*
+	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
+	 * used inside if there's no room to expand. Because this is a quite
+	 * rare case and a part of very slow path, it is very acceptable.
+	 * Initially cache_bh[] will be given practically enough space and once
+	 * it is expanded, expansion wouldn't be needed again ever.
+	 */
+	mem = rte_realloc(bt->table, n * sizeof(struct mr_cache_entry), 0);
+	if (mem == NULL) {
+		/* Not an error, B-tree search will be skipped. */
+		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
+			(void *)bt);
+		ret = -1;
+	} else {
+		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
+		bt->table = mem;
+		bt->size = n;
+	}
+	return ret;
+}
+
+/**
+ * Look up LKey from given B-tree lookup table, store the last index and return
+ * searched LKey.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param[out] idx
+ *   Pointer to index. Even on search failure, returns index where it stops
+ *   searching so that index can be used when inserting a new entry.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t n;
+	uint16_t base = 0;
+
+	MLX5_ASSERT(bt != NULL);
+	lkp_tbl = *bt->table;
+	n = bt->len;
+	/* First entry must be NULL for comparison. */
+	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
+				    lkp_tbl[0].lkey == UINT32_MAX));
+	/* Binary search. */
+	do {
+		register uint16_t delta = n >> 1;
+
+		if (addr < lkp_tbl[base + delta].start) {
+			n = delta;
+		} else {
+			base += delta;
+			n -= delta;
+		}
+	} while (n > 1);
+	MLX5_ASSERT(addr >= lkp_tbl[base].start);
+	*idx = base;
+	if (addr < lkp_tbl[base].end)
+		return lkp_tbl[base].lkey;
+	/* Not found. */
+	return UINT32_MAX;
+}
+
+/**
+ * Insert an entry to B-tree lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param entry
+ *   Pointer to new entry to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_insert(struct mlx5_mr_btree *bt, struct mr_cache_entry *entry)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t idx = 0;
+	size_t shift;
+
+	MLX5_ASSERT(bt != NULL);
+	MLX5_ASSERT(bt->len <= bt->size);
+	MLX5_ASSERT(bt->len > 0);
+	lkp_tbl = *bt->table;
+	/* Find out the slot for insertion. */
+	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
+		DRV_LOG(DEBUG,
+			"abort insertion to B-tree(%p): already exist at"
+			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+			(void *)bt, idx, entry->start, entry->end, entry->lkey);
+		/* Already exist, return. */
+		return 0;
+	}
+	/* If table is full, return error. */
+	if (unlikely(bt->len == bt->size)) {
+		bt->overflow = 1;
+		return -1;
+	}
+	/* Insert entry. */
+	++idx;
+	shift = (bt->len - idx) * sizeof(struct mr_cache_entry);
+	if (shift)
+		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
+	lkp_tbl[idx] = *entry;
+	bt->len++;
+	DRV_LOG(DEBUG,
+		"inserted B-tree(%p)[%u],"
+		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		(void *)bt, idx, entry->start, entry->end, entry->lkey);
+	return 0;
+}
+
+/**
+ * Initialize B-tree and allocate memory for lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries to allocate.
+ * @param socket
+ *   NUMA socket on which memory must be allocated.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
+{
+	if (bt == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	MLX5_ASSERT(!bt->table && !bt->size);
+	memset(bt, 0, sizeof(*bt));
+	bt->table = rte_calloc_socket("B-tree table",
+				      n, sizeof(struct mr_cache_entry),
+				      0, socket);
+	if (bt->table == NULL) {
+		rte_errno = ENOMEM;
+		DEBUG("failed to allocate memory for btree cache on socket %d",
+		      socket);
+		return -rte_errno;
+	}
+	bt->size = n;
+	/* First entry must be NULL for binary search. */
+	(*bt->table)[bt->len++] = (struct mr_cache_entry) {
+		.lkey = UINT32_MAX,
+	};
+	DEBUG("initialized B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	return 0;
+}
+
+/**
+ * Free B-tree resources.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
+{
+	if (bt == NULL)
+		return;
+	DEBUG("freeing B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	rte_free(bt->table);
+	memset(bt, 0, sizeof(*bt));
+}
+
+/**
+ * Dump all the entries in a B-tree
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	int idx;
+	struct mr_cache_entry *lkp_tbl;
+
+	if (bt == NULL)
+		return;
+	lkp_tbl = *bt->table;
+	for (idx = 0; idx < bt->len; ++idx) {
+		struct mr_cache_entry *entry = &lkp_tbl[idx];
+
+		DEBUG("B-tree(%p)[%u],"
+		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
+	}
+#endif
+}
+
+/**
+ * Find virtually contiguous memory chunk in a given MR.
+ *
+ * @param dev
+ *   Pointer to MR structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If not found, this will not be
+ *   updated.
+ * @param start_idx
+ *   Start index of the memseg bitmap.
+ *
+ * @return
+ *   Next index to go on lookup.
+ */
+static int
+mr_find_next_chunk(struct mlx5_mr *mr, struct mr_cache_entry *entry,
+		   int base_idx)
+{
+	uintptr_t start = 0;
+	uintptr_t end = 0;
+	uint32_t idx = 0;
+
+	/* MR for external memory doesn't have memseg list. */
+	if (mr->msl == NULL) {
+		struct ibv_mr *ibv_mr = mr->ibv_mr;
+
+		MLX5_ASSERT(mr->ms_bmp_n == 1);
+		MLX5_ASSERT(mr->ms_n == 1);
+		MLX5_ASSERT(base_idx == 0);
+		/*
+		 * Can't search it from memseg list but get it directly from
+		 * verbs MR as there's only one chunk.
+		 */
+		entry->start = (uintptr_t)ibv_mr->addr;
+		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+		/* Returning 1 ends iteration. */
+		return 1;
+	}
+	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
+		if (rte_bitmap_get(mr->ms_bmp, idx)) {
+			const struct rte_memseg_list *msl;
+			const struct rte_memseg *ms;
+
+			msl = mr->msl;
+			ms = rte_fbarray_get(&msl->memseg_arr,
+					     mr->ms_base_idx + idx);
+			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+			if (!start)
+				start = ms->addr_64;
+			end = ms->addr_64 + ms->hugepage_sz;
+		} else if (start) {
+			/* Passed the end of a fragment. */
+			break;
+		}
+	}
+	if (start) {
+		/* Found one chunk. */
+		entry->start = start;
+		entry->end = end;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+	}
+	return idx;
+}
+
+/**
+ * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
+ * Then, this entry will have to be searched by mr_lookup_list() in
+ * mlx5_mr_create() on miss.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr
+ *   Pointer to MR to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr)
+{
+	unsigned int n;
+
+	DRV_LOG(DEBUG, "Inserting MR(%p) to global cache(%p)",
+		(void *)mr, (void *)share_cache);
+	for (n = 0; n < mr->ms_bmp_n; ) {
+		struct mr_cache_entry entry;
+
+		memset(&entry, 0, sizeof(entry));
+		/* Find a contiguous chunk and advance the index. */
+		n = mr_find_next_chunk(mr, &entry, n);
+		if (!entry.end)
+			break;
+		if (mr_btree_insert(&share_cache->cache, &entry) < 0) {
+			/*
+			 * Overflowed, but the global table cannot be expanded
+			 * because of deadlock.
+			 */
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/**
+ * Look up address in the original global MR list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Found MR on match, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr)
+{
+	struct mlx5_mr *mr;
+
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret;
+
+			memset(&ret, 0, sizeof(ret));
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (addr >= ret.start && addr < ret.end) {
+				/* Found. */
+				*entry = ret;
+				return mr;
+			}
+		}
+	}
+	return NULL;
+}
+
+/**
+ * Look up address on global MR cache.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr)
+{
+	uint16_t idx;
+	uint32_t lkey = UINT32_MAX;
+	struct mlx5_mr *mr;
+
+	/*
+	 * If the global cache has overflowed since it failed to expand the
+	 * B-tree table, it can't have all the existing MRs. Then, the address
+	 * has to be searched by traversing the original MR list instead, which
+	 * is very slow path. Otherwise, the global cache is all inclusive.
+	 */
+	if (!unlikely(share_cache->cache.overflow)) {
+		lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+		if (lkey != UINT32_MAX)
+			*entry = (*share_cache->cache.table)[idx];
+	} else {
+		/* Falling back to the slowest path. */
+		mr = mlx5_mr_lookup_list(share_cache, entry, addr);
+		if (mr != NULL)
+			lkey = entry->lkey;
+	}
+	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
+					   addr < entry->end));
+	return lkey;
+}
+
+/**
+ * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
+ * can raise memory free event and the callback function will spin on the lock.
+ *
+ * @param mr
+ *   Pointer to MR to free.
+ */
+static void
+mr_free(struct mlx5_mr *mr)
+{
+	if (mr == NULL)
+		return;
+	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
+	if (mr->ibv_mr != NULL)
+		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
+	if (mr->ms_bmp != NULL)
+		rte_bitmap_free(mr->ms_bmp);
+	rte_free(mr);
+}
+
+void
+mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr;
+
+	DRV_LOG(DEBUG, "Rebuild dev cache[] %p", (void *)share_cache);
+	/* Flush cache to rebuild. */
+	share_cache->cache.len = 1;
+	share_cache->cache.overflow = 0;
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr)
+		if (mlx5_mr_insert_cache(share_cache, mr) < 0)
+			return;
+}
+
+/**
+ * Release resources of detached MR having no online entry.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+static void
+mlx5_mr_garbage_collect(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
+
+	/* Must be called from the primary process. */
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/*
+	 * MR can't be freed with holding the lock because rte_free() could call
+	 * memory free callback function. This will be a deadlock situation.
+	 */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach the whole free list and release it after unlocking. */
+	free_list = share_cache->mr_free_list;
+	LIST_INIT(&share_cache->mr_free_list);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Release resources. */
+	mr_next = LIST_FIRST(&free_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		mr_free(mr);
+	}
+}
+
+/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
+static int
+mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
+			  const struct rte_memseg *ms, size_t len, void *arg)
+{
+	struct mr_find_contig_memsegs_data *data = arg;
+
+	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
+		return 0;
+	/* Found, save it and stop walking. */
+	data->start = ms->addr_64;
+	data->end = ms->addr_64 + len;
+	data->msl = msl;
+	return 1;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This API should be called on a secondary process, then a request is sent to
+ * the primary process in order to create a MR for the address. As the global MR
+ * list is on the shared memory, following LKey lookup should succeed unless the
+ * request fails.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create_secondary(struct ibv_pd *pd __rte_unused,
+			 struct mlx5_mp_id *mp_id,
+			 struct mlx5_mr_share_cache *share_cache,
+			 struct mr_cache_entry *entry, uintptr_t addr,
+			 unsigned int mr_ext_memseg_en __rte_unused)
+{
+	int ret;
+
+	DEBUG("port %u requesting MR creation for address (%p)",
+	      mp_id->port_id, (void *)addr);
+	ret = mlx5_mp_req_mr_create(mp_id, addr);
+	if (ret) {
+		DEBUG("Fail to request MR creation for address (%p)",
+		      (void *)addr);
+		return UINT32_MAX;
+	}
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	DEBUG("MR CREATED by primary process for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
+	      (void *)addr, entry->start, entry->end, entry->lkey);
+	return entry->lkey;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * Register entire virtually contiguous memory chunk around the address.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en)
+{
+	struct mr_find_contig_memsegs_data data = {.addr = addr, };
+	struct mr_find_contig_memsegs_data data_re;
+	const struct rte_memseg_list *msl;
+	const struct rte_memseg *ms;
+	struct mlx5_mr *mr = NULL;
+	int ms_idx_shift = -1;
+	uint32_t bmp_size;
+	void *bmp_mem;
+	uint32_t ms_n;
+	uint32_t n;
+	size_t len;
+
+	DRV_LOG(DEBUG, "Creating a MR using address (%p)", (void *)addr);
+	/*
+	 * Release detached MRs if any. This can't be called with holding either
+	 * memory_hotplug_lock or share_cache->rwlock. MRs on the free list have
+	 * been detached by the memory free event but it couldn't be released
+	 * inside the callback due to deadlock. As a result, releasing resources
+	 * is quite opportunistic.
+	 */
+	mlx5_mr_garbage_collect(share_cache);
+	/*
+	 * If enabled, find out a contiguous virtual address chunk in use, to
+	 * which the given address belongs, in order to register maximum range.
+	 * In the best case where mempools are not dynamically recreated and
+	 * '--socket-mem' is specified as an EAL option, it is very likely to
+	 * have only one MR(LKey) per a socket and per a hugepage-size even
+	 * though the system memory is highly fragmented. As the whole memory
+	 * chunk will be pinned by kernel, it can't be reused unless entire
+	 * chunk is freed from EAL.
+	 *
+	 * If disabled, just register one memseg (page). Then, memory
+	 * consumption will be minimized but it may drop performance if there
+	 * are many MRs to lookup on the datapath.
+	 */
+	if (!mr_ext_memseg_en) {
+		data.msl = rte_mem_virt2memseg_list((void *)addr);
+		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
+		data.end = data.start + data.msl->page_sz;
+	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
+		DRV_LOG(WARNING,
+			"Unable to find virtually contiguous"
+			" chunk for address (%p)."
+			" rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_nolock;
+	}
+alloc_resources:
+	/* Addresses must be page-aligned. */
+	MLX5_ASSERT(data.msl);
+	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
+	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
+	msl = data.msl;
+	ms = rte_mem_virt2memseg((void *)data.start, msl);
+	len = data.end - data.start;
+	MLX5_ASSERT(ms);
+	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+	/* Number of memsegs in the range. */
+	ms_n = len / msl->page_sz;
+	DEBUG("Extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " page_sz=0x%" PRIx64 ", ms_n=%u",
+	      (void *)addr, data.start, data.end, msl->page_sz, ms_n);
+	/* Size of memory for bitmap. */
+	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE) +
+				bmp_size,
+				RTE_CACHE_LINE_SIZE, msl->socket_id);
+	if (mr == NULL) {
+		DEBUG("Unable to allocate memory for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = ENOMEM;
+		goto err_nolock;
+	}
+	mr->msl = msl;
+	/*
+	 * Save the index of the first memseg and initialize memseg bitmap. To
+	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
+	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
+	 */
+	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
+	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
+	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
+	if (mr->ms_bmp == NULL) {
+		DEBUG("Unable to initialize bitmap for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = EINVAL;
+		goto err_nolock;
+	}
+	/*
+	 * Should recheck whether the extended contiguous chunk is still valid.
+	 * Because memory_hotplug_lock can't be held if there's any memory
+	 * related calls in a critical path, resource allocation above can't be
+	 * locked. If the memory has been changed at this point, try again with
+	 * just single page. If not, go on with the big chunk atomically from
+	 * here.
+	 */
+	rte_mcfg_mem_read_lock();
+	data_re = data;
+	if (len > msl->page_sz &&
+	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
+		DEBUG("Unable to find virtually contiguous"
+		      " chunk for address (%p)."
+		      " rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_memlock;
+	}
+	if (data.start != data_re.start || data.end != data_re.end) {
+		/*
+		 * The extended contiguous chunk has been changed. Try again
+		 * with single memseg instead.
+		 */
+		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
+		data.end = data.start + msl->page_sz;
+		rte_mcfg_mem_read_unlock();
+		mr_free(mr);
+		goto alloc_resources;
+	}
+	MLX5_ASSERT(data.msl == data_re.msl);
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/*
+	 * Check the address is really missing. If other thread already created
+	 * one or it is not found due to overflow, abort and return.
+	 */
+	if (mlx5_mr_lookup_cache(share_cache, entry, addr) != UINT32_MAX) {
+		/*
+		 * Insert to the global cache table. It may fail due to
+		 * low-on-memory. Then, this entry will have to be searched
+		 * here again.
+		 */
+		mr_btree_insert(&share_cache->cache, entry);
+		DEBUG("Found MR for %p on final lookup, abort", (void *)addr);
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		rte_mcfg_mem_read_unlock();
+		/*
+		 * Must be unlocked before calling rte_free() because
+		 * mlx5_mr_mem_event_free_cb() can be called inside.
+		 */
+		mr_free(mr);
+		return entry->lkey;
+	}
+	/*
+	 * Trim start and end addresses for verbs MR. Set bits for registering
+	 * memsegs but exclude already registered ones. Bitmap can be
+	 * fragmented.
+	 */
+	for (n = 0; n < ms_n; ++n) {
+		uintptr_t start;
+		struct mr_cache_entry ret;
+
+		memset(&ret, 0, sizeof(ret));
+		start = data_re.start + n * msl->page_sz;
+		/* Exclude memsegs already registered by other MRs. */
+		if (mlx5_mr_lookup_cache(share_cache, &ret, start) ==
+		    UINT32_MAX) {
+			/*
+			 * Start from the first unregistered memseg in the
+			 * extended range.
+			 */
+			if (ms_idx_shift == -1) {
+				mr->ms_base_idx += n;
+				data.start = start;
+				ms_idx_shift = n;
+			}
+			data.end = start + msl->page_sz;
+			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
+			++mr->ms_n;
+		}
+	}
+	len = data.end - data.start;
+	mr->ms_bmp_n = len / msl->page_sz;
+	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
+	/*
+	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
+	 * called with holding the memory lock because it doesn't use
+	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
+	 * through mlx5_alloc_verbs_buf().
+	 */
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)data.start, len,
+				       IBV_ACCESS_LOCAL_WRITE |
+					   IBV_ACCESS_RELAXED_ORDERING);
+	if (mr->ibv_mr == NULL) {
+		DEBUG("Fail to create a verbs MR for address (%p)",
+		      (void *)addr);
+		rte_errno = EINVAL;
+		goto err_mrlock;
+	}
+	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
+	MLX5_ASSERT(mr->ibv_mr->length == len);
+	LIST_INSERT_HEAD(&share_cache->mr_list, mr, mr);
+	DEBUG("MR CREATED (%p) for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+	      (void *)mr, (void *)addr, data.start, data.end,
+	      rte_cpu_to_be_32(mr->ibv_mr->lkey),
+	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	/* Insert to the global cache table. */
+	mlx5_mr_insert_cache(share_cache, mr);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	rte_mcfg_mem_read_unlock();
+	return entry->lkey;
+err_mrlock:
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+err_memlock:
+	rte_mcfg_mem_read_unlock();
+err_nolock:
+	/*
+	 * In case of error, as this can be called in a datapath, a warning
+	 * message per an error is preferable instead. Must be unlocked before
+	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
+	 * inside.
+	 */
+	mr_free(mr);
+	return UINT32_MAX;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This can be called from primary and secondary process.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+	       struct mlx5_mr_share_cache *share_cache,
+	       struct mr_cache_entry *entry, uintptr_t addr,
+	       unsigned int mr_ext_memseg_en)
+{
+	uint32_t ret = 0;
+
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		ret = mlx5_mr_create_primary(pd, share_cache, entry,
+					     addr, mr_ext_memseg_en);
+		break;
+	case RTE_PROC_SECONDARY:
+		ret = mlx5_mr_create_secondary(pd, mp_id, share_cache, entry,
+					       addr, mr_ext_memseg_en);
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+/**
+ * Look up address in the global MR cache table. If not found, create a new MR.
+ * Insert the found/created entry to local bottom-half cache table.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this is not written.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_lookup_caches(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+		 struct mlx5_mr_share_cache *share_cache,
+		 struct mlx5_mr_ctrl *mr_ctrl,
+		 struct mr_cache_entry *entry, uintptr_t addr,
+		 unsigned int mr_ext_memseg_en)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	uint32_t lkey;
+	uint16_t idx;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in the global cache. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+	if (lkey != UINT32_MAX) {
+		/* Found. */
+		*entry = (*share_cache->cache.table)[idx];
+		rte_rwlock_read_unlock(&share_cache->rwlock);
+		/*
+		 * Update local cache. Even if it fails, return the found entry
+		 * to update top-half cache. Next time, this entry will be found
+		 * in the global cache.
+		 */
+		mr_btree_insert(bt, entry);
+		return lkey;
+	}
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/* First time to see the address? Create a new MR. */
+	lkey = mlx5_mr_create(pd, mp_id, share_cache, entry, addr,
+			      mr_ext_memseg_en);
+	/*
+	 * Update the local cache if successfully created a new global MR. Even
+	 * if failed to create one, there's no action to take in this datapath
+	 * code. As returning LKey is invalid, this will eventually make HW
+	 * fail.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half of LKey search on datapath. First search in cache_bh[] and if
+ * misses, search in the global MR cache table and update the new entry to
+ * per-queue local caches.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en)
+{
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+	/* Victim in top-half cache to replace with new entry. */
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		/*
+		 * If missed in local lookup table, search in the global cache
+		 * and local cache_bh[] will be updated inside if possible.
+		 * Top-half cache entry will also be updated.
+		 */
+		lkey = mr_lookup_caches(pd, mp_id, share_cache, mr_ctrl,
+					repl, addr, mr_ext_memseg_en);
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
+
+/**
+ * Release all the created MRs and resources on global MR cache of a device.
+ * list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+void
+mlx5_mr_release_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach from MR list and move to free list. */
+	mr_next = LIST_FIRST(&share_cache->mr_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		LIST_REMOVE(mr, mr);
+		LIST_INSERT_HEAD(&share_cache->mr_free_list, mr, mr);
+	}
+	LIST_INIT(&share_cache->mr_list);
+	/* Free global cache. */
+	mlx5_mr_btree_free(&share_cache->cache);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Free all remaining MRs. */
+	mlx5_mr_garbage_collect(share_cache);
+}
+
+/**
+ * Flush all of the local cache entries.
+ *
+ * @param mr_ctrl
+ *   Pointer to per-queue MR local cache.
+ */
+void
+mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
+{
+	/* Reset the most-recently-used index. */
+	mr_ctrl->mru = 0;
+	/* Reset the linear search array. */
+	mr_ctrl->head = 0;
+	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
+	/* Reset the B-tree table. */
+	mr_ctrl->cache_bh.len = 1;
+	mr_ctrl->cache_bh.overflow = 0;
+	/* Update the generation number. */
+	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
+	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
+		(void *)mr_ctrl, mr_ctrl->cur_gen);
+}
+
+/**
+ * Creates a memory region for external memory, that is memory which is not
+ * part of the DPDK memory segments.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param addr
+ *   Starting virtual address of memory.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @param socked_id
+ *   Socket to allocate heap memory for the control structures.
+ *
+ * @return
+ *   Pointer to MR structure on success, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len, int socket_id)
+{
+	struct mlx5_mr *mr = NULL;
+
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE),
+				RTE_CACHE_LINE_SIZE, socket_id);
+	if (mr == NULL)
+		return NULL;
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)addr, len,
+				       IBV_ACCESS_LOCAL_WRITE |
+					   IBV_ACCESS_RELAXED_ORDERING);
+	if (mr->ibv_mr == NULL) {
+		DRV_LOG(WARNING,
+			"Fail to create a verbs MR for address (%p)",
+			(void *)addr);
+		rte_free(mr);
+		return NULL;
+	}
+	mr->msl = NULL; /* Mark it is external memory. */
+	mr->ms_bmp = NULL;
+	mr->ms_n = 1;
+	mr->ms_bmp_n = 1;
+	DRV_LOG(DEBUG,
+		"MR CREATED (%p) for external memory %p:\n"
+		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+		(void *)mr, (void *)addr,
+		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	return mr;
+}
+
+/**
+ * Dump all the created MRs and the global cache entries.
+ *
+ * @param sh
+ *   Pointer to Ethernet device shared context.
+ */
+void
+mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	struct mlx5_mr *mr;
+	int mr_n = 0;
+	int chunk_n = 0;
+
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		DEBUG("MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
+		      mr_n++, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		      mr->ms_n, mr->ms_bmp_n);
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret = { 0, };
+
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (!ret.end)
+				break;
+			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
+			      chunk_n++, ret.start, ret.end);
+		}
+	}
+	DEBUG("Dumping global cache %p", (void *)share_cache);
+	mlx5_mr_btree_dump(&share_cache->cache);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+#endif
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
new file mode 100644
index 0000000000..e805f96375
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MR_H_
+#define RTE_PMD_MLX5_COMMON_MR_H_
+
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/queue.h>
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx5dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_rwlock.h>
+#include <rte_bitmap.h>
+#include <rte_memory.h>
+
+#include "mlx5_common_mp.h"
+
+/* Size of per-queue MR cache array for linear search. */
+#define MLX5_MR_CACHE_N 8
+#define MLX5_MR_BTREE_CACHE_N 256
+
+/* Memory Region object. */
+struct mlx5_mr {
+	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
+	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
+	const struct rte_memseg_list *msl;
+	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
+	int ms_n; /* Number of memsegs in use. */
+	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
+	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
+};
+
+/* Cache entry for Memory Region. */
+struct mr_cache_entry {
+	uintptr_t start; /* Start address of MR. */
+	uintptr_t end; /* End address of MR. */
+	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
+} __rte_packed;
+
+/* MR Cache table for Binary search. */
+struct mlx5_mr_btree {
+	uint16_t len; /* Number of entries. */
+	uint16_t size; /* Total number of entries. */
+	int overflow; /* Mark failure of table expansion. */
+	struct mr_cache_entry (*table)[];
+} __rte_packed;
+
+/* Per-queue MR control descriptor. */
+struct mlx5_mr_ctrl {
+	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
+	uint32_t cur_gen; /* Generation number saved to flush caches. */
+	uint16_t mru; /* Index of last hit entry in top-half cache. */
+	uint16_t head; /* Index of the oldest entry in top-half cache. */
+	struct mr_cache_entry cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
+	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
+} __rte_packed;
+
+LIST_HEAD(mlx5_mr_list, mlx5_mr);
+
+/* Global per-device MR cache. */
+struct mlx5_mr_share_cache {
+	uint32_t dev_gen; /* Generation number to flush local caches. */
+	rte_rwlock_t rwlock; /* MR cache Lock. */
+	struct mlx5_mr_btree cache; /* Global MR cache table. */
+	struct mlx5_mr_list mr_list; /* Registered MR list. */
+	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+} __rte_packed;
+
+/**
+ * Look up LKey from given lookup table by linear search. Firstly look up the
+ * last-hit entry. If miss, the entire array is searched. If found, update the
+ * last-hit index and return LKey.
+ *
+ * @param lkp_tbl
+ *   Pointer to lookup table.
+ * @param[in,out] cached_idx
+ *   Pointer to last-hit index.
+ * @param n
+ *   Size of lookup table.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static __rte_always_inline uint32_t
+mlx5_mr_lookup_lkey(struct mr_cache_entry *lkp_tbl, uint16_t *cached_idx,
+		    uint16_t n, uintptr_t addr)
+{
+	uint16_t idx;
+
+	if (likely(addr >= lkp_tbl[*cached_idx].start &&
+		   addr < lkp_tbl[*cached_idx].end))
+		return lkp_tbl[*cached_idx].lkey;
+	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
+		if (addr >= lkp_tbl[idx].start &&
+		    addr < lkp_tbl[idx].end) {
+			/* Found. */
+			*cached_idx = idx;
+			return lkp_tbl[idx].lkey;
+		}
+	}
+	return UINT32_MAX;
+}
+
+__rte_experimental
+int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
+__rte_experimental
+void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
+__rte_experimental
+void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused);
+__rte_experimental
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en);
+__rte_experimental
+void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
+__rte_experimental
+void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
+__rte_experimental
+void mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache);
+__rte_experimental
+void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
+__rte_experimental
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr);
+__rte_experimental
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len,
+		   int socket_id);
+__rte_experimental
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en);
+
+#endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index 265703d1c9..b58a378278 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -61,4 +61,18 @@ EXPERIMENTAL {
 	mlx5_mp_req_mr_create;
 	mlx5_mp_req_queue_state_modify;
 	mlx5_mp_req_verbs_cmd_fd;
+
+	mlx5_mr_btree_init;
+	mlx5_mr_btree_free;
+	mlx5_mr_btree_dump;
+	mlx5_mr_addr2mr_bh;
+	mlx5_mr_release_cache;
+	mlx5_mr_dump_cache;
+	mlx5_mr_rebuild_cache;
+	mlx5_mr_insert_cache;
+	mlx5_mr_lookup_cache;
+	mlx5_mr_lookup_list;
+	mlx5_create_mr_ext;
+	mlx5_mr_create_primary;
+	mlx5_mr_flush_local_cache;
 };
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver
  2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                     ` (2 preceding siblings ...)
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: refactor memory management codes Vu Pham
@ 2020-04-07 17:00   ` Vu Pham
  2020-04-08  9:06     ` Slava Ovsiienko
  3 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-07 17:00 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Modify mlx5 net pmd driver to use MR management APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/Makefile     |    1 +
 drivers/common/mlx5/meson.build  |    1 +
 drivers/net/mlx5/mlx5.c          |    4 +-
 drivers/net/mlx5/mlx5.h          |   12 +-
 drivers/net/mlx5/mlx5_mp.c       |    8 +-
 drivers/net/mlx5/mlx5_mr.c       | 1169 ++------------------------------------
 drivers/net/mlx5/mlx5_mr.h       |   87 +--
 drivers/net/mlx5/mlx5_rxtx.c     |    1 +
 drivers/net/mlx5/mlx5_rxtx.h     |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h |    2 +
 drivers/net/mlx5/mlx5_trigger.c  |    1 +
 drivers/net/mlx5/mlx5_txq.c      |    3 +-
 12 files changed, 75 insertions(+), 1224 deletions(-)

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index 2a88492731..26267c957a 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -18,6 +18,7 @@ SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
 SRCS-y += mlx5_common_mp.c
+SRCS-y += mlx5_common_mr.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index 83671861c9..175251b691 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -56,6 +56,7 @@ sources = files(
 	'mlx5_common.c',
 	'mlx5_nl.c',
 	'mlx5_common_mp.c',
+	'mlx5_common_mr.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 9eac8011f3..f45055d96f 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -618,7 +618,7 @@ mlx5_alloc_shared_ibctx(const struct mlx5_dev_spawn_data *spawn,
 	 * At this point the device is not added to the memory
 	 * event list yet, context is just being created.
 	 */
-	err = mlx5_mr_btree_init(&sh->mr.cache,
+	err = mlx5_mr_btree_init(&sh->share_cache.cache,
 				 MLX5_MR_BTREE_CACHE_N * 2,
 				 spawn->pci_dev->device.numa_node);
 	if (err) {
@@ -690,7 +690,7 @@ mlx5_free_shared_ibctx(struct mlx5_ibv_shared *sh)
 	LIST_REMOVE(sh, mem_event_cb);
 	rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
 	/* Release created Memory Regions. */
-	mlx5_mr_release(sh);
+	mlx5_mr_release_cache(&sh->share_cache);
 	/* Remove context from the global device list. */
 	LIST_REMOVE(sh, next);
 	/*
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 9e15600afd..41b6e78369 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -37,10 +37,10 @@
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /** Key string for IPC. */
@@ -198,8 +198,6 @@ struct mlx5_verbs_alloc_ctx {
 	const void *obj; /* Pointer to the DPDK object. */
 };
 
-LIST_HEAD(mlx5_mr_list, mlx5_mr);
-
 /* Flow drop context necessary due to Verbs API. */
 struct mlx5_drop {
 	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
@@ -390,13 +388,7 @@ struct mlx5_ibv_shared {
 	struct ibv_device_attr_ex device_attr; /* Device properties. */
 	LIST_ENTRY(mlx5_ibv_shared) mem_event_cb;
 	/**< Called by memory event callback. */
-	struct {
-		uint32_t dev_gen; /* Generation number to flush local caches. */
-		rte_rwlock_t rwlock; /* MR Lock. */
-		struct mlx5_mr_btree cache; /* Global MR cache table. */
-		struct mlx5_mr_list mr_list; /* Registered MR list. */
-		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
-	} mr;
+	struct mlx5_mr_share_cache share_cache;
 	/* Shared DV/DR flow data section. */
 	pthread_mutex_t dv_mutex; /* DV context mutex. */
 	uint32_t dv_meta_mask; /* flow META metadata supported mask. */
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 43684dbc3a..7ad322d474 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -11,6 +11,7 @@
 #include <rte_string_fns.h>
 
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
@@ -25,7 +26,7 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
 	struct mlx5_priv *priv;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 	int ret;
 
@@ -40,7 +41,10 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
 		mp_init_msg(&priv->mp_id, &mp_res, param->type);
-		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
+		lkey = mlx5_mr_create_primary(priv->sh->pd,
+					      &priv->sh->share_cache,
+					      &entry, param->args.addr,
+					      priv->config.mr_ext_memseg_en);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 9151992a72..2b4b3e2891 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -18,6 +18,8 @@
 #include <rte_bus_pci.h>
 
 #include <mlx5_glue.h>
+#include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_mr.h"
@@ -36,834 +38,6 @@ struct mr_update_mp_data {
 	int ret;
 };
 
-/**
- * Expand B-tree table to a given size. Can't be called with holding
- * memory_hotplug_lock or sh->mr.rwlock due to rte_realloc().
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries for expansion.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_expand(struct mlx5_mr_btree *bt, int n)
-{
-	void *mem;
-	int ret = 0;
-
-	if (n <= bt->size)
-		return ret;
-	/*
-	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
-	 * used inside if there's no room to expand. Because this is a quite
-	 * rare case and a part of very slow path, it is very acceptable.
-	 * Initially cache_bh[] will be given practically enough space and once
-	 * it is expanded, expansion wouldn't be needed again ever.
-	 */
-	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
-	if (mem == NULL) {
-		/* Not an error, B-tree search will be skipped. */
-		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
-			(void *)bt);
-		ret = -1;
-	} else {
-		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
-		bt->table = mem;
-		bt->size = n;
-	}
-	return ret;
-}
-
-/**
- * Look up LKey from given B-tree lookup table, store the last index and return
- * searched LKey.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param[out] idx
- *   Pointer to index. Even on search failure, returns index where it stops
- *   searching so that index can be used when inserting a new entry.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t n;
-	uint16_t base = 0;
-
-	MLX5_ASSERT(bt != NULL);
-	lkp_tbl = *bt->table;
-	n = bt->len;
-	/* First entry must be NULL for comparison. */
-	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
-				    lkp_tbl[0].lkey == UINT32_MAX));
-	/* Binary search. */
-	do {
-		register uint16_t delta = n >> 1;
-
-		if (addr < lkp_tbl[base + delta].start) {
-			n = delta;
-		} else {
-			base += delta;
-			n -= delta;
-		}
-	} while (n > 1);
-	MLX5_ASSERT(addr >= lkp_tbl[base].start);
-	*idx = base;
-	if (addr < lkp_tbl[base].end)
-		return lkp_tbl[base].lkey;
-	/* Not found. */
-	return UINT32_MAX;
-}
-
-/**
- * Insert an entry to B-tree lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param entry
- *   Pointer to new entry to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t idx = 0;
-	size_t shift;
-
-	MLX5_ASSERT(bt != NULL);
-	MLX5_ASSERT(bt->len <= bt->size);
-	MLX5_ASSERT(bt->len > 0);
-	lkp_tbl = *bt->table;
-	/* Find out the slot for insertion. */
-	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
-		DRV_LOG(DEBUG,
-			"abort insertion to B-tree(%p): already exist at"
-			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-			(void *)bt, idx, entry->start, entry->end, entry->lkey);
-		/* Already exist, return. */
-		return 0;
-	}
-	/* If table is full, return error. */
-	if (unlikely(bt->len == bt->size)) {
-		bt->overflow = 1;
-		return -1;
-	}
-	/* Insert entry. */
-	++idx;
-	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
-	if (shift)
-		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
-	lkp_tbl[idx] = *entry;
-	bt->len++;
-	DRV_LOG(DEBUG,
-		"inserted B-tree(%p)[%u],"
-		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		(void *)bt, idx, entry->start, entry->end, entry->lkey);
-	return 0;
-}
-
-/**
- * Initialize B-tree and allocate memory for lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries to allocate.
- * @param socket
- *   NUMA socket on which memory must be allocated.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
-{
-	if (bt == NULL) {
-		rte_errno = EINVAL;
-		return -rte_errno;
-	}
-	MLX5_ASSERT(!bt->table && !bt->size);
-	memset(bt, 0, sizeof(*bt));
-	bt->table = rte_calloc_socket("B-tree table",
-				      n, sizeof(struct mlx5_mr_cache),
-				      0, socket);
-	if (bt->table == NULL) {
-		rte_errno = ENOMEM;
-		DEBUG("failed to allocate memory for btree cache on socket %d",
-		      socket);
-		return -rte_errno;
-	}
-	bt->size = n;
-	/* First entry must be NULL for binary search. */
-	(*bt->table)[bt->len++] = (struct mlx5_mr_cache) {
-		.lkey = UINT32_MAX,
-	};
-	DEBUG("initialized B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	return 0;
-}
-
-/**
- * Free B-tree resources.
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
-{
-	if (bt == NULL)
-		return;
-	DEBUG("freeing B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	rte_free(bt->table);
-	memset(bt, 0, sizeof(*bt));
-}
-
-/**
- * Dump all the entries in a B-tree
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	int idx;
-	struct mlx5_mr_cache *lkp_tbl;
-
-	if (bt == NULL)
-		return;
-	lkp_tbl = *bt->table;
-	for (idx = 0; idx < bt->len; ++idx) {
-		struct mlx5_mr_cache *entry = &lkp_tbl[idx];
-
-		DEBUG("B-tree(%p)[%u],"
-		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
-	}
-#endif
-}
-
-/**
- * Find virtually contiguous memory chunk in a given MR.
- *
- * @param dev
- *   Pointer to MR structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If not found, this will not be
- *   updated.
- * @param start_idx
- *   Start index of the memseg bitmap.
- *
- * @return
- *   Next index to go on lookup.
- */
-static int
-mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
-		   int base_idx)
-{
-	uintptr_t start = 0;
-	uintptr_t end = 0;
-	uint32_t idx = 0;
-
-	/* MR for external memory doesn't have memseg list. */
-	if (mr->msl == NULL) {
-		struct ibv_mr *ibv_mr = mr->ibv_mr;
-
-		MLX5_ASSERT(mr->ms_bmp_n == 1);
-		MLX5_ASSERT(mr->ms_n == 1);
-		MLX5_ASSERT(base_idx == 0);
-		/*
-		 * Can't search it from memseg list but get it directly from
-		 * verbs MR as there's only one chunk.
-		 */
-		entry->start = (uintptr_t)ibv_mr->addr;
-		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-		/* Returning 1 ends iteration. */
-		return 1;
-	}
-	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
-		if (rte_bitmap_get(mr->ms_bmp, idx)) {
-			const struct rte_memseg_list *msl;
-			const struct rte_memseg *ms;
-
-			msl = mr->msl;
-			ms = rte_fbarray_get(&msl->memseg_arr,
-					     mr->ms_base_idx + idx);
-			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-			if (!start)
-				start = ms->addr_64;
-			end = ms->addr_64 + ms->hugepage_sz;
-		} else if (start) {
-			/* Passed the end of a fragment. */
-			break;
-		}
-	}
-	if (start) {
-		/* Found one chunk. */
-		entry->start = start;
-		entry->end = end;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-	}
-	return idx;
-}
-
-/**
- * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
- * Then, this entry will have to be searched by mr_lookup_dev_list() in
- * mlx5_mr_create() on miss.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param mr
- *   Pointer to MR to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_insert_dev_cache(struct mlx5_ibv_shared *sh, struct mlx5_mr *mr)
-{
-	unsigned int n;
-
-	DRV_LOG(DEBUG, "device %s inserting MR(%p) to global cache",
-		sh->ibdev_name, (void *)mr);
-	for (n = 0; n < mr->ms_bmp_n; ) {
-		struct mlx5_mr_cache entry;
-
-		memset(&entry, 0, sizeof(entry));
-		/* Find a contiguous chunk and advance the index. */
-		n = mr_find_next_chunk(mr, &entry, n);
-		if (!entry.end)
-			break;
-		if (mr_btree_insert(&sh->mr.cache, &entry) < 0) {
-			/*
-			 * Overflowed, but the global table cannot be expanded
-			 * because of deadlock.
-			 */
-			return -1;
-		}
-	}
-	return 0;
-}
-
-/**
- * Look up address in the original global MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Found MR on match, NULL otherwise.
- */
-static struct mlx5_mr *
-mr_lookup_dev_list(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-		   uintptr_t addr)
-{
-	struct mlx5_mr *mr;
-
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret;
-
-			memset(&ret, 0, sizeof(ret));
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (addr >= ret.start && addr < ret.end) {
-				/* Found. */
-				*entry = ret;
-				return mr;
-			}
-		}
-	}
-	return NULL;
-}
-
-/**
- * Look up address on device.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mr_lookup_dev(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-	      uintptr_t addr)
-{
-	uint16_t idx;
-	uint32_t lkey = UINT32_MAX;
-	struct mlx5_mr *mr;
-
-	/*
-	 * If the global cache has overflowed since it failed to expand the
-	 * B-tree table, it can't have all the existing MRs. Then, the address
-	 * has to be searched by traversing the original MR list instead, which
-	 * is very slow path. Otherwise, the global cache is all inclusive.
-	 */
-	if (!unlikely(sh->mr.cache.overflow)) {
-		lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-		if (lkey != UINT32_MAX)
-			*entry = (*sh->mr.cache.table)[idx];
-	} else {
-		/* Falling back to the slowest path. */
-		mr = mr_lookup_dev_list(sh, entry, addr);
-		if (mr != NULL)
-			lkey = entry->lkey;
-	}
-	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
-					   addr < entry->end));
-	return lkey;
-}
-
-/**
- * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
- * can raise memory free event and the callback function will spin on the lock.
- *
- * @param mr
- *   Pointer to MR to free.
- */
-static void
-mr_free(struct mlx5_mr *mr)
-{
-	if (mr == NULL)
-		return;
-	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
-	if (mr->ibv_mr != NULL)
-		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
-	if (mr->ms_bmp != NULL)
-		rte_bitmap_free(mr->ms_bmp);
-	rte_free(mr);
-}
-
-/**
- * Release resources of detached MR having no online entry.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mlx5_mr_garbage_collect(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
-
-	/* Must be called from the primary process. */
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	/*
-	 * MR can't be freed with holding the lock because rte_free() could call
-	 * memory free callback function. This will be a deadlock situation.
-	 */
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach the whole free list and release it after unlocking. */
-	free_list = sh->mr.mr_free_list;
-	LIST_INIT(&sh->mr.mr_free_list);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Release resources. */
-	mr_next = LIST_FIRST(&free_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		mr_free(mr);
-	}
-}
-
-/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
-static int
-mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
-			  const struct rte_memseg *ms, size_t len, void *arg)
-{
-	struct mr_find_contig_memsegs_data *data = arg;
-
-	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
-		return 0;
-	/* Found, save it and stop walking. */
-	data->start = ms->addr_64;
-	data->end = ms->addr_64 + len;
-	data->msl = msl;
-	return 1;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This API should be called on a secondary process, then a request is sent to
- * the primary process in order to create a MR for the address. As the global MR
- * list is on the shared memory, following LKey lookup should succeed unless the
- * request fails.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-			 uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	int ret;
-
-	DEBUG("port %u requesting MR creation for address (%p)",
-	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
-	if (ret) {
-		DEBUG("port %u fail to request MR creation for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		return UINT32_MAX;
-	}
-	rte_rwlock_read_lock(&priv->sh->mr.rwlock);
-	/* Fill in output data. */
-	mr_lookup_dev(priv->sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_read_unlock(&priv->sh->mr.rwlock);
-	DEBUG("port %u MR CREATED by primary process for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
-	      dev->data->port_id, (void *)addr,
-	      entry->start, entry->end, entry->lkey);
-	return entry->lkey;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * Register entire virtually contiguous memory chunk around the address.
- * This must be called from the primary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-uint32_t
-mlx5_mr_create_primary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-		       uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_dev_config *config = &priv->config;
-	const struct rte_memseg_list *msl;
-	const struct rte_memseg *ms;
-	struct mlx5_mr *mr = NULL;
-	size_t len;
-	uint32_t ms_n;
-	uint32_t bmp_size;
-	void *bmp_mem;
-	int ms_idx_shift = -1;
-	unsigned int n;
-	struct mr_find_contig_memsegs_data data = {
-		.addr = addr,
-	};
-	struct mr_find_contig_memsegs_data data_re;
-
-	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
-		dev->data->port_id, (void *)addr);
-	/*
-	 * Release detached MRs if any. This can't be called with holding either
-	 * memory_hotplug_lock or sh->mr.rwlock. MRs on the free list have
-	 * been detached by the memory free event but it couldn't be released
-	 * inside the callback due to deadlock. As a result, releasing resources
-	 * is quite opportunistic.
-	 */
-	mlx5_mr_garbage_collect(sh);
-	/*
-	 * If enabled, find out a contiguous virtual address chunk in use, to
-	 * which the given address belongs, in order to register maximum range.
-	 * In the best case where mempools are not dynamically recreated and
-	 * '--socket-mem' is specified as an EAL option, it is very likely to
-	 * have only one MR(LKey) per a socket and per a hugepage-size even
-	 * though the system memory is highly fragmented. As the whole memory
-	 * chunk will be pinned by kernel, it can't be reused unless entire
-	 * chunk is freed from EAL.
-	 *
-	 * If disabled, just register one memseg (page). Then, memory
-	 * consumption will be minimized but it may drop performance if there
-	 * are many MRs to lookup on the datapath.
-	 */
-	if (!config->mr_ext_memseg_en) {
-		data.msl = rte_mem_virt2memseg_list((void *)addr);
-		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
-		data.end = data.start + data.msl->page_sz;
-	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
-		DRV_LOG(WARNING,
-			"port %u unable to find virtually contiguous"
-			" chunk for address (%p)."
-			" rte_memseg_contig_walk() failed.",
-			dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_nolock;
-	}
-alloc_resources:
-	/* Addresses must be page-aligned. */
-	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
-	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
-	msl = data.msl;
-	ms = rte_mem_virt2memseg((void *)data.start, msl);
-	len = data.end - data.start;
-	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-	/* Number of memsegs in the range. */
-	ms_n = len / msl->page_sz;
-	DEBUG("port %u extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " page_sz=0x%" PRIx64 ", ms_n=%u",
-	      dev->data->port_id, (void *)addr,
-	      data.start, data.end, msl->page_sz, ms_n);
-	/* Size of memory for bitmap. */
-	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE) +
-				bmp_size,
-				RTE_CACHE_LINE_SIZE, msl->socket_id);
-	if (mr == NULL) {
-		DEBUG("port %u unable to allocate memory for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENOMEM;
-		goto err_nolock;
-	}
-	mr->msl = msl;
-	/*
-	 * Save the index of the first memseg and initialize memseg bitmap. To
-	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
-	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
-	 */
-	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
-	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
-	if (mr->ms_bmp == NULL) {
-		DEBUG("port %u unable to initialize bitmap for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_nolock;
-	}
-	/*
-	 * Should recheck whether the extended contiguous chunk is still valid.
-	 * Because memory_hotplug_lock can't be held if there's any memory
-	 * related calls in a critical path, resource allocation above can't be
-	 * locked. If the memory has been changed at this point, try again with
-	 * just single page. If not, go on with the big chunk atomically from
-	 * here.
-	 */
-	rte_mcfg_mem_read_lock();
-	data_re = data;
-	if (len > msl->page_sz &&
-	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
-		DEBUG("port %u unable to find virtually contiguous"
-		      " chunk for address (%p)."
-		      " rte_memseg_contig_walk() failed.",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_memlock;
-	}
-	if (data.start != data_re.start || data.end != data_re.end) {
-		/*
-		 * The extended contiguous chunk has been changed. Try again
-		 * with single memseg instead.
-		 */
-		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
-		data.end = data.start + msl->page_sz;
-		rte_mcfg_mem_read_unlock();
-		mr_free(mr);
-		goto alloc_resources;
-	}
-	MLX5_ASSERT(data.msl == data_re.msl);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/*
-	 * Check the address is really missing. If other thread already created
-	 * one or it is not found due to overflow, abort and return.
-	 */
-	if (mr_lookup_dev(sh, entry, addr) != UINT32_MAX) {
-		/*
-		 * Insert to the global cache table. It may fail due to
-		 * low-on-memory. Then, this entry will have to be searched
-		 * here again.
-		 */
-		mr_btree_insert(&sh->mr.cache, entry);
-		DEBUG("port %u found MR for %p on final lookup, abort",
-		      dev->data->port_id, (void *)addr);
-		rte_rwlock_write_unlock(&sh->mr.rwlock);
-		rte_mcfg_mem_read_unlock();
-		/*
-		 * Must be unlocked before calling rte_free() because
-		 * mlx5_mr_mem_event_free_cb() can be called inside.
-		 */
-		mr_free(mr);
-		return entry->lkey;
-	}
-	/*
-	 * Trim start and end addresses for verbs MR. Set bits for registering
-	 * memsegs but exclude already registered ones. Bitmap can be
-	 * fragmented.
-	 */
-	for (n = 0; n < ms_n; ++n) {
-		uintptr_t start;
-		struct mlx5_mr_cache ret;
-
-		memset(&ret, 0, sizeof(ret));
-		start = data_re.start + n * msl->page_sz;
-		/* Exclude memsegs already registered by other MRs. */
-		if (mr_lookup_dev(sh, &ret, start) == UINT32_MAX) {
-			/*
-			 * Start from the first unregistered memseg in the
-			 * extended range.
-			 */
-			if (ms_idx_shift == -1) {
-				mr->ms_base_idx += n;
-				data.start = start;
-				ms_idx_shift = n;
-			}
-			data.end = start + msl->page_sz;
-			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
-			++mr->ms_n;
-		}
-	}
-	len = data.end - data.start;
-	mr->ms_bmp_n = len / msl->page_sz;
-	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
-	/*
-	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
-	 * called with holding the memory lock because it doesn't use
-	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
-	 * through mlx5_alloc_verbs_buf().
-	 */
-	mr->ibv_mr = mlx5_glue->reg_mr(sh->pd, (void *)data.start, len,
-				       IBV_ACCESS_LOCAL_WRITE |
-					   IBV_ACCESS_RELAXED_ORDERING);
-	if (mr->ibv_mr == NULL) {
-		DEBUG("port %u fail to create a verbs MR for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_mrlock;
-	}
-	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
-	MLX5_ASSERT(mr->ibv_mr->length == len);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
-	DEBUG("port %u MR CREATED (%p) for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-	      dev->data->port_id, (void *)mr, (void *)addr,
-	      data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	/* Fill in output data. */
-	mr_lookup_dev(sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	rte_mcfg_mem_read_unlock();
-	return entry->lkey;
-err_mrlock:
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-err_memlock:
-	rte_mcfg_mem_read_unlock();
-err_nolock:
-	/*
-	 * In case of error, as this can be called in a datapath, a warning
-	 * message per an error is preferable instead. Must be unlocked before
-	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
-	 * inside.
-	 */
-	mr_free(mr);
-	return UINT32_MAX;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This can be called from primary and secondary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-	       uintptr_t addr)
-{
-	uint32_t ret = 0;
-
-	switch (rte_eal_process_type()) {
-	case RTE_PROC_PRIMARY:
-		ret = mlx5_mr_create_primary(dev, entry, addr);
-		break;
-	case RTE_PROC_SECONDARY:
-		ret = mlx5_mr_create_secondary(dev, entry, addr);
-		break;
-	default:
-		break;
-	}
-	return ret;
-}
-
-/**
- * Rebuild the global B-tree cache of device from the original MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mr_rebuild_dev_cache(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr;
-
-	DRV_LOG(DEBUG, "device %s rebuild dev cache[]", sh->ibdev_name);
-	/* Flush cache to rebuild. */
-	sh->mr.cache.len = 1;
-	sh->mr.cache.overflow = 0;
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr)
-		if (mr_insert_dev_cache(sh, mr) < 0)
-			return;
-}
-
 /**
  * Callback for memory free event. Iterate freed memsegs and check whether it
  * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a
@@ -900,18 +74,18 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		    RTE_ALIGN((uintptr_t)addr, msl->page_sz));
 	MLX5_ASSERT(len == RTE_ALIGN(len, msl->page_sz));
 	ms_n = len / msl->page_sz;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
 	/* Clear bits of freed memsegs from MR. */
 	for (i = 0; i < ms_n; ++i) {
 		const struct rte_memseg *ms;
-		struct mlx5_mr_cache entry;
+		struct mr_cache_entry entry;
 		uintptr_t start;
 		int ms_idx;
 		uint32_t pos;
 
 		/* Find MR having this memseg. */
 		start = (uintptr_t)addr + i * msl->page_sz;
-		mr = mr_lookup_dev_list(sh, &entry, start);
+		mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, start);
 		if (mr == NULL)
 			continue;
 		MLX5_ASSERT(mr->msl); /* Can't be external memory. */
@@ -927,7 +101,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rte_bitmap_clear(mr->ms_bmp, pos);
 		if (--mr->ms_n == 0) {
 			LIST_REMOVE(mr, mr);
-			LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+			LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 			DEBUG("device %s remove MR(%p) from list",
 			      sh->ibdev_name, (void *)mr);
 		}
@@ -938,7 +112,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rebuild = 1;
 	}
 	if (rebuild) {
-		mr_rebuild_dev_cache(sh);
+		mlx5_mr_rebuild_cache(&sh->share_cache);
 		/*
 		 * Flush local caches by propagating invalidation across cores.
 		 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -948,12 +122,12 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		 * generation below) will be guaranteed to be seen by other core
 		 * before the core sees the newly allocated memory.
 		 */
-		++sh->mr.dev_gen;
+		++sh->share_cache.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
-		      sh->mr.dev_gen);
+		      sh->share_cache.dev_gen);
 		rte_smp_wmb();
 	}
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
 
 /**
@@ -990,111 +164,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Look up address in the global MR cache table. If not found, create a new MR.
- * Insert the found/created entry to local bottom-half cache table.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this is not written.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   struct mlx5_mr_cache *entry, uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
-	uint16_t idx;
-	uint32_t lkey;
-
-	/* If local cache table is full, try to double it. */
-	if (unlikely(bt->len == bt->size))
-		mr_btree_expand(bt, bt->size << 1);
-	/* Look up in the global cache. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-	if (lkey != UINT32_MAX) {
-		/* Found. */
-		*entry = (*sh->mr.cache.table)[idx];
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
-		/*
-		 * Update local cache. Even if it fails, return the found entry
-		 * to update top-half cache. Next time, this entry will be found
-		 * in the global cache.
-		 */
-		mr_btree_insert(bt, entry);
-		return lkey;
-	}
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-	/* First time to see the address? Create a new MR. */
-	lkey = mlx5_mr_create(dev, entry, addr);
-	/*
-	 * Update the local cache if successfully created a new global MR. Even
-	 * if failed to create one, there's no action to take in this datapath
-	 * code. As returning LKey is invalid, this will eventually make HW
-	 * fail.
-	 */
-	if (lkey != UINT32_MAX)
-		mr_btree_insert(bt, entry);
-	return lkey;
-}
-
-/**
- * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] and if
- * misses, search in the global MR cache table and update the new entry to
- * per-queue local caches.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   uintptr_t addr)
-{
-	uint32_t lkey;
-	uint16_t bh_idx = 0;
-	/* Victim in top-half cache to replace with new entry. */
-	struct mlx5_mr_cache *repl = &mr_ctrl->cache[mr_ctrl->head];
-
-	/* Binary-search MR translation table. */
-	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
-	/* Update top-half cache. */
-	if (likely(lkey != UINT32_MAX)) {
-		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
-	} else {
-		/*
-		 * If missed in local lookup table, search in the global cache
-		 * and local cache_bh[] will be updated inside if possible.
-		 * Top-half cache entry will also be updated.
-		 */
-		lkey = mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
-		if (unlikely(lkey == UINT32_MAX))
-			return UINT32_MAX;
-	}
-	/* Update the most recently used entry. */
-	mr_ctrl->mru = mr_ctrl->head;
-	/* Point to the next victim, the oldest. */
-	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
-	return lkey;
-}
-
 /**
  * Bottom-half of LKey search on Rx.
  *
@@ -1114,7 +183,9 @@ mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
 	struct mlx5_priv *priv = rxq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1136,7 +207,9 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
 	struct mlx5_priv *priv = txq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1165,82 +238,6 @@ mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	return lkey;
 }
 
-/**
- * Flush all of the local cache entries.
- *
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- */
-void
-mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
-{
-	/* Reset the most-recently-used index. */
-	mr_ctrl->mru = 0;
-	/* Reset the linear search array. */
-	mr_ctrl->head = 0;
-	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
-	/* Reset the B-tree table. */
-	mr_ctrl->cache_bh.len = 1;
-	mr_ctrl->cache_bh.overflow = 0;
-	/* Update the generation number. */
-	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
-	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
-		(void *)mr_ctrl, mr_ctrl->cur_gen);
-}
-
-/**
- * Creates a memory region for external memory, that is memory which is not
- * part of the DPDK memory segments.
- *
- * @param dev
- *   Pointer to the ethernet device.
- * @param addr
- *   Starting virtual address of memory.
- * @param len
- *   Length of memory segment being mapped.
- * @param socked_id
- *   Socket to allocate heap memory for the control structures.
- *
- * @return
- *   Pointer to MR structure on success, NULL otherwise.
- */
-static struct mlx5_mr *
-mlx5_create_mr_ext(struct rte_eth_dev *dev, uintptr_t addr, size_t len,
-		   int socket_id)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_mr *mr = NULL;
-
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE),
-				RTE_CACHE_LINE_SIZE, socket_id);
-	if (mr == NULL)
-		return NULL;
-	mr->ibv_mr = mlx5_glue->reg_mr(priv->sh->pd, (void *)addr, len,
-				       IBV_ACCESS_LOCAL_WRITE |
-					   IBV_ACCESS_RELAXED_ORDERING);
-	if (mr->ibv_mr == NULL) {
-		DRV_LOG(WARNING,
-			"port %u fail to create a verbs MR for address (%p)",
-			dev->data->port_id, (void *)addr);
-		rte_free(mr);
-		return NULL;
-	}
-	mr->msl = NULL; /* Mark it is external memory. */
-	mr->ms_bmp = NULL;
-	mr->ms_n = 1;
-	mr->ms_bmp_n = 1;
-	DRV_LOG(DEBUG,
-		"port %u MR CREATED (%p) for external memory %p:\n"
-		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-		dev->data->port_id, (void *)mr, (void *)addr,
-		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	return mr;
-}
-
 /**
  * Called during rte_mempool_mem_iter() by mlx5_mr_update_ext_mp().
  *
@@ -1267,19 +264,19 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 	struct mlx5_mr *mr = NULL;
 	uintptr_t addr = (uintptr_t)memhdr->addr;
 	size_t len = memhdr->len;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	/* If already registered, it should return. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_lookup_dev(sh, &entry, addr);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	lkey = mlx5_mr_lookup_cache(&sh->share_cache, &entry, addr);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	if (lkey != UINT32_MAX)
 		return;
 	DRV_LOG(DEBUG, "port %u register MR for chunk #%d of mempool (%s)",
 		dev->data->port_id, mem_idx, mp->name);
-	mr = mlx5_create_mr_ext(dev, addr, len, mp->socket_id);
+	mr = mlx5_create_mr_ext(sh->pd, addr, len, mp->socket_id);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to allocate a new MR of"
@@ -1288,13 +285,14 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 		data->ret = -1;
 		return;
 	}
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	/* Insert to the local cache table */
-	mlx5_mr_addr2mr_bh(dev, mr_ctrl, addr);
+	mlx5_mr_addr2mr_bh(sh->pd, &priv->mp_id, &sh->share_cache,
+			   mr_ctrl, addr, priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1351,19 +349,19 @@ mlx5_dma_map(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	priv = dev->data->dev_private;
-	mr = mlx5_create_mr_ext(dev, (uintptr_t)addr, len, SOCKET_ID_ANY);
+	sh = priv->sh;
+	mr = mlx5_create_mr_ext(sh->pd, (uintptr_t)addr, len, SOCKET_ID_ANY);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to dma map", dev->data->port_id);
 		rte_errno = EINVAL;
 		return -1;
 	}
-	sh = priv->sh;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1390,7 +388,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	struct mlx5_priv *priv;
 	struct mlx5_ibv_shared *sh;
 	struct mlx5_mr *mr;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 
 	dev = pci_dev_to_eth_dev(pdev);
 	if (!dev) {
@@ -1401,10 +399,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	}
 	priv = dev->data->dev_private;
 	sh = priv->sh;
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	mr = mr_lookup_dev_list(sh, &entry, (uintptr_t)addr);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, (uintptr_t)addr);
 	if (!mr) {
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
+		rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 		DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't registered "
 				 "to PCI device %p", (uintptr_t)addr,
 				 (void *)pdev);
@@ -1412,10 +410,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	LIST_REMOVE(mr, mr);
-	LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
 	      (void *)mr);
-	mr_rebuild_dev_cache(sh);
+	mlx5_mr_rebuild_cache(&sh->share_cache);
 	/*
 	 * Flush local caches by propagating invalidation across cores.
 	 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -1425,10 +423,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	 * generation below) will be guaranteed to be seen by other core
 	 * before the core sees the newly allocated memory.
 	 */
-	++sh->mr.dev_gen;
-	DEBUG("broadcasting local cache flush, gen=%d",	sh->mr.dev_gen);
+	++sh->share_cache.dev_gen;
+	DEBUG("broadcasting local cache flush, gen=%d",
+	      sh->share_cache.dev_gen);
 	rte_smp_wmb();
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1503,14 +502,19 @@ mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
 		     unsigned mem_idx __rte_unused)
 {
 	struct mr_update_mp_data *data = opaque;
+	struct rte_eth_dev *dev = data->dev;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
 	uint32_t lkey;
 
 	/* Stop iteration if failed in the previous walk. */
 	if (data->ret < 0)
 		return;
 	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr);
+	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, data->mr_ctrl,
+				  (uintptr_t)memhdr->addr,
+				  priv->config.mr_ext_memseg_en);
 	if (lkey == UINT32_MAX)
 		data->ret = -1;
 }
@@ -1545,76 +549,3 @@ mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 	}
 	return data.ret;
 }
-
-/**
- * Dump all the created MRs and the global cache entries.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	struct mlx5_mr *mr;
-	int mr_n = 0;
-	int chunk_n = 0;
-
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		DEBUG("device %s MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
-		      sh->ibdev_name, mr_n++,
-		      rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		      mr->ms_n, mr->ms_bmp_n);
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret = { 0, };
-
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (!ret.end)
-				break;
-			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
-			      chunk_n++, ret.start, ret.end);
-		}
-	}
-	DEBUG("device %s dumping global cache", sh->ibdev_name);
-	mlx5_mr_btree_dump(&sh->mr.cache);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-#endif
-}
-
-/**
- * Release all the created MRs and resources for shared device context.
- * list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_release(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-
-	if (rte_log_can_log(mlx5_logtype, RTE_LOG_DEBUG))
-		mlx5_mr_dump_dev(sh);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach from MR list and move to free list. */
-	mr_next = LIST_FIRST(&sh->mr.mr_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		LIST_REMOVE(mr, mr);
-		LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
-	}
-	LIST_INIT(&sh->mr.mr_list);
-	/* Free global cache. */
-	mlx5_mr_btree_free(&sh->mr.cache);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Free all remaining MRs. */
-	mlx5_mr_garbage_collect(sh);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 48264c8294..0c5877b3d6 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -24,99 +24,16 @@
 #include <rte_ethdev.h>
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_memory.h>
 
-/* Memory Region object. */
-struct mlx5_mr {
-	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
-	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
-	const struct rte_memseg_list *msl;
-	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
-	int ms_n; /* Number of memsegs in use. */
-	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
-	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
-};
-
-/* Cache entry for Memory Region. */
-struct mlx5_mr_cache {
-	uintptr_t start; /* Start address of MR. */
-	uintptr_t end; /* End address of MR. */
-	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
-} __rte_packed;
-
-/* MR Cache table for Binary search. */
-struct mlx5_mr_btree {
-	uint16_t len; /* Number of entries. */
-	uint16_t size; /* Total number of entries. */
-	int overflow; /* Mark failure of table expansion. */
-	struct mlx5_mr_cache (*table)[];
-} __rte_packed;
-
-/* Per-queue MR control descriptor. */
-struct mlx5_mr_ctrl {
-	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
-	uint32_t cur_gen; /* Generation number saved to flush caches. */
-	uint16_t mru; /* Index of last hit entry in top-half cache. */
-	uint16_t head; /* Index of the oldest entry in top-half cache. */
-	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
-	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
-} __rte_packed;
-
-struct mlx5_ibv_shared;
-extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
-extern rte_rwlock_t mlx5_mem_event_rwlock;
+#include <mlx5_common_mr.h>
 
 /* First entry must be NULL for comparison. */
 #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
 
-int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
-void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
-uint32_t mlx5_mr_create_primary(struct rte_eth_dev *dev,
-				struct mlx5_mr_cache *entry, uintptr_t addr);
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
 int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 		      struct rte_mempool *mp);
-void mlx5_mr_release(struct mlx5_ibv_shared *sh);
-
-/* Debug purpose functions. */
-void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt);
-void mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh);
-
-/**
- * Look up LKey from given lookup table by linear search. Firstly look up the
- * last-hit entry. If miss, the entire array is searched. If found, update the
- * last-hit index and return LKey.
- *
- * @param lkp_tbl
- *   Pointer to lookup table.
- * @param[in,out] cached_idx
- *   Pointer to last-hit index.
- * @param n
- *   Size of lookup table.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static __rte_always_inline uint32_t
-mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t *cached_idx,
-		     uint16_t n, uintptr_t addr)
-{
-	uint16_t idx;
-
-	if (likely(addr >= lkp_tbl[*cached_idx].start &&
-		   addr < lkp_tbl[*cached_idx].end))
-		return lkp_tbl[*cached_idx].lkey;
-	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
-		if (addr >= lkp_tbl[idx].start &&
-		    addr < lkp_tbl[idx].end) {
-			/* Found. */
-			*cached_idx = idx;
-			return lkp_tbl[idx].lkey;
-		}
-	}
-	return UINT32_MAX;
-}
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index fc7591c2b0..5f9b670442 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -33,6 +33,7 @@
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_autoconf.h"
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 939778aa55..84161ad6af 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -34,11 +34,11 @@
 #include <mlx5_glue.h>
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /* Support tunnel matching. */
@@ -598,8 +598,8 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half (Binary Search) on miss. */
@@ -630,8 +630,8 @@ mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	if (unlikely(*mr_ctrl->dev_gen_ptr != mr_ctrl->cur_gen))
 		mlx5_mr_flush_local_cache(mr_ctrl);
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half on miss. */
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index ea925156f0..6ddcbfb0ad 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -13,6 +13,8 @@
 
 #include "mlx5_autoconf.h"
 
+#include "mlx5_mr.h"
+
 /* HW checksum offload capabilities of vectorized Tx. */
 #define MLX5_VEC_TX_CKSUM_OFFLOAD_CAP \
 	(DEV_TX_OFFLOAD_IPV4_CKSUM | \
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 438b705952..759670408b 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -11,6 +11,7 @@
 #include <rte_alarm.h>
 
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 #include "rte_pmd_mlx5.h"
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 0653f4cf30..29e5cabab6 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -30,6 +30,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
@@ -1289,7 +1290,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		goto error;
 	}
 	/* Save pointer of global generation number to check memory event. */
-	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->mr.dev_gen;
+	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	MLX5_ASSERT(desc > MLX5_TX_COMP_THRESH);
 	tmpl->txq.offloads = conf->offloads |
 			     dev->data->dev_conf.txmode.offloads;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/4] common/mlx5: refactor memory management codes
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: refactor memory management codes Vu Pham
@ 2020-04-08  9:04     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-08  9:04 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 7, 2020 20:01
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v3 3/4] common/mlx5: refactor memory management codes
> 
> Refactor common memory btree and cache management to common driver.
> Replace some input parameters of MR APIs to more common datastructure
> like PD, port_id, share_cache,... so that muliptle PMD drivers can use those
> MR APIs.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/mlx5_common_mr.c            | 1108
> +++++++++++++++++++++++
>  drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
>  drivers/common/mlx5/rte_common_mlx5_version.map |   14 +
>  3 files changed, 1282 insertions(+)
>  create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mr.h
> 
> diff --git a/drivers/common/mlx5/mlx5_common_mr.c
> b/drivers/common/mlx5/mlx5_common_mr.c
> new file mode 100644
> index 0000000000..9d4a06dd5b
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mr.c
> @@ -0,0 +1,1108 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2016 6WIND S.A.
> + * Copyright 2020 Mellanox Technologies, Ltd  */ #include
> +<rte_eal_memconfig.h> #include <rte_errno.h> #include <rte_mempool.h>
> +#include <rte_malloc.h> #include <rte_rwlock.h>
> +
> +#include "mlx5_glue.h"
> +#include "mlx5_common_mp.h"
> +#include "mlx5_common_mr.h"
> +#include "mlx5_common_utils.h"
> +
> +struct mr_find_contig_memsegs_data {
> +	uintptr_t addr;
> +	uintptr_t start;
> +	uintptr_t end;
> +	const struct rte_memseg_list *msl;
> +};
> +
> +/**
> + * Expand B-tree table to a given size. Can't be called with holding
> + * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param n
> + *   Number of entries for expansion.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_btree_expand(struct mlx5_mr_btree *bt, int n) {
> +	void *mem;
> +	int ret = 0;
> +
> +	if (n <= bt->size)
> +		return ret;
> +	/*
> +	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
> +	 * used inside if there's no room to expand. Because this is a quite
> +	 * rare case and a part of very slow path, it is very acceptable.
> +	 * Initially cache_bh[] will be given practically enough space and once
> +	 * it is expanded, expansion wouldn't be needed again ever.
> +	 */
> +	mem = rte_realloc(bt->table, n * sizeof(struct mr_cache_entry), 0);
> +	if (mem == NULL) {
> +		/* Not an error, B-tree search will be skipped. */
> +		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
> +			(void *)bt);
> +		ret = -1;
> +	} else {
> +		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
> +		bt->table = mem;
> +		bt->size = n;
> +	}
> +	return ret;
> +}
> +
> +/**
> + * Look up LKey from given B-tree lookup table, store the last index
> +and return
> + * searched LKey.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param[out] idx
> + *   Pointer to index. Even on search failure, returns index where it stops
> + *   searching so that index can be used when inserting a new entry.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t
> +addr) {
> +	struct mr_cache_entry *lkp_tbl;
> +	uint16_t n;
> +	uint16_t base = 0;
> +
> +	MLX5_ASSERT(bt != NULL);
> +	lkp_tbl = *bt->table;
> +	n = bt->len;
> +	/* First entry must be NULL for comparison. */
> +	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
> +				    lkp_tbl[0].lkey == UINT32_MAX));
> +	/* Binary search. */
> +	do {
> +		register uint16_t delta = n >> 1;
> +
> +		if (addr < lkp_tbl[base + delta].start) {
> +			n = delta;
> +		} else {
> +			base += delta;
> +			n -= delta;
> +		}
> +	} while (n > 1);
> +	MLX5_ASSERT(addr >= lkp_tbl[base].start);
> +	*idx = base;
> +	if (addr < lkp_tbl[base].end)
> +		return lkp_tbl[base].lkey;
> +	/* Not found. */
> +	return UINT32_MAX;
> +}
> +
> +/**
> + * Insert an entry to B-tree lookup table.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param entry
> + *   Pointer to new entry to insert.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_btree_insert(struct mlx5_mr_btree *bt, struct mr_cache_entry *entry)
> +{
> +	struct mr_cache_entry *lkp_tbl;
> +	uint16_t idx = 0;
> +	size_t shift;
> +
> +	MLX5_ASSERT(bt != NULL);
> +	MLX5_ASSERT(bt->len <= bt->size);
> +	MLX5_ASSERT(bt->len > 0);
> +	lkp_tbl = *bt->table;
> +	/* Find out the slot for insertion. */
> +	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
> +		DRV_LOG(DEBUG,
> +			"abort insertion to B-tree(%p): already exist at"
> +			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> +			(void *)bt, idx, entry->start, entry->end, entry->lkey);
> +		/* Already exist, return. */
> +		return 0;
> +	}
> +	/* If table is full, return error. */
> +	if (unlikely(bt->len == bt->size)) {
> +		bt->overflow = 1;
> +		return -1;
> +	}
> +	/* Insert entry. */
> +	++idx;
> +	shift = (bt->len - idx) * sizeof(struct mr_cache_entry);
> +	if (shift)
> +		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
> +	lkp_tbl[idx] = *entry;
> +	bt->len++;
> +	DRV_LOG(DEBUG,
> +		"inserted B-tree(%p)[%u],"
> +		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> +		(void *)bt, idx, entry->start, entry->end, entry->lkey);
> +	return 0;
> +}
> +
> +/**
> + * Initialize B-tree and allocate memory for lookup table.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param n
> + *   Number of entries to allocate.
> + * @param socket
> + *   NUMA socket on which memory must be allocated.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket) {
> +	if (bt == NULL) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(!bt->table && !bt->size);
> +	memset(bt, 0, sizeof(*bt));
> +	bt->table = rte_calloc_socket("B-tree table",
> +				      n, sizeof(struct mr_cache_entry),
> +				      0, socket);
> +	if (bt->table == NULL) {
> +		rte_errno = ENOMEM;
> +		DEBUG("failed to allocate memory for btree cache on socket
> %d",
> +		      socket);
> +		return -rte_errno;
> +	}
> +	bt->size = n;
> +	/* First entry must be NULL for binary search. */
> +	(*bt->table)[bt->len++] = (struct mr_cache_entry) {
> +		.lkey = UINT32_MAX,
> +	};
> +	DEBUG("initialized B-tree %p with table %p",
> +	      (void *)bt, (void *)bt->table);
> +	return 0;
> +}
> +
> +/**
> + * Free B-tree resources.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + */
> +void
> +mlx5_mr_btree_free(struct mlx5_mr_btree *bt) {
> +	if (bt == NULL)
> +		return;
> +	DEBUG("freeing B-tree %p with table %p",
> +	      (void *)bt, (void *)bt->table);
> +	rte_free(bt->table);
> +	memset(bt, 0, sizeof(*bt));
> +}
> +
> +/**
> + * Dump all the entries in a B-tree
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + */
> +void
> +mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused) { #ifdef
> +RTE_LIBRTE_MLX5_DEBUG
> +	int idx;
> +	struct mr_cache_entry *lkp_tbl;
> +
> +	if (bt == NULL)
> +		return;
> +	lkp_tbl = *bt->table;
> +	for (idx = 0; idx < bt->len; ++idx) {
> +		struct mr_cache_entry *entry = &lkp_tbl[idx];
> +
> +		DEBUG("B-tree(%p)[%u],"
> +		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> +		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
> +	}
> +#endif
> +}
> +
> +/**
> + * Find virtually contiguous memory chunk in a given MR.
> + *
> + * @param dev
> + *   Pointer to MR structure.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If not found, this will not be
> + *   updated.
> + * @param start_idx
> + *   Start index of the memseg bitmap.
> + *
> + * @return
> + *   Next index to go on lookup.
> + */
> +static int
> +mr_find_next_chunk(struct mlx5_mr *mr, struct mr_cache_entry *entry,
> +		   int base_idx)
> +{
> +	uintptr_t start = 0;
> +	uintptr_t end = 0;
> +	uint32_t idx = 0;
> +
> +	/* MR for external memory doesn't have memseg list. */
> +	if (mr->msl == NULL) {
> +		struct ibv_mr *ibv_mr = mr->ibv_mr;
> +
> +		MLX5_ASSERT(mr->ms_bmp_n == 1);
> +		MLX5_ASSERT(mr->ms_n == 1);
> +		MLX5_ASSERT(base_idx == 0);
> +		/*
> +		 * Can't search it from memseg list but get it directly from
> +		 * verbs MR as there's only one chunk.
> +		 */
> +		entry->start = (uintptr_t)ibv_mr->addr;
> +		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
> +		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> +		/* Returning 1 ends iteration. */
> +		return 1;
> +	}
> +	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
> +		if (rte_bitmap_get(mr->ms_bmp, idx)) {
> +			const struct rte_memseg_list *msl;
> +			const struct rte_memseg *ms;
> +
> +			msl = mr->msl;
> +			ms = rte_fbarray_get(&msl->memseg_arr,
> +					     mr->ms_base_idx + idx);
> +			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> +			if (!start)
> +				start = ms->addr_64;
> +			end = ms->addr_64 + ms->hugepage_sz;
> +		} else if (start) {
> +			/* Passed the end of a fragment. */
> +			break;
> +		}
> +	}
> +	if (start) {
> +		/* Found one chunk. */
> +		entry->start = start;
> +		entry->end = end;
> +		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> +	}
> +	return idx;
> +}
> +
> +/**
> + * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
> + * Then, this entry will have to be searched by mr_lookup_list() in
> + * mlx5_mr_create() on miss.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param mr
> + *   Pointer to MR to insert.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +int
> +mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mlx5_mr *mr)
> +{
> +	unsigned int n;
> +
> +	DRV_LOG(DEBUG, "Inserting MR(%p) to global cache(%p)",
> +		(void *)mr, (void *)share_cache);
> +	for (n = 0; n < mr->ms_bmp_n; ) {
> +		struct mr_cache_entry entry;
> +
> +		memset(&entry, 0, sizeof(entry));
> +		/* Find a contiguous chunk and advance the index. */
> +		n = mr_find_next_chunk(mr, &entry, n);
> +		if (!entry.end)
> +			break;
> +		if (mr_btree_insert(&share_cache->cache, &entry) < 0) {
> +			/*
> +			 * Overflowed, but the global table cannot be
> expanded
> +			 * because of deadlock.
> +			 */
> +			return -1;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Look up address in the original global MR list.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Found MR on match, NULL otherwise.
> + */
> +struct mlx5_mr *
> +mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
> +		    struct mr_cache_entry *entry, uintptr_t addr) {
> +	struct mlx5_mr *mr;
> +
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
> +		unsigned int n;
> +
> +		if (mr->ms_n == 0)
> +			continue;
> +		for (n = 0; n < mr->ms_bmp_n; ) {
> +			struct mr_cache_entry ret;
> +
> +			memset(&ret, 0, sizeof(ret));
> +			n = mr_find_next_chunk(mr, &ret, n);
> +			if (addr >= ret.start && addr < ret.end) {
> +				/* Found. */
> +				*entry = ret;
> +				return mr;
> +			}
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * Look up address on global MR cache.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +uint32_t
> +mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mr_cache_entry *entry, uintptr_t addr) {
> +	uint16_t idx;
> +	uint32_t lkey = UINT32_MAX;
> +	struct mlx5_mr *mr;
> +
> +	/*
> +	 * If the global cache has overflowed since it failed to expand the
> +	 * B-tree table, it can't have all the existing MRs. Then, the address
> +	 * has to be searched by traversing the original MR list instead, which
> +	 * is very slow path. Otherwise, the global cache is all inclusive.
> +	 */
> +	if (!unlikely(share_cache->cache.overflow)) {
> +		lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
> +		if (lkey != UINT32_MAX)
> +			*entry = (*share_cache->cache.table)[idx];
> +	} else {
> +		/* Falling back to the slowest path. */
> +		mr = mlx5_mr_lookup_list(share_cache, entry, addr);
> +		if (mr != NULL)
> +			lkey = entry->lkey;
> +	}
> +	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
> +					   addr < entry->end));
> +	return lkey;
> +}
> +
> +/**
> + * Free MR resources. MR lock must not be held to avoid a deadlock.
> +rte_free()
> + * can raise memory free event and the callback function will spin on the
> lock.
> + *
> + * @param mr
> + *   Pointer to MR to free.
> + */
> +static void
> +mr_free(struct mlx5_mr *mr)
> +{
> +	if (mr == NULL)
> +		return;
> +	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
> +	if (mr->ibv_mr != NULL)
> +		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
> +	if (mr->ms_bmp != NULL)
> +		rte_bitmap_free(mr->ms_bmp);
> +	rte_free(mr);
> +}
> +
> +void
> +mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache) {
> +	struct mlx5_mr *mr;
> +
> +	DRV_LOG(DEBUG, "Rebuild dev cache[] %p", (void *)share_cache);
> +	/* Flush cache to rebuild. */
> +	share_cache->cache.len = 1;
> +	share_cache->cache.overflow = 0;
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &share_cache->mr_list, mr)
> +		if (mlx5_mr_insert_cache(share_cache, mr) < 0)
> +			return;
> +}
> +
> +/**
> + * Release resources of detached MR having no online entry.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + */
> +static void
> +mlx5_mr_garbage_collect(struct mlx5_mr_share_cache *share_cache) {
> +	struct mlx5_mr *mr_next;
> +	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
> +
> +	/* Must be called from the primary process. */
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +	/*
> +	 * MR can't be freed with holding the lock because rte_free() could
> call
> +	 * memory free callback function. This will be a deadlock situation.
> +	 */
> +	rte_rwlock_write_lock(&share_cache->rwlock);
> +	/* Detach the whole free list and release it after unlocking. */
> +	free_list = share_cache->mr_free_list;
> +	LIST_INIT(&share_cache->mr_free_list);
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +	/* Release resources. */
> +	mr_next = LIST_FIRST(&free_list);
> +	while (mr_next != NULL) {
> +		struct mlx5_mr *mr = mr_next;
> +
> +		mr_next = LIST_NEXT(mr, mr);
> +		mr_free(mr);
> +	}
> +}
> +
> +/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
> +static int mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
> +			  const struct rte_memseg *ms, size_t len, void *arg) {
> +	struct mr_find_contig_memsegs_data *data = arg;
> +
> +	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
> +		return 0;
> +	/* Found, save it and stop walking. */
> +	data->start = ms->addr_64;
> +	data->end = ms->addr_64 + len;
> +	data->msl = msl;
> +	return 1;
> +}
> +
> +/**
> + * Create a new global Memory Region (MR) for a missing virtual address.
> + * This API should be called on a secondary process, then a request is
> +sent to
> + * the primary process in order to create a MR for the address. As the
> +global MR
> + * list is on the shared memory, following LKey lookup should succeed
> +unless the
> + * request fails.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + * @param mr_ext_memseg_en
> + *   Configurable flag about external memory segment enable or not.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +static uint32_t
> +mlx5_mr_create_secondary(struct ibv_pd *pd __rte_unused,
> +			 struct mlx5_mp_id *mp_id,
> +			 struct mlx5_mr_share_cache *share_cache,
> +			 struct mr_cache_entry *entry, uintptr_t addr,
> +			 unsigned int mr_ext_memseg_en __rte_unused) {
> +	int ret;
> +
> +	DEBUG("port %u requesting MR creation for address (%p)",
> +	      mp_id->port_id, (void *)addr);
> +	ret = mlx5_mp_req_mr_create(mp_id, addr);
> +	if (ret) {
> +		DEBUG("Fail to request MR creation for address (%p)",
> +		      (void *)addr);
> +		return UINT32_MAX;
> +	}
> +	rte_rwlock_read_lock(&share_cache->rwlock);
> +	/* Fill in output data. */
> +	mlx5_mr_lookup_cache(share_cache, entry, addr);
> +	/* Lookup can't fail. */
> +	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> +	rte_rwlock_read_unlock(&share_cache->rwlock);
> +	DEBUG("MR CREATED by primary process for %p:\n"
> +	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
> +	      (void *)addr, entry->start, entry->end, entry->lkey);
> +	return entry->lkey;
> +}
> +
> +/**
> + * Create a new global Memory Region (MR) for a missing virtual address.
> + * Register entire virtually contiguous memory chunk around the address.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + * @param mr_ext_memseg_en
> + *   Configurable flag about external memory segment enable or not.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +uint32_t
> +mlx5_mr_create_primary(struct ibv_pd *pd,
> +		       struct mlx5_mr_share_cache *share_cache,
> +		       struct mr_cache_entry *entry, uintptr_t addr,
> +		       unsigned int mr_ext_memseg_en) {
> +	struct mr_find_contig_memsegs_data data = {.addr = addr, };
> +	struct mr_find_contig_memsegs_data data_re;
> +	const struct rte_memseg_list *msl;
> +	const struct rte_memseg *ms;
> +	struct mlx5_mr *mr = NULL;
> +	int ms_idx_shift = -1;
> +	uint32_t bmp_size;
> +	void *bmp_mem;
> +	uint32_t ms_n;
> +	uint32_t n;
> +	size_t len;
> +
> +	DRV_LOG(DEBUG, "Creating a MR using address (%p)", (void *)addr);
> +	/*
> +	 * Release detached MRs if any. This can't be called with holding
> either
> +	 * memory_hotplug_lock or share_cache->rwlock. MRs on the free list
> have
> +	 * been detached by the memory free event but it couldn't be
> released
> +	 * inside the callback due to deadlock. As a result, releasing resources
> +	 * is quite opportunistic.
> +	 */
> +	mlx5_mr_garbage_collect(share_cache);
> +	/*
> +	 * If enabled, find out a contiguous virtual address chunk in use, to
> +	 * which the given address belongs, in order to register maximum
> range.
> +	 * In the best case where mempools are not dynamically recreated
> and
> +	 * '--socket-mem' is specified as an EAL option, it is very likely to
> +	 * have only one MR(LKey) per a socket and per a hugepage-size even
> +	 * though the system memory is highly fragmented. As the whole
> memory
> +	 * chunk will be pinned by kernel, it can't be reused unless entire
> +	 * chunk is freed from EAL.
> +	 *
> +	 * If disabled, just register one memseg (page). Then, memory
> +	 * consumption will be minimized but it may drop performance if
> there
> +	 * are many MRs to lookup on the datapath.
> +	 */
> +	if (!mr_ext_memseg_en) {
> +		data.msl = rte_mem_virt2memseg_list((void *)addr);
> +		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
> +		data.end = data.start + data.msl->page_sz;
> +	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> &data)) {
> +		DRV_LOG(WARNING,
> +			"Unable to find virtually contiguous"
> +			" chunk for address (%p)."
> +			" rte_memseg_contig_walk() failed.", (void *)addr);
> +		rte_errno = ENXIO;
> +		goto err_nolock;
> +	}
> +alloc_resources:
> +	/* Addresses must be page-aligned. */
> +	MLX5_ASSERT(data.msl);
> +	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
> +	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
> +	msl = data.msl;
> +	ms = rte_mem_virt2memseg((void *)data.start, msl);
> +	len = data.end - data.start;
> +	MLX5_ASSERT(ms);
> +	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> +	/* Number of memsegs in the range. */
> +	ms_n = len / msl->page_sz;
> +	DEBUG("Extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> +	      " page_sz=0x%" PRIx64 ", ms_n=%u",
> +	      (void *)addr, data.start, data.end, msl->page_sz, ms_n);
> +	/* Size of memory for bitmap. */
> +	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
> +	mr = rte_zmalloc_socket(NULL,
> +				RTE_ALIGN_CEIL(sizeof(*mr),
> +					       RTE_CACHE_LINE_SIZE) +
> +				bmp_size,
> +				RTE_CACHE_LINE_SIZE, msl->socket_id);
> +	if (mr == NULL) {
> +		DEBUG("Unable to allocate memory for a new MR of"
> +		      " address (%p).", (void *)addr);
> +		rte_errno = ENOMEM;
> +		goto err_nolock;
> +	}
> +	mr->msl = msl;
> +	/*
> +	 * Save the index of the first memseg and initialize memseg bitmap.
> To
> +	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
> +	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
> +	 */
> +	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
> +	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
> +	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
> +	if (mr->ms_bmp == NULL) {
> +		DEBUG("Unable to initialize bitmap for a new MR of"
> +		      " address (%p).", (void *)addr);
> +		rte_errno = EINVAL;
> +		goto err_nolock;
> +	}
> +	/*
> +	 * Should recheck whether the extended contiguous chunk is still
> valid.
> +	 * Because memory_hotplug_lock can't be held if there's any memory
> +	 * related calls in a critical path, resource allocation above can't be
> +	 * locked. If the memory has been changed at this point, try again
> with
> +	 * just single page. If not, go on with the big chunk atomically from
> +	 * here.
> +	 */
> +	rte_mcfg_mem_read_lock();
> +	data_re = data;
> +	if (len > msl->page_sz &&
> +	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re))
> {
> +		DEBUG("Unable to find virtually contiguous"
> +		      " chunk for address (%p)."
> +		      " rte_memseg_contig_walk() failed.", (void *)addr);
> +		rte_errno = ENXIO;
> +		goto err_memlock;
> +	}
> +	if (data.start != data_re.start || data.end != data_re.end) {
> +		/*
> +		 * The extended contiguous chunk has been changed. Try
> again
> +		 * with single memseg instead.
> +		 */
> +		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
> +		data.end = data.start + msl->page_sz;
> +		rte_mcfg_mem_read_unlock();
> +		mr_free(mr);
> +		goto alloc_resources;
> +	}
> +	MLX5_ASSERT(data.msl == data_re.msl);
> +	rte_rwlock_write_lock(&share_cache->rwlock);
> +	/*
> +	 * Check the address is really missing. If other thread already created
> +	 * one or it is not found due to overflow, abort and return.
> +	 */
> +	if (mlx5_mr_lookup_cache(share_cache, entry, addr) != UINT32_MAX)
> {
> +		/*
> +		 * Insert to the global cache table. It may fail due to
> +		 * low-on-memory. Then, this entry will have to be searched
> +		 * here again.
> +		 */
> +		mr_btree_insert(&share_cache->cache, entry);
> +		DEBUG("Found MR for %p on final lookup, abort", (void
> *)addr);
> +		rte_rwlock_write_unlock(&share_cache->rwlock);
> +		rte_mcfg_mem_read_unlock();
> +		/*
> +		 * Must be unlocked before calling rte_free() because
> +		 * mlx5_mr_mem_event_free_cb() can be called inside.
> +		 */
> +		mr_free(mr);
> +		return entry->lkey;
> +	}
> +	/*
> +	 * Trim start and end addresses for verbs MR. Set bits for registering
> +	 * memsegs but exclude already registered ones. Bitmap can be
> +	 * fragmented.
> +	 */
> +	for (n = 0; n < ms_n; ++n) {
> +		uintptr_t start;
> +		struct mr_cache_entry ret;
> +
> +		memset(&ret, 0, sizeof(ret));
> +		start = data_re.start + n * msl->page_sz;
> +		/* Exclude memsegs already registered by other MRs. */
> +		if (mlx5_mr_lookup_cache(share_cache, &ret, start) ==
> +		    UINT32_MAX) {
> +			/*
> +			 * Start from the first unregistered memseg in the
> +			 * extended range.
> +			 */
> +			if (ms_idx_shift == -1) {
> +				mr->ms_base_idx += n;
> +				data.start = start;
> +				ms_idx_shift = n;
> +			}
> +			data.end = start + msl->page_sz;
> +			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
> +			++mr->ms_n;
> +		}
> +	}
> +	len = data.end - data.start;
> +	mr->ms_bmp_n = len / msl->page_sz;
> +	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
> +	/*
> +	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can
> be
> +	 * called with holding the memory lock because it doesn't use
> +	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
> +	 * through mlx5_alloc_verbs_buf().
> +	 */
> +	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)data.start, len,
> +				       IBV_ACCESS_LOCAL_WRITE |
> +					   IBV_ACCESS_RELAXED_ORDERING);
> +	if (mr->ibv_mr == NULL) {
> +		DEBUG("Fail to create a verbs MR for address (%p)",
> +		      (void *)addr);
> +		rte_errno = EINVAL;
> +		goto err_mrlock;
> +	}
> +	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
> +	MLX5_ASSERT(mr->ibv_mr->length == len);
> +	LIST_INSERT_HEAD(&share_cache->mr_list, mr, mr);
> +	DEBUG("MR CREATED (%p) for %p:\n"
> +	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> +	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> +	      (void *)mr, (void *)addr, data.start, data.end,
> +	      rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> +	/* Insert to the global cache table. */
> +	mlx5_mr_insert_cache(share_cache, mr);
> +	/* Fill in output data. */
> +	mlx5_mr_lookup_cache(share_cache, entry, addr);
> +	/* Lookup can't fail. */
> +	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +	rte_mcfg_mem_read_unlock();
> +	return entry->lkey;
> +err_mrlock:
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +err_memlock:
> +	rte_mcfg_mem_read_unlock();
> +err_nolock:
> +	/*
> +	 * In case of error, as this can be called in a datapath, a warning
> +	 * message per an error is preferable instead. Must be unlocked
> before
> +	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be
> called
> +	 * inside.
> +	 */
> +	mr_free(mr);
> +	return UINT32_MAX;
> +}
> +
> +/**
> + * Create a new global Memory Region (MR) for a missing virtual address.
> + * This can be called from primary and secondary process.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +static uint32_t
> +mlx5_mr_create(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
> +	       struct mlx5_mr_share_cache *share_cache,
> +	       struct mr_cache_entry *entry, uintptr_t addr,
> +	       unsigned int mr_ext_memseg_en)
> +{
> +	uint32_t ret = 0;
> +
> +	switch (rte_eal_process_type()) {
> +	case RTE_PROC_PRIMARY:
> +		ret = mlx5_mr_create_primary(pd, share_cache, entry,
> +					     addr, mr_ext_memseg_en);
> +		break;
> +	case RTE_PROC_SECONDARY:
> +		ret = mlx5_mr_create_secondary(pd, mp_id, share_cache,
> entry,
> +					       addr, mr_ext_memseg_en);
> +		break;
> +	default:
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/**
> + * Look up address in the global MR cache table. If not found, create a new
> MR.
> + * Insert the found/created entry to local bottom-half cache table.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this is not written.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mr_lookup_caches(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
> +		 struct mlx5_mr_share_cache *share_cache,
> +		 struct mlx5_mr_ctrl *mr_ctrl,
> +		 struct mr_cache_entry *entry, uintptr_t addr,
> +		 unsigned int mr_ext_memseg_en)
> +{
> +	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
> +	uint32_t lkey;
> +	uint16_t idx;
> +
> +	/* If local cache table is full, try to double it. */
> +	if (unlikely(bt->len == bt->size))
> +		mr_btree_expand(bt, bt->size << 1);
> +	/* Look up in the global cache. */
> +	rte_rwlock_read_lock(&share_cache->rwlock);
> +	lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
> +	if (lkey != UINT32_MAX) {
> +		/* Found. */
> +		*entry = (*share_cache->cache.table)[idx];
> +		rte_rwlock_read_unlock(&share_cache->rwlock);
> +		/*
> +		 * Update local cache. Even if it fails, return the found entry
> +		 * to update top-half cache. Next time, this entry will be
> found
> +		 * in the global cache.
> +		 */
> +		mr_btree_insert(bt, entry);
> +		return lkey;
> +	}
> +	rte_rwlock_read_unlock(&share_cache->rwlock);
> +	/* First time to see the address? Create a new MR. */
> +	lkey = mlx5_mr_create(pd, mp_id, share_cache, entry, addr,
> +			      mr_ext_memseg_en);
> +	/*
> +	 * Update the local cache if successfully created a new global MR.
> Even
> +	 * if failed to create one, there's no action to take in this datapath
> +	 * code. As returning LKey is invalid, this will eventually make HW
> +	 * fail.
> +	 */
> +	if (lkey != UINT32_MAX)
> +		mr_btree_insert(bt, entry);
> +	return lkey;
> +}
> +
> +/**
> + * Bottom-half of LKey search on datapath. First search in cache_bh[]
> +and if
> + * misses, search in the global MR cache table and update the new entry
> +to
> + * per-queue local caches.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id
> *mp_id,
> +			    struct mlx5_mr_share_cache *share_cache,
> +			    struct mlx5_mr_ctrl *mr_ctrl,
> +			    uintptr_t addr, unsigned int mr_ext_memseg_en) {
> +	uint32_t lkey;
> +	uint16_t bh_idx = 0;
> +	/* Victim in top-half cache to replace with new entry. */
> +	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
> +
> +	/* Binary-search MR translation table. */
> +	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
> +	/* Update top-half cache. */
> +	if (likely(lkey != UINT32_MAX)) {
> +		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
> +	} else {
> +		/*
> +		 * If missed in local lookup table, search in the global cache
> +		 * and local cache_bh[] will be updated inside if possible.
> +		 * Top-half cache entry will also be updated.
> +		 */
> +		lkey = mr_lookup_caches(pd, mp_id, share_cache, mr_ctrl,
> +					repl, addr, mr_ext_memseg_en);
> +		if (unlikely(lkey == UINT32_MAX))
> +			return UINT32_MAX;
> +	}
> +	/* Update the most recently used entry. */
> +	mr_ctrl->mru = mr_ctrl->head;
> +	/* Point to the next victim, the oldest. */
> +	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
> +	return lkey;
> +}
> +
> +/**
> + * Release all the created MRs and resources on global MR cache of a device.
> + * list.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + */
> +void
> +mlx5_mr_release_cache(struct mlx5_mr_share_cache *share_cache) {
> +	struct mlx5_mr *mr_next;
> +
> +	rte_rwlock_write_lock(&share_cache->rwlock);
> +	/* Detach from MR list and move to free list. */
> +	mr_next = LIST_FIRST(&share_cache->mr_list);
> +	while (mr_next != NULL) {
> +		struct mlx5_mr *mr = mr_next;
> +
> +		mr_next = LIST_NEXT(mr, mr);
> +		LIST_REMOVE(mr, mr);
> +		LIST_INSERT_HEAD(&share_cache->mr_free_list, mr, mr);
> +	}
> +	LIST_INIT(&share_cache->mr_list);
> +	/* Free global cache. */
> +	mlx5_mr_btree_free(&share_cache->cache);
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +	/* Free all remaining MRs. */
> +	mlx5_mr_garbage_collect(share_cache);
> +}
> +
> +/**
> + * Flush all of the local cache entries.
> + *
> + * @param mr_ctrl
> + *   Pointer to per-queue MR local cache.
> + */
> +void
> +mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl) {
> +	/* Reset the most-recently-used index. */
> +	mr_ctrl->mru = 0;
> +	/* Reset the linear search array. */
> +	mr_ctrl->head = 0;
> +	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
> +	/* Reset the B-tree table. */
> +	mr_ctrl->cache_bh.len = 1;
> +	mr_ctrl->cache_bh.overflow = 0;
> +	/* Update the generation number. */
> +	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
> +	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
> +		(void *)mr_ctrl, mr_ctrl->cur_gen);
> +}
> +
> +/**
> + * Creates a memory region for external memory, that is memory which is
> +not
> + * part of the DPDK memory segments.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param addr
> + *   Starting virtual address of memory.
> + * @param len
> + *   Length of memory segment being mapped.
> + * @param socked_id
> + *   Socket to allocate heap memory for the control structures.
> + *
> + * @return
> + *   Pointer to MR structure on success, NULL otherwise.
> + */
> +struct mlx5_mr *
> +mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len, int
> +socket_id) {
> +	struct mlx5_mr *mr = NULL;
> +
> +	mr = rte_zmalloc_socket(NULL,
> +				RTE_ALIGN_CEIL(sizeof(*mr),
> +					       RTE_CACHE_LINE_SIZE),
> +				RTE_CACHE_LINE_SIZE, socket_id);
> +	if (mr == NULL)
> +		return NULL;
> +	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)addr, len,
> +				       IBV_ACCESS_LOCAL_WRITE |
> +					   IBV_ACCESS_RELAXED_ORDERING);
> +	if (mr->ibv_mr == NULL) {
> +		DRV_LOG(WARNING,
> +			"Fail to create a verbs MR for address (%p)",
> +			(void *)addr);
> +		rte_free(mr);
> +		return NULL;
> +	}
> +	mr->msl = NULL; /* Mark it is external memory. */
> +	mr->ms_bmp = NULL;
> +	mr->ms_n = 1;
> +	mr->ms_bmp_n = 1;
> +	DRV_LOG(DEBUG,
> +		"MR CREATED (%p) for external memory %p:\n"
> +		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> +		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> +		(void *)mr, (void *)addr,
> +		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> +	return mr;
> +}
> +
> +/**
> + * Dump all the created MRs and the global cache entries.
> + *
> + * @param sh
> + *   Pointer to Ethernet device shared context.
> + */
> +void
> +mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache
> +__rte_unused) { #ifdef RTE_LIBRTE_MLX5_DEBUG
> +	struct mlx5_mr *mr;
> +	int mr_n = 0;
> +	int chunk_n = 0;
> +
> +	rte_rwlock_read_lock(&share_cache->rwlock);
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
> +		unsigned int n;
> +
> +		DEBUG("MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
> +		      mr_n++, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +		      mr->ms_n, mr->ms_bmp_n);
> +		if (mr->ms_n == 0)
> +			continue;
> +		for (n = 0; n < mr->ms_bmp_n; ) {
> +			struct mr_cache_entry ret = { 0, };
> +
> +			n = mr_find_next_chunk(mr, &ret, n);
> +			if (!ret.end)
> +				break;
> +			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR
> ")",
> +			      chunk_n++, ret.start, ret.end);
> +		}
> +	}
> +	DEBUG("Dumping global cache %p", (void *)share_cache);
> +	mlx5_mr_btree_dump(&share_cache->cache);
> +	rte_rwlock_read_unlock(&share_cache->rwlock);
> +#endif
> +}
> diff --git a/drivers/common/mlx5/mlx5_common_mr.h
> b/drivers/common/mlx5/mlx5_common_mr.h
> new file mode 100644
> index 0000000000..e805f96375
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mr.h
> @@ -0,0 +1,160 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2018 6WIND S.A.
> + * Copyright 2018 Mellanox Technologies, Ltd  */
> +
> +#ifndef RTE_PMD_MLX5_COMMON_MR_H_
> +#define RTE_PMD_MLX5_COMMON_MR_H_
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +
> +/* Verbs header. */
> +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic.
> +*/ #ifdef PEDANTIC #pragma GCC diagnostic ignored "-Wpedantic"
> +#endif
> +#include <infiniband/verbs.h>
> +#include <infiniband/mlx5dv.h>
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic error "-Wpedantic"
> +#endif
> +
> +#include <rte_rwlock.h>
> +#include <rte_bitmap.h>
> +#include <rte_memory.h>
> +
> +#include "mlx5_common_mp.h"
> +
> +/* Size of per-queue MR cache array for linear search. */ #define
> +MLX5_MR_CACHE_N 8 #define MLX5_MR_BTREE_CACHE_N 256
> +
> +/* Memory Region object. */
> +struct mlx5_mr {
> +	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
> +	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
> +	const struct rte_memseg_list *msl;
> +	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
> +	int ms_n; /* Number of memsegs in use. */
> +	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
> +	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR.
> */
> +};
> +
> +/* Cache entry for Memory Region. */
> +struct mr_cache_entry {
> +	uintptr_t start; /* Start address of MR. */
> +	uintptr_t end; /* End address of MR. */
> +	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */ } __rte_packed;
> +
> +/* MR Cache table for Binary search. */ struct mlx5_mr_btree {
> +	uint16_t len; /* Number of entries. */
> +	uint16_t size; /* Total number of entries. */
> +	int overflow; /* Mark failure of table expansion. */
> +	struct mr_cache_entry (*table)[];
> +} __rte_packed;
> +
> +/* Per-queue MR control descriptor. */
> +struct mlx5_mr_ctrl {
> +	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
> +	uint32_t cur_gen; /* Generation number saved to flush caches. */
> +	uint16_t mru; /* Index of last hit entry in top-half cache. */
> +	uint16_t head; /* Index of the oldest entry in top-half cache. */
> +	struct mr_cache_entry cache[MLX5_MR_CACHE_N]; /* Cache for top-
> half. */
> +	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */ }
> +__rte_packed;
> +
> +LIST_HEAD(mlx5_mr_list, mlx5_mr);
> +
> +/* Global per-device MR cache. */
> +struct mlx5_mr_share_cache {
> +	uint32_t dev_gen; /* Generation number to flush local caches. */
> +	rte_rwlock_t rwlock; /* MR cache Lock. */
> +	struct mlx5_mr_btree cache; /* Global MR cache table. */
> +	struct mlx5_mr_list mr_list; /* Registered MR list. */
> +	struct mlx5_mr_list mr_free_list; /* Freed MR list. */ } __rte_packed;
> +
> +/**
> + * Look up LKey from given lookup table by linear search. Firstly look
> +up the
> + * last-hit entry. If miss, the entire array is searched. If found,
> +update the
> + * last-hit index and return LKey.
> + *
> + * @param lkp_tbl
> + *   Pointer to lookup table.
> + * @param[in,out] cached_idx
> + *   Pointer to last-hit index.
> + * @param n
> + *   Size of lookup table.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static __rte_always_inline uint32_t
> +mlx5_mr_lookup_lkey(struct mr_cache_entry *lkp_tbl, uint16_t *cached_idx,
> +		    uint16_t n, uintptr_t addr)
> +{
> +	uint16_t idx;
> +
> +	if (likely(addr >= lkp_tbl[*cached_idx].start &&
> +		   addr < lkp_tbl[*cached_idx].end))
> +		return lkp_tbl[*cached_idx].lkey;
> +	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
> +		if (addr >= lkp_tbl[idx].start &&
> +		    addr < lkp_tbl[idx].end) {
> +			/* Found. */
> +			*cached_idx = idx;
> +			return lkp_tbl[idx].lkey;
> +		}
> +	}
> +	return UINT32_MAX;
> +}
> +
> +__rte_experimental
> +int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
> +__rte_experimental void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
> +__rte_experimental void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt
> +__rte_unused); __rte_experimental uint32_t mlx5_mr_addr2mr_bh(struct
> +ibv_pd *pd, struct mlx5_mp_id *mp_id,
> +			    struct mlx5_mr_share_cache *share_cache,
> +			    struct mlx5_mr_ctrl *mr_ctrl,
> +			    uintptr_t addr, unsigned int mr_ext_memseg_en);
> +__rte_experimental void mlx5_mr_release_cache(struct
> +mlx5_mr_share_cache *mr_cache); __rte_experimental void
> +mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache
> +__rte_unused); __rte_experimental void mlx5_mr_rebuild_cache(struct
> +mlx5_mr_share_cache *share_cache); __rte_experimental void
> +mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
> +__rte_experimental int mlx5_mr_insert_cache(struct mlx5_mr_share_cache
> +*share_cache,
> +		     struct mlx5_mr *mr);
> +__rte_experimental
> +uint32_t
> +mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mr_cache_entry *entry, uintptr_t addr);
> +__rte_experimental struct mlx5_mr * mlx5_mr_lookup_list(struct
> +mlx5_mr_share_cache *share_cache,
> +		    struct mr_cache_entry *entry, uintptr_t addr);
> __rte_experimental
> +struct mlx5_mr * mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr,
> +size_t len,
> +		   int socket_id);
> +__rte_experimental
> +uint32_t
> +mlx5_mr_create_primary(struct ibv_pd *pd,
> +		       struct mlx5_mr_share_cache *share_cache,
> +		       struct mr_cache_entry *entry, uintptr_t addr,
> +		       unsigned int mr_ext_memseg_en);
> +
> +#endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
> diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map
> b/drivers/common/mlx5/rte_common_mlx5_version.map
> index 265703d1c9..b58a378278 100644
> --- a/drivers/common/mlx5/rte_common_mlx5_version.map
> +++ b/drivers/common/mlx5/rte_common_mlx5_version.map
> @@ -61,4 +61,18 @@ EXPERIMENTAL {
>  	mlx5_mp_req_mr_create;
>  	mlx5_mp_req_queue_state_modify;
>  	mlx5_mp_req_verbs_cmd_fd;
> +
> +	mlx5_mr_btree_init;
> +	mlx5_mr_btree_free;
> +	mlx5_mr_btree_dump;
> +	mlx5_mr_addr2mr_bh;
> +	mlx5_mr_release_cache;
> +	mlx5_mr_dump_cache;
> +	mlx5_mr_rebuild_cache;
> +	mlx5_mr_insert_cache;
> +	mlx5_mr_lookup_cache;
> +	mlx5_mr_lookup_list;
> +	mlx5_create_mr_ext;
> +	mlx5_mr_create_primary;
> +	mlx5_mr_flush_local_cache;
>  };
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling codes to common driver
  2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling " Vu Pham
@ 2020-04-08  9:05     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-08  9:05 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 7, 2020 19:48
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v2 1/4] common/mlx5: refactor MP IPC handling codes to
> common driver
> 
> Refactor common mp handling codes from net pmd to common driver.
> Using port_id as standard input parameter for all MP IPC APIs instead of using
> rte_eth_dev.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/mlx5_common_mp.c            | 188
> ++++++++++++++++++++++++
>  drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++++
>  drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
>  3 files changed, 299 insertions(+)
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
> 
> diff --git a/drivers/common/mlx5/mlx5_common_mp.c
> b/drivers/common/mlx5/mlx5_common_mp.c
> new file mode 100644
> index 0000000000..da55143bc1
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mp.c
> @@ -0,0 +1,188 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2019 6WIND S.A.
> + * Copyright 2019 Mellanox Technologies, Ltd  */
> +
> +#include <stdio.h>
> +#include <time.h>
> +
> +#include <rte_eal.h>
> +#include <rte_errno.h>
> +
> +#include "mlx5_common_mp.h"
> +#include "mlx5_common_utils.h"
> +
> +/**
> + * Request Memory Region creation to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + * @param addr
> + *   Target virtual address to register.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
> +	req->args.addr = addr;
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	ret = res->result;
> +	if (ret)
> +		rte_errno = -ret;
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Request Verbs queue state modification to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + * @param sm
> + *   State modify parameters.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
> +			       struct mlx5_mp_arg_queue_state_modify *sm) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req,
> MLX5_MP_REQ_QUEUE_STATE_MODIFY);
> +	req->args.state_modify = *sm;
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	ret = res->result;
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Request Verbs command file descriptor for mmap to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + *
> + * @return
> + *   fd on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	if (res->result) {
> +		rte_errno = -res->result;
> +		DRV_LOG(ERR,
> +			"port %u failed to get command FD from primary
> process",
> +			mp_id->port_id);
> +		ret = -rte_errno;
> +		goto exit;
> +	}
> +	MLX5_ASSERT(mp_res->num_fds == 1);
> +	ret = mp_res->fds[0];
> +	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
> +		mp_id->port_id, ret);
> +exit:
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Initialize by primary process.
> + */
> +int
> +mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action) {
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +
> +	/* primary is allowed to not support IPC */
> +	ret = rte_mp_action_register(name, primary_action);
> +	if (ret && rte_errno != ENOTSUP)
> +		return -1;
> +	return 0;
> +}
> +
> +/**
> + * Un-initialize by primary process.
> + */
> +void
> +mlx5_mp_uninit_primary(const char *name) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +	rte_mp_action_unregister(name);
> +}
> +
> +/**
> + * Initialize by secondary process.
> + */
> +int
> +mlx5_mp_init_secondary(const char *name, const rte_mp_t
> +secondary_action) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	return rte_mp_action_register(name, secondary_action); }
> +
> +/**
> + * Un-initialize by secondary process.
> + */
> +void
> +mlx5_mp_uninit_secondary(const char *name) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	rte_mp_action_unregister(name);
> +}
> diff --git a/drivers/common/mlx5/mlx5_common_mp.h
> b/drivers/common/mlx5/mlx5_common_mp.h
> new file mode 100644
> index 0000000000..7aab77acb2
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mp.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2018 6WIND S.A.
> + * Copyright 2018 Mellanox Technologies, Ltd  */
> +
> +#ifndef RTE_PMD_MLX5_COMMON_MP_H_
> +#define RTE_PMD_MLX5_COMMON_MP_H_
> +
> +/* Verbs header. */
> +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic.
> +*/ #ifdef PEDANTIC #pragma GCC diagnostic ignored "-Wpedantic"
> +#endif
> +#include <infiniband/verbs.h>
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic error "-Wpedantic"
> +#endif
> +
> +#include <rte_eal.h>
> +#include <rte_string_fns.h>
> +
> +/* Request types for IPC. */
> +enum mlx5_mp_req_type {
> +	MLX5_MP_REQ_VERBS_CMD_FD = 1,
> +	MLX5_MP_REQ_CREATE_MR,
> +	MLX5_MP_REQ_START_RXTX,
> +	MLX5_MP_REQ_STOP_RXTX,
> +	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
> +};
> +
> +struct mlx5_mp_arg_queue_state_modify {
> +	uint8_t is_wq; /* Set if WQ. */
> +	uint16_t queue_id; /* DPDK queue ID. */
> +	enum ibv_wq_state state; /* WQ requested state. */ };
> +
> +/* Pameters for IPC. */
> +struct mlx5_mp_param {
> +	enum mlx5_mp_req_type type;
> +	int port_id;
> +	int result;
> +	RTE_STD_C11
> +	union {
> +		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
> +		struct mlx5_mp_arg_queue_state_modify state_modify;
> +		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
> +	} args;
> +};
> +
> +/*  Identifier of a MP process */
> +struct mlx5_mp_id {
> +	char name[RTE_MP_MAX_NAME_LEN];
> +	uint16_t port_id;
> +};
> +
> +/** Request timeout for IPC. */
> +#define MLX5_MP_REQ_TIMEOUT_SEC 5
> +
> +/**
> + * Initialize IPC message.
> + *
> + * @param[in] port_id
> + *   Port ID of the device.
> + * @param[out] msg
> + *   Pointer to message to fill in.
> + * @param[in] type
> + *   Message type.
> + */
> +static inline void
> +mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
> +	    enum mlx5_mp_req_type type)
> +{
> +	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg-
> >param;
> +
> +	memset(msg, 0, sizeof(*msg));
> +	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
> +	msg->len_param = sizeof(*param);
> +	param->type = type;
> +	param->port_id = mp_id->port_id;
> +}
> +
> +__rte_experimental
> +int mlx5_mp_init_primary(const char *name, const rte_mp_t
> +primary_action); __rte_experimental void mlx5_mp_uninit_primary(const
> +char *name); __rte_experimental int mlx5_mp_init_secondary(const char
> +*name, const rte_mp_t secondary_action); __rte_experimental void
> +mlx5_mp_uninit_secondary(const char *name); __rte_experimental int
> +mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
> +__rte_experimental int mlx5_mp_req_queue_state_modify(struct
> mlx5_mp_id
> +*mp_id,
> +				   struct mlx5_mp_arg_queue_state_modify
> *sm); __rte_experimental
> +int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
> +
> +#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
> diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map
> b/drivers/common/mlx5/rte_common_mlx5_version.map
> index aede2a0a51..265703d1c9 100644
> --- a/drivers/common/mlx5/rte_common_mlx5_version.map
> +++ b/drivers/common/mlx5/rte_common_mlx5_version.map
> @@ -48,4 +48,17 @@ DPDK_20.0.1 {
>  	mlx5_nl_vlan_vmwa_delete;
> 
>  	mlx5_translate_port_name;
> +
> +};
> +
> +EXPERIMENTAL {
> +        global:
> +
> +	mlx5_mp_init_primary;
> +	mlx5_mp_uninit_primary;
> +	mlx5_mp_init_secondary;
> +	mlx5_mp_uninit_secondary;
> +	mlx5_mp_req_mr_create;
> +	mlx5_mp_req_queue_state_modify;
> +	mlx5_mp_req_verbs_cmd_fd;
>  };
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling codes to common driver
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
@ 2020-04-08  9:05     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-08  9:05 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 7, 2020 20:01
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling
> codes to common driver
> 
> Refactor common multi-process handling codes from net PMD to common
> driver. Using tuple mp_id{name, port_id} as standard input parameter for all
> multi-process IPC APIs instead of using rte_eth_dev.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/mlx5_common_mp.c            | 188
> ++++++++++++++++++++++++
>  drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++++
>  drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
>  3 files changed, 299 insertions(+)
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
> 
> diff --git a/drivers/common/mlx5/mlx5_common_mp.c
> b/drivers/common/mlx5/mlx5_common_mp.c
> new file mode 100644
> index 0000000000..da55143bc1
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mp.c
> @@ -0,0 +1,188 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2019 6WIND S.A.
> + * Copyright 2019 Mellanox Technologies, Ltd  */
> +
> +#include <stdio.h>
> +#include <time.h>
> +
> +#include <rte_eal.h>
> +#include <rte_errno.h>
> +
> +#include "mlx5_common_mp.h"
> +#include "mlx5_common_utils.h"
> +
> +/**
> + * Request Memory Region creation to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + * @param addr
> + *   Target virtual address to register.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
> +	req->args.addr = addr;
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	ret = res->result;
> +	if (ret)
> +		rte_errno = -ret;
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Request Verbs queue state modification to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + * @param sm
> + *   State modify parameters.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
> +			       struct mlx5_mp_arg_queue_state_modify *sm) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req,
> MLX5_MP_REQ_QUEUE_STATE_MODIFY);
> +	req->args.state_modify = *sm;
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	ret = res->result;
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Request Verbs command file descriptor for mmap to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + *
> + * @return
> + *   fd on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	if (res->result) {
> +		rte_errno = -res->result;
> +		DRV_LOG(ERR,
> +			"port %u failed to get command FD from primary
> process",
> +			mp_id->port_id);
> +		ret = -rte_errno;
> +		goto exit;
> +	}
> +	MLX5_ASSERT(mp_res->num_fds == 1);
> +	ret = mp_res->fds[0];
> +	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
> +		mp_id->port_id, ret);
> +exit:
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Initialize by primary process.
> + */
> +int
> +mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action) {
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +
> +	/* primary is allowed to not support IPC */
> +	ret = rte_mp_action_register(name, primary_action);
> +	if (ret && rte_errno != ENOTSUP)
> +		return -1;
> +	return 0;
> +}
> +
> +/**
> + * Un-initialize by primary process.
> + */
> +void
> +mlx5_mp_uninit_primary(const char *name) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +	rte_mp_action_unregister(name);
> +}
> +
> +/**
> + * Initialize by secondary process.
> + */
> +int
> +mlx5_mp_init_secondary(const char *name, const rte_mp_t
> +secondary_action) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	return rte_mp_action_register(name, secondary_action); }
> +
> +/**
> + * Un-initialize by secondary process.
> + */
> +void
> +mlx5_mp_uninit_secondary(const char *name) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	rte_mp_action_unregister(name);
> +}
> diff --git a/drivers/common/mlx5/mlx5_common_mp.h
> b/drivers/common/mlx5/mlx5_common_mp.h
> new file mode 100644
> index 0000000000..7aab77acb2
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mp.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2018 6WIND S.A.
> + * Copyright 2018 Mellanox Technologies, Ltd  */
> +
> +#ifndef RTE_PMD_MLX5_COMMON_MP_H_
> +#define RTE_PMD_MLX5_COMMON_MP_H_
> +
> +/* Verbs header. */
> +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic.
> +*/ #ifdef PEDANTIC #pragma GCC diagnostic ignored "-Wpedantic"
> +#endif
> +#include <infiniband/verbs.h>
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic error "-Wpedantic"
> +#endif
> +
> +#include <rte_eal.h>
> +#include <rte_string_fns.h>
> +
> +/* Request types for IPC. */
> +enum mlx5_mp_req_type {
> +	MLX5_MP_REQ_VERBS_CMD_FD = 1,
> +	MLX5_MP_REQ_CREATE_MR,
> +	MLX5_MP_REQ_START_RXTX,
> +	MLX5_MP_REQ_STOP_RXTX,
> +	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
> +};
> +
> +struct mlx5_mp_arg_queue_state_modify {
> +	uint8_t is_wq; /* Set if WQ. */
> +	uint16_t queue_id; /* DPDK queue ID. */
> +	enum ibv_wq_state state; /* WQ requested state. */ };
> +
> +/* Pameters for IPC. */
> +struct mlx5_mp_param {
> +	enum mlx5_mp_req_type type;
> +	int port_id;
> +	int result;
> +	RTE_STD_C11
> +	union {
> +		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
> +		struct mlx5_mp_arg_queue_state_modify state_modify;
> +		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
> +	} args;
> +};
> +
> +/*  Identifier of a MP process */
> +struct mlx5_mp_id {
> +	char name[RTE_MP_MAX_NAME_LEN];
> +	uint16_t port_id;
> +};
> +
> +/** Request timeout for IPC. */
> +#define MLX5_MP_REQ_TIMEOUT_SEC 5
> +
> +/**
> + * Initialize IPC message.
> + *
> + * @param[in] port_id
> + *   Port ID of the device.
> + * @param[out] msg
> + *   Pointer to message to fill in.
> + * @param[in] type
> + *   Message type.
> + */
> +static inline void
> +mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
> +	    enum mlx5_mp_req_type type)
> +{
> +	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg-
> >param;
> +
> +	memset(msg, 0, sizeof(*msg));
> +	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
> +	msg->len_param = sizeof(*param);
> +	param->type = type;
> +	param->port_id = mp_id->port_id;
> +}
> +
> +__rte_experimental
> +int mlx5_mp_init_primary(const char *name, const rte_mp_t
> +primary_action); __rte_experimental void mlx5_mp_uninit_primary(const
> +char *name); __rte_experimental int mlx5_mp_init_secondary(const char
> +*name, const rte_mp_t secondary_action); __rte_experimental void
> +mlx5_mp_uninit_secondary(const char *name); __rte_experimental int
> +mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
> +__rte_experimental int mlx5_mp_req_queue_state_modify(struct
> mlx5_mp_id
> +*mp_id,
> +				   struct mlx5_mp_arg_queue_state_modify
> *sm); __rte_experimental
> +int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
> +
> +#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
> diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map
> b/drivers/common/mlx5/rte_common_mlx5_version.map
> index aede2a0a51..265703d1c9 100644
> --- a/drivers/common/mlx5/rte_common_mlx5_version.map
> +++ b/drivers/common/mlx5/rte_common_mlx5_version.map
> @@ -48,4 +48,17 @@ DPDK_20.0.1 {
>  	mlx5_nl_vlan_vmwa_delete;
> 
>  	mlx5_translate_port_name;
> +
> +};
> +
> +EXPERIMENTAL {
> +        global:
> +
> +	mlx5_mp_init_primary;
> +	mlx5_mp_uninit_primary;
> +	mlx5_mp_init_secondary;
> +	mlx5_mp_uninit_secondary;
> +	mlx5_mp_req_mr_create;
> +	mlx5_mp_req_queue_state_modify;
> +	mlx5_mp_req_verbs_cmd_fd;
>  };
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-process APIs
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
@ 2020-04-08  9:05     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-08  9:05 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 7, 2020 20:01
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-
> process APIs
> 
> Modify net PMD to use multi-process APIs from common driver.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/Makefile    |   3 +-
>  drivers/common/mlx5/meson.build |   1 +
>  drivers/net/mlx5/mlx5.c         |  15 ++-
>  drivers/net/mlx5/mlx5.h         |  43 +-------
>  drivers/net/mlx5/mlx5_mp.c      | 234 +++-------------------------------------
>  drivers/net/mlx5/mlx5_mr.c      |   2 +-
>  drivers/net/mlx5/mlx5_rxtx.c    |   3 +-
>  7 files changed, 37 insertions(+), 264 deletions(-)
> 
> diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
> index f32933d592..2a88492731 100644
> --- a/drivers/common/mlx5/Makefile
> +++ b/drivers/common/mlx5/Makefile
> @@ -17,6 +17,7 @@ endif
>  SRCS-y += mlx5_devx_cmds.c
>  SRCS-y += mlx5_common.c
>  SRCS-y += mlx5_nl.c
> +SRCS-y += mlx5_common_mp.c
>  ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
>  INSTALL-y-lib += $(LIB_GLUE)
>  endif
> @@ -46,7 +47,7 @@ endif
>  LDLIBS += -lrte_eal -lrte_pci -lrte_kvargs -lrte_net
> 
>  # A few warnings cannot be avoided in external headers.
> -CFLAGS += -Wno-error=cast-qual -UPEDANTIC
> +CFLAGS += -Wno-error=cast-qual  -UPEDANTIC -DALLOW_EXPERIMENTAL_API
> 
>  EXPORT_MAP := rte_common_mlx5_version.map
> 
> diff --git a/drivers/common/mlx5/meson.build
> b/drivers/common/mlx5/meson.build index f671710714..83671861c9 100644
> --- a/drivers/common/mlx5/meson.build
> +++ b/drivers/common/mlx5/meson.build
> @@ -55,6 +55,7 @@ sources = files(
>  	'mlx5_devx_cmds.c',
>  	'mlx5_common.c',
>  	'mlx5_nl.c',
> +	'mlx5_common_mp.c',
>  )
>  if not dlopen_ibverbs
>  	sources += files('mlx5_glue.c')
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c index
> 6a11b141da..9eac8011f3 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -38,6 +38,7 @@
>  #include <mlx5_glue.h>
>  #include <mlx5_devx_cmds.h>
>  #include <mlx5_common.h>
> +#include <mlx5_common_mp.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5.h"
> @@ -1714,7 +1715,8 @@ mlx5_init_once(void)
>  		rte_rwlock_init(&sd->mem_event_rwlock);
>  		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
>  						mlx5_mr_mem_event_cb,
> NULL);
> -		ret = mlx5_mp_init_primary();
> +		ret = mlx5_mp_init_primary(MLX5_MP_NAME,
> +					   mlx5_mp_primary_handle);
>  		if (ret)
>  			goto out;
>  		sd->init_done = true;
> @@ -1722,7 +1724,8 @@ mlx5_init_once(void)
>  	case RTE_PROC_SECONDARY:
>  		if (ld->init_done)
>  			break;
> -		ret = mlx5_mp_init_secondary();
> +		ret = mlx5_mp_init_secondary(MLX5_MP_NAME,
> +					     mlx5_mp_secondary_handle);
>  		if (ret)
>  			goto out;
>  		++sd->secondary_cnt;
> @@ -2197,6 +2200,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>  	}
>  	DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
>  	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
> +		struct mlx5_mp_id mp_id;
> +
>  		eth_dev = rte_eth_dev_attach_secondary(name);
>  		if (eth_dev == NULL) {
>  			DRV_LOG(ERR, "can not attach rte ethdev"); @@ -
> 2208,8 +2213,10 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>  		err = mlx5_proc_priv_init(eth_dev);
>  		if (err)
>  			return NULL;
> +		mp_id.port_id = eth_dev->data->port_id;
> +		strlcpy(mp_id.name, MLX5_MP_NAME,
> RTE_MP_MAX_NAME_LEN);
>  		/* Receive command fd from primary process */
> -		err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
> +		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
>  		if (err < 0)
>  			return NULL;
>  		/* Remap UAR for Tx queues. */
> @@ -2373,6 +2380,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>  	priv->ibv_port = spawn->ibv_port;
>  	priv->pci_dev = spawn->pci_dev;
>  	priv->mtu = RTE_ETHER_MTU;
> +	priv->mp_id.port_id = port_id;
> +	strlcpy(priv->mp_id.name, MLX5_MP_NAME,
> RTE_MP_MAX_NAME_LEN);
>  #ifndef RTE_ARCH_64
>  	/* Initialize UAR access locks for 32bit implementations. */
>  	rte_spinlock_init(&priv->uar_lock_cq);
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h index
> 34ab4758b1..9e15600afd 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -36,43 +36,13 @@
>  #include <mlx5_devx_cmds.h>
>  #include <mlx5_prm.h>
>  #include <mlx5_nl.h>
> +#include <mlx5_common_mp.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
>  #include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
> 
> -/* Request types for IPC. */
> -enum mlx5_mp_req_type {
> -	MLX5_MP_REQ_VERBS_CMD_FD = 1,
> -	MLX5_MP_REQ_CREATE_MR,
> -	MLX5_MP_REQ_START_RXTX,
> -	MLX5_MP_REQ_STOP_RXTX,
> -	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
> -};
> -
> -struct mlx5_mp_arg_queue_state_modify {
> -	uint8_t is_wq; /* Set if WQ. */
> -	uint16_t queue_id; /* DPDK queue ID. */
> -	enum ibv_wq_state state; /* WQ requested state. */
> -};
> -
> -/* Pameters for IPC. */
> -struct mlx5_mp_param {
> -	enum mlx5_mp_req_type type;
> -	int port_id;
> -	int result;
> -	RTE_STD_C11
> -	union {
> -		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
> -		struct mlx5_mp_arg_queue_state_modify state_modify;
> -		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
> -	} args;
> -};
> -
> -/** Request timeout for IPC. */
> -#define MLX5_MP_REQ_TIMEOUT_SEC 5
> -
>  /** Key string for IPC. */
>  #define MLX5_MP_NAME "net_mlx5_mp"
> 
> @@ -561,6 +531,7 @@ struct mlx5_priv {
>  #endif
>  	uint8_t skip_default_rss_reta; /* Skip configuration of default reta. */
>  	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
> +	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
>  };
> 
>  #define PORT_ID(priv) ((priv)->dev_data->port_id) @@ -761,16 +732,10 @@
> int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
>  		       struct rte_flow_error *error);
> 
>  /* mlx5_mp.c */
> +int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void
> +*peer); int mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg,
> +const void *peer);
>  void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);  void
> mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev); -int
> mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr); -int
> mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev); -int
> mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
> -				   struct mlx5_mp_arg_queue_state_modify
> *sm);
> -int mlx5_mp_init_primary(void);
> -void mlx5_mp_uninit_primary(void);
> -int mlx5_mp_init_secondary(void);
> -void mlx5_mp_uninit_secondary(void);
> 
>  /* mlx5_socket.c */
> 
> diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c index
> 55d408fe95..43684dbc3a 100644
> --- a/drivers/net/mlx5/mlx5_mp.c
> +++ b/drivers/net/mlx5/mlx5_mp.c
> @@ -10,46 +10,14 @@
>  #include <rte_ethdev_driver.h>
>  #include <rte_string_fns.h>
> 
> +#include <mlx5_common_mp.h>
> +
>  #include "mlx5.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_utils.h"
> 
> -/**
> - * Initialize IPC message.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param[out] msg
> - *   Pointer to message to fill in.
> - * @param[in] type
> - *   Message type.
> - */
> -static inline void
> -mp_init_msg(struct rte_eth_dev *dev, struct rte_mp_msg *msg,
> -	    enum mlx5_mp_req_type type)
> -{
> -	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg-
> >param;
> -
> -	memset(msg, 0, sizeof(*msg));
> -	strlcpy(msg->name, MLX5_MP_NAME, sizeof(msg->name));
> -	msg->len_param = sizeof(*param);
> -	param->type = type;
> -	param->port_id = dev->data->port_id;
> -}
> -
> -/**
> - * IPC message handler of primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param[in] peer
> - *   Pointer to the peer socket path.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -static int
> -mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
> +int
> +mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void
> +*peer)
>  {
>  	struct rte_mp_msg mp_res;
>  	struct mlx5_mp_param *res = (struct mlx5_mp_param
> *)mp_res.param; @@ -71,21 +39,21 @@ mp_primary_handle(const struct
> rte_mp_msg *mp_msg, const void *peer)
>  	priv = dev->data->dev_private;
>  	switch (param->type) {
>  	case MLX5_MP_REQ_CREATE_MR:
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		lkey = mlx5_mr_create_primary(dev, &entry, param-
> >args.addr);
>  		if (lkey == UINT32_MAX)
>  			res->result = -rte_errno;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
>  	case MLX5_MP_REQ_VERBS_CMD_FD:
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		mp_res.num_fds = 1;
>  		mp_res.fds[0] = priv->sh->ctx->cmd_fd;
>  		res->result = 0;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
>  	case MLX5_MP_REQ_QUEUE_STATE_MODIFY:
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		res->result = mlx5_queue_state_modify_primary
>  					(dev, &param->args.state_modify);
>  		ret = rte_mp_reply(&mp_res, peer);
> @@ -110,14 +78,15 @@ mp_primary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>   * @return
>   *   0 on success, a negative errno value otherwise and rte_errno is set.
>   */
> -static int
> -mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
> +int
> +mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void
> +*peer)
>  {
>  	struct rte_mp_msg mp_res;
>  	struct mlx5_mp_param *res = (struct mlx5_mp_param
> *)mp_res.param;
>  	const struct mlx5_mp_param *param =
>  		(const struct mlx5_mp_param *)mp_msg->param;
>  	struct rte_eth_dev *dev;
> +	struct mlx5_priv *priv;
>  	int ret;
> 
>  	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> @@ -127,13 +96,14 @@ mp_secondary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  		return -rte_errno;
>  	}
>  	dev = &rte_eth_devices[param->port_id];
> +	priv = dev->data->dev_private;
>  	switch (param->type) {
>  	case MLX5_MP_REQ_START_RXTX:
>  		DRV_LOG(INFO, "port %u starting datapath", dev->data-
> >port_id);
>  		rte_mb();
>  		dev->rx_pkt_burst = mlx5_select_rx_function(dev);
>  		dev->tx_pkt_burst = mlx5_select_tx_function(dev);
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		res->result = 0;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
> @@ -142,7 +112,7 @@ mp_secondary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  		dev->rx_pkt_burst = removed_rx_burst;
>  		dev->tx_pkt_burst = removed_tx_burst;
>  		rte_mb();
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		res->result = 0;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
> @@ -171,6 +141,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum
> mlx5_mp_req_type type)
>  	struct rte_mp_reply mp_rep;
>  	struct mlx5_mp_param *res;
>  	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	struct mlx5_priv *priv = dev->data->dev_private;
>  	int ret;
>  	int i;
> 
> @@ -182,7 +153,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum
> mlx5_mp_req_type type)
>  			dev->data->port_id, type);
>  		return;
>  	}
> -	mp_init_msg(dev, &mp_req, type);
> +	mp_init_msg(&priv->mp_id, &mp_req, type);
>  	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
>  	if (ret) {
>  		if (rte_errno != ENOTSUP)
> @@ -234,178 +205,3 @@ mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev)  {
>  	mp_req_on_rxtx(dev, MLX5_MP_REQ_STOP_RXTX);  }
> -
> -/**
> - * Request Memory Region creation to the primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr) -{
> -	struct rte_mp_msg mp_req;
> -	struct rte_mp_msg *mp_res;
> -	struct rte_mp_reply mp_rep;
> -	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> -	struct mlx5_mp_param *res;
> -	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_CREATE_MR);
> -	req->args.addr = addr;
> -	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> -	if (ret) {
> -		DRV_LOG(ERR, "port %u request to primary process failed",
> -			dev->data->port_id);
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(mp_rep.nb_received == 1);
> -	mp_res = &mp_rep.msgs[0];
> -	res = (struct mlx5_mp_param *)mp_res->param;
> -	ret = res->result;
> -	if (ret)
> -		rte_errno = -ret;
> -	free(mp_rep.msgs);
> -	return ret;
> -}
> -
> -/**
> - * Request Verbs queue state modification to the primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param sm
> - *   State modify parameters.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
> -			       struct mlx5_mp_arg_queue_state_modify *sm)
> -{
> -	struct rte_mp_msg mp_req;
> -	struct rte_mp_msg *mp_res;
> -	struct rte_mp_reply mp_rep;
> -	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> -	struct mlx5_mp_param *res;
> -	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
> -	req->args.state_modify = *sm;
> -	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> -	if (ret) {
> -		DRV_LOG(ERR, "port %u request to primary process failed",
> -			dev->data->port_id);
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(mp_rep.nb_received == 1);
> -	mp_res = &mp_rep.msgs[0];
> -	res = (struct mlx5_mp_param *)mp_res->param;
> -	ret = res->result;
> -	free(mp_rep.msgs);
> -	return ret;
> -}
> -
> -/**
> - * Request Verbs command file descriptor for mmap to the primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - *
> - * @return
> - *   fd on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev) -{
> -	struct rte_mp_msg mp_req;
> -	struct rte_mp_msg *mp_res;
> -	struct rte_mp_reply mp_rep;
> -	struct mlx5_mp_param *res;
> -	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
> -	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> -	if (ret) {
> -		DRV_LOG(ERR, "port %u request to primary process failed",
> -			dev->data->port_id);
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(mp_rep.nb_received == 1);
> -	mp_res = &mp_rep.msgs[0];
> -	res = (struct mlx5_mp_param *)mp_res->param;
> -	if (res->result) {
> -		rte_errno = -res->result;
> -		DRV_LOG(ERR,
> -			"port %u failed to get command FD from primary
> process",
> -			dev->data->port_id);
> -		ret = -rte_errno;
> -		goto exit;
> -	}
> -	MLX5_ASSERT(mp_res->num_fds == 1);
> -	ret = mp_res->fds[0];
> -	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
> -		dev->data->port_id, ret);
> -exit:
> -	free(mp_rep.msgs);
> -	return ret;
> -}
> -
> -/**
> - * Initialize by primary process.
> - */
> -int
> -mlx5_mp_init_primary(void)
> -{
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> -
> -	/* primary is allowed to not support IPC */
> -	ret = rte_mp_action_register(MLX5_MP_NAME, mp_primary_handle);
> -	if (ret && rte_errno != ENOTSUP)
> -		return -1;
> -	return 0;
> -}
> -
> -/**
> - * Un-initialize by primary process.
> - */
> -void
> -mlx5_mp_uninit_primary(void)
> -{
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> -	rte_mp_action_unregister(MLX5_MP_NAME);
> -}
> -
> -/**
> - * Initialize by secondary process.
> - */
> -int
> -mlx5_mp_init_secondary(void)
> -{
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	return rte_mp_action_register(MLX5_MP_NAME,
> mp_secondary_handle);
> -}
> -
> -/**
> - * Un-initialize by secondary process.
> - */
> -void
> -mlx5_mp_uninit_secondary(void)
> -{
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	rte_mp_action_unregister(MLX5_MP_NAME);
> -}
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c index
> a8f185a208..9151992a72 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -540,7 +540,7 @@ mlx5_mr_create_secondary(struct rte_eth_dev *dev,
> struct mlx5_mr_cache *entry,
> 
>  	DEBUG("port %u requesting MR creation for address (%p)",
>  	      dev->data->port_id, (void *)addr);
> -	ret = mlx5_mp_req_mr_create(dev, addr);
> +	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
>  	if (ret) {
>  		DEBUG("port %u fail to request MR creation for address
> (%p)",
>  		      dev->data->port_id, (void *)addr); diff --git
> a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> f3bf763769..fc7591c2b0 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1000,6 +1000,7 @@ static int
>  mlx5_queue_state_modify(struct rte_eth_dev *dev,
>  			struct mlx5_mp_arg_queue_state_modify *sm)  {
> +	struct mlx5_priv *priv = dev->data->dev_private;
>  	int ret = 0;
> 
>  	switch (rte_eal_process_type()) {
> @@ -1007,7 +1008,7 @@ mlx5_queue_state_modify(struct rte_eth_dev *dev,
>  		ret = mlx5_queue_state_modify_primary(dev, sm);
>  		break;
>  	case RTE_PROC_SECONDARY:
> -		ret = mlx5_mp_req_queue_state_modify(dev, sm);
> +		ret = mlx5_mp_req_queue_state_modify(&priv->mp_id, sm);
>  		break;
>  	default:
>  		break;
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver
  2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver Vu Pham
@ 2020-04-08  9:06     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-08  9:06 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham


> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 7, 2020 20:01
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver
> 
> Modify mlx5 net pmd driver to use MR management APIs from common
> driver.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/Makefile     |    1 +
>  drivers/common/mlx5/meson.build  |    1 +
>  drivers/net/mlx5/mlx5.c          |    4 +-
>  drivers/net/mlx5/mlx5.h          |   12 +-
>  drivers/net/mlx5/mlx5_mp.c       |    8 +-
>  drivers/net/mlx5/mlx5_mr.c       | 1169 ++------------------------------------
>  drivers/net/mlx5/mlx5_mr.h       |   87 +--
>  drivers/net/mlx5/mlx5_rxtx.c     |    1 +
>  drivers/net/mlx5/mlx5_rxtx.h     |   10 +-
>  drivers/net/mlx5/mlx5_rxtx_vec.h |    2 +
>  drivers/net/mlx5/mlx5_trigger.c  |    1 +
>  drivers/net/mlx5/mlx5_txq.c      |    3 +-
>  12 files changed, 75 insertions(+), 1224 deletions(-)
> 
> diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
> index 2a88492731..26267c957a 100644
> --- a/drivers/common/mlx5/Makefile
> +++ b/drivers/common/mlx5/Makefile
> @@ -18,6 +18,7 @@ SRCS-y += mlx5_devx_cmds.c
>  SRCS-y += mlx5_common.c
>  SRCS-y += mlx5_nl.c
>  SRCS-y += mlx5_common_mp.c
> +SRCS-y += mlx5_common_mr.c
>  ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
>  INSTALL-y-lib += $(LIB_GLUE)
>  endif
> diff --git a/drivers/common/mlx5/meson.build
> b/drivers/common/mlx5/meson.build
> index 83671861c9..175251b691 100644
> --- a/drivers/common/mlx5/meson.build
> +++ b/drivers/common/mlx5/meson.build
> @@ -56,6 +56,7 @@ sources = files(
>  	'mlx5_common.c',
>  	'mlx5_nl.c',
>  	'mlx5_common_mp.c',
> +	'mlx5_common_mr.c',
>  )
>  if not dlopen_ibverbs
>  	sources += files('mlx5_glue.c')
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
> index 9eac8011f3..f45055d96f 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -618,7 +618,7 @@ mlx5_alloc_shared_ibctx(const struct
> mlx5_dev_spawn_data *spawn,
>  	 * At this point the device is not added to the memory
>  	 * event list yet, context is just being created.
>  	 */
> -	err = mlx5_mr_btree_init(&sh->mr.cache,
> +	err = mlx5_mr_btree_init(&sh->share_cache.cache,
>  				 MLX5_MR_BTREE_CACHE_N * 2,
>  				 spawn->pci_dev->device.numa_node);
>  	if (err) {
> @@ -690,7 +690,7 @@ mlx5_free_shared_ibctx(struct mlx5_ibv_shared *sh)
>  	LIST_REMOVE(sh, mem_event_cb);
>  	rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
>  	/* Release created Memory Regions. */
> -	mlx5_mr_release(sh);
> +	mlx5_mr_release_cache(&sh->share_cache);
>  	/* Remove context from the global device list. */
>  	LIST_REMOVE(sh, next);
>  	/*
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
> index 9e15600afd..41b6e78369 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -37,10 +37,10 @@
>  #include <mlx5_prm.h>
>  #include <mlx5_nl.h>
>  #include <mlx5_common_mp.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
> -#include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
> 
>  /** Key string for IPC. */
> @@ -198,8 +198,6 @@ struct mlx5_verbs_alloc_ctx {
>  	const void *obj; /* Pointer to the DPDK object. */
>  };
> 
> -LIST_HEAD(mlx5_mr_list, mlx5_mr);
> -
>  /* Flow drop context necessary due to Verbs API. */
>  struct mlx5_drop {
>  	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
> @@ -390,13 +388,7 @@ struct mlx5_ibv_shared {
>  	struct ibv_device_attr_ex device_attr; /* Device properties. */
>  	LIST_ENTRY(mlx5_ibv_shared) mem_event_cb;
>  	/**< Called by memory event callback. */
> -	struct {
> -		uint32_t dev_gen; /* Generation number to flush local
> caches. */
> -		rte_rwlock_t rwlock; /* MR Lock. */
> -		struct mlx5_mr_btree cache; /* Global MR cache table. */
> -		struct mlx5_mr_list mr_list; /* Registered MR list. */
> -		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
> -	} mr;
> +	struct mlx5_mr_share_cache share_cache;
>  	/* Shared DV/DR flow data section. */
>  	pthread_mutex_t dv_mutex; /* DV context mutex. */
>  	uint32_t dv_meta_mask; /* flow META metadata supported mask. */
> diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
> index 43684dbc3a..7ad322d474 100644
> --- a/drivers/net/mlx5/mlx5_mp.c
> +++ b/drivers/net/mlx5/mlx5_mp.c
> @@ -11,6 +11,7 @@
>  #include <rte_string_fns.h>
> 
>  #include <mlx5_common_mp.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5.h"
>  #include "mlx5_rxtx.h"
> @@ -25,7 +26,7 @@ mlx5_mp_primary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  		(const struct mlx5_mp_param *)mp_msg->param;
>  	struct rte_eth_dev *dev;
>  	struct mlx5_priv *priv;
> -	struct mlx5_mr_cache entry;
> +	struct mr_cache_entry entry;
>  	uint32_t lkey;
>  	int ret;
> 
> @@ -40,7 +41,10 @@ mlx5_mp_primary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  	switch (param->type) {
>  	case MLX5_MP_REQ_CREATE_MR:
>  		mp_init_msg(&priv->mp_id, &mp_res, param->type);
> -		lkey = mlx5_mr_create_primary(dev, &entry, param-
> >args.addr);
> +		lkey = mlx5_mr_create_primary(priv->sh->pd,
> +					      &priv->sh->share_cache,
> +					      &entry, param->args.addr,
> +					      priv->config.mr_ext_memseg_en);
>  		if (lkey == UINT32_MAX)
>  			res->result = -rte_errno;
>  		ret = rte_mp_reply(&mp_res, peer);
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> index 9151992a72..2b4b3e2891 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -18,6 +18,8 @@
>  #include <rte_bus_pci.h>
> 
>  #include <mlx5_glue.h>
> +#include <mlx5_common_mp.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5.h"
>  #include "mlx5_mr.h"
> @@ -36,834 +38,6 @@ struct mr_update_mp_data {
>  	int ret;
>  };
> 
> -/**
> - * Expand B-tree table to a given size. Can't be called with holding
> - * memory_hotplug_lock or sh->mr.rwlock due to rte_realloc().
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param n
> - *   Number of entries for expansion.
> - *
> - * @return
> - *   0 on success, -1 on failure.
> - */
> -static int
> -mr_btree_expand(struct mlx5_mr_btree *bt, int n)
> -{
> -	void *mem;
> -	int ret = 0;
> -
> -	if (n <= bt->size)
> -		return ret;
> -	/*
> -	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
> -	 * used inside if there's no room to expand. Because this is a quite
> -	 * rare case and a part of very slow path, it is very acceptable.
> -	 * Initially cache_bh[] will be given practically enough space and once
> -	 * it is expanded, expansion wouldn't be needed again ever.
> -	 */
> -	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
> -	if (mem == NULL) {
> -		/* Not an error, B-tree search will be skipped. */
> -		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
> -			(void *)bt);
> -		ret = -1;
> -	} else {
> -		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
> -		bt->table = mem;
> -		bt->size = n;
> -	}
> -	return ret;
> -}
> -
> -/**
> - * Look up LKey from given B-tree lookup table, store the last index and
> return
> - * searched LKey.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param[out] idx
> - *   Pointer to index. Even on search failure, returns index where it stops
> - *   searching so that index can be used when inserting a new entry.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static uint32_t
> -mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
> -{
> -	struct mlx5_mr_cache *lkp_tbl;
> -	uint16_t n;
> -	uint16_t base = 0;
> -
> -	MLX5_ASSERT(bt != NULL);
> -	lkp_tbl = *bt->table;
> -	n = bt->len;
> -	/* First entry must be NULL for comparison. */
> -	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
> -				    lkp_tbl[0].lkey == UINT32_MAX));
> -	/* Binary search. */
> -	do {
> -		register uint16_t delta = n >> 1;
> -
> -		if (addr < lkp_tbl[base + delta].start) {
> -			n = delta;
> -		} else {
> -			base += delta;
> -			n -= delta;
> -		}
> -	} while (n > 1);
> -	MLX5_ASSERT(addr >= lkp_tbl[base].start);
> -	*idx = base;
> -	if (addr < lkp_tbl[base].end)
> -		return lkp_tbl[base].lkey;
> -	/* Not found. */
> -	return UINT32_MAX;
> -}
> -
> -/**
> - * Insert an entry to B-tree lookup table.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param entry
> - *   Pointer to new entry to insert.
> - *
> - * @return
> - *   0 on success, -1 on failure.
> - */
> -static int
> -mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
> -{
> -	struct mlx5_mr_cache *lkp_tbl;
> -	uint16_t idx = 0;
> -	size_t shift;
> -
> -	MLX5_ASSERT(bt != NULL);
> -	MLX5_ASSERT(bt->len <= bt->size);
> -	MLX5_ASSERT(bt->len > 0);
> -	lkp_tbl = *bt->table;
> -	/* Find out the slot for insertion. */
> -	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
> -		DRV_LOG(DEBUG,
> -			"abort insertion to B-tree(%p): already exist at"
> -			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> -			(void *)bt, idx, entry->start, entry->end, entry->lkey);
> -		/* Already exist, return. */
> -		return 0;
> -	}
> -	/* If table is full, return error. */
> -	if (unlikely(bt->len == bt->size)) {
> -		bt->overflow = 1;
> -		return -1;
> -	}
> -	/* Insert entry. */
> -	++idx;
> -	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
> -	if (shift)
> -		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
> -	lkp_tbl[idx] = *entry;
> -	bt->len++;
> -	DRV_LOG(DEBUG,
> -		"inserted B-tree(%p)[%u],"
> -		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> -		(void *)bt, idx, entry->start, entry->end, entry->lkey);
> -	return 0;
> -}
> -
> -/**
> - * Initialize B-tree and allocate memory for lookup table.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param n
> - *   Number of entries to allocate.
> - * @param socket
> - *   NUMA socket on which memory must be allocated.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
> -{
> -	if (bt == NULL) {
> -		rte_errno = EINVAL;
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(!bt->table && !bt->size);
> -	memset(bt, 0, sizeof(*bt));
> -	bt->table = rte_calloc_socket("B-tree table",
> -				      n, sizeof(struct mlx5_mr_cache),
> -				      0, socket);
> -	if (bt->table == NULL) {
> -		rte_errno = ENOMEM;
> -		DEBUG("failed to allocate memory for btree cache on socket
> %d",
> -		      socket);
> -		return -rte_errno;
> -	}
> -	bt->size = n;
> -	/* First entry must be NULL for binary search. */
> -	(*bt->table)[bt->len++] = (struct mlx5_mr_cache) {
> -		.lkey = UINT32_MAX,
> -	};
> -	DEBUG("initialized B-tree %p with table %p",
> -	      (void *)bt, (void *)bt->table);
> -	return 0;
> -}
> -
> -/**
> - * Free B-tree resources.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - */
> -void
> -mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
> -{
> -	if (bt == NULL)
> -		return;
> -	DEBUG("freeing B-tree %p with table %p",
> -	      (void *)bt, (void *)bt->table);
> -	rte_free(bt->table);
> -	memset(bt, 0, sizeof(*bt));
> -}
> -
> -/**
> - * Dump all the entries in a B-tree
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - */
> -void
> -mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
> -{
> -#ifdef RTE_LIBRTE_MLX5_DEBUG
> -	int idx;
> -	struct mlx5_mr_cache *lkp_tbl;
> -
> -	if (bt == NULL)
> -		return;
> -	lkp_tbl = *bt->table;
> -	for (idx = 0; idx < bt->len; ++idx) {
> -		struct mlx5_mr_cache *entry = &lkp_tbl[idx];
> -
> -		DEBUG("B-tree(%p)[%u],"
> -		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> -		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
> -	}
> -#endif
> -}
> -
> -/**
> - * Find virtually contiguous memory chunk in a given MR.
> - *
> - * @param dev
> - *   Pointer to MR structure.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry. If not found, this will not be
> - *   updated.
> - * @param start_idx
> - *   Start index of the memseg bitmap.
> - *
> - * @return
> - *   Next index to go on lookup.
> - */
> -static int
> -mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
> -		   int base_idx)
> -{
> -	uintptr_t start = 0;
> -	uintptr_t end = 0;
> -	uint32_t idx = 0;
> -
> -	/* MR for external memory doesn't have memseg list. */
> -	if (mr->msl == NULL) {
> -		struct ibv_mr *ibv_mr = mr->ibv_mr;
> -
> -		MLX5_ASSERT(mr->ms_bmp_n == 1);
> -		MLX5_ASSERT(mr->ms_n == 1);
> -		MLX5_ASSERT(base_idx == 0);
> -		/*
> -		 * Can't search it from memseg list but get it directly from
> -		 * verbs MR as there's only one chunk.
> -		 */
> -		entry->start = (uintptr_t)ibv_mr->addr;
> -		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
> -		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> -		/* Returning 1 ends iteration. */
> -		return 1;
> -	}
> -	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
> -		if (rte_bitmap_get(mr->ms_bmp, idx)) {
> -			const struct rte_memseg_list *msl;
> -			const struct rte_memseg *ms;
> -
> -			msl = mr->msl;
> -			ms = rte_fbarray_get(&msl->memseg_arr,
> -					     mr->ms_base_idx + idx);
> -			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> -			if (!start)
> -				start = ms->addr_64;
> -			end = ms->addr_64 + ms->hugepage_sz;
> -		} else if (start) {
> -			/* Passed the end of a fragment. */
> -			break;
> -		}
> -	}
> -	if (start) {
> -		/* Found one chunk. */
> -		entry->start = start;
> -		entry->end = end;
> -		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> -	}
> -	return idx;
> -}
> -
> -/**
> - * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
> - * Then, this entry will have to be searched by mr_lookup_dev_list() in
> - * mlx5_mr_create() on miss.
> - *
> - * @param dev
> - *   Pointer to Ethernet device shared context.
> - * @param mr
> - *   Pointer to MR to insert.
> - *
> - * @return
> - *   0 on success, -1 on failure.
> - */
> -static int
> -mr_insert_dev_cache(struct mlx5_ibv_shared *sh, struct mlx5_mr *mr)
> -{
> -	unsigned int n;
> -
> -	DRV_LOG(DEBUG, "device %s inserting MR(%p) to global cache",
> -		sh->ibdev_name, (void *)mr);
> -	for (n = 0; n < mr->ms_bmp_n; ) {
> -		struct mlx5_mr_cache entry;
> -
> -		memset(&entry, 0, sizeof(entry));
> -		/* Find a contiguous chunk and advance the index. */
> -		n = mr_find_next_chunk(mr, &entry, n);
> -		if (!entry.end)
> -			break;
> -		if (mr_btree_insert(&sh->mr.cache, &entry) < 0) {
> -			/*
> -			 * Overflowed, but the global table cannot be
> expanded
> -			 * because of deadlock.
> -			 */
> -			return -1;
> -		}
> -	}
> -	return 0;
> -}
> -
> -/**
> - * Look up address in the original global MR list.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Found MR on match, NULL otherwise.
> - */
> -static struct mlx5_mr *
> -mr_lookup_dev_list(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache
> *entry,
> -		   uintptr_t addr)
> -{
> -	struct mlx5_mr *mr;
> -
> -	/* Iterate all the existing MRs. */
> -	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
> -		unsigned int n;
> -
> -		if (mr->ms_n == 0)
> -			continue;
> -		for (n = 0; n < mr->ms_bmp_n; ) {
> -			struct mlx5_mr_cache ret;
> -
> -			memset(&ret, 0, sizeof(ret));
> -			n = mr_find_next_chunk(mr, &ret, n);
> -			if (addr >= ret.start && addr < ret.end) {
> -				/* Found. */
> -				*entry = ret;
> -				return mr;
> -			}
> -		}
> -	}
> -	return NULL;
> -}
> -
> -/**
> - * Look up address on device.
> - *
> - * @param dev
> - *   Pointer to Ethernet device shared context.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -static uint32_t
> -mr_lookup_dev(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
> -	      uintptr_t addr)
> -{
> -	uint16_t idx;
> -	uint32_t lkey = UINT32_MAX;
> -	struct mlx5_mr *mr;
> -
> -	/*
> -	 * If the global cache has overflowed since it failed to expand the
> -	 * B-tree table, it can't have all the existing MRs. Then, the address
> -	 * has to be searched by traversing the original MR list instead, which
> -	 * is very slow path. Otherwise, the global cache is all inclusive.
> -	 */
> -	if (!unlikely(sh->mr.cache.overflow)) {
> -		lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
> -		if (lkey != UINT32_MAX)
> -			*entry = (*sh->mr.cache.table)[idx];
> -	} else {
> -		/* Falling back to the slowest path. */
> -		mr = mr_lookup_dev_list(sh, entry, addr);
> -		if (mr != NULL)
> -			lkey = entry->lkey;
> -	}
> -	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
> -					   addr < entry->end));
> -	return lkey;
> -}
> -
> -/**
> - * Free MR resources. MR lock must not be held to avoid a deadlock.
> rte_free()
> - * can raise memory free event and the callback function will spin on the
> lock.
> - *
> - * @param mr
> - *   Pointer to MR to free.
> - */
> -static void
> -mr_free(struct mlx5_mr *mr)
> -{
> -	if (mr == NULL)
> -		return;
> -	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
> -	if (mr->ibv_mr != NULL)
> -		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
> -	if (mr->ms_bmp != NULL)
> -		rte_bitmap_free(mr->ms_bmp);
> -	rte_free(mr);
> -}
> -
> -/**
> - * Release resources of detached MR having no online entry.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -static void
> -mlx5_mr_garbage_collect(struct mlx5_ibv_shared *sh)
> -{
> -	struct mlx5_mr *mr_next;
> -	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
> -
> -	/* Must be called from the primary process. */
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> -	/*
> -	 * MR can't be freed with holding the lock because rte_free() could
> call
> -	 * memory free callback function. This will be a deadlock situation.
> -	 */
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	/* Detach the whole free list and release it after unlocking. */
> -	free_list = sh->mr.mr_free_list;
> -	LIST_INIT(&sh->mr.mr_free_list);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -	/* Release resources. */
> -	mr_next = LIST_FIRST(&free_list);
> -	while (mr_next != NULL) {
> -		struct mlx5_mr *mr = mr_next;
> -
> -		mr_next = LIST_NEXT(mr, mr);
> -		mr_free(mr);
> -	}
> -}
> -
> -/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
> -static int
> -mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
> -			  const struct rte_memseg *ms, size_t len, void *arg)
> -{
> -	struct mr_find_contig_memsegs_data *data = arg;
> -
> -	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
> -		return 0;
> -	/* Found, save it and stop walking. */
> -	data->start = ms->addr_64;
> -	data->end = ms->addr_64 + len;
> -	data->msl = msl;
> -	return 1;
> -}
> -
> -/**
> - * Create a new global Memory Region (MR) for a missing virtual address.
> - * This API should be called on a secondary process, then a request is sent to
> - * the primary process in order to create a MR for the address. As the global
> MR
> - * list is on the shared memory, following LKey lookup should succeed unless
> the
> - * request fails.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this will not be updated.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -static uint32_t
> -mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache
> *entry,
> -			 uintptr_t addr)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	int ret;
> -
> -	DEBUG("port %u requesting MR creation for address (%p)",
> -	      dev->data->port_id, (void *)addr);
> -	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
> -	if (ret) {
> -		DEBUG("port %u fail to request MR creation for address
> (%p)",
> -		      dev->data->port_id, (void *)addr);
> -		return UINT32_MAX;
> -	}
> -	rte_rwlock_read_lock(&priv->sh->mr.rwlock);
> -	/* Fill in output data. */
> -	mr_lookup_dev(priv->sh, entry, addr);
> -	/* Lookup can't fail. */
> -	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> -	rte_rwlock_read_unlock(&priv->sh->mr.rwlock);
> -	DEBUG("port %u MR CREATED by primary process for %p:\n"
> -	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
> -	      dev->data->port_id, (void *)addr,
> -	      entry->start, entry->end, entry->lkey);
> -	return entry->lkey;
> -}
> -
> -/**
> - * Create a new global Memory Region (MR) for a missing virtual address.
> - * Register entire virtually contiguous memory chunk around the address.
> - * This must be called from the primary process.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this will not be updated.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -uint32_t
> -mlx5_mr_create_primary(struct rte_eth_dev *dev, struct mlx5_mr_cache
> *entry,
> -		       uintptr_t addr)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_ibv_shared *sh = priv->sh;
> -	struct mlx5_dev_config *config = &priv->config;
> -	const struct rte_memseg_list *msl;
> -	const struct rte_memseg *ms;
> -	struct mlx5_mr *mr = NULL;
> -	size_t len;
> -	uint32_t ms_n;
> -	uint32_t bmp_size;
> -	void *bmp_mem;
> -	int ms_idx_shift = -1;
> -	unsigned int n;
> -	struct mr_find_contig_memsegs_data data = {
> -		.addr = addr,
> -	};
> -	struct mr_find_contig_memsegs_data data_re;
> -
> -	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
> -		dev->data->port_id, (void *)addr);
> -	/*
> -	 * Release detached MRs if any. This can't be called with holding
> either
> -	 * memory_hotplug_lock or sh->mr.rwlock. MRs on the free list have
> -	 * been detached by the memory free event but it couldn't be
> released
> -	 * inside the callback due to deadlock. As a result, releasing resources
> -	 * is quite opportunistic.
> -	 */
> -	mlx5_mr_garbage_collect(sh);
> -	/*
> -	 * If enabled, find out a contiguous virtual address chunk in use, to
> -	 * which the given address belongs, in order to register maximum
> range.
> -	 * In the best case where mempools are not dynamically recreated
> and
> -	 * '--socket-mem' is specified as an EAL option, it is very likely to
> -	 * have only one MR(LKey) per a socket and per a hugepage-size even
> -	 * though the system memory is highly fragmented. As the whole
> memory
> -	 * chunk will be pinned by kernel, it can't be reused unless entire
> -	 * chunk is freed from EAL.
> -	 *
> -	 * If disabled, just register one memseg (page). Then, memory
> -	 * consumption will be minimized but it may drop performance if
> there
> -	 * are many MRs to lookup on the datapath.
> -	 */
> -	if (!config->mr_ext_memseg_en) {
> -		data.msl = rte_mem_virt2memseg_list((void *)addr);
> -		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
> -		data.end = data.start + data.msl->page_sz;
> -	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> &data)) {
> -		DRV_LOG(WARNING,
> -			"port %u unable to find virtually contiguous"
> -			" chunk for address (%p)."
> -			" rte_memseg_contig_walk() failed.",
> -			dev->data->port_id, (void *)addr);
> -		rte_errno = ENXIO;
> -		goto err_nolock;
> -	}
> -alloc_resources:
> -	/* Addresses must be page-aligned. */
> -	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
> -	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
> -	msl = data.msl;
> -	ms = rte_mem_virt2memseg((void *)data.start, msl);
> -	len = data.end - data.start;
> -	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> -	/* Number of memsegs in the range. */
> -	ms_n = len / msl->page_sz;
> -	DEBUG("port %u extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> -	      " page_sz=0x%" PRIx64 ", ms_n=%u",
> -	      dev->data->port_id, (void *)addr,
> -	      data.start, data.end, msl->page_sz, ms_n);
> -	/* Size of memory for bitmap. */
> -	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
> -	mr = rte_zmalloc_socket(NULL,
> -				RTE_ALIGN_CEIL(sizeof(*mr),
> -					       RTE_CACHE_LINE_SIZE) +
> -				bmp_size,
> -				RTE_CACHE_LINE_SIZE, msl->socket_id);
> -	if (mr == NULL) {
> -		DEBUG("port %u unable to allocate memory for a new MR of"
> -		      " address (%p).",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = ENOMEM;
> -		goto err_nolock;
> -	}
> -	mr->msl = msl;
> -	/*
> -	 * Save the index of the first memseg and initialize memseg bitmap.
> To
> -	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
> -	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
> -	 */
> -	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
> -	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
> -	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
> -	if (mr->ms_bmp == NULL) {
> -		DEBUG("port %u unable to initialize bitmap for a new MR of"
> -		      " address (%p).",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = EINVAL;
> -		goto err_nolock;
> -	}
> -	/*
> -	 * Should recheck whether the extended contiguous chunk is still
> valid.
> -	 * Because memory_hotplug_lock can't be held if there's any memory
> -	 * related calls in a critical path, resource allocation above can't be
> -	 * locked. If the memory has been changed at this point, try again
> with
> -	 * just single page. If not, go on with the big chunk atomically from
> -	 * here.
> -	 */
> -	rte_mcfg_mem_read_lock();
> -	data_re = data;
> -	if (len > msl->page_sz &&
> -	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re))
> {
> -		DEBUG("port %u unable to find virtually contiguous"
> -		      " chunk for address (%p)."
> -		      " rte_memseg_contig_walk() failed.",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = ENXIO;
> -		goto err_memlock;
> -	}
> -	if (data.start != data_re.start || data.end != data_re.end) {
> -		/*
> -		 * The extended contiguous chunk has been changed. Try
> again
> -		 * with single memseg instead.
> -		 */
> -		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
> -		data.end = data.start + msl->page_sz;
> -		rte_mcfg_mem_read_unlock();
> -		mr_free(mr);
> -		goto alloc_resources;
> -	}
> -	MLX5_ASSERT(data.msl == data_re.msl);
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	/*
> -	 * Check the address is really missing. If other thread already created
> -	 * one or it is not found due to overflow, abort and return.
> -	 */
> -	if (mr_lookup_dev(sh, entry, addr) != UINT32_MAX) {
> -		/*
> -		 * Insert to the global cache table. It may fail due to
> -		 * low-on-memory. Then, this entry will have to be searched
> -		 * here again.
> -		 */
> -		mr_btree_insert(&sh->mr.cache, entry);
> -		DEBUG("port %u found MR for %p on final lookup, abort",
> -		      dev->data->port_id, (void *)addr);
> -		rte_rwlock_write_unlock(&sh->mr.rwlock);
> -		rte_mcfg_mem_read_unlock();
> -		/*
> -		 * Must be unlocked before calling rte_free() because
> -		 * mlx5_mr_mem_event_free_cb() can be called inside.
> -		 */
> -		mr_free(mr);
> -		return entry->lkey;
> -	}
> -	/*
> -	 * Trim start and end addresses for verbs MR. Set bits for registering
> -	 * memsegs but exclude already registered ones. Bitmap can be
> -	 * fragmented.
> -	 */
> -	for (n = 0; n < ms_n; ++n) {
> -		uintptr_t start;
> -		struct mlx5_mr_cache ret;
> -
> -		memset(&ret, 0, sizeof(ret));
> -		start = data_re.start + n * msl->page_sz;
> -		/* Exclude memsegs already registered by other MRs. */
> -		if (mr_lookup_dev(sh, &ret, start) == UINT32_MAX) {
> -			/*
> -			 * Start from the first unregistered memseg in the
> -			 * extended range.
> -			 */
> -			if (ms_idx_shift == -1) {
> -				mr->ms_base_idx += n;
> -				data.start = start;
> -				ms_idx_shift = n;
> -			}
> -			data.end = start + msl->page_sz;
> -			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
> -			++mr->ms_n;
> -		}
> -	}
> -	len = data.end - data.start;
> -	mr->ms_bmp_n = len / msl->page_sz;
> -	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
> -	/*
> -	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can
> be
> -	 * called with holding the memory lock because it doesn't use
> -	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
> -	 * through mlx5_alloc_verbs_buf().
> -	 */
> -	mr->ibv_mr = mlx5_glue->reg_mr(sh->pd, (void *)data.start, len,
> -				       IBV_ACCESS_LOCAL_WRITE |
> -					   IBV_ACCESS_RELAXED_ORDERING);
> -	if (mr->ibv_mr == NULL) {
> -		DEBUG("port %u fail to create a verbs MR for address (%p)",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = EINVAL;
> -		goto err_mrlock;
> -	}
> -	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
> -	MLX5_ASSERT(mr->ibv_mr->length == len);
> -	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
> -	DEBUG("port %u MR CREATED (%p) for %p:\n"
> -	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> -	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> -	      dev->data->port_id, (void *)mr, (void *)addr,
> -	      data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> -	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> -	/* Insert to the global cache table. */
> -	mr_insert_dev_cache(sh, mr);
> -	/* Fill in output data. */
> -	mr_lookup_dev(sh, entry, addr);
> -	/* Lookup can't fail. */
> -	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -	rte_mcfg_mem_read_unlock();
> -	return entry->lkey;
> -err_mrlock:
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -err_memlock:
> -	rte_mcfg_mem_read_unlock();
> -err_nolock:
> -	/*
> -	 * In case of error, as this can be called in a datapath, a warning
> -	 * message per an error is preferable instead. Must be unlocked
> before
> -	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be
> called
> -	 * inside.
> -	 */
> -	mr_free(mr);
> -	return UINT32_MAX;
> -}
> -
> -/**
> - * Create a new global Memory Region (MR) for a missing virtual address.
> - * This can be called from primary and secondary process.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this will not be updated.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -static uint32_t
> -mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
> -	       uintptr_t addr)
> -{
> -	uint32_t ret = 0;
> -
> -	switch (rte_eal_process_type()) {
> -	case RTE_PROC_PRIMARY:
> -		ret = mlx5_mr_create_primary(dev, entry, addr);
> -		break;
> -	case RTE_PROC_SECONDARY:
> -		ret = mlx5_mr_create_secondary(dev, entry, addr);
> -		break;
> -	default:
> -		break;
> -	}
> -	return ret;
> -}
> -
> -/**
> - * Rebuild the global B-tree cache of device from the original MR list.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -static void
> -mr_rebuild_dev_cache(struct mlx5_ibv_shared *sh)
> -{
> -	struct mlx5_mr *mr;
> -
> -	DRV_LOG(DEBUG, "device %s rebuild dev cache[]", sh->ibdev_name);
> -	/* Flush cache to rebuild. */
> -	sh->mr.cache.len = 1;
> -	sh->mr.cache.overflow = 0;
> -	/* Iterate all the existing MRs. */
> -	LIST_FOREACH(mr, &sh->mr.mr_list, mr)
> -		if (mr_insert_dev_cache(sh, mr) < 0)
> -			return;
> -}
> -
>  /**
>   * Callback for memory free event. Iterate freed memsegs and check whether
> it
>   * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a
> @@ -900,18 +74,18 @@ mlx5_mr_mem_event_free_cb(struct
> mlx5_ibv_shared *sh,
>  		    RTE_ALIGN((uintptr_t)addr, msl->page_sz));
>  	MLX5_ASSERT(len == RTE_ALIGN(len, msl->page_sz));
>  	ms_n = len / msl->page_sz;
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> +	rte_rwlock_write_lock(&sh->share_cache.rwlock);
>  	/* Clear bits of freed memsegs from MR. */
>  	for (i = 0; i < ms_n; ++i) {
>  		const struct rte_memseg *ms;
> -		struct mlx5_mr_cache entry;
> +		struct mr_cache_entry entry;
>  		uintptr_t start;
>  		int ms_idx;
>  		uint32_t pos;
> 
>  		/* Find MR having this memseg. */
>  		start = (uintptr_t)addr + i * msl->page_sz;
> -		mr = mr_lookup_dev_list(sh, &entry, start);
> +		mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, start);
>  		if (mr == NULL)
>  			continue;
>  		MLX5_ASSERT(mr->msl); /* Can't be external memory. */
> @@ -927,7 +101,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared
> *sh,
>  		rte_bitmap_clear(mr->ms_bmp, pos);
>  		if (--mr->ms_n == 0) {
>  			LIST_REMOVE(mr, mr);
> -			LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
> +			LIST_INSERT_HEAD(&sh->share_cache.mr_free_list,
> mr, mr);
>  			DEBUG("device %s remove MR(%p) from list",
>  			      sh->ibdev_name, (void *)mr);
>  		}
> @@ -938,7 +112,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared
> *sh,
>  		rebuild = 1;
>  	}
>  	if (rebuild) {
> -		mr_rebuild_dev_cache(sh);
> +		mlx5_mr_rebuild_cache(&sh->share_cache);
>  		/*
>  		 * Flush local caches by propagating invalidation across cores.
>  		 * rte_smp_wmb() is enough to synchronize this event. If one
> of
> @@ -948,12 +122,12 @@ mlx5_mr_mem_event_free_cb(struct
> mlx5_ibv_shared *sh,
>  		 * generation below) will be guaranteed to be seen by other
> core
>  		 * before the core sees the newly allocated memory.
>  		 */
> -		++sh->mr.dev_gen;
> +		++sh->share_cache.dev_gen;
>  		DEBUG("broadcasting local cache flush, gen=%d",
> -		      sh->mr.dev_gen);
> +		      sh->share_cache.dev_gen);
>  		rte_smp_wmb();
>  	}
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> +	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  }
> 
>  /**
> @@ -990,111 +164,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event
> event_type, const void *addr,
>  	}
>  }
> 
> -/**
> - * Look up address in the global MR cache table. If not found, create a new
> MR.
> - * Insert the found/created entry to local bottom-half cache table.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param mr_ctrl
> - *   Pointer to per-queue MR control structure.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this is not written.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static uint32_t
> -mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
> -		   struct mlx5_mr_cache *entry, uintptr_t addr)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_ibv_shared *sh = priv->sh;
> -	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
> -	uint16_t idx;
> -	uint32_t lkey;
> -
> -	/* If local cache table is full, try to double it. */
> -	if (unlikely(bt->len == bt->size))
> -		mr_btree_expand(bt, bt->size << 1);
> -	/* Look up in the global cache. */
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
> -	if (lkey != UINT32_MAX) {
> -		/* Found. */
> -		*entry = (*sh->mr.cache.table)[idx];
> -		rte_rwlock_read_unlock(&sh->mr.rwlock);
> -		/*
> -		 * Update local cache. Even if it fails, return the found entry
> -		 * to update top-half cache. Next time, this entry will be
> found
> -		 * in the global cache.
> -		 */
> -		mr_btree_insert(bt, entry);
> -		return lkey;
> -	}
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> -	/* First time to see the address? Create a new MR. */
> -	lkey = mlx5_mr_create(dev, entry, addr);
> -	/*
> -	 * Update the local cache if successfully created a new global MR.
> Even
> -	 * if failed to create one, there's no action to take in this datapath
> -	 * code. As returning LKey is invalid, this will eventually make HW
> -	 * fail.
> -	 */
> -	if (lkey != UINT32_MAX)
> -		mr_btree_insert(bt, entry);
> -	return lkey;
> -}
> -
> -/**
> - * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] and if
> - * misses, search in the global MR cache table and update the new entry to
> - * per-queue local caches.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param mr_ctrl
> - *   Pointer to per-queue MR control structure.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static uint32_t
> -mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
> -		   uintptr_t addr)
> -{
> -	uint32_t lkey;
> -	uint16_t bh_idx = 0;
> -	/* Victim in top-half cache to replace with new entry. */
> -	struct mlx5_mr_cache *repl = &mr_ctrl->cache[mr_ctrl->head];
> -
> -	/* Binary-search MR translation table. */
> -	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
> -	/* Update top-half cache. */
> -	if (likely(lkey != UINT32_MAX)) {
> -		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
> -	} else {
> -		/*
> -		 * If missed in local lookup table, search in the global cache
> -		 * and local cache_bh[] will be updated inside if possible.
> -		 * Top-half cache entry will also be updated.
> -		 */
> -		lkey = mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
> -		if (unlikely(lkey == UINT32_MAX))
> -			return UINT32_MAX;
> -	}
> -	/* Update the most recently used entry. */
> -	mr_ctrl->mru = mr_ctrl->head;
> -	/* Point to the next victim, the oldest. */
> -	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
> -	return lkey;
> -}
> -
>  /**
>   * Bottom-half of LKey search on Rx.
>   *
> @@ -1114,7 +183,9 @@ mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq,
> uintptr_t addr)
>  	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
>  	struct mlx5_priv *priv = rxq_ctrl->priv;
> 
> -	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
> +	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
> +				  &priv->sh->share_cache, mr_ctrl, addr,
> +				  priv->config.mr_ext_memseg_en);
>  }
> 
>  /**
> @@ -1136,7 +207,9 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq,
> uintptr_t addr)
>  	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
>  	struct mlx5_priv *priv = txq_ctrl->priv;
> 
> -	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
> +	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
> +				  &priv->sh->share_cache, mr_ctrl, addr,
> +				  priv->config.mr_ext_memseg_en);
>  }
> 
>  /**
> @@ -1165,82 +238,6 @@ mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq,
> struct rte_mbuf *mb)
>  	return lkey;
>  }
> 
> -/**
> - * Flush all of the local cache entries.
> - *
> - * @param mr_ctrl
> - *   Pointer to per-queue MR control structure.
> - */
> -void
> -mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
> -{
> -	/* Reset the most-recently-used index. */
> -	mr_ctrl->mru = 0;
> -	/* Reset the linear search array. */
> -	mr_ctrl->head = 0;
> -	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
> -	/* Reset the B-tree table. */
> -	mr_ctrl->cache_bh.len = 1;
> -	mr_ctrl->cache_bh.overflow = 0;
> -	/* Update the generation number. */
> -	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
> -	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
> -		(void *)mr_ctrl, mr_ctrl->cur_gen);
> -}
> -
> -/**
> - * Creates a memory region for external memory, that is memory which is not
> - * part of the DPDK memory segments.
> - *
> - * @param dev
> - *   Pointer to the ethernet device.
> - * @param addr
> - *   Starting virtual address of memory.
> - * @param len
> - *   Length of memory segment being mapped.
> - * @param socked_id
> - *   Socket to allocate heap memory for the control structures.
> - *
> - * @return
> - *   Pointer to MR structure on success, NULL otherwise.
> - */
> -static struct mlx5_mr *
> -mlx5_create_mr_ext(struct rte_eth_dev *dev, uintptr_t addr, size_t len,
> -		   int socket_id)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_mr *mr = NULL;
> -
> -	mr = rte_zmalloc_socket(NULL,
> -				RTE_ALIGN_CEIL(sizeof(*mr),
> -					       RTE_CACHE_LINE_SIZE),
> -				RTE_CACHE_LINE_SIZE, socket_id);
> -	if (mr == NULL)
> -		return NULL;
> -	mr->ibv_mr = mlx5_glue->reg_mr(priv->sh->pd, (void *)addr, len,
> -				       IBV_ACCESS_LOCAL_WRITE |
> -					   IBV_ACCESS_RELAXED_ORDERING);
> -	if (mr->ibv_mr == NULL) {
> -		DRV_LOG(WARNING,
> -			"port %u fail to create a verbs MR for address (%p)",
> -			dev->data->port_id, (void *)addr);
> -		rte_free(mr);
> -		return NULL;
> -	}
> -	mr->msl = NULL; /* Mark it is external memory. */
> -	mr->ms_bmp = NULL;
> -	mr->ms_n = 1;
> -	mr->ms_bmp_n = 1;
> -	DRV_LOG(DEBUG,
> -		"port %u MR CREATED (%p) for external memory %p:\n"
> -		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> -		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> -		dev->data->port_id, (void *)mr, (void *)addr,
> -		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> -		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> -	return mr;
> -}
> -
>  /**
>   * Called during rte_mempool_mem_iter() by mlx5_mr_update_ext_mp().
>   *
> @@ -1267,19 +264,19 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool
> *mp, void *opaque,
>  	struct mlx5_mr *mr = NULL;
>  	uintptr_t addr = (uintptr_t)memhdr->addr;
>  	size_t len = memhdr->len;
> -	struct mlx5_mr_cache entry;
> +	struct mr_cache_entry entry;
>  	uint32_t lkey;
> 
>  	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
>  	/* If already registered, it should return. */
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	lkey = mr_lookup_dev(sh, &entry, addr);
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> +	rte_rwlock_read_lock(&sh->share_cache.rwlock);
> +	lkey = mlx5_mr_lookup_cache(&sh->share_cache, &entry, addr);
> +	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  	if (lkey != UINT32_MAX)
>  		return;
>  	DRV_LOG(DEBUG, "port %u register MR for chunk #%d of mempool
> (%s)",
>  		dev->data->port_id, mem_idx, mp->name);
> -	mr = mlx5_create_mr_ext(dev, addr, len, mp->socket_id);
> +	mr = mlx5_create_mr_ext(sh->pd, addr, len, mp->socket_id);
>  	if (!mr) {
>  		DRV_LOG(WARNING,
>  			"port %u unable to allocate a new MR of"
> @@ -1288,13 +285,14 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool
> *mp, void *opaque,
>  		data->ret = -1;
>  		return;
>  	}
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
> +	rte_rwlock_write_lock(&sh->share_cache.rwlock);
> +	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
>  	/* Insert to the global cache table. */
> -	mr_insert_dev_cache(sh, mr);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> +	mlx5_mr_insert_cache(&sh->share_cache, mr);
> +	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  	/* Insert to the local cache table */
> -	mlx5_mr_addr2mr_bh(dev, mr_ctrl, addr);
> +	mlx5_mr_addr2mr_bh(sh->pd, &priv->mp_id, &sh->share_cache,
> +			   mr_ctrl, addr, priv->config.mr_ext_memseg_en);
>  }
> 
>  /**
> @@ -1351,19 +349,19 @@ mlx5_dma_map(struct rte_pci_device *pdev, void
> *addr,
>  		return -1;
>  	}
>  	priv = dev->data->dev_private;
> -	mr = mlx5_create_mr_ext(dev, (uintptr_t)addr, len, SOCKET_ID_ANY);
> +	sh = priv->sh;
> +	mr = mlx5_create_mr_ext(sh->pd, (uintptr_t)addr, len,
> SOCKET_ID_ANY);
>  	if (!mr) {
>  		DRV_LOG(WARNING,
>  			"port %u unable to dma map", dev->data->port_id);
>  		rte_errno = EINVAL;
>  		return -1;
>  	}
> -	sh = priv->sh;
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
> +	rte_rwlock_write_lock(&sh->share_cache.rwlock);
> +	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
>  	/* Insert to the global cache table. */
> -	mr_insert_dev_cache(sh, mr);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> +	mlx5_mr_insert_cache(&sh->share_cache, mr);
> +	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  	return 0;
>  }
> 
> @@ -1390,7 +388,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void
> *addr,
>  	struct mlx5_priv *priv;
>  	struct mlx5_ibv_shared *sh;
>  	struct mlx5_mr *mr;
> -	struct mlx5_mr_cache entry;
> +	struct mr_cache_entry entry;
> 
>  	dev = pci_dev_to_eth_dev(pdev);
>  	if (!dev) {
> @@ -1401,10 +399,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  	}
>  	priv = dev->data->dev_private;
>  	sh = priv->sh;
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	mr = mr_lookup_dev_list(sh, &entry, (uintptr_t)addr);
> +	rte_rwlock_read_lock(&sh->share_cache.rwlock);
> +	mr = mlx5_mr_lookup_list(&sh->share_cache, &entry,
> (uintptr_t)addr);
>  	if (!mr) {
> -		rte_rwlock_read_unlock(&sh->mr.rwlock);
> +		rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  		DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't
> registered "
>  				 "to PCI device %p", (uintptr_t)addr,
>  				 (void *)pdev);
> @@ -1412,10 +410,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  		return -1;
>  	}
>  	LIST_REMOVE(mr, mr);
> -	LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
> +	LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
>  	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
>  	      (void *)mr);
> -	mr_rebuild_dev_cache(sh);
> +	mlx5_mr_rebuild_cache(&sh->share_cache);
>  	/*
>  	 * Flush local caches by propagating invalidation across cores.
>  	 * rte_smp_wmb() is enough to synchronize this event. If one of
> @@ -1425,10 +423,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  	 * generation below) will be guaranteed to be seen by other core
>  	 * before the core sees the newly allocated memory.
>  	 */
> -	++sh->mr.dev_gen;
> -	DEBUG("broadcasting local cache flush, gen=%d",	sh-
> >mr.dev_gen);
> +	++sh->share_cache.dev_gen;
> +	DEBUG("broadcasting local cache flush, gen=%d",
> +	      sh->share_cache.dev_gen);
>  	rte_smp_wmb();
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> +	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  	return 0;
>  }
> 
> @@ -1503,14 +502,19 @@ mlx5_mr_update_mp_cb(struct rte_mempool *mp
> __rte_unused, void *opaque,
>  		     unsigned mem_idx __rte_unused)
>  {
>  	struct mr_update_mp_data *data = opaque;
> +	struct rte_eth_dev *dev = data->dev;
> +	struct mlx5_priv *priv = dev->data->dev_private;
> +
>  	uint32_t lkey;
> 
>  	/* Stop iteration if failed in the previous walk. */
>  	if (data->ret < 0)
>  		return;
>  	/* Register address of the chunk and update local caches. */
> -	lkey = mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
> -				  (uintptr_t)memhdr->addr);
> +	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
> +				  &priv->sh->share_cache, data->mr_ctrl,
> +				  (uintptr_t)memhdr->addr,
> +				  priv->config.mr_ext_memseg_en);
>  	if (lkey == UINT32_MAX)
>  		data->ret = -1;
>  }
> @@ -1545,76 +549,3 @@ mlx5_mr_update_mp(struct rte_eth_dev *dev,
> struct mlx5_mr_ctrl *mr_ctrl,
>  	}
>  	return data.ret;
>  }
> -
> -/**
> - * Dump all the created MRs and the global cache entries.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -void
> -mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh __rte_unused)
> -{
> -#ifdef RTE_LIBRTE_MLX5_DEBUG
> -	struct mlx5_mr *mr;
> -	int mr_n = 0;
> -	int chunk_n = 0;
> -
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	/* Iterate all the existing MRs. */
> -	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
> -		unsigned int n;
> -
> -		DEBUG("device %s MR[%u], LKey = 0x%x, ms_n = %u,
> ms_bmp_n = %u",
> -		      sh->ibdev_name, mr_n++,
> -		      rte_cpu_to_be_32(mr->ibv_mr->lkey),
> -		      mr->ms_n, mr->ms_bmp_n);
> -		if (mr->ms_n == 0)
> -			continue;
> -		for (n = 0; n < mr->ms_bmp_n; ) {
> -			struct mlx5_mr_cache ret = { 0, };
> -
> -			n = mr_find_next_chunk(mr, &ret, n);
> -			if (!ret.end)
> -				break;
> -			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR
> ")",
> -			      chunk_n++, ret.start, ret.end);
> -		}
> -	}
> -	DEBUG("device %s dumping global cache", sh->ibdev_name);
> -	mlx5_mr_btree_dump(&sh->mr.cache);
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> -#endif
> -}
> -
> -/**
> - * Release all the created MRs and resources for shared device context.
> - * list.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -void
> -mlx5_mr_release(struct mlx5_ibv_shared *sh)
> -{
> -	struct mlx5_mr *mr_next;
> -
> -	if (rte_log_can_log(mlx5_logtype, RTE_LOG_DEBUG))
> -		mlx5_mr_dump_dev(sh);
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	/* Detach from MR list and move to free list. */
> -	mr_next = LIST_FIRST(&sh->mr.mr_list);
> -	while (mr_next != NULL) {
> -		struct mlx5_mr *mr = mr_next;
> -
> -		mr_next = LIST_NEXT(mr, mr);
> -		LIST_REMOVE(mr, mr);
> -		LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
> -	}
> -	LIST_INIT(&sh->mr.mr_list);
> -	/* Free global cache. */
> -	mlx5_mr_btree_free(&sh->mr.cache);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -	/* Free all remaining MRs. */
> -	mlx5_mr_garbage_collect(sh);
> -}
> diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
> index 48264c8294..0c5877b3d6 100644
> --- a/drivers/net/mlx5/mlx5_mr.h
> +++ b/drivers/net/mlx5/mlx5_mr.h
> @@ -24,99 +24,16 @@
>  #include <rte_ethdev.h>
>  #include <rte_rwlock.h>
>  #include <rte_bitmap.h>
> +#include <rte_memory.h>
> 
> -/* Memory Region object. */
> -struct mlx5_mr {
> -	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
> -	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
> -	const struct rte_memseg_list *msl;
> -	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
> -	int ms_n; /* Number of memsegs in use. */
> -	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
> -	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR.
> */
> -};
> -
> -/* Cache entry for Memory Region. */
> -struct mlx5_mr_cache {
> -	uintptr_t start; /* Start address of MR. */
> -	uintptr_t end; /* End address of MR. */
> -	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
> -} __rte_packed;
> -
> -/* MR Cache table for Binary search. */
> -struct mlx5_mr_btree {
> -	uint16_t len; /* Number of entries. */
> -	uint16_t size; /* Total number of entries. */
> -	int overflow; /* Mark failure of table expansion. */
> -	struct mlx5_mr_cache (*table)[];
> -} __rte_packed;
> -
> -/* Per-queue MR control descriptor. */
> -struct mlx5_mr_ctrl {
> -	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
> -	uint32_t cur_gen; /* Generation number saved to flush caches. */
> -	uint16_t mru; /* Index of last hit entry in top-half cache. */
> -	uint16_t head; /* Index of the oldest entry in top-half cache. */
> -	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for top-
> half. */
> -	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
> -} __rte_packed;
> -
> -struct mlx5_ibv_shared;
> -extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
> -extern rte_rwlock_t mlx5_mem_event_rwlock;
> +#include <mlx5_common_mr.h>
> 
>  /* First entry must be NULL for comparison. */
>  #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
> 
> -int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
> -void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
> -uint32_t mlx5_mr_create_primary(struct rte_eth_dev *dev,
> -				struct mlx5_mr_cache *entry, uintptr_t addr);
>  void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void
> *addr,
>  			  size_t len, void *arg);
>  int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> *mr_ctrl,
>  		      struct rte_mempool *mp);
> -void mlx5_mr_release(struct mlx5_ibv_shared *sh);
> -
> -/* Debug purpose functions. */
> -void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt);
> -void mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh);
> -
> -/**
> - * Look up LKey from given lookup table by linear search. Firstly look up the
> - * last-hit entry. If miss, the entire array is searched. If found, update the
> - * last-hit index and return LKey.
> - *
> - * @param lkp_tbl
> - *   Pointer to lookup table.
> - * @param[in,out] cached_idx
> - *   Pointer to last-hit index.
> - * @param n
> - *   Size of lookup table.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static __rte_always_inline uint32_t
> -mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t
> *cached_idx,
> -		     uint16_t n, uintptr_t addr)
> -{
> -	uint16_t idx;
> -
> -	if (likely(addr >= lkp_tbl[*cached_idx].start &&
> -		   addr < lkp_tbl[*cached_idx].end))
> -		return lkp_tbl[*cached_idx].lkey;
> -	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
> -		if (addr >= lkp_tbl[idx].start &&
> -		    addr < lkp_tbl[idx].end) {
> -			/* Found. */
> -			*cached_idx = idx;
> -			return lkp_tbl[idx].lkey;
> -		}
> -	}
> -	return UINT32_MAX;
> -}
> 
>  #endif /* RTE_PMD_MLX5_MR_H_ */
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index fc7591c2b0..5f9b670442 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -33,6 +33,7 @@
> 
>  #include "mlx5_defs.h"
>  #include "mlx5.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_utils.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_autoconf.h"
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index 939778aa55..84161ad6af 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -34,11 +34,11 @@
>  #include <mlx5_glue.h>
>  #include <mlx5_prm.h>
>  #include <mlx5_common.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
>  #include "mlx5.h"
> -#include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
> 
>  /* Support tunnel matching. */
> @@ -598,8 +598,8 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq,
> uintptr_t addr)
>  	uint32_t lkey;
> 
>  	/* Linear search on MR cache array. */
> -	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
> -				    MLX5_MR_CACHE_N, addr);
> +	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
> +				   MLX5_MR_CACHE_N, addr);
>  	if (likely(lkey != UINT32_MAX))
>  		return lkey;
>  	/* Take slower bottom-half (Binary Search) on miss. */
> @@ -630,8 +630,8 @@ mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct
> rte_mbuf *mb)
>  	if (unlikely(*mr_ctrl->dev_gen_ptr != mr_ctrl->cur_gen))
>  		mlx5_mr_flush_local_cache(mr_ctrl);
>  	/* Linear search on MR cache array. */
> -	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
> -				    MLX5_MR_CACHE_N, addr);
> +	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
> +				   MLX5_MR_CACHE_N, addr);
>  	if (likely(lkey != UINT32_MAX))
>  		return lkey;
>  	/* Take slower bottom-half on miss. */
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> b/drivers/net/mlx5/mlx5_rxtx_vec.h
> index ea925156f0..6ddcbfb0ad 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> @@ -13,6 +13,8 @@
> 
>  #include "mlx5_autoconf.h"
> 
> +#include "mlx5_mr.h"
> +
>  /* HW checksum offload capabilities of vectorized Tx. */
>  #define MLX5_VEC_TX_CKSUM_OFFLOAD_CAP \
>  	(DEV_TX_OFFLOAD_IPV4_CKSUM | \
> diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
> index 438b705952..759670408b 100644
> --- a/drivers/net/mlx5/mlx5_trigger.c
> +++ b/drivers/net/mlx5/mlx5_trigger.c
> @@ -11,6 +11,7 @@
>  #include <rte_alarm.h>
> 
>  #include "mlx5.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_utils.h"
>  #include "rte_pmd_mlx5.h"
> diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
> index 0653f4cf30..29e5cabab6 100644
> --- a/drivers/net/mlx5/mlx5_txq.c
> +++ b/drivers/net/mlx5/mlx5_txq.c
> @@ -30,6 +30,7 @@
>  #include <mlx5_glue.h>
>  #include <mlx5_devx_cmds.h>
>  #include <mlx5_common.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
> @@ -1289,7 +1290,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t
> idx, uint16_t desc,
>  		goto error;
>  	}
>  	/* Save pointer of global generation number to check memory event.
> */
> -	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->mr.dev_gen;
> +	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
>  	MLX5_ASSERT(desc > MLX5_TX_COMP_THRESH);
>  	tmpl->txq.offloads = conf->offloads |
>  			     dev->data->dev_conf.txmode.offloads;
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver
  2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
                   ` (5 preceding siblings ...)
  2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
@ 2020-04-13 21:17 ` Vu Pham
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling " Vu Pham
                     ` (2 more replies)
  6 siblings, 3 replies; 26+ messages in thread
From: Vu Pham @ 2020-04-13 21:17 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Current mlx5 net PMD and future mlx5(regex,...) PMDs that run
and share the same HCAs need to use common memory management
driver. Memory management codes embeddedly use multi-process IPC
for primary/secondary processes to register and sync on memory
registrations MRs. That's the main reason to refactor and move
multi-process IPC APIs to mlx5 common driver and make it become
the base commit, then refactor and move common MR codes to
common driver in subsequent patch.

Vu Pham (2):
  common/mlx5: refactor multi-process IPC handling codes to common
    driver
  common/mlx5: refactor memory management codes

 drivers/common/mlx5/Makefile                    |    4 +-
 drivers/common/mlx5/meson.build                 |    2 +
 drivers/common/mlx5/mlx5_common_mp.c            |  188 ++++
 drivers/common/mlx5/mlx5_common_mp.h            |   98 ++
 drivers/common/mlx5/mlx5_common_mr.c            | 1108 +++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   27 +
 drivers/net/mlx5/mlx5.c                         |   19 +-
 drivers/net/mlx5/mlx5.h                         |   55 +-
 drivers/net/mlx5/mlx5_mp.c                      |  242 +----
 drivers/net/mlx5/mlx5_mr.c                      | 1169 +----------------------
 drivers/net/mlx5/mlx5_mr.h                      |   87 +-
 drivers/net/mlx5/mlx5_rxtx.c                    |    4 +-
 drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
 drivers/net/mlx5/mlx5_trigger.c                 |    1 +
 drivers/net/mlx5/mlx5_txq.c                     |    3 +-
 17 files changed, 1692 insertions(+), 1487 deletions(-)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

-- 
2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling codes to common driver
  2020-04-13 21:17 ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Vu Pham
@ 2020-04-13 21:17   ` Vu Pham
  2020-04-14  7:26     ` Slava Ovsiienko
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 2/2] common/mlx5: refactor memory management codes Vu Pham
  2020-04-15  9:30   ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Raslan Darawsheh
  2 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-13 21:17 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Refactor common multi-process handling codes from net PMD to common
driver. Using tuple mp_id{name, port_id} as standard input parameter
for all multi-process IPC APIs instead of using rte_eth_dev.

Modify net PMD to use multi-process APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/Makefile                    |   3 +-
 drivers/common/mlx5/meson.build                 |   1 +
 drivers/common/mlx5/mlx5_common_mp.c            | 188 +++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++
 drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
 drivers/net/mlx5/mlx5.c                         |  15 +-
 drivers/net/mlx5/mlx5.h                         |  43 +----
 drivers/net/mlx5/mlx5_mp.c                      | 234 ++----------------------
 drivers/net/mlx5/mlx5_mr.c                      |   2 +-
 drivers/net/mlx5/mlx5_rxtx.c                    |   3 +-
 10 files changed, 336 insertions(+), 264 deletions(-)
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mp.h

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index f32933d592..2a88492731 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -17,6 +17,7 @@ endif
 SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
+SRCS-y += mlx5_common_mp.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
@@ -46,7 +47,7 @@ endif
 LDLIBS += -lrte_eal -lrte_pci -lrte_kvargs -lrte_net
 
 # A few warnings cannot be avoided in external headers.
-CFLAGS += -Wno-error=cast-qual -UPEDANTIC
+CFLAGS += -Wno-error=cast-qual  -UPEDANTIC -DALLOW_EXPERIMENTAL_API
 
 EXPORT_MAP := rte_common_mlx5_version.map
 
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index f671710714..83671861c9 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -55,6 +55,7 @@ sources = files(
 	'mlx5_devx_cmds.c',
 	'mlx5_common.c',
 	'mlx5_nl.c',
+	'mlx5_common_mp.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
new file mode 100644
index 0000000000..da55143bc1
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2019 6WIND S.A.
+ * Copyright 2019 Mellanox Technologies, Ltd
+ */
+
+#include <stdio.h>
+#include <time.h>
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "mlx5_common_mp.h"
+#include "mlx5_common_utils.h"
+
+/**
+ * Request Memory Region creation to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
+	req->args.addr = addr;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs queue state modification to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ * @param sm
+ *   State modify parameters.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+			       struct mlx5_mp_arg_queue_state_modify *sm)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
+	req->args.state_modify = *sm;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Request Verbs command file descriptor for mmap to the primary process.
+ *
+ * @param[in] mp_id
+ *   ID of the MP process.
+ *
+ * @return
+ *   fd on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	if (res->result) {
+		rte_errno = -res->result;
+		DRV_LOG(ERR,
+			"port %u failed to get command FD from primary process",
+			mp_id->port_id);
+		ret = -rte_errno;
+		goto exit;
+	}
+	MLX5_ASSERT(mp_res->num_fds == 1);
+	ret = mp_res->fds[0];
+	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
+		mp_id->port_id, ret);
+exit:
+	free(mp_rep.msgs);
+	return ret;
+}
+
+/**
+ * Initialize by primary process.
+ */
+int
+mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action)
+{
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+
+	/* primary is allowed to not support IPC */
+	ret = rte_mp_action_register(name, primary_action);
+	if (ret && rte_errno != ENOTSUP)
+		return -1;
+	return 0;
+}
+
+/**
+ * Un-initialize by primary process.
+ */
+void
+mlx5_mp_uninit_primary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	rte_mp_action_unregister(name);
+}
+
+/**
+ * Initialize by secondary process.
+ */
+int
+mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	return rte_mp_action_register(name, secondary_action);
+}
+
+/**
+ * Un-initialize by secondary process.
+ */
+void
+mlx5_mp_uninit_secondary(const char *name)
+{
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	rte_mp_action_unregister(name);
+}
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
new file mode 100644
index 0000000000..7aab77acb2
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MP_H_
+#define RTE_PMD_MLX5_COMMON_MP_H_
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_eal.h>
+#include <rte_string_fns.h>
+
+/* Request types for IPC. */
+enum mlx5_mp_req_type {
+	MLX5_MP_REQ_VERBS_CMD_FD = 1,
+	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_START_RXTX,
+	MLX5_MP_REQ_STOP_RXTX,
+	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
+};
+
+struct mlx5_mp_arg_queue_state_modify {
+	uint8_t is_wq; /* Set if WQ. */
+	uint16_t queue_id; /* DPDK queue ID. */
+	enum ibv_wq_state state; /* WQ requested state. */
+};
+
+/* Pameters for IPC. */
+struct mlx5_mp_param {
+	enum mlx5_mp_req_type type;
+	int port_id;
+	int result;
+	RTE_STD_C11
+	union {
+		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_queue_state_modify state_modify;
+		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
+	} args;
+};
+
+/*  Identifier of a MP process */
+struct mlx5_mp_id {
+	char name[RTE_MP_MAX_NAME_LEN];
+	uint16_t port_id;
+};
+
+/** Request timeout for IPC. */
+#define MLX5_MP_REQ_TIMEOUT_SEC 5
+
+/**
+ * Initialize IPC message.
+ *
+ * @param[in] port_id
+ *   Port ID of the device.
+ * @param[out] msg
+ *   Pointer to message to fill in.
+ * @param[in] type
+ *   Message type.
+ */
+static inline void
+mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
+	    enum mlx5_mp_req_type type)
+{
+	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
+
+	memset(msg, 0, sizeof(*msg));
+	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
+	msg->len_param = sizeof(*param);
+	param->type = type;
+	param->port_id = mp_id->port_id;
+}
+
+__rte_experimental
+int mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action);
+__rte_experimental
+void mlx5_mp_uninit_primary(const char *name);
+__rte_experimental
+int mlx5_mp_init_secondary(const char *name, const rte_mp_t secondary_action);
+__rte_experimental
+void mlx5_mp_uninit_secondary(const char *name);
+__rte_experimental
+int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
+__rte_experimental
+int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
+				   struct mlx5_mp_arg_queue_state_modify *sm);
+__rte_experimental
+int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
+
+#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index aede2a0a51..265703d1c9 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -48,4 +48,17 @@ DPDK_20.0.1 {
 	mlx5_nl_vlan_vmwa_delete;
 
 	mlx5_translate_port_name;
+
+};
+
+EXPERIMENTAL {
+        global:
+
+	mlx5_mp_init_primary;
+	mlx5_mp_uninit_primary;
+	mlx5_mp_init_secondary;
+	mlx5_mp_uninit_secondary;
+	mlx5_mp_req_mr_create;
+	mlx5_mp_req_queue_state_modify;
+	mlx5_mp_req_verbs_cmd_fd;
 };
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 293d316413..d87c384422 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -38,6 +38,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
@@ -1722,7 +1723,8 @@ mlx5_init_once(void)
 		rte_rwlock_init(&sd->mem_event_rwlock);
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
-		ret = mlx5_mp_init_primary();
+		ret = mlx5_mp_init_primary(MLX5_MP_NAME,
+					   mlx5_mp_primary_handle);
 		if (ret)
 			goto out;
 		sd->init_done = true;
@@ -1730,7 +1732,8 @@ mlx5_init_once(void)
 	case RTE_PROC_SECONDARY:
 		if (ld->init_done)
 			break;
-		ret = mlx5_mp_init_secondary();
+		ret = mlx5_mp_init_secondary(MLX5_MP_NAME,
+					     mlx5_mp_secondary_handle);
 		if (ret)
 			goto out;
 		++sd->secondary_cnt;
@@ -2205,6 +2208,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	}
 	DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct mlx5_mp_id mp_id;
+
 		eth_dev = rte_eth_dev_attach_secondary(name);
 		if (eth_dev == NULL) {
 			DRV_LOG(ERR, "can not attach rte ethdev");
@@ -2216,8 +2221,10 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
+		mp_id.port_id = eth_dev->data->port_id;
+		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 		/* Receive command fd from primary process */
-		err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
+		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
 			return NULL;
 		/* Remap UAR for Tx queues. */
@@ -2379,6 +2386,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	priv->ibv_port = spawn->ibv_port;
 	priv->pci_dev = spawn->pci_dev;
 	priv->mtu = RTE_ETHER_MTU;
+	priv->mp_id.port_id = port_id;
+	strlcpy(priv->mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
 #ifndef RTE_ARCH_64
 	/* Initialize UAR access locks for 32bit implementations. */
 	rte_spinlock_init(&priv->uar_lock_cq);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index fccfe47341..e9d5868883 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -36,43 +36,13 @@
 #include <mlx5_devx_cmds.h>
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
+#include <mlx5_common_mp.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
-/* Request types for IPC. */
-enum mlx5_mp_req_type {
-	MLX5_MP_REQ_VERBS_CMD_FD = 1,
-	MLX5_MP_REQ_CREATE_MR,
-	MLX5_MP_REQ_START_RXTX,
-	MLX5_MP_REQ_STOP_RXTX,
-	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
-};
-
-struct mlx5_mp_arg_queue_state_modify {
-	uint8_t is_wq; /* Set if WQ. */
-	uint16_t queue_id; /* DPDK queue ID. */
-	enum ibv_wq_state state; /* WQ requested state. */
-};
-
-/* Pameters for IPC. */
-struct mlx5_mp_param {
-	enum mlx5_mp_req_type type;
-	int port_id;
-	int result;
-	RTE_STD_C11
-	union {
-		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
-		struct mlx5_mp_arg_queue_state_modify state_modify;
-		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
-	} args;
-};
-
-/** Request timeout for IPC. */
-#define MLX5_MP_REQ_TIMEOUT_SEC 5
-
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
@@ -583,6 +553,7 @@ struct mlx5_priv {
 #endif
 	uint8_t skip_default_rss_reta; /* Skip configuration of default reta. */
 	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
+	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
 };
 
 #define PORT_ID(priv) ((priv)->dev_data->port_id)
@@ -783,16 +754,10 @@ int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
 
 /* mlx5_mp.c */
+int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
+int mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
 void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);
 void mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev);
-int mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr);
-int mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev);
-int mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-				   struct mlx5_mp_arg_queue_state_modify *sm);
-int mlx5_mp_init_primary(void);
-void mlx5_mp_uninit_primary(void);
-int mlx5_mp_init_secondary(void);
-void mlx5_mp_uninit_secondary(void);
 
 /* mlx5_socket.c */
 
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 55d408fe95..43684dbc3a 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -10,46 +10,14 @@
 #include <rte_ethdev_driver.h>
 #include <rte_string_fns.h>
 
+#include <mlx5_common_mp.h>
+
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 
-/**
- * Initialize IPC message.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[out] msg
- *   Pointer to message to fill in.
- * @param[in] type
- *   Message type.
- */
-static inline void
-mp_init_msg(struct rte_eth_dev *dev, struct rte_mp_msg *msg,
-	    enum mlx5_mp_req_type type)
-{
-	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg->param;
-
-	memset(msg, 0, sizeof(*msg));
-	strlcpy(msg->name, MLX5_MP_NAME, sizeof(msg->name));
-	msg->len_param = sizeof(*param);
-	param->type = type;
-	param->port_id = dev->data->port_id;
-}
-
-/**
- * IPC message handler of primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param[in] peer
- *   Pointer to the peer socket path.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-static int
-mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
@@ -71,21 +39,21 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_VERBS_CMD_FD:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		mp_res.num_fds = 1;
 		mp_res.fds[0] = priv->sh->ctx->cmd_fd;
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 	case MLX5_MP_REQ_QUEUE_STATE_MODIFY:
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = mlx5_queue_state_modify_primary
 					(dev, &param->args.state_modify);
 		ret = rte_mp_reply(&mp_res, peer);
@@ -110,14 +78,15 @@ mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
-mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
+int
+mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
 	struct rte_mp_msg mp_res;
 	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
 	const struct mlx5_mp_param *param =
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
+	struct mlx5_priv *priv;
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
@@ -127,13 +96,14 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		return -rte_errno;
 	}
 	dev = &rte_eth_devices[param->port_id];
+	priv = dev->data->dev_private;
 	switch (param->type) {
 	case MLX5_MP_REQ_START_RXTX:
 		DRV_LOG(INFO, "port %u starting datapath", dev->data->port_id);
 		rte_mb();
 		dev->rx_pkt_burst = mlx5_select_rx_function(dev);
 		dev->tx_pkt_burst = mlx5_select_tx_function(dev);
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -142,7 +112,7 @@ mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		dev->rx_pkt_burst = removed_rx_burst;
 		dev->tx_pkt_burst = removed_tx_burst;
 		rte_mb();
-		mp_init_msg(dev, &mp_res, param->type);
+		mp_init_msg(&priv->mp_id, &mp_res, param->type);
 		res->result = 0;
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
@@ -171,6 +141,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 	struct rte_mp_reply mp_rep;
 	struct mlx5_mp_param *res;
 	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret;
 	int i;
 
@@ -182,7 +153,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum mlx5_mp_req_type type)
 			dev->data->port_id, type);
 		return;
 	}
-	mp_init_msg(dev, &mp_req, type);
+	mp_init_msg(&priv->mp_id, &mp_req, type);
 	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
 	if (ret) {
 		if (rte_errno != ENOTSUP)
@@ -234,178 +205,3 @@ mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev)
 {
 	mp_req_on_rxtx(dev, MLX5_MP_REQ_STOP_RXTX);
 }
-
-/**
- * Request Memory Region creation to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_CREATE_MR);
-	req->args.addr = addr;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	if (ret)
-		rte_errno = -ret;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs queue state modification to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- * @param sm
- *   State modify parameters.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
-			       struct mlx5_mp_arg_queue_state_modify *sm)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
-	req->args.state_modify = *sm;
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	ret = res->result;
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Request Verbs command file descriptor for mmap to the primary process.
- *
- * @param[in] dev
- *   Pointer to Ethernet structure.
- *
- * @return
- *   fd on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev)
-{
-	struct rte_mp_msg mp_req;
-	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
-	struct mlx5_mp_param *res;
-	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
-	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
-	if (ret) {
-		DRV_LOG(ERR, "port %u request to primary process failed",
-			dev->data->port_id);
-		return -rte_errno;
-	}
-	MLX5_ASSERT(mp_rep.nb_received == 1);
-	mp_res = &mp_rep.msgs[0];
-	res = (struct mlx5_mp_param *)mp_res->param;
-	if (res->result) {
-		rte_errno = -res->result;
-		DRV_LOG(ERR,
-			"port %u failed to get command FD from primary process",
-			dev->data->port_id);
-		ret = -rte_errno;
-		goto exit;
-	}
-	MLX5_ASSERT(mp_res->num_fds == 1);
-	ret = mp_res->fds[0];
-	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
-		dev->data->port_id, ret);
-exit:
-	free(mp_rep.msgs);
-	return ret;
-}
-
-/**
- * Initialize by primary process.
- */
-int
-mlx5_mp_init_primary(void)
-{
-	int ret;
-
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-
-	/* primary is allowed to not support IPC */
-	ret = rte_mp_action_register(MLX5_MP_NAME, mp_primary_handle);
-	if (ret && rte_errno != ENOTSUP)
-		return -1;
-	return 0;
-}
-
-/**
- * Un-initialize by primary process.
- */
-void
-mlx5_mp_uninit_primary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
-
-/**
- * Initialize by secondary process.
- */
-int
-mlx5_mp_init_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	return rte_mp_action_register(MLX5_MP_NAME, mp_secondary_handle);
-}
-
-/**
- * Un-initialize by secondary process.
- */
-void
-mlx5_mp_uninit_secondary(void)
-{
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
-	rte_mp_action_unregister(MLX5_MP_NAME);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index a8f185a208..9151992a72 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -540,7 +540,7 @@ mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
 
 	DEBUG("port %u requesting MR creation for address (%p)",
 	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(dev, addr);
+	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
 	if (ret) {
 		DEBUG("port %u fail to request MR creation for address (%p)",
 		      dev->data->port_id, (void *)addr);
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 7ce3732fd3..42d7da8a4b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1000,6 +1000,7 @@ static int
 mlx5_queue_state_modify(struct rte_eth_dev *dev,
 			struct mlx5_mp_arg_queue_state_modify *sm)
 {
+	struct mlx5_priv *priv = dev->data->dev_private;
 	int ret = 0;
 
 	switch (rte_eal_process_type()) {
@@ -1007,7 +1008,7 @@ mlx5_queue_state_modify(struct rte_eth_dev *dev,
 		ret = mlx5_queue_state_modify_primary(dev, sm);
 		break;
 	case RTE_PROC_SECONDARY:
-		ret = mlx5_mp_req_queue_state_modify(dev, sm);
+		ret = mlx5_mp_req_queue_state_modify(&priv->mp_id, sm);
 		break;
 	default:
 		break;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] common/mlx5: refactor memory management codes
  2020-04-13 21:17 ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling " Vu Pham
@ 2020-04-13 21:17   ` Vu Pham
  2020-04-14  7:27     ` Slava Ovsiienko
  2020-04-15  9:30   ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Raslan Darawsheh
  2 siblings, 1 reply; 26+ messages in thread
From: Vu Pham @ 2020-04-13 21:17 UTC (permalink / raw)
  To: dev; +Cc: viacheslavo, orika, matan, rasland, Vu Pham

Refactor common memory btree and cache management to common driver.
Replace some input parameters of MR APIs to more common datastructure
like PD, port_id, share_cache,... so that muliptle PMD drivers can
use those MR APIs.

Modify mlx5 net pmd driver to use MR management APIs from common driver.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
---
 drivers/common/mlx5/Makefile                    |    1 +
 drivers/common/mlx5/meson.build                 |    1 +
 drivers/common/mlx5/mlx5_common_mr.c            | 1108 +++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
 drivers/common/mlx5/rte_common_mlx5_version.map |   14 +
 drivers/net/mlx5/mlx5.c                         |    4 +-
 drivers/net/mlx5/mlx5.h                         |   12 +-
 drivers/net/mlx5/mlx5_mp.c                      |    8 +-
 drivers/net/mlx5/mlx5_mr.c                      | 1169 +----------------------
 drivers/net/mlx5/mlx5_mr.h                      |   87 +-
 drivers/net/mlx5/mlx5_rxtx.c                    |    1 +
 drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
 drivers/net/mlx5/mlx5_trigger.c                 |    1 +
 drivers/net/mlx5/mlx5_txq.c                     |    3 +-
 15 files changed, 1357 insertions(+), 1224 deletions(-)
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
 create mode 100644 drivers/common/mlx5/mlx5_common_mr.h

diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
index 2a88492731..26267c957a 100644
--- a/drivers/common/mlx5/Makefile
+++ b/drivers/common/mlx5/Makefile
@@ -18,6 +18,7 @@ SRCS-y += mlx5_devx_cmds.c
 SRCS-y += mlx5_common.c
 SRCS-y += mlx5_nl.c
 SRCS-y += mlx5_common_mp.c
+SRCS-y += mlx5_common_mr.c
 ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
 INSTALL-y-lib += $(LIB_GLUE)
 endif
diff --git a/drivers/common/mlx5/meson.build b/drivers/common/mlx5/meson.build
index 83671861c9..175251b691 100644
--- a/drivers/common/mlx5/meson.build
+++ b/drivers/common/mlx5/meson.build
@@ -56,6 +56,7 @@ sources = files(
 	'mlx5_common.c',
 	'mlx5_nl.c',
 	'mlx5_common_mp.c',
+	'mlx5_common_mr.c',
 )
 if not dlopen_ibverbs
 	sources += files('mlx5_glue.c')
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
new file mode 100644
index 0000000000..9d4a06dd5b
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -0,0 +1,1108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2016 6WIND S.A.
+ * Copyright 2020 Mellanox Technologies, Ltd
+ */
+#include <rte_eal_memconfig.h>
+#include <rte_errno.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_rwlock.h>
+
+#include "mlx5_glue.h"
+#include "mlx5_common_mp.h"
+#include "mlx5_common_mr.h"
+#include "mlx5_common_utils.h"
+
+struct mr_find_contig_memsegs_data {
+	uintptr_t addr;
+	uintptr_t start;
+	uintptr_t end;
+	const struct rte_memseg_list *msl;
+};
+
+/**
+ * Expand B-tree table to a given size. Can't be called with holding
+ * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries for expansion.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_expand(struct mlx5_mr_btree *bt, int n)
+{
+	void *mem;
+	int ret = 0;
+
+	if (n <= bt->size)
+		return ret;
+	/*
+	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
+	 * used inside if there's no room to expand. Because this is a quite
+	 * rare case and a part of very slow path, it is very acceptable.
+	 * Initially cache_bh[] will be given practically enough space and once
+	 * it is expanded, expansion wouldn't be needed again ever.
+	 */
+	mem = rte_realloc(bt->table, n * sizeof(struct mr_cache_entry), 0);
+	if (mem == NULL) {
+		/* Not an error, B-tree search will be skipped. */
+		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
+			(void *)bt);
+		ret = -1;
+	} else {
+		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
+		bt->table = mem;
+		bt->size = n;
+	}
+	return ret;
+}
+
+/**
+ * Look up LKey from given B-tree lookup table, store the last index and return
+ * searched LKey.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param[out] idx
+ *   Pointer to index. Even on search failure, returns index where it stops
+ *   searching so that index can be used when inserting a new entry.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t n;
+	uint16_t base = 0;
+
+	MLX5_ASSERT(bt != NULL);
+	lkp_tbl = *bt->table;
+	n = bt->len;
+	/* First entry must be NULL for comparison. */
+	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
+				    lkp_tbl[0].lkey == UINT32_MAX));
+	/* Binary search. */
+	do {
+		register uint16_t delta = n >> 1;
+
+		if (addr < lkp_tbl[base + delta].start) {
+			n = delta;
+		} else {
+			base += delta;
+			n -= delta;
+		}
+	} while (n > 1);
+	MLX5_ASSERT(addr >= lkp_tbl[base].start);
+	*idx = base;
+	if (addr < lkp_tbl[base].end)
+		return lkp_tbl[base].lkey;
+	/* Not found. */
+	return UINT32_MAX;
+}
+
+/**
+ * Insert an entry to B-tree lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param entry
+ *   Pointer to new entry to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+static int
+mr_btree_insert(struct mlx5_mr_btree *bt, struct mr_cache_entry *entry)
+{
+	struct mr_cache_entry *lkp_tbl;
+	uint16_t idx = 0;
+	size_t shift;
+
+	MLX5_ASSERT(bt != NULL);
+	MLX5_ASSERT(bt->len <= bt->size);
+	MLX5_ASSERT(bt->len > 0);
+	lkp_tbl = *bt->table;
+	/* Find out the slot for insertion. */
+	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
+		DRV_LOG(DEBUG,
+			"abort insertion to B-tree(%p): already exist at"
+			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+			(void *)bt, idx, entry->start, entry->end, entry->lkey);
+		/* Already exist, return. */
+		return 0;
+	}
+	/* If table is full, return error. */
+	if (unlikely(bt->len == bt->size)) {
+		bt->overflow = 1;
+		return -1;
+	}
+	/* Insert entry. */
+	++idx;
+	shift = (bt->len - idx) * sizeof(struct mr_cache_entry);
+	if (shift)
+		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
+	lkp_tbl[idx] = *entry;
+	bt->len++;
+	DRV_LOG(DEBUG,
+		"inserted B-tree(%p)[%u],"
+		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		(void *)bt, idx, entry->start, entry->end, entry->lkey);
+	return 0;
+}
+
+/**
+ * Initialize B-tree and allocate memory for lookup table.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ * @param n
+ *   Number of entries to allocate.
+ * @param socket
+ *   NUMA socket on which memory must be allocated.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
+{
+	if (bt == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	MLX5_ASSERT(!bt->table && !bt->size);
+	memset(bt, 0, sizeof(*bt));
+	bt->table = rte_calloc_socket("B-tree table",
+				      n, sizeof(struct mr_cache_entry),
+				      0, socket);
+	if (bt->table == NULL) {
+		rte_errno = ENOMEM;
+		DEBUG("failed to allocate memory for btree cache on socket %d",
+		      socket);
+		return -rte_errno;
+	}
+	bt->size = n;
+	/* First entry must be NULL for binary search. */
+	(*bt->table)[bt->len++] = (struct mr_cache_entry) {
+		.lkey = UINT32_MAX,
+	};
+	DEBUG("initialized B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	return 0;
+}
+
+/**
+ * Free B-tree resources.
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
+{
+	if (bt == NULL)
+		return;
+	DEBUG("freeing B-tree %p with table %p",
+	      (void *)bt, (void *)bt->table);
+	rte_free(bt->table);
+	memset(bt, 0, sizeof(*bt));
+}
+
+/**
+ * Dump all the entries in a B-tree
+ *
+ * @param bt
+ *   Pointer to B-tree structure.
+ */
+void
+mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	int idx;
+	struct mr_cache_entry *lkp_tbl;
+
+	if (bt == NULL)
+		return;
+	lkp_tbl = *bt->table;
+	for (idx = 0; idx < bt->len; ++idx) {
+		struct mr_cache_entry *entry = &lkp_tbl[idx];
+
+		DEBUG("B-tree(%p)[%u],"
+		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
+		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
+	}
+#endif
+}
+
+/**
+ * Find virtually contiguous memory chunk in a given MR.
+ *
+ * @param dev
+ *   Pointer to MR structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If not found, this will not be
+ *   updated.
+ * @param start_idx
+ *   Start index of the memseg bitmap.
+ *
+ * @return
+ *   Next index to go on lookup.
+ */
+static int
+mr_find_next_chunk(struct mlx5_mr *mr, struct mr_cache_entry *entry,
+		   int base_idx)
+{
+	uintptr_t start = 0;
+	uintptr_t end = 0;
+	uint32_t idx = 0;
+
+	/* MR for external memory doesn't have memseg list. */
+	if (mr->msl == NULL) {
+		struct ibv_mr *ibv_mr = mr->ibv_mr;
+
+		MLX5_ASSERT(mr->ms_bmp_n == 1);
+		MLX5_ASSERT(mr->ms_n == 1);
+		MLX5_ASSERT(base_idx == 0);
+		/*
+		 * Can't search it from memseg list but get it directly from
+		 * verbs MR as there's only one chunk.
+		 */
+		entry->start = (uintptr_t)ibv_mr->addr;
+		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+		/* Returning 1 ends iteration. */
+		return 1;
+	}
+	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
+		if (rte_bitmap_get(mr->ms_bmp, idx)) {
+			const struct rte_memseg_list *msl;
+			const struct rte_memseg *ms;
+
+			msl = mr->msl;
+			ms = rte_fbarray_get(&msl->memseg_arr,
+					     mr->ms_base_idx + idx);
+			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+			if (!start)
+				start = ms->addr_64;
+			end = ms->addr_64 + ms->hugepage_sz;
+		} else if (start) {
+			/* Passed the end of a fragment. */
+			break;
+		}
+	}
+	if (start) {
+		/* Found one chunk. */
+		entry->start = start;
+		entry->end = end;
+		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
+	}
+	return idx;
+}
+
+/**
+ * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
+ * Then, this entry will have to be searched by mr_lookup_list() in
+ * mlx5_mr_create() on miss.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr
+ *   Pointer to MR to insert.
+ *
+ * @return
+ *   0 on success, -1 on failure.
+ */
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr)
+{
+	unsigned int n;
+
+	DRV_LOG(DEBUG, "Inserting MR(%p) to global cache(%p)",
+		(void *)mr, (void *)share_cache);
+	for (n = 0; n < mr->ms_bmp_n; ) {
+		struct mr_cache_entry entry;
+
+		memset(&entry, 0, sizeof(entry));
+		/* Find a contiguous chunk and advance the index. */
+		n = mr_find_next_chunk(mr, &entry, n);
+		if (!entry.end)
+			break;
+		if (mr_btree_insert(&share_cache->cache, &entry) < 0) {
+			/*
+			 * Overflowed, but the global table cannot be expanded
+			 * because of deadlock.
+			 */
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/**
+ * Look up address in the original global MR list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Found MR on match, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr)
+{
+	struct mlx5_mr *mr;
+
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret;
+
+			memset(&ret, 0, sizeof(ret));
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (addr >= ret.start && addr < ret.end) {
+				/* Found. */
+				*entry = ret;
+				return mr;
+			}
+		}
+	}
+	return NULL;
+}
+
+/**
+ * Look up address on global MR cache.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry. If no match, this will not be updated.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr)
+{
+	uint16_t idx;
+	uint32_t lkey = UINT32_MAX;
+	struct mlx5_mr *mr;
+
+	/*
+	 * If the global cache has overflowed since it failed to expand the
+	 * B-tree table, it can't have all the existing MRs. Then, the address
+	 * has to be searched by traversing the original MR list instead, which
+	 * is very slow path. Otherwise, the global cache is all inclusive.
+	 */
+	if (!unlikely(share_cache->cache.overflow)) {
+		lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+		if (lkey != UINT32_MAX)
+			*entry = (*share_cache->cache.table)[idx];
+	} else {
+		/* Falling back to the slowest path. */
+		mr = mlx5_mr_lookup_list(share_cache, entry, addr);
+		if (mr != NULL)
+			lkey = entry->lkey;
+	}
+	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
+					   addr < entry->end));
+	return lkey;
+}
+
+/**
+ * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
+ * can raise memory free event and the callback function will spin on the lock.
+ *
+ * @param mr
+ *   Pointer to MR to free.
+ */
+static void
+mr_free(struct mlx5_mr *mr)
+{
+	if (mr == NULL)
+		return;
+	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
+	if (mr->ibv_mr != NULL)
+		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
+	if (mr->ms_bmp != NULL)
+		rte_bitmap_free(mr->ms_bmp);
+	rte_free(mr);
+}
+
+void
+mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr;
+
+	DRV_LOG(DEBUG, "Rebuild dev cache[] %p", (void *)share_cache);
+	/* Flush cache to rebuild. */
+	share_cache->cache.len = 1;
+	share_cache->cache.overflow = 0;
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr)
+		if (mlx5_mr_insert_cache(share_cache, mr) < 0)
+			return;
+}
+
+/**
+ * Release resources of detached MR having no online entry.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+static void
+mlx5_mr_garbage_collect(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
+
+	/* Must be called from the primary process. */
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/*
+	 * MR can't be freed with holding the lock because rte_free() could call
+	 * memory free callback function. This will be a deadlock situation.
+	 */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach the whole free list and release it after unlocking. */
+	free_list = share_cache->mr_free_list;
+	LIST_INIT(&share_cache->mr_free_list);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Release resources. */
+	mr_next = LIST_FIRST(&free_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		mr_free(mr);
+	}
+}
+
+/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
+static int
+mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
+			  const struct rte_memseg *ms, size_t len, void *arg)
+{
+	struct mr_find_contig_memsegs_data *data = arg;
+
+	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
+		return 0;
+	/* Found, save it and stop walking. */
+	data->start = ms->addr_64;
+	data->end = ms->addr_64 + len;
+	data->msl = msl;
+	return 1;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This API should be called on a secondary process, then a request is sent to
+ * the primary process in order to create a MR for the address. As the global MR
+ * list is on the shared memory, following LKey lookup should succeed unless the
+ * request fails.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create_secondary(struct ibv_pd *pd __rte_unused,
+			 struct mlx5_mp_id *mp_id,
+			 struct mlx5_mr_share_cache *share_cache,
+			 struct mr_cache_entry *entry, uintptr_t addr,
+			 unsigned int mr_ext_memseg_en __rte_unused)
+{
+	int ret;
+
+	DEBUG("port %u requesting MR creation for address (%p)",
+	      mp_id->port_id, (void *)addr);
+	ret = mlx5_mp_req_mr_create(mp_id, addr);
+	if (ret) {
+		DEBUG("Fail to request MR creation for address (%p)",
+		      (void *)addr);
+		return UINT32_MAX;
+	}
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	DEBUG("MR CREATED by primary process for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
+	      (void *)addr, entry->start, entry->end, entry->lkey);
+	return entry->lkey;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * Register entire virtually contiguous memory chunk around the address.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ * @param mr_ext_memseg_en
+ *   Configurable flag about external memory segment enable or not.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en)
+{
+	struct mr_find_contig_memsegs_data data = {.addr = addr, };
+	struct mr_find_contig_memsegs_data data_re;
+	const struct rte_memseg_list *msl;
+	const struct rte_memseg *ms;
+	struct mlx5_mr *mr = NULL;
+	int ms_idx_shift = -1;
+	uint32_t bmp_size;
+	void *bmp_mem;
+	uint32_t ms_n;
+	uint32_t n;
+	size_t len;
+
+	DRV_LOG(DEBUG, "Creating a MR using address (%p)", (void *)addr);
+	/*
+	 * Release detached MRs if any. This can't be called with holding either
+	 * memory_hotplug_lock or share_cache->rwlock. MRs on the free list have
+	 * been detached by the memory free event but it couldn't be released
+	 * inside the callback due to deadlock. As a result, releasing resources
+	 * is quite opportunistic.
+	 */
+	mlx5_mr_garbage_collect(share_cache);
+	/*
+	 * If enabled, find out a contiguous virtual address chunk in use, to
+	 * which the given address belongs, in order to register maximum range.
+	 * In the best case where mempools are not dynamically recreated and
+	 * '--socket-mem' is specified as an EAL option, it is very likely to
+	 * have only one MR(LKey) per a socket and per a hugepage-size even
+	 * though the system memory is highly fragmented. As the whole memory
+	 * chunk will be pinned by kernel, it can't be reused unless entire
+	 * chunk is freed from EAL.
+	 *
+	 * If disabled, just register one memseg (page). Then, memory
+	 * consumption will be minimized but it may drop performance if there
+	 * are many MRs to lookup on the datapath.
+	 */
+	if (!mr_ext_memseg_en) {
+		data.msl = rte_mem_virt2memseg_list((void *)addr);
+		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
+		data.end = data.start + data.msl->page_sz;
+	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
+		DRV_LOG(WARNING,
+			"Unable to find virtually contiguous"
+			" chunk for address (%p)."
+			" rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_nolock;
+	}
+alloc_resources:
+	/* Addresses must be page-aligned. */
+	MLX5_ASSERT(data.msl);
+	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
+	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
+	msl = data.msl;
+	ms = rte_mem_virt2memseg((void *)data.start, msl);
+	len = data.end - data.start;
+	MLX5_ASSERT(ms);
+	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
+	/* Number of memsegs in the range. */
+	ms_n = len / msl->page_sz;
+	DEBUG("Extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " page_sz=0x%" PRIx64 ", ms_n=%u",
+	      (void *)addr, data.start, data.end, msl->page_sz, ms_n);
+	/* Size of memory for bitmap. */
+	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE) +
+				bmp_size,
+				RTE_CACHE_LINE_SIZE, msl->socket_id);
+	if (mr == NULL) {
+		DEBUG("Unable to allocate memory for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = ENOMEM;
+		goto err_nolock;
+	}
+	mr->msl = msl;
+	/*
+	 * Save the index of the first memseg and initialize memseg bitmap. To
+	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
+	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
+	 */
+	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
+	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
+	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
+	if (mr->ms_bmp == NULL) {
+		DEBUG("Unable to initialize bitmap for a new MR of"
+		      " address (%p).", (void *)addr);
+		rte_errno = EINVAL;
+		goto err_nolock;
+	}
+	/*
+	 * Should recheck whether the extended contiguous chunk is still valid.
+	 * Because memory_hotplug_lock can't be held if there's any memory
+	 * related calls in a critical path, resource allocation above can't be
+	 * locked. If the memory has been changed at this point, try again with
+	 * just single page. If not, go on with the big chunk atomically from
+	 * here.
+	 */
+	rte_mcfg_mem_read_lock();
+	data_re = data;
+	if (len > msl->page_sz &&
+	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
+		DEBUG("Unable to find virtually contiguous"
+		      " chunk for address (%p)."
+		      " rte_memseg_contig_walk() failed.", (void *)addr);
+		rte_errno = ENXIO;
+		goto err_memlock;
+	}
+	if (data.start != data_re.start || data.end != data_re.end) {
+		/*
+		 * The extended contiguous chunk has been changed. Try again
+		 * with single memseg instead.
+		 */
+		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
+		data.end = data.start + msl->page_sz;
+		rte_mcfg_mem_read_unlock();
+		mr_free(mr);
+		goto alloc_resources;
+	}
+	MLX5_ASSERT(data.msl == data_re.msl);
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/*
+	 * Check the address is really missing. If other thread already created
+	 * one or it is not found due to overflow, abort and return.
+	 */
+	if (mlx5_mr_lookup_cache(share_cache, entry, addr) != UINT32_MAX) {
+		/*
+		 * Insert to the global cache table. It may fail due to
+		 * low-on-memory. Then, this entry will have to be searched
+		 * here again.
+		 */
+		mr_btree_insert(&share_cache->cache, entry);
+		DEBUG("Found MR for %p on final lookup, abort", (void *)addr);
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		rte_mcfg_mem_read_unlock();
+		/*
+		 * Must be unlocked before calling rte_free() because
+		 * mlx5_mr_mem_event_free_cb() can be called inside.
+		 */
+		mr_free(mr);
+		return entry->lkey;
+	}
+	/*
+	 * Trim start and end addresses for verbs MR. Set bits for registering
+	 * memsegs but exclude already registered ones. Bitmap can be
+	 * fragmented.
+	 */
+	for (n = 0; n < ms_n; ++n) {
+		uintptr_t start;
+		struct mr_cache_entry ret;
+
+		memset(&ret, 0, sizeof(ret));
+		start = data_re.start + n * msl->page_sz;
+		/* Exclude memsegs already registered by other MRs. */
+		if (mlx5_mr_lookup_cache(share_cache, &ret, start) ==
+		    UINT32_MAX) {
+			/*
+			 * Start from the first unregistered memseg in the
+			 * extended range.
+			 */
+			if (ms_idx_shift == -1) {
+				mr->ms_base_idx += n;
+				data.start = start;
+				ms_idx_shift = n;
+			}
+			data.end = start + msl->page_sz;
+			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
+			++mr->ms_n;
+		}
+	}
+	len = data.end - data.start;
+	mr->ms_bmp_n = len / msl->page_sz;
+	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
+	/*
+	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
+	 * called with holding the memory lock because it doesn't use
+	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
+	 * through mlx5_alloc_verbs_buf().
+	 */
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)data.start, len,
+				       IBV_ACCESS_LOCAL_WRITE |
+					   IBV_ACCESS_RELAXED_ORDERING);
+	if (mr->ibv_mr == NULL) {
+		DEBUG("Fail to create a verbs MR for address (%p)",
+		      (void *)addr);
+		rte_errno = EINVAL;
+		goto err_mrlock;
+	}
+	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
+	MLX5_ASSERT(mr->ibv_mr->length == len);
+	LIST_INSERT_HEAD(&share_cache->mr_list, mr, mr);
+	DEBUG("MR CREATED (%p) for %p:\n"
+	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+	      (void *)mr, (void *)addr, data.start, data.end,
+	      rte_cpu_to_be_32(mr->ibv_mr->lkey),
+	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	/* Insert to the global cache table. */
+	mlx5_mr_insert_cache(share_cache, mr);
+	/* Fill in output data. */
+	mlx5_mr_lookup_cache(share_cache, entry, addr);
+	/* Lookup can't fail. */
+	MLX5_ASSERT(entry->lkey != UINT32_MAX);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	rte_mcfg_mem_read_unlock();
+	return entry->lkey;
+err_mrlock:
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+err_memlock:
+	rte_mcfg_mem_read_unlock();
+err_nolock:
+	/*
+	 * In case of error, as this can be called in a datapath, a warning
+	 * message per an error is preferable instead. Must be unlocked before
+	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
+	 * inside.
+	 */
+	mr_free(mr);
+	return UINT32_MAX;
+}
+
+/**
+ * Create a new global Memory Region (MR) for a missing virtual address.
+ * This can be called from primary and secondary process.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this will not be updated.
+ * @param addr
+ *   Target virtual address to register.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
+ */
+static uint32_t
+mlx5_mr_create(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+	       struct mlx5_mr_share_cache *share_cache,
+	       struct mr_cache_entry *entry, uintptr_t addr,
+	       unsigned int mr_ext_memseg_en)
+{
+	uint32_t ret = 0;
+
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		ret = mlx5_mr_create_primary(pd, share_cache, entry,
+					     addr, mr_ext_memseg_en);
+		break;
+	case RTE_PROC_SECONDARY:
+		ret = mlx5_mr_create_secondary(pd, mp_id, share_cache, entry,
+					       addr, mr_ext_memseg_en);
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+/**
+ * Look up address in the global MR cache table. If not found, create a new MR.
+ * Insert the found/created entry to local bottom-half cache table.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param[out] entry
+ *   Pointer to returning MR cache entry, found in the global cache or newly
+ *   created. If failed to create one, this is not written.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static uint32_t
+mr_lookup_caches(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+		 struct mlx5_mr_share_cache *share_cache,
+		 struct mlx5_mr_ctrl *mr_ctrl,
+		 struct mr_cache_entry *entry, uintptr_t addr,
+		 unsigned int mr_ext_memseg_en)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	uint32_t lkey;
+	uint16_t idx;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in the global cache. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
+	if (lkey != UINT32_MAX) {
+		/* Found. */
+		*entry = (*share_cache->cache.table)[idx];
+		rte_rwlock_read_unlock(&share_cache->rwlock);
+		/*
+		 * Update local cache. Even if it fails, return the found entry
+		 * to update top-half cache. Next time, this entry will be found
+		 * in the global cache.
+		 */
+		mr_btree_insert(bt, entry);
+		return lkey;
+	}
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/* First time to see the address? Create a new MR. */
+	lkey = mlx5_mr_create(pd, mp_id, share_cache, entry, addr,
+			      mr_ext_memseg_en);
+	/*
+	 * Update the local cache if successfully created a new global MR. Even
+	 * if failed to create one, there's no action to take in this datapath
+	 * code. As returning LKey is invalid, this will eventually make HW
+	 * fail.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half of LKey search on datapath. First search in cache_bh[] and if
+ * misses, search in the global MR cache table and update the new entry to
+ * per-queue local caches.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Pointer to per-queue MR control structure.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en)
+{
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+	/* Victim in top-half cache to replace with new entry. */
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		/*
+		 * If missed in local lookup table, search in the global cache
+		 * and local cache_bh[] will be updated inside if possible.
+		 * Top-half cache entry will also be updated.
+		 */
+		lkey = mr_lookup_caches(pd, mp_id, share_cache, mr_ctrl,
+					repl, addr, mr_ext_memseg_en);
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
+
+/**
+ * Release all the created MRs and resources on global MR cache of a device.
+ * list.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ */
+void
+mlx5_mr_release_cache(struct mlx5_mr_share_cache *share_cache)
+{
+	struct mlx5_mr *mr_next;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	/* Detach from MR list and move to free list. */
+	mr_next = LIST_FIRST(&share_cache->mr_list);
+	while (mr_next != NULL) {
+		struct mlx5_mr *mr = mr_next;
+
+		mr_next = LIST_NEXT(mr, mr);
+		LIST_REMOVE(mr, mr);
+		LIST_INSERT_HEAD(&share_cache->mr_free_list, mr, mr);
+	}
+	LIST_INIT(&share_cache->mr_list);
+	/* Free global cache. */
+	mlx5_mr_btree_free(&share_cache->cache);
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	/* Free all remaining MRs. */
+	mlx5_mr_garbage_collect(share_cache);
+}
+
+/**
+ * Flush all of the local cache entries.
+ *
+ * @param mr_ctrl
+ *   Pointer to per-queue MR local cache.
+ */
+void
+mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
+{
+	/* Reset the most-recently-used index. */
+	mr_ctrl->mru = 0;
+	/* Reset the linear search array. */
+	mr_ctrl->head = 0;
+	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
+	/* Reset the B-tree table. */
+	mr_ctrl->cache_bh.len = 1;
+	mr_ctrl->cache_bh.overflow = 0;
+	/* Update the generation number. */
+	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
+	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
+		(void *)mr_ctrl, mr_ctrl->cur_gen);
+}
+
+/**
+ * Creates a memory region for external memory, that is memory which is not
+ * part of the DPDK memory segments.
+ *
+ * @param pd
+ *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
+ * @param addr
+ *   Starting virtual address of memory.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @param socked_id
+ *   Socket to allocate heap memory for the control structures.
+ *
+ * @return
+ *   Pointer to MR structure on success, NULL otherwise.
+ */
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len, int socket_id)
+{
+	struct mlx5_mr *mr = NULL;
+
+	mr = rte_zmalloc_socket(NULL,
+				RTE_ALIGN_CEIL(sizeof(*mr),
+					       RTE_CACHE_LINE_SIZE),
+				RTE_CACHE_LINE_SIZE, socket_id);
+	if (mr == NULL)
+		return NULL;
+	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)addr, len,
+				       IBV_ACCESS_LOCAL_WRITE |
+					   IBV_ACCESS_RELAXED_ORDERING);
+	if (mr->ibv_mr == NULL) {
+		DRV_LOG(WARNING,
+			"Fail to create a verbs MR for address (%p)",
+			(void *)addr);
+		rte_free(mr);
+		return NULL;
+	}
+	mr->msl = NULL; /* Mark it is external memory. */
+	mr->ms_bmp = NULL;
+	mr->ms_n = 1;
+	mr->ms_bmp_n = 1;
+	DRV_LOG(DEBUG,
+		"MR CREATED (%p) for external memory %p:\n"
+		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
+		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
+		(void *)mr, (void *)addr,
+		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
+	return mr;
+}
+
+/**
+ * Dump all the created MRs and the global cache entries.
+ *
+ * @param sh
+ *   Pointer to Ethernet device shared context.
+ */
+void
+mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
+{
+#ifdef RTE_LIBRTE_MLX5_DEBUG
+	struct mlx5_mr *mr;
+	int mr_n = 0;
+	int chunk_n = 0;
+
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	/* Iterate all the existing MRs. */
+	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
+		unsigned int n;
+
+		DEBUG("MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
+		      mr_n++, rte_cpu_to_be_32(mr->ibv_mr->lkey),
+		      mr->ms_n, mr->ms_bmp_n);
+		if (mr->ms_n == 0)
+			continue;
+		for (n = 0; n < mr->ms_bmp_n; ) {
+			struct mr_cache_entry ret = { 0, };
+
+			n = mr_find_next_chunk(mr, &ret, n);
+			if (!ret.end)
+				break;
+			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
+			      chunk_n++, ret.start, ret.end);
+		}
+	}
+	DEBUG("Dumping global cache %p", (void *)share_cache);
+	mlx5_mr_btree_dump(&share_cache->cache);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+#endif
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
new file mode 100644
index 0000000000..e805f96375
--- /dev/null
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 6WIND S.A.
+ * Copyright 2018 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_COMMON_MR_H_
+#define RTE_PMD_MLX5_COMMON_MR_H_
+
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/queue.h>
+
+/* Verbs header. */
+/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx5dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+#include <rte_rwlock.h>
+#include <rte_bitmap.h>
+#include <rte_memory.h>
+
+#include "mlx5_common_mp.h"
+
+/* Size of per-queue MR cache array for linear search. */
+#define MLX5_MR_CACHE_N 8
+#define MLX5_MR_BTREE_CACHE_N 256
+
+/* Memory Region object. */
+struct mlx5_mr {
+	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
+	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
+	const struct rte_memseg_list *msl;
+	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
+	int ms_n; /* Number of memsegs in use. */
+	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
+	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
+};
+
+/* Cache entry for Memory Region. */
+struct mr_cache_entry {
+	uintptr_t start; /* Start address of MR. */
+	uintptr_t end; /* End address of MR. */
+	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
+} __rte_packed;
+
+/* MR Cache table for Binary search. */
+struct mlx5_mr_btree {
+	uint16_t len; /* Number of entries. */
+	uint16_t size; /* Total number of entries. */
+	int overflow; /* Mark failure of table expansion. */
+	struct mr_cache_entry (*table)[];
+} __rte_packed;
+
+/* Per-queue MR control descriptor. */
+struct mlx5_mr_ctrl {
+	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
+	uint32_t cur_gen; /* Generation number saved to flush caches. */
+	uint16_t mru; /* Index of last hit entry in top-half cache. */
+	uint16_t head; /* Index of the oldest entry in top-half cache. */
+	struct mr_cache_entry cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
+	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
+} __rte_packed;
+
+LIST_HEAD(mlx5_mr_list, mlx5_mr);
+
+/* Global per-device MR cache. */
+struct mlx5_mr_share_cache {
+	uint32_t dev_gen; /* Generation number to flush local caches. */
+	rte_rwlock_t rwlock; /* MR cache Lock. */
+	struct mlx5_mr_btree cache; /* Global MR cache table. */
+	struct mlx5_mr_list mr_list; /* Registered MR list. */
+	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+} __rte_packed;
+
+/**
+ * Look up LKey from given lookup table by linear search. Firstly look up the
+ * last-hit entry. If miss, the entire array is searched. If found, update the
+ * last-hit index and return LKey.
+ *
+ * @param lkp_tbl
+ *   Pointer to lookup table.
+ * @param[in,out] cached_idx
+ *   Pointer to last-hit index.
+ * @param n
+ *   Size of lookup table.
+ * @param addr
+ *   Search key.
+ *
+ * @return
+ *   Searched LKey on success, UINT32_MAX on no match.
+ */
+static __rte_always_inline uint32_t
+mlx5_mr_lookup_lkey(struct mr_cache_entry *lkp_tbl, uint16_t *cached_idx,
+		    uint16_t n, uintptr_t addr)
+{
+	uint16_t idx;
+
+	if (likely(addr >= lkp_tbl[*cached_idx].start &&
+		   addr < lkp_tbl[*cached_idx].end))
+		return lkp_tbl[*cached_idx].lkey;
+	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
+		if (addr >= lkp_tbl[idx].start &&
+		    addr < lkp_tbl[idx].end) {
+			/* Found. */
+			*cached_idx = idx;
+			return lkp_tbl[idx].lkey;
+		}
+	}
+	return UINT32_MAX;
+}
+
+__rte_experimental
+int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
+__rte_experimental
+void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
+__rte_experimental
+void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused);
+__rte_experimental
+uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
+			    struct mlx5_mr_share_cache *share_cache,
+			    struct mlx5_mr_ctrl *mr_ctrl,
+			    uintptr_t addr, unsigned int mr_ext_memseg_en);
+__rte_experimental
+void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
+__rte_experimental
+void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
+__rte_experimental
+void mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache);
+__rte_experimental
+void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
+__rte_experimental
+int
+mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mlx5_mr *mr);
+__rte_experimental
+uint32_t
+mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
+		     struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
+		    struct mr_cache_entry *entry, uintptr_t addr);
+__rte_experimental
+struct mlx5_mr *
+mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len,
+		   int socket_id);
+__rte_experimental
+uint32_t
+mlx5_mr_create_primary(struct ibv_pd *pd,
+		       struct mlx5_mr_share_cache *share_cache,
+		       struct mr_cache_entry *entry, uintptr_t addr,
+		       unsigned int mr_ext_memseg_en);
+
+#endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map b/drivers/common/mlx5/rte_common_mlx5_version.map
index 265703d1c9..b58a378278 100644
--- a/drivers/common/mlx5/rte_common_mlx5_version.map
+++ b/drivers/common/mlx5/rte_common_mlx5_version.map
@@ -61,4 +61,18 @@ EXPERIMENTAL {
 	mlx5_mp_req_mr_create;
 	mlx5_mp_req_queue_state_modify;
 	mlx5_mp_req_verbs_cmd_fd;
+
+	mlx5_mr_btree_init;
+	mlx5_mr_btree_free;
+	mlx5_mr_btree_dump;
+	mlx5_mr_addr2mr_bh;
+	mlx5_mr_release_cache;
+	mlx5_mr_dump_cache;
+	mlx5_mr_rebuild_cache;
+	mlx5_mr_insert_cache;
+	mlx5_mr_lookup_cache;
+	mlx5_mr_lookup_list;
+	mlx5_create_mr_ext;
+	mlx5_mr_create_primary;
+	mlx5_mr_flush_local_cache;
 };
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index d87c384422..f8b134ca66 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -623,7 +623,7 @@ mlx5_alloc_shared_ibctx(const struct mlx5_dev_spawn_data *spawn,
 	 * At this point the device is not added to the memory
 	 * event list yet, context is just being created.
 	 */
-	err = mlx5_mr_btree_init(&sh->mr.cache,
+	err = mlx5_mr_btree_init(&sh->share_cache.cache,
 				 MLX5_MR_BTREE_CACHE_N * 2,
 				 spawn->pci_dev->device.numa_node);
 	if (err) {
@@ -695,7 +695,7 @@ mlx5_free_shared_ibctx(struct mlx5_ibv_shared *sh)
 	LIST_REMOVE(sh, mem_event_cb);
 	rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
 	/* Release created Memory Regions. */
-	mlx5_mr_release(sh);
+	mlx5_mr_release_cache(&sh->share_cache);
 	/* Remove context from the global device list. */
 	LIST_REMOVE(sh, next);
 	/*
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index e9d5868883..c45c01e916 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -37,10 +37,10 @@
 #include <mlx5_prm.h>
 #include <mlx5_nl.h>
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /** Key string for IPC. */
@@ -199,8 +199,6 @@ struct mlx5_verbs_alloc_ctx {
 	const void *obj; /* Pointer to the DPDK object. */
 };
 
-LIST_HEAD(mlx5_mr_list, mlx5_mr);
-
 /* Flow drop context necessary due to Verbs API. */
 struct mlx5_drop {
 	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
@@ -411,13 +409,7 @@ struct mlx5_ibv_shared {
 	struct ibv_device_attr_ex device_attr; /* Device properties. */
 	LIST_ENTRY(mlx5_ibv_shared) mem_event_cb;
 	/**< Called by memory event callback. */
-	struct {
-		uint32_t dev_gen; /* Generation number to flush local caches. */
-		rte_rwlock_t rwlock; /* MR Lock. */
-		struct mlx5_mr_btree cache; /* Global MR cache table. */
-		struct mlx5_mr_list mr_list; /* Registered MR list. */
-		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
-	} mr;
+	struct mlx5_mr_share_cache share_cache;
 	/* Shared DV/DR flow data section. */
 	pthread_mutex_t dv_mutex; /* DV context mutex. */
 	uint32_t dv_meta_mask; /* flow META metadata supported mask. */
diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
index 43684dbc3a..7ad322d474 100644
--- a/drivers/net/mlx5/mlx5_mp.c
+++ b/drivers/net/mlx5/mlx5_mp.c
@@ -11,6 +11,7 @@
 #include <rte_string_fns.h>
 
 #include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_rxtx.h"
@@ -25,7 +26,7 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		(const struct mlx5_mp_param *)mp_msg->param;
 	struct rte_eth_dev *dev;
 	struct mlx5_priv *priv;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 	int ret;
 
@@ -40,7 +41,10 @@ mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	switch (param->type) {
 	case MLX5_MP_REQ_CREATE_MR:
 		mp_init_msg(&priv->mp_id, &mp_res, param->type);
-		lkey = mlx5_mr_create_primary(dev, &entry, param->args.addr);
+		lkey = mlx5_mr_create_primary(priv->sh->pd,
+					      &priv->sh->share_cache,
+					      &entry, param->args.addr,
+					      priv->config.mr_ext_memseg_en);
 		if (lkey == UINT32_MAX)
 			res->result = -rte_errno;
 		ret = rte_mp_reply(&mp_res, peer);
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 9151992a72..2b4b3e2891 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -18,6 +18,8 @@
 #include <rte_bus_pci.h>
 
 #include <mlx5_glue.h>
+#include <mlx5_common_mp.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5.h"
 #include "mlx5_mr.h"
@@ -36,834 +38,6 @@ struct mr_update_mp_data {
 	int ret;
 };
 
-/**
- * Expand B-tree table to a given size. Can't be called with holding
- * memory_hotplug_lock or sh->mr.rwlock due to rte_realloc().
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries for expansion.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_expand(struct mlx5_mr_btree *bt, int n)
-{
-	void *mem;
-	int ret = 0;
-
-	if (n <= bt->size)
-		return ret;
-	/*
-	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
-	 * used inside if there's no room to expand. Because this is a quite
-	 * rare case and a part of very slow path, it is very acceptable.
-	 * Initially cache_bh[] will be given practically enough space and once
-	 * it is expanded, expansion wouldn't be needed again ever.
-	 */
-	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
-	if (mem == NULL) {
-		/* Not an error, B-tree search will be skipped. */
-		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
-			(void *)bt);
-		ret = -1;
-	} else {
-		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
-		bt->table = mem;
-		bt->size = n;
-	}
-	return ret;
-}
-
-/**
- * Look up LKey from given B-tree lookup table, store the last index and return
- * searched LKey.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param[out] idx
- *   Pointer to index. Even on search failure, returns index where it stops
- *   searching so that index can be used when inserting a new entry.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t n;
-	uint16_t base = 0;
-
-	MLX5_ASSERT(bt != NULL);
-	lkp_tbl = *bt->table;
-	n = bt->len;
-	/* First entry must be NULL for comparison. */
-	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
-				    lkp_tbl[0].lkey == UINT32_MAX));
-	/* Binary search. */
-	do {
-		register uint16_t delta = n >> 1;
-
-		if (addr < lkp_tbl[base + delta].start) {
-			n = delta;
-		} else {
-			base += delta;
-			n -= delta;
-		}
-	} while (n > 1);
-	MLX5_ASSERT(addr >= lkp_tbl[base].start);
-	*idx = base;
-	if (addr < lkp_tbl[base].end)
-		return lkp_tbl[base].lkey;
-	/* Not found. */
-	return UINT32_MAX;
-}
-
-/**
- * Insert an entry to B-tree lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param entry
- *   Pointer to new entry to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
-{
-	struct mlx5_mr_cache *lkp_tbl;
-	uint16_t idx = 0;
-	size_t shift;
-
-	MLX5_ASSERT(bt != NULL);
-	MLX5_ASSERT(bt->len <= bt->size);
-	MLX5_ASSERT(bt->len > 0);
-	lkp_tbl = *bt->table;
-	/* Find out the slot for insertion. */
-	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
-		DRV_LOG(DEBUG,
-			"abort insertion to B-tree(%p): already exist at"
-			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-			(void *)bt, idx, entry->start, entry->end, entry->lkey);
-		/* Already exist, return. */
-		return 0;
-	}
-	/* If table is full, return error. */
-	if (unlikely(bt->len == bt->size)) {
-		bt->overflow = 1;
-		return -1;
-	}
-	/* Insert entry. */
-	++idx;
-	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
-	if (shift)
-		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
-	lkp_tbl[idx] = *entry;
-	bt->len++;
-	DRV_LOG(DEBUG,
-		"inserted B-tree(%p)[%u],"
-		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		(void *)bt, idx, entry->start, entry->end, entry->lkey);
-	return 0;
-}
-
-/**
- * Initialize B-tree and allocate memory for lookup table.
- *
- * @param bt
- *   Pointer to B-tree structure.
- * @param n
- *   Number of entries to allocate.
- * @param socket
- *   NUMA socket on which memory must be allocated.
- *
- * @return
- *   0 on success, a negative errno value otherwise and rte_errno is set.
- */
-int
-mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
-{
-	if (bt == NULL) {
-		rte_errno = EINVAL;
-		return -rte_errno;
-	}
-	MLX5_ASSERT(!bt->table && !bt->size);
-	memset(bt, 0, sizeof(*bt));
-	bt->table = rte_calloc_socket("B-tree table",
-				      n, sizeof(struct mlx5_mr_cache),
-				      0, socket);
-	if (bt->table == NULL) {
-		rte_errno = ENOMEM;
-		DEBUG("failed to allocate memory for btree cache on socket %d",
-		      socket);
-		return -rte_errno;
-	}
-	bt->size = n;
-	/* First entry must be NULL for binary search. */
-	(*bt->table)[bt->len++] = (struct mlx5_mr_cache) {
-		.lkey = UINT32_MAX,
-	};
-	DEBUG("initialized B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	return 0;
-}
-
-/**
- * Free B-tree resources.
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
-{
-	if (bt == NULL)
-		return;
-	DEBUG("freeing B-tree %p with table %p",
-	      (void *)bt, (void *)bt->table);
-	rte_free(bt->table);
-	memset(bt, 0, sizeof(*bt));
-}
-
-/**
- * Dump all the entries in a B-tree
- *
- * @param bt
- *   Pointer to B-tree structure.
- */
-void
-mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	int idx;
-	struct mlx5_mr_cache *lkp_tbl;
-
-	if (bt == NULL)
-		return;
-	lkp_tbl = *bt->table;
-	for (idx = 0; idx < bt->len; ++idx) {
-		struct mlx5_mr_cache *entry = &lkp_tbl[idx];
-
-		DEBUG("B-tree(%p)[%u],"
-		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
-		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
-	}
-#endif
-}
-
-/**
- * Find virtually contiguous memory chunk in a given MR.
- *
- * @param dev
- *   Pointer to MR structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If not found, this will not be
- *   updated.
- * @param start_idx
- *   Start index of the memseg bitmap.
- *
- * @return
- *   Next index to go on lookup.
- */
-static int
-mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
-		   int base_idx)
-{
-	uintptr_t start = 0;
-	uintptr_t end = 0;
-	uint32_t idx = 0;
-
-	/* MR for external memory doesn't have memseg list. */
-	if (mr->msl == NULL) {
-		struct ibv_mr *ibv_mr = mr->ibv_mr;
-
-		MLX5_ASSERT(mr->ms_bmp_n == 1);
-		MLX5_ASSERT(mr->ms_n == 1);
-		MLX5_ASSERT(base_idx == 0);
-		/*
-		 * Can't search it from memseg list but get it directly from
-		 * verbs MR as there's only one chunk.
-		 */
-		entry->start = (uintptr_t)ibv_mr->addr;
-		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-		/* Returning 1 ends iteration. */
-		return 1;
-	}
-	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
-		if (rte_bitmap_get(mr->ms_bmp, idx)) {
-			const struct rte_memseg_list *msl;
-			const struct rte_memseg *ms;
-
-			msl = mr->msl;
-			ms = rte_fbarray_get(&msl->memseg_arr,
-					     mr->ms_base_idx + idx);
-			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-			if (!start)
-				start = ms->addr_64;
-			end = ms->addr_64 + ms->hugepage_sz;
-		} else if (start) {
-			/* Passed the end of a fragment. */
-			break;
-		}
-	}
-	if (start) {
-		/* Found one chunk. */
-		entry->start = start;
-		entry->end = end;
-		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
-	}
-	return idx;
-}
-
-/**
- * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
- * Then, this entry will have to be searched by mr_lookup_dev_list() in
- * mlx5_mr_create() on miss.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param mr
- *   Pointer to MR to insert.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-static int
-mr_insert_dev_cache(struct mlx5_ibv_shared *sh, struct mlx5_mr *mr)
-{
-	unsigned int n;
-
-	DRV_LOG(DEBUG, "device %s inserting MR(%p) to global cache",
-		sh->ibdev_name, (void *)mr);
-	for (n = 0; n < mr->ms_bmp_n; ) {
-		struct mlx5_mr_cache entry;
-
-		memset(&entry, 0, sizeof(entry));
-		/* Find a contiguous chunk and advance the index. */
-		n = mr_find_next_chunk(mr, &entry, n);
-		if (!entry.end)
-			break;
-		if (mr_btree_insert(&sh->mr.cache, &entry) < 0) {
-			/*
-			 * Overflowed, but the global table cannot be expanded
-			 * because of deadlock.
-			 */
-			return -1;
-		}
-	}
-	return 0;
-}
-
-/**
- * Look up address in the original global MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Found MR on match, NULL otherwise.
- */
-static struct mlx5_mr *
-mr_lookup_dev_list(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-		   uintptr_t addr)
-{
-	struct mlx5_mr *mr;
-
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret;
-
-			memset(&ret, 0, sizeof(ret));
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (addr >= ret.start && addr < ret.end) {
-				/* Found. */
-				*entry = ret;
-				return mr;
-			}
-		}
-	}
-	return NULL;
-}
-
-/**
- * Look up address on device.
- *
- * @param dev
- *   Pointer to Ethernet device shared context.
- * @param[out] entry
- *   Pointer to returning MR cache entry. If no match, this will not be updated.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mr_lookup_dev(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
-	      uintptr_t addr)
-{
-	uint16_t idx;
-	uint32_t lkey = UINT32_MAX;
-	struct mlx5_mr *mr;
-
-	/*
-	 * If the global cache has overflowed since it failed to expand the
-	 * B-tree table, it can't have all the existing MRs. Then, the address
-	 * has to be searched by traversing the original MR list instead, which
-	 * is very slow path. Otherwise, the global cache is all inclusive.
-	 */
-	if (!unlikely(sh->mr.cache.overflow)) {
-		lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-		if (lkey != UINT32_MAX)
-			*entry = (*sh->mr.cache.table)[idx];
-	} else {
-		/* Falling back to the slowest path. */
-		mr = mr_lookup_dev_list(sh, entry, addr);
-		if (mr != NULL)
-			lkey = entry->lkey;
-	}
-	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
-					   addr < entry->end));
-	return lkey;
-}
-
-/**
- * Free MR resources. MR lock must not be held to avoid a deadlock. rte_free()
- * can raise memory free event and the callback function will spin on the lock.
- *
- * @param mr
- *   Pointer to MR to free.
- */
-static void
-mr_free(struct mlx5_mr *mr)
-{
-	if (mr == NULL)
-		return;
-	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
-	if (mr->ibv_mr != NULL)
-		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
-	if (mr->ms_bmp != NULL)
-		rte_bitmap_free(mr->ms_bmp);
-	rte_free(mr);
-}
-
-/**
- * Release resources of detached MR having no online entry.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mlx5_mr_garbage_collect(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
-
-	/* Must be called from the primary process. */
-	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	/*
-	 * MR can't be freed with holding the lock because rte_free() could call
-	 * memory free callback function. This will be a deadlock situation.
-	 */
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach the whole free list and release it after unlocking. */
-	free_list = sh->mr.mr_free_list;
-	LIST_INIT(&sh->mr.mr_free_list);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Release resources. */
-	mr_next = LIST_FIRST(&free_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		mr_free(mr);
-	}
-}
-
-/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
-static int
-mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
-			  const struct rte_memseg *ms, size_t len, void *arg)
-{
-	struct mr_find_contig_memsegs_data *data = arg;
-
-	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
-		return 0;
-	/* Found, save it and stop walking. */
-	data->start = ms->addr_64;
-	data->end = ms->addr_64 + len;
-	data->msl = msl;
-	return 1;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This API should be called on a secondary process, then a request is sent to
- * the primary process in order to create a MR for the address. As the global MR
- * list is on the shared memory, following LKey lookup should succeed unless the
- * request fails.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-			 uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	int ret;
-
-	DEBUG("port %u requesting MR creation for address (%p)",
-	      dev->data->port_id, (void *)addr);
-	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
-	if (ret) {
-		DEBUG("port %u fail to request MR creation for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		return UINT32_MAX;
-	}
-	rte_rwlock_read_lock(&priv->sh->mr.rwlock);
-	/* Fill in output data. */
-	mr_lookup_dev(priv->sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_read_unlock(&priv->sh->mr.rwlock);
-	DEBUG("port %u MR CREATED by primary process for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
-	      dev->data->port_id, (void *)addr,
-	      entry->start, entry->end, entry->lkey);
-	return entry->lkey;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * Register entire virtually contiguous memory chunk around the address.
- * This must be called from the primary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-uint32_t
-mlx5_mr_create_primary(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-		       uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_dev_config *config = &priv->config;
-	const struct rte_memseg_list *msl;
-	const struct rte_memseg *ms;
-	struct mlx5_mr *mr = NULL;
-	size_t len;
-	uint32_t ms_n;
-	uint32_t bmp_size;
-	void *bmp_mem;
-	int ms_idx_shift = -1;
-	unsigned int n;
-	struct mr_find_contig_memsegs_data data = {
-		.addr = addr,
-	};
-	struct mr_find_contig_memsegs_data data_re;
-
-	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
-		dev->data->port_id, (void *)addr);
-	/*
-	 * Release detached MRs if any. This can't be called with holding either
-	 * memory_hotplug_lock or sh->mr.rwlock. MRs on the free list have
-	 * been detached by the memory free event but it couldn't be released
-	 * inside the callback due to deadlock. As a result, releasing resources
-	 * is quite opportunistic.
-	 */
-	mlx5_mr_garbage_collect(sh);
-	/*
-	 * If enabled, find out a contiguous virtual address chunk in use, to
-	 * which the given address belongs, in order to register maximum range.
-	 * In the best case where mempools are not dynamically recreated and
-	 * '--socket-mem' is specified as an EAL option, it is very likely to
-	 * have only one MR(LKey) per a socket and per a hugepage-size even
-	 * though the system memory is highly fragmented. As the whole memory
-	 * chunk will be pinned by kernel, it can't be reused unless entire
-	 * chunk is freed from EAL.
-	 *
-	 * If disabled, just register one memseg (page). Then, memory
-	 * consumption will be minimized but it may drop performance if there
-	 * are many MRs to lookup on the datapath.
-	 */
-	if (!config->mr_ext_memseg_en) {
-		data.msl = rte_mem_virt2memseg_list((void *)addr);
-		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
-		data.end = data.start + data.msl->page_sz;
-	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data)) {
-		DRV_LOG(WARNING,
-			"port %u unable to find virtually contiguous"
-			" chunk for address (%p)."
-			" rte_memseg_contig_walk() failed.",
-			dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_nolock;
-	}
-alloc_resources:
-	/* Addresses must be page-aligned. */
-	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
-	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
-	msl = data.msl;
-	ms = rte_mem_virt2memseg((void *)data.start, msl);
-	len = data.end - data.start;
-	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
-	/* Number of memsegs in the range. */
-	ms_n = len / msl->page_sz;
-	DEBUG("port %u extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " page_sz=0x%" PRIx64 ", ms_n=%u",
-	      dev->data->port_id, (void *)addr,
-	      data.start, data.end, msl->page_sz, ms_n);
-	/* Size of memory for bitmap. */
-	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE) +
-				bmp_size,
-				RTE_CACHE_LINE_SIZE, msl->socket_id);
-	if (mr == NULL) {
-		DEBUG("port %u unable to allocate memory for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENOMEM;
-		goto err_nolock;
-	}
-	mr->msl = msl;
-	/*
-	 * Save the index of the first memseg and initialize memseg bitmap. To
-	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
-	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
-	 */
-	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
-	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
-	if (mr->ms_bmp == NULL) {
-		DEBUG("port %u unable to initialize bitmap for a new MR of"
-		      " address (%p).",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_nolock;
-	}
-	/*
-	 * Should recheck whether the extended contiguous chunk is still valid.
-	 * Because memory_hotplug_lock can't be held if there's any memory
-	 * related calls in a critical path, resource allocation above can't be
-	 * locked. If the memory has been changed at this point, try again with
-	 * just single page. If not, go on with the big chunk atomically from
-	 * here.
-	 */
-	rte_mcfg_mem_read_lock();
-	data_re = data;
-	if (len > msl->page_sz &&
-	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re)) {
-		DEBUG("port %u unable to find virtually contiguous"
-		      " chunk for address (%p)."
-		      " rte_memseg_contig_walk() failed.",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = ENXIO;
-		goto err_memlock;
-	}
-	if (data.start != data_re.start || data.end != data_re.end) {
-		/*
-		 * The extended contiguous chunk has been changed. Try again
-		 * with single memseg instead.
-		 */
-		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
-		data.end = data.start + msl->page_sz;
-		rte_mcfg_mem_read_unlock();
-		mr_free(mr);
-		goto alloc_resources;
-	}
-	MLX5_ASSERT(data.msl == data_re.msl);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/*
-	 * Check the address is really missing. If other thread already created
-	 * one or it is not found due to overflow, abort and return.
-	 */
-	if (mr_lookup_dev(sh, entry, addr) != UINT32_MAX) {
-		/*
-		 * Insert to the global cache table. It may fail due to
-		 * low-on-memory. Then, this entry will have to be searched
-		 * here again.
-		 */
-		mr_btree_insert(&sh->mr.cache, entry);
-		DEBUG("port %u found MR for %p on final lookup, abort",
-		      dev->data->port_id, (void *)addr);
-		rte_rwlock_write_unlock(&sh->mr.rwlock);
-		rte_mcfg_mem_read_unlock();
-		/*
-		 * Must be unlocked before calling rte_free() because
-		 * mlx5_mr_mem_event_free_cb() can be called inside.
-		 */
-		mr_free(mr);
-		return entry->lkey;
-	}
-	/*
-	 * Trim start and end addresses for verbs MR. Set bits for registering
-	 * memsegs but exclude already registered ones. Bitmap can be
-	 * fragmented.
-	 */
-	for (n = 0; n < ms_n; ++n) {
-		uintptr_t start;
-		struct mlx5_mr_cache ret;
-
-		memset(&ret, 0, sizeof(ret));
-		start = data_re.start + n * msl->page_sz;
-		/* Exclude memsegs already registered by other MRs. */
-		if (mr_lookup_dev(sh, &ret, start) == UINT32_MAX) {
-			/*
-			 * Start from the first unregistered memseg in the
-			 * extended range.
-			 */
-			if (ms_idx_shift == -1) {
-				mr->ms_base_idx += n;
-				data.start = start;
-				ms_idx_shift = n;
-			}
-			data.end = start + msl->page_sz;
-			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
-			++mr->ms_n;
-		}
-	}
-	len = data.end - data.start;
-	mr->ms_bmp_n = len / msl->page_sz;
-	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
-	/*
-	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can be
-	 * called with holding the memory lock because it doesn't use
-	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
-	 * through mlx5_alloc_verbs_buf().
-	 */
-	mr->ibv_mr = mlx5_glue->reg_mr(sh->pd, (void *)data.start, len,
-				       IBV_ACCESS_LOCAL_WRITE |
-					   IBV_ACCESS_RELAXED_ORDERING);
-	if (mr->ibv_mr == NULL) {
-		DEBUG("port %u fail to create a verbs MR for address (%p)",
-		      dev->data->port_id, (void *)addr);
-		rte_errno = EINVAL;
-		goto err_mrlock;
-	}
-	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
-	MLX5_ASSERT(mr->ibv_mr->length == len);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
-	DEBUG("port %u MR CREATED (%p) for %p:\n"
-	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-	      dev->data->port_id, (void *)mr, (void *)addr,
-	      data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	/* Fill in output data. */
-	mr_lookup_dev(sh, entry, addr);
-	/* Lookup can't fail. */
-	MLX5_ASSERT(entry->lkey != UINT32_MAX);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	rte_mcfg_mem_read_unlock();
-	return entry->lkey;
-err_mrlock:
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-err_memlock:
-	rte_mcfg_mem_read_unlock();
-err_nolock:
-	/*
-	 * In case of error, as this can be called in a datapath, a warning
-	 * message per an error is preferable instead. Must be unlocked before
-	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be called
-	 * inside.
-	 */
-	mr_free(mr);
-	return UINT32_MAX;
-}
-
-/**
- * Create a new global Memory Region (MR) for a missing virtual address.
- * This can be called from primary and secondary process.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this will not be updated.
- * @param addr
- *   Target virtual address to register.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
- */
-static uint32_t
-mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
-	       uintptr_t addr)
-{
-	uint32_t ret = 0;
-
-	switch (rte_eal_process_type()) {
-	case RTE_PROC_PRIMARY:
-		ret = mlx5_mr_create_primary(dev, entry, addr);
-		break;
-	case RTE_PROC_SECONDARY:
-		ret = mlx5_mr_create_secondary(dev, entry, addr);
-		break;
-	default:
-		break;
-	}
-	return ret;
-}
-
-/**
- * Rebuild the global B-tree cache of device from the original MR list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-static void
-mr_rebuild_dev_cache(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr;
-
-	DRV_LOG(DEBUG, "device %s rebuild dev cache[]", sh->ibdev_name);
-	/* Flush cache to rebuild. */
-	sh->mr.cache.len = 1;
-	sh->mr.cache.overflow = 0;
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr)
-		if (mr_insert_dev_cache(sh, mr) < 0)
-			return;
-}
-
 /**
  * Callback for memory free event. Iterate freed memsegs and check whether it
  * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a
@@ -900,18 +74,18 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		    RTE_ALIGN((uintptr_t)addr, msl->page_sz));
 	MLX5_ASSERT(len == RTE_ALIGN(len, msl->page_sz));
 	ms_n = len / msl->page_sz;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
 	/* Clear bits of freed memsegs from MR. */
 	for (i = 0; i < ms_n; ++i) {
 		const struct rte_memseg *ms;
-		struct mlx5_mr_cache entry;
+		struct mr_cache_entry entry;
 		uintptr_t start;
 		int ms_idx;
 		uint32_t pos;
 
 		/* Find MR having this memseg. */
 		start = (uintptr_t)addr + i * msl->page_sz;
-		mr = mr_lookup_dev_list(sh, &entry, start);
+		mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, start);
 		if (mr == NULL)
 			continue;
 		MLX5_ASSERT(mr->msl); /* Can't be external memory. */
@@ -927,7 +101,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rte_bitmap_clear(mr->ms_bmp, pos);
 		if (--mr->ms_n == 0) {
 			LIST_REMOVE(mr, mr);
-			LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+			LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 			DEBUG("device %s remove MR(%p) from list",
 			      sh->ibdev_name, (void *)mr);
 		}
@@ -938,7 +112,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		rebuild = 1;
 	}
 	if (rebuild) {
-		mr_rebuild_dev_cache(sh);
+		mlx5_mr_rebuild_cache(&sh->share_cache);
 		/*
 		 * Flush local caches by propagating invalidation across cores.
 		 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -948,12 +122,12 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared *sh,
 		 * generation below) will be guaranteed to be seen by other core
 		 * before the core sees the newly allocated memory.
 		 */
-		++sh->mr.dev_gen;
+		++sh->share_cache.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
-		      sh->mr.dev_gen);
+		      sh->share_cache.dev_gen);
 		rte_smp_wmb();
 	}
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
 
 /**
@@ -990,111 +164,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Look up address in the global MR cache table. If not found, create a new MR.
- * Insert the found/created entry to local bottom-half cache table.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param[out] entry
- *   Pointer to returning MR cache entry, found in the global cache or newly
- *   created. If failed to create one, this is not written.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   struct mlx5_mr_cache *entry, uintptr_t addr)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_ibv_shared *sh = priv->sh;
-	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
-	uint16_t idx;
-	uint32_t lkey;
-
-	/* If local cache table is full, try to double it. */
-	if (unlikely(bt->len == bt->size))
-		mr_btree_expand(bt, bt->size << 1);
-	/* Look up in the global cache. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
-	if (lkey != UINT32_MAX) {
-		/* Found. */
-		*entry = (*sh->mr.cache.table)[idx];
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
-		/*
-		 * Update local cache. Even if it fails, return the found entry
-		 * to update top-half cache. Next time, this entry will be found
-		 * in the global cache.
-		 */
-		mr_btree_insert(bt, entry);
-		return lkey;
-	}
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-	/* First time to see the address? Create a new MR. */
-	lkey = mlx5_mr_create(dev, entry, addr);
-	/*
-	 * Update the local cache if successfully created a new global MR. Even
-	 * if failed to create one, there's no action to take in this datapath
-	 * code. As returning LKey is invalid, this will eventually make HW
-	 * fail.
-	 */
-	if (lkey != UINT32_MAX)
-		mr_btree_insert(bt, entry);
-	return lkey;
-}
-
-/**
- * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] and if
- * misses, search in the global MR cache table and update the new entry to
- * per-queue local caches.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static uint32_t
-mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		   uintptr_t addr)
-{
-	uint32_t lkey;
-	uint16_t bh_idx = 0;
-	/* Victim in top-half cache to replace with new entry. */
-	struct mlx5_mr_cache *repl = &mr_ctrl->cache[mr_ctrl->head];
-
-	/* Binary-search MR translation table. */
-	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
-	/* Update top-half cache. */
-	if (likely(lkey != UINT32_MAX)) {
-		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
-	} else {
-		/*
-		 * If missed in local lookup table, search in the global cache
-		 * and local cache_bh[] will be updated inside if possible.
-		 * Top-half cache entry will also be updated.
-		 */
-		lkey = mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
-		if (unlikely(lkey == UINT32_MAX))
-			return UINT32_MAX;
-	}
-	/* Update the most recently used entry. */
-	mr_ctrl->mru = mr_ctrl->head;
-	/* Point to the next victim, the oldest. */
-	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
-	return lkey;
-}
-
 /**
  * Bottom-half of LKey search on Rx.
  *
@@ -1114,7 +183,9 @@ mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
 	struct mlx5_priv *priv = rxq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1136,7 +207,9 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
 	struct mlx5_priv *priv = txq_ctrl->priv;
 
-	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
+	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, mr_ctrl, addr,
+				  priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1165,82 +238,6 @@ mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	return lkey;
 }
 
-/**
- * Flush all of the local cache entries.
- *
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- */
-void
-mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
-{
-	/* Reset the most-recently-used index. */
-	mr_ctrl->mru = 0;
-	/* Reset the linear search array. */
-	mr_ctrl->head = 0;
-	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
-	/* Reset the B-tree table. */
-	mr_ctrl->cache_bh.len = 1;
-	mr_ctrl->cache_bh.overflow = 0;
-	/* Update the generation number. */
-	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
-	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
-		(void *)mr_ctrl, mr_ctrl->cur_gen);
-}
-
-/**
- * Creates a memory region for external memory, that is memory which is not
- * part of the DPDK memory segments.
- *
- * @param dev
- *   Pointer to the ethernet device.
- * @param addr
- *   Starting virtual address of memory.
- * @param len
- *   Length of memory segment being mapped.
- * @param socked_id
- *   Socket to allocate heap memory for the control structures.
- *
- * @return
- *   Pointer to MR structure on success, NULL otherwise.
- */
-static struct mlx5_mr *
-mlx5_create_mr_ext(struct rte_eth_dev *dev, uintptr_t addr, size_t len,
-		   int socket_id)
-{
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_mr *mr = NULL;
-
-	mr = rte_zmalloc_socket(NULL,
-				RTE_ALIGN_CEIL(sizeof(*mr),
-					       RTE_CACHE_LINE_SIZE),
-				RTE_CACHE_LINE_SIZE, socket_id);
-	if (mr == NULL)
-		return NULL;
-	mr->ibv_mr = mlx5_glue->reg_mr(priv->sh->pd, (void *)addr, len,
-				       IBV_ACCESS_LOCAL_WRITE |
-					   IBV_ACCESS_RELAXED_ORDERING);
-	if (mr->ibv_mr == NULL) {
-		DRV_LOG(WARNING,
-			"port %u fail to create a verbs MR for address (%p)",
-			dev->data->port_id, (void *)addr);
-		rte_free(mr);
-		return NULL;
-	}
-	mr->msl = NULL; /* Mark it is external memory. */
-	mr->ms_bmp = NULL;
-	mr->ms_n = 1;
-	mr->ms_bmp_n = 1;
-	DRV_LOG(DEBUG,
-		"port %u MR CREATED (%p) for external memory %p:\n"
-		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
-		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
-		dev->data->port_id, (void *)mr, (void *)addr,
-		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
-	return mr;
-}
-
 /**
  * Called during rte_mempool_mem_iter() by mlx5_mr_update_ext_mp().
  *
@@ -1267,19 +264,19 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 	struct mlx5_mr *mr = NULL;
 	uintptr_t addr = (uintptr_t)memhdr->addr;
 	size_t len = memhdr->len;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 	uint32_t lkey;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	/* If already registered, it should return. */
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	lkey = mr_lookup_dev(sh, &entry, addr);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	lkey = mlx5_mr_lookup_cache(&sh->share_cache, &entry, addr);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	if (lkey != UINT32_MAX)
 		return;
 	DRV_LOG(DEBUG, "port %u register MR for chunk #%d of mempool (%s)",
 		dev->data->port_id, mem_idx, mp->name);
-	mr = mlx5_create_mr_ext(dev, addr, len, mp->socket_id);
+	mr = mlx5_create_mr_ext(sh->pd, addr, len, mp->socket_id);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to allocate a new MR of"
@@ -1288,13 +285,14 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool *mp, void *opaque,
 		data->ret = -1;
 		return;
 	}
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	/* Insert to the local cache table */
-	mlx5_mr_addr2mr_bh(dev, mr_ctrl, addr);
+	mlx5_mr_addr2mr_bh(sh->pd, &priv->mp_id, &sh->share_cache,
+			   mr_ctrl, addr, priv->config.mr_ext_memseg_en);
 }
 
 /**
@@ -1351,19 +349,19 @@ mlx5_dma_map(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	priv = dev->data->dev_private;
-	mr = mlx5_create_mr_ext(dev, (uintptr_t)addr, len, SOCKET_ID_ANY);
+	sh = priv->sh;
+	mr = mlx5_create_mr_ext(sh->pd, (uintptr_t)addr, len, SOCKET_ID_ANY);
 	if (!mr) {
 		DRV_LOG(WARNING,
 			"port %u unable to dma map", dev->data->port_id);
 		rte_errno = EINVAL;
 		return -1;
 	}
-	sh = priv->sh;
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
+	rte_rwlock_write_lock(&sh->share_cache.rwlock);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
 	/* Insert to the global cache table. */
-	mr_insert_dev_cache(sh, mr);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
+	mlx5_mr_insert_cache(&sh->share_cache, mr);
+	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1390,7 +388,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	struct mlx5_priv *priv;
 	struct mlx5_ibv_shared *sh;
 	struct mlx5_mr *mr;
-	struct mlx5_mr_cache entry;
+	struct mr_cache_entry entry;
 
 	dev = pci_dev_to_eth_dev(pdev);
 	if (!dev) {
@@ -1401,10 +399,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	}
 	priv = dev->data->dev_private;
 	sh = priv->sh;
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	mr = mr_lookup_dev_list(sh, &entry, (uintptr_t)addr);
+	rte_rwlock_read_lock(&sh->share_cache.rwlock);
+	mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, (uintptr_t)addr);
 	if (!mr) {
-		rte_rwlock_read_unlock(&sh->mr.rwlock);
+		rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 		DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't registered "
 				 "to PCI device %p", (uintptr_t)addr,
 				 (void *)pdev);
@@ -1412,10 +410,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 		return -1;
 	}
 	LIST_REMOVE(mr, mr);
-	LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
+	LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
 	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
 	      (void *)mr);
-	mr_rebuild_dev_cache(sh);
+	mlx5_mr_rebuild_cache(&sh->share_cache);
 	/*
 	 * Flush local caches by propagating invalidation across cores.
 	 * rte_smp_wmb() is enough to synchronize this event. If one of
@@ -1425,10 +423,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	 * generation below) will be guaranteed to be seen by other core
 	 * before the core sees the newly allocated memory.
 	 */
-	++sh->mr.dev_gen;
-	DEBUG("broadcasting local cache flush, gen=%d",	sh->mr.dev_gen);
+	++sh->share_cache.dev_gen;
+	DEBUG("broadcasting local cache flush, gen=%d",
+	      sh->share_cache.dev_gen);
 	rte_smp_wmb();
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
+	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
 
@@ -1503,14 +502,19 @@ mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
 		     unsigned mem_idx __rte_unused)
 {
 	struct mr_update_mp_data *data = opaque;
+	struct rte_eth_dev *dev = data->dev;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
 	uint32_t lkey;
 
 	/* Stop iteration if failed in the previous walk. */
 	if (data->ret < 0)
 		return;
 	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr);
+	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
+				  &priv->sh->share_cache, data->mr_ctrl,
+				  (uintptr_t)memhdr->addr,
+				  priv->config.mr_ext_memseg_en);
 	if (lkey == UINT32_MAX)
 		data->ret = -1;
 }
@@ -1545,76 +549,3 @@ mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 	}
 	return data.ret;
 }
-
-/**
- * Dump all the created MRs and the global cache entries.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh __rte_unused)
-{
-#ifdef RTE_LIBRTE_MLX5_DEBUG
-	struct mlx5_mr *mr;
-	int mr_n = 0;
-	int chunk_n = 0;
-
-	rte_rwlock_read_lock(&sh->mr.rwlock);
-	/* Iterate all the existing MRs. */
-	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
-		unsigned int n;
-
-		DEBUG("device %s MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
-		      sh->ibdev_name, mr_n++,
-		      rte_cpu_to_be_32(mr->ibv_mr->lkey),
-		      mr->ms_n, mr->ms_bmp_n);
-		if (mr->ms_n == 0)
-			continue;
-		for (n = 0; n < mr->ms_bmp_n; ) {
-			struct mlx5_mr_cache ret = { 0, };
-
-			n = mr_find_next_chunk(mr, &ret, n);
-			if (!ret.end)
-				break;
-			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR ")",
-			      chunk_n++, ret.start, ret.end);
-		}
-	}
-	DEBUG("device %s dumping global cache", sh->ibdev_name);
-	mlx5_mr_btree_dump(&sh->mr.cache);
-	rte_rwlock_read_unlock(&sh->mr.rwlock);
-#endif
-}
-
-/**
- * Release all the created MRs and resources for shared device context.
- * list.
- *
- * @param sh
- *   Pointer to Ethernet device shared context.
- */
-void
-mlx5_mr_release(struct mlx5_ibv_shared *sh)
-{
-	struct mlx5_mr *mr_next;
-
-	if (rte_log_can_log(mlx5_logtype, RTE_LOG_DEBUG))
-		mlx5_mr_dump_dev(sh);
-	rte_rwlock_write_lock(&sh->mr.rwlock);
-	/* Detach from MR list and move to free list. */
-	mr_next = LIST_FIRST(&sh->mr.mr_list);
-	while (mr_next != NULL) {
-		struct mlx5_mr *mr = mr_next;
-
-		mr_next = LIST_NEXT(mr, mr);
-		LIST_REMOVE(mr, mr);
-		LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
-	}
-	LIST_INIT(&sh->mr.mr_list);
-	/* Free global cache. */
-	mlx5_mr_btree_free(&sh->mr.cache);
-	rte_rwlock_write_unlock(&sh->mr.rwlock);
-	/* Free all remaining MRs. */
-	mlx5_mr_garbage_collect(sh);
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 48264c8294..0c5877b3d6 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -24,99 +24,16 @@
 #include <rte_ethdev.h>
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_memory.h>
 
-/* Memory Region object. */
-struct mlx5_mr {
-	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
-	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
-	const struct rte_memseg_list *msl;
-	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
-	int ms_n; /* Number of memsegs in use. */
-	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
-	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR. */
-};
-
-/* Cache entry for Memory Region. */
-struct mlx5_mr_cache {
-	uintptr_t start; /* Start address of MR. */
-	uintptr_t end; /* End address of MR. */
-	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
-} __rte_packed;
-
-/* MR Cache table for Binary search. */
-struct mlx5_mr_btree {
-	uint16_t len; /* Number of entries. */
-	uint16_t size; /* Total number of entries. */
-	int overflow; /* Mark failure of table expansion. */
-	struct mlx5_mr_cache (*table)[];
-} __rte_packed;
-
-/* Per-queue MR control descriptor. */
-struct mlx5_mr_ctrl {
-	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
-	uint32_t cur_gen; /* Generation number saved to flush caches. */
-	uint16_t mru; /* Index of last hit entry in top-half cache. */
-	uint16_t head; /* Index of the oldest entry in top-half cache. */
-	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for top-half. */
-	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
-} __rte_packed;
-
-struct mlx5_ibv_shared;
-extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
-extern rte_rwlock_t mlx5_mem_event_rwlock;
+#include <mlx5_common_mr.h>
 
 /* First entry must be NULL for comparison. */
 #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
 
-int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
-void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
-uint32_t mlx5_mr_create_primary(struct rte_eth_dev *dev,
-				struct mlx5_mr_cache *entry, uintptr_t addr);
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
 int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
 		      struct rte_mempool *mp);
-void mlx5_mr_release(struct mlx5_ibv_shared *sh);
-
-/* Debug purpose functions. */
-void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt);
-void mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh);
-
-/**
- * Look up LKey from given lookup table by linear search. Firstly look up the
- * last-hit entry. If miss, the entire array is searched. If found, update the
- * last-hit index and return LKey.
- *
- * @param lkp_tbl
- *   Pointer to lookup table.
- * @param[in,out] cached_idx
- *   Pointer to last-hit index.
- * @param n
- *   Size of lookup table.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-static __rte_always_inline uint32_t
-mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t *cached_idx,
-		     uint16_t n, uintptr_t addr)
-{
-	uint16_t idx;
-
-	if (likely(addr >= lkp_tbl[*cached_idx].start &&
-		   addr < lkp_tbl[*cached_idx].end))
-		return lkp_tbl[*cached_idx].lkey;
-	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
-		if (addr >= lkp_tbl[idx].start &&
-		    addr < lkp_tbl[idx].end) {
-			/* Found. */
-			*cached_idx = idx;
-			return lkp_tbl[idx].lkey;
-		}
-	}
-	return UINT32_MAX;
-}
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 42d7da8a4b..3e583d49a6 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -33,6 +33,7 @@
 
 #include "mlx5_defs.h"
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_autoconf.h"
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index d155c241eb..537d449c88 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -34,11 +34,11 @@
 #include <mlx5_glue.h>
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
 #include "mlx5.h"
-#include "mlx5_mr.h"
 #include "mlx5_autoconf.h"
 
 /* Support tunnel matching. */
@@ -598,8 +598,8 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half (Binary Search) on miss. */
@@ -630,8 +630,8 @@ mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 	if (unlikely(*mr_ctrl->dev_gen_ptr != mr_ctrl->cur_gen))
 		mlx5_mr_flush_local_cache(mr_ctrl);
 	/* Linear search on MR cache array. */
-	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
-				    MLX5_MR_CACHE_N, addr);
+	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
+				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
 	/* Take slower bottom-half on miss. */
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index ea925156f0..6ddcbfb0ad 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -13,6 +13,8 @@
 
 #include "mlx5_autoconf.h"
 
+#include "mlx5_mr.h"
+
 /* HW checksum offload capabilities of vectorized Tx. */
 #define MLX5_VEC_TX_CKSUM_OFFLOAD_CAP \
 	(DEV_TX_OFFLOAD_IPV4_CKSUM | \
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 438b705952..759670408b 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -11,6 +11,7 @@
 #include <rte_alarm.h>
 
 #include "mlx5.h"
+#include "mlx5_mr.h"
 #include "mlx5_rxtx.h"
 #include "mlx5_utils.h"
 #include "rte_pmd_mlx5.h"
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 0653f4cf30..29e5cabab6 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -30,6 +30,7 @@
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
 #include <mlx5_common.h>
+#include <mlx5_common_mr.h>
 
 #include "mlx5_defs.h"
 #include "mlx5_utils.h"
@@ -1289,7 +1290,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		goto error;
 	}
 	/* Save pointer of global generation number to check memory event. */
-	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->mr.dev_gen;
+	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	MLX5_ASSERT(desc > MLX5_TX_COMP_THRESH);
 	tmpl->txq.offloads = conf->offloads |
 			     dev->data->dev_conf.txmode.offloads;
-- 
2.16.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling codes to common driver
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling " Vu Pham
@ 2020-04-14  7:26     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-14  7:26 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 14, 2020 0:18
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling
> codes to common driver
> 
> Refactor common multi-process handling codes from net PMD to common
> driver. Using tuple mp_id{name, port_id} as standard input parameter for all
> multi-process IPC APIs instead of using rte_eth_dev.
> 
> Modify net PMD to use multi-process APIs from common driver.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/Makefile                    |   3 +-
>  drivers/common/mlx5/meson.build                 |   1 +
>  drivers/common/mlx5/mlx5_common_mp.c            | 188
> +++++++++++++++++++
>  drivers/common/mlx5/mlx5_common_mp.h            |  98 ++++++++++
>  drivers/common/mlx5/rte_common_mlx5_version.map |  13 ++
>  drivers/net/mlx5/mlx5.c                         |  15 +-
>  drivers/net/mlx5/mlx5.h                         |  43 +----
>  drivers/net/mlx5/mlx5_mp.c                      | 234 ++----------------------
>  drivers/net/mlx5/mlx5_mr.c                      |   2 +-
>  drivers/net/mlx5/mlx5_rxtx.c                    |   3 +-
>  10 files changed, 336 insertions(+), 264 deletions(-)  create mode 100644
> drivers/common/mlx5/mlx5_common_mp.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
> 
> diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
> index f32933d592..2a88492731 100644
> --- a/drivers/common/mlx5/Makefile
> +++ b/drivers/common/mlx5/Makefile
> @@ -17,6 +17,7 @@ endif
>  SRCS-y += mlx5_devx_cmds.c
>  SRCS-y += mlx5_common.c
>  SRCS-y += mlx5_nl.c
> +SRCS-y += mlx5_common_mp.c
>  ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
>  INSTALL-y-lib += $(LIB_GLUE)
>  endif
> @@ -46,7 +47,7 @@ endif
>  LDLIBS += -lrte_eal -lrte_pci -lrte_kvargs -lrte_net
> 
>  # A few warnings cannot be avoided in external headers.
> -CFLAGS += -Wno-error=cast-qual -UPEDANTIC
> +CFLAGS += -Wno-error=cast-qual  -UPEDANTIC -DALLOW_EXPERIMENTAL_API
> 
>  EXPORT_MAP := rte_common_mlx5_version.map
> 
> diff --git a/drivers/common/mlx5/meson.build
> b/drivers/common/mlx5/meson.build index f671710714..83671861c9 100644
> --- a/drivers/common/mlx5/meson.build
> +++ b/drivers/common/mlx5/meson.build
> @@ -55,6 +55,7 @@ sources = files(
>  	'mlx5_devx_cmds.c',
>  	'mlx5_common.c',
>  	'mlx5_nl.c',
> +	'mlx5_common_mp.c',
>  )
>  if not dlopen_ibverbs
>  	sources += files('mlx5_glue.c')
> diff --git a/drivers/common/mlx5/mlx5_common_mp.c
> b/drivers/common/mlx5/mlx5_common_mp.c
> new file mode 100644
> index 0000000000..da55143bc1
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mp.c
> @@ -0,0 +1,188 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2019 6WIND S.A.
> + * Copyright 2019 Mellanox Technologies, Ltd  */
> +
> +#include <stdio.h>
> +#include <time.h>
> +
> +#include <rte_eal.h>
> +#include <rte_errno.h>
> +
> +#include "mlx5_common_mp.h"
> +#include "mlx5_common_utils.h"
> +
> +/**
> + * Request Memory Region creation to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + * @param addr
> + *   Target virtual address to register.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_CREATE_MR);
> +	req->args.addr = addr;
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	ret = res->result;
> +	if (ret)
> +		rte_errno = -ret;
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Request Verbs queue state modification to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + * @param sm
> + *   State modify parameters.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
> +			       struct mlx5_mp_arg_queue_state_modify *sm) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req,
> MLX5_MP_REQ_QUEUE_STATE_MODIFY);
> +	req->args.state_modify = *sm;
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	ret = res->result;
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Request Verbs command file descriptor for mmap to the primary process.
> + *
> + * @param[in] mp_id
> + *   ID of the MP process.
> + *
> + * @return
> + *   fd on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id) {
> +	struct rte_mp_msg mp_req;
> +	struct rte_mp_msg *mp_res;
> +	struct rte_mp_reply mp_rep;
> +	struct mlx5_mp_param *res;
> +	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	mp_init_msg(mp_id, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
> +	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> +	if (ret) {
> +		DRV_LOG(ERR, "port %u request to primary process failed",
> +			mp_id->port_id);
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(mp_rep.nb_received == 1);
> +	mp_res = &mp_rep.msgs[0];
> +	res = (struct mlx5_mp_param *)mp_res->param;
> +	if (res->result) {
> +		rte_errno = -res->result;
> +		DRV_LOG(ERR,
> +			"port %u failed to get command FD from primary
> process",
> +			mp_id->port_id);
> +		ret = -rte_errno;
> +		goto exit;
> +	}
> +	MLX5_ASSERT(mp_res->num_fds == 1);
> +	ret = mp_res->fds[0];
> +	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
> +		mp_id->port_id, ret);
> +exit:
> +	free(mp_rep.msgs);
> +	return ret;
> +}
> +
> +/**
> + * Initialize by primary process.
> + */
> +int
> +mlx5_mp_init_primary(const char *name, const rte_mp_t primary_action) {
> +	int ret;
> +
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +
> +	/* primary is allowed to not support IPC */
> +	ret = rte_mp_action_register(name, primary_action);
> +	if (ret && rte_errno != ENOTSUP)
> +		return -1;
> +	return 0;
> +}
> +
> +/**
> + * Un-initialize by primary process.
> + */
> +void
> +mlx5_mp_uninit_primary(const char *name) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +	rte_mp_action_unregister(name);
> +}
> +
> +/**
> + * Initialize by secondary process.
> + */
> +int
> +mlx5_mp_init_secondary(const char *name, const rte_mp_t
> +secondary_action) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	return rte_mp_action_register(name, secondary_action); }
> +
> +/**
> + * Un-initialize by secondary process.
> + */
> +void
> +mlx5_mp_uninit_secondary(const char *name) {
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> +	rte_mp_action_unregister(name);
> +}
> diff --git a/drivers/common/mlx5/mlx5_common_mp.h
> b/drivers/common/mlx5/mlx5_common_mp.h
> new file mode 100644
> index 0000000000..7aab77acb2
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mp.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2018 6WIND S.A.
> + * Copyright 2018 Mellanox Technologies, Ltd  */
> +
> +#ifndef RTE_PMD_MLX5_COMMON_MP_H_
> +#define RTE_PMD_MLX5_COMMON_MP_H_
> +
> +/* Verbs header. */
> +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic.
> +*/ #ifdef PEDANTIC #pragma GCC diagnostic ignored "-Wpedantic"
> +#endif
> +#include <infiniband/verbs.h>
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic error "-Wpedantic"
> +#endif
> +
> +#include <rte_eal.h>
> +#include <rte_string_fns.h>
> +
> +/* Request types for IPC. */
> +enum mlx5_mp_req_type {
> +	MLX5_MP_REQ_VERBS_CMD_FD = 1,
> +	MLX5_MP_REQ_CREATE_MR,
> +	MLX5_MP_REQ_START_RXTX,
> +	MLX5_MP_REQ_STOP_RXTX,
> +	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
> +};
> +
> +struct mlx5_mp_arg_queue_state_modify {
> +	uint8_t is_wq; /* Set if WQ. */
> +	uint16_t queue_id; /* DPDK queue ID. */
> +	enum ibv_wq_state state; /* WQ requested state. */ };
> +
> +/* Pameters for IPC. */
> +struct mlx5_mp_param {
> +	enum mlx5_mp_req_type type;
> +	int port_id;
> +	int result;
> +	RTE_STD_C11
> +	union {
> +		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
> +		struct mlx5_mp_arg_queue_state_modify state_modify;
> +		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
> +	} args;
> +};
> +
> +/*  Identifier of a MP process */
> +struct mlx5_mp_id {
> +	char name[RTE_MP_MAX_NAME_LEN];
> +	uint16_t port_id;
> +};
> +
> +/** Request timeout for IPC. */
> +#define MLX5_MP_REQ_TIMEOUT_SEC 5
> +
> +/**
> + * Initialize IPC message.
> + *
> + * @param[in] port_id
> + *   Port ID of the device.
> + * @param[out] msg
> + *   Pointer to message to fill in.
> + * @param[in] type
> + *   Message type.
> + */
> +static inline void
> +mp_init_msg(struct mlx5_mp_id *mp_id, struct rte_mp_msg *msg,
> +	    enum mlx5_mp_req_type type)
> +{
> +	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg-
> >param;
> +
> +	memset(msg, 0, sizeof(*msg));
> +	strlcpy(msg->name, mp_id->name, sizeof(msg->name));
> +	msg->len_param = sizeof(*param);
> +	param->type = type;
> +	param->port_id = mp_id->port_id;
> +}
> +
> +__rte_experimental
> +int mlx5_mp_init_primary(const char *name, const rte_mp_t
> +primary_action); __rte_experimental void mlx5_mp_uninit_primary(const
> +char *name); __rte_experimental int mlx5_mp_init_secondary(const char
> +*name, const rte_mp_t secondary_action); __rte_experimental void
> +mlx5_mp_uninit_secondary(const char *name); __rte_experimental int
> +mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
> +__rte_experimental int mlx5_mp_req_queue_state_modify(struct
> mlx5_mp_id
> +*mp_id,
> +				   struct mlx5_mp_arg_queue_state_modify
> *sm); __rte_experimental
> +int mlx5_mp_req_verbs_cmd_fd(struct mlx5_mp_id *mp_id);
> +
> +#endif /* RTE_PMD_MLX5_COMMON_MP_H_ */
> diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map
> b/drivers/common/mlx5/rte_common_mlx5_version.map
> index aede2a0a51..265703d1c9 100644
> --- a/drivers/common/mlx5/rte_common_mlx5_version.map
> +++ b/drivers/common/mlx5/rte_common_mlx5_version.map
> @@ -48,4 +48,17 @@ DPDK_20.0.1 {
>  	mlx5_nl_vlan_vmwa_delete;
> 
>  	mlx5_translate_port_name;
> +
> +};
> +
> +EXPERIMENTAL {
> +        global:
> +
> +	mlx5_mp_init_primary;
> +	mlx5_mp_uninit_primary;
> +	mlx5_mp_init_secondary;
> +	mlx5_mp_uninit_secondary;
> +	mlx5_mp_req_mr_create;
> +	mlx5_mp_req_queue_state_modify;
> +	mlx5_mp_req_verbs_cmd_fd;
>  };
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c index
> 293d316413..d87c384422 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -38,6 +38,7 @@
>  #include <mlx5_glue.h>
>  #include <mlx5_devx_cmds.h>
>  #include <mlx5_common.h>
> +#include <mlx5_common_mp.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5.h"
> @@ -1722,7 +1723,8 @@ mlx5_init_once(void)
>  		rte_rwlock_init(&sd->mem_event_rwlock);
>  		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
>  						mlx5_mr_mem_event_cb,
> NULL);
> -		ret = mlx5_mp_init_primary();
> +		ret = mlx5_mp_init_primary(MLX5_MP_NAME,
> +					   mlx5_mp_primary_handle);
>  		if (ret)
>  			goto out;
>  		sd->init_done = true;
> @@ -1730,7 +1732,8 @@ mlx5_init_once(void)
>  	case RTE_PROC_SECONDARY:
>  		if (ld->init_done)
>  			break;
> -		ret = mlx5_mp_init_secondary();
> +		ret = mlx5_mp_init_secondary(MLX5_MP_NAME,
> +					     mlx5_mp_secondary_handle);
>  		if (ret)
>  			goto out;
>  		++sd->secondary_cnt;
> @@ -2205,6 +2208,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>  	}
>  	DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
>  	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
> +		struct mlx5_mp_id mp_id;
> +
>  		eth_dev = rte_eth_dev_attach_secondary(name);
>  		if (eth_dev == NULL) {
>  			DRV_LOG(ERR, "can not attach rte ethdev"); @@ -
> 2216,8 +2221,10 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>  		err = mlx5_proc_priv_init(eth_dev);
>  		if (err)
>  			return NULL;
> +		mp_id.port_id = eth_dev->data->port_id;
> +		strlcpy(mp_id.name, MLX5_MP_NAME,
> RTE_MP_MAX_NAME_LEN);
>  		/* Receive command fd from primary process */
> -		err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
> +		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
>  		if (err < 0)
>  			return NULL;
>  		/* Remap UAR for Tx queues. */
> @@ -2379,6 +2386,8 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>  	priv->ibv_port = spawn->ibv_port;
>  	priv->pci_dev = spawn->pci_dev;
>  	priv->mtu = RTE_ETHER_MTU;
> +	priv->mp_id.port_id = port_id;
> +	strlcpy(priv->mp_id.name, MLX5_MP_NAME,
> RTE_MP_MAX_NAME_LEN);
>  #ifndef RTE_ARCH_64
>  	/* Initialize UAR access locks for 32bit implementations. */
>  	rte_spinlock_init(&priv->uar_lock_cq);
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h index
> fccfe47341..e9d5868883 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -36,43 +36,13 @@
>  #include <mlx5_devx_cmds.h>
>  #include <mlx5_prm.h>
>  #include <mlx5_nl.h>
> +#include <mlx5_common_mp.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
>  #include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
> 
> -/* Request types for IPC. */
> -enum mlx5_mp_req_type {
> -	MLX5_MP_REQ_VERBS_CMD_FD = 1,
> -	MLX5_MP_REQ_CREATE_MR,
> -	MLX5_MP_REQ_START_RXTX,
> -	MLX5_MP_REQ_STOP_RXTX,
> -	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
> -};
> -
> -struct mlx5_mp_arg_queue_state_modify {
> -	uint8_t is_wq; /* Set if WQ. */
> -	uint16_t queue_id; /* DPDK queue ID. */
> -	enum ibv_wq_state state; /* WQ requested state. */
> -};
> -
> -/* Pameters for IPC. */
> -struct mlx5_mp_param {
> -	enum mlx5_mp_req_type type;
> -	int port_id;
> -	int result;
> -	RTE_STD_C11
> -	union {
> -		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
> -		struct mlx5_mp_arg_queue_state_modify state_modify;
> -		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
> -	} args;
> -};
> -
> -/** Request timeout for IPC. */
> -#define MLX5_MP_REQ_TIMEOUT_SEC 5
> -
>  /** Key string for IPC. */
>  #define MLX5_MP_NAME "net_mlx5_mp"
> 
> @@ -583,6 +553,7 @@ struct mlx5_priv {
>  #endif
>  	uint8_t skip_default_rss_reta; /* Skip configuration of default reta. */
>  	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
> +	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
>  };
> 
>  #define PORT_ID(priv) ((priv)->dev_data->port_id) @@ -783,16 +754,10 @@
> int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
>  		       struct rte_flow_error *error);
> 
>  /* mlx5_mp.c */
> +int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void
> +*peer); int mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg,
> +const void *peer);
>  void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);  void
> mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev); -int
> mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr); -int
> mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev); -int
> mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
> -				   struct mlx5_mp_arg_queue_state_modify
> *sm);
> -int mlx5_mp_init_primary(void);
> -void mlx5_mp_uninit_primary(void);
> -int mlx5_mp_init_secondary(void);
> -void mlx5_mp_uninit_secondary(void);
> 
>  /* mlx5_socket.c */
> 
> diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c index
> 55d408fe95..43684dbc3a 100644
> --- a/drivers/net/mlx5/mlx5_mp.c
> +++ b/drivers/net/mlx5/mlx5_mp.c
> @@ -10,46 +10,14 @@
>  #include <rte_ethdev_driver.h>
>  #include <rte_string_fns.h>
> 
> +#include <mlx5_common_mp.h>
> +
>  #include "mlx5.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_utils.h"
> 
> -/**
> - * Initialize IPC message.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param[out] msg
> - *   Pointer to message to fill in.
> - * @param[in] type
> - *   Message type.
> - */
> -static inline void
> -mp_init_msg(struct rte_eth_dev *dev, struct rte_mp_msg *msg,
> -	    enum mlx5_mp_req_type type)
> -{
> -	struct mlx5_mp_param *param = (struct mlx5_mp_param *)msg-
> >param;
> -
> -	memset(msg, 0, sizeof(*msg));
> -	strlcpy(msg->name, MLX5_MP_NAME, sizeof(msg->name));
> -	msg->len_param = sizeof(*param);
> -	param->type = type;
> -	param->port_id = dev->data->port_id;
> -}
> -
> -/**
> - * IPC message handler of primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param[in] peer
> - *   Pointer to the peer socket path.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -static int
> -mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
> +int
> +mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void
> +*peer)
>  {
>  	struct rte_mp_msg mp_res;
>  	struct mlx5_mp_param *res = (struct mlx5_mp_param
> *)mp_res.param; @@ -71,21 +39,21 @@ mp_primary_handle(const struct
> rte_mp_msg *mp_msg, const void *peer)
>  	priv = dev->data->dev_private;
>  	switch (param->type) {
>  	case MLX5_MP_REQ_CREATE_MR:
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		lkey = mlx5_mr_create_primary(dev, &entry, param-
> >args.addr);
>  		if (lkey == UINT32_MAX)
>  			res->result = -rte_errno;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
>  	case MLX5_MP_REQ_VERBS_CMD_FD:
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		mp_res.num_fds = 1;
>  		mp_res.fds[0] = priv->sh->ctx->cmd_fd;
>  		res->result = 0;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
>  	case MLX5_MP_REQ_QUEUE_STATE_MODIFY:
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		res->result = mlx5_queue_state_modify_primary
>  					(dev, &param->args.state_modify);
>  		ret = rte_mp_reply(&mp_res, peer);
> @@ -110,14 +78,15 @@ mp_primary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>   * @return
>   *   0 on success, a negative errno value otherwise and rte_errno is set.
>   */
> -static int
> -mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
> +int
> +mlx5_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void
> +*peer)
>  {
>  	struct rte_mp_msg mp_res;
>  	struct mlx5_mp_param *res = (struct mlx5_mp_param
> *)mp_res.param;
>  	const struct mlx5_mp_param *param =
>  		(const struct mlx5_mp_param *)mp_msg->param;
>  	struct rte_eth_dev *dev;
> +	struct mlx5_priv *priv;
>  	int ret;
> 
>  	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> @@ -127,13 +96,14 @@ mp_secondary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  		return -rte_errno;
>  	}
>  	dev = &rte_eth_devices[param->port_id];
> +	priv = dev->data->dev_private;
>  	switch (param->type) {
>  	case MLX5_MP_REQ_START_RXTX:
>  		DRV_LOG(INFO, "port %u starting datapath", dev->data-
> >port_id);
>  		rte_mb();
>  		dev->rx_pkt_burst = mlx5_select_rx_function(dev);
>  		dev->tx_pkt_burst = mlx5_select_tx_function(dev);
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		res->result = 0;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
> @@ -142,7 +112,7 @@ mp_secondary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  		dev->rx_pkt_burst = removed_rx_burst;
>  		dev->tx_pkt_burst = removed_tx_burst;
>  		rte_mb();
> -		mp_init_msg(dev, &mp_res, param->type);
> +		mp_init_msg(&priv->mp_id, &mp_res, param->type);
>  		res->result = 0;
>  		ret = rte_mp_reply(&mp_res, peer);
>  		break;
> @@ -171,6 +141,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum
> mlx5_mp_req_type type)
>  	struct rte_mp_reply mp_rep;
>  	struct mlx5_mp_param *res;
>  	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> +	struct mlx5_priv *priv = dev->data->dev_private;
>  	int ret;
>  	int i;
> 
> @@ -182,7 +153,7 @@ mp_req_on_rxtx(struct rte_eth_dev *dev, enum
> mlx5_mp_req_type type)
>  			dev->data->port_id, type);
>  		return;
>  	}
> -	mp_init_msg(dev, &mp_req, type);
> +	mp_init_msg(&priv->mp_id, &mp_req, type);
>  	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
>  	if (ret) {
>  		if (rte_errno != ENOTSUP)
> @@ -234,178 +205,3 @@ mlx5_mp_req_stop_rxtx(struct rte_eth_dev *dev)  {
>  	mp_req_on_rxtx(dev, MLX5_MP_REQ_STOP_RXTX);  }
> -
> -/**
> - * Request Memory Region creation to the primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mp_req_mr_create(struct rte_eth_dev *dev, uintptr_t addr) -{
> -	struct rte_mp_msg mp_req;
> -	struct rte_mp_msg *mp_res;
> -	struct rte_mp_reply mp_rep;
> -	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> -	struct mlx5_mp_param *res;
> -	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_CREATE_MR);
> -	req->args.addr = addr;
> -	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> -	if (ret) {
> -		DRV_LOG(ERR, "port %u request to primary process failed",
> -			dev->data->port_id);
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(mp_rep.nb_received == 1);
> -	mp_res = &mp_rep.msgs[0];
> -	res = (struct mlx5_mp_param *)mp_res->param;
> -	ret = res->result;
> -	if (ret)
> -		rte_errno = -ret;
> -	free(mp_rep.msgs);
> -	return ret;
> -}
> -
> -/**
> - * Request Verbs queue state modification to the primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - * @param sm
> - *   State modify parameters.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mp_req_queue_state_modify(struct rte_eth_dev *dev,
> -			       struct mlx5_mp_arg_queue_state_modify *sm)
> -{
> -	struct rte_mp_msg mp_req;
> -	struct rte_mp_msg *mp_res;
> -	struct rte_mp_reply mp_rep;
> -	struct mlx5_mp_param *req = (struct mlx5_mp_param
> *)mp_req.param;
> -	struct mlx5_mp_param *res;
> -	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_QUEUE_STATE_MODIFY);
> -	req->args.state_modify = *sm;
> -	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> -	if (ret) {
> -		DRV_LOG(ERR, "port %u request to primary process failed",
> -			dev->data->port_id);
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(mp_rep.nb_received == 1);
> -	mp_res = &mp_rep.msgs[0];
> -	res = (struct mlx5_mp_param *)mp_res->param;
> -	ret = res->result;
> -	free(mp_rep.msgs);
> -	return ret;
> -}
> -
> -/**
> - * Request Verbs command file descriptor for mmap to the primary process.
> - *
> - * @param[in] dev
> - *   Pointer to Ethernet structure.
> - *
> - * @return
> - *   fd on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev) -{
> -	struct rte_mp_msg mp_req;
> -	struct rte_mp_msg *mp_res;
> -	struct rte_mp_reply mp_rep;
> -	struct mlx5_mp_param *res;
> -	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec
> = 0};
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	mp_init_msg(dev, &mp_req, MLX5_MP_REQ_VERBS_CMD_FD);
> -	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
> -	if (ret) {
> -		DRV_LOG(ERR, "port %u request to primary process failed",
> -			dev->data->port_id);
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(mp_rep.nb_received == 1);
> -	mp_res = &mp_rep.msgs[0];
> -	res = (struct mlx5_mp_param *)mp_res->param;
> -	if (res->result) {
> -		rte_errno = -res->result;
> -		DRV_LOG(ERR,
> -			"port %u failed to get command FD from primary
> process",
> -			dev->data->port_id);
> -		ret = -rte_errno;
> -		goto exit;
> -	}
> -	MLX5_ASSERT(mp_res->num_fds == 1);
> -	ret = mp_res->fds[0];
> -	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
> -		dev->data->port_id, ret);
> -exit:
> -	free(mp_rep.msgs);
> -	return ret;
> -}
> -
> -/**
> - * Initialize by primary process.
> - */
> -int
> -mlx5_mp_init_primary(void)
> -{
> -	int ret;
> -
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> -
> -	/* primary is allowed to not support IPC */
> -	ret = rte_mp_action_register(MLX5_MP_NAME, mp_primary_handle);
> -	if (ret && rte_errno != ENOTSUP)
> -		return -1;
> -	return 0;
> -}
> -
> -/**
> - * Un-initialize by primary process.
> - */
> -void
> -mlx5_mp_uninit_primary(void)
> -{
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> -	rte_mp_action_unregister(MLX5_MP_NAME);
> -}
> -
> -/**
> - * Initialize by secondary process.
> - */
> -int
> -mlx5_mp_init_secondary(void)
> -{
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	return rte_mp_action_register(MLX5_MP_NAME,
> mp_secondary_handle);
> -}
> -
> -/**
> - * Un-initialize by secondary process.
> - */
> -void
> -mlx5_mp_uninit_secondary(void)
> -{
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
> -	rte_mp_action_unregister(MLX5_MP_NAME);
> -}
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c index
> a8f185a208..9151992a72 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -540,7 +540,7 @@ mlx5_mr_create_secondary(struct rte_eth_dev *dev,
> struct mlx5_mr_cache *entry,
> 
>  	DEBUG("port %u requesting MR creation for address (%p)",
>  	      dev->data->port_id, (void *)addr);
> -	ret = mlx5_mp_req_mr_create(dev, addr);
> +	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
>  	if (ret) {
>  		DEBUG("port %u fail to request MR creation for address
> (%p)",
>  		      dev->data->port_id, (void *)addr); diff --git
> a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> 7ce3732fd3..42d7da8a4b 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1000,6 +1000,7 @@ static int
>  mlx5_queue_state_modify(struct rte_eth_dev *dev,
>  			struct mlx5_mp_arg_queue_state_modify *sm)  {
> +	struct mlx5_priv *priv = dev->data->dev_private;
>  	int ret = 0;
> 
>  	switch (rte_eal_process_type()) {
> @@ -1007,7 +1008,7 @@ mlx5_queue_state_modify(struct rte_eth_dev *dev,
>  		ret = mlx5_queue_state_modify_primary(dev, sm);
>  		break;
>  	case RTE_PROC_SECONDARY:
> -		ret = mlx5_mp_req_queue_state_modify(dev, sm);
> +		ret = mlx5_mp_req_queue_state_modify(&priv->mp_id, sm);
>  		break;
>  	default:
>  		break;
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] common/mlx5: refactor memory management codes
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 2/2] common/mlx5: refactor memory management codes Vu Pham
@ 2020-04-14  7:27     ` Slava Ovsiienko
  0 siblings, 0 replies; 26+ messages in thread
From: Slava Ovsiienko @ 2020-04-14  7:27 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Ori Kam, Matan Azrad, Raslan Darawsheh, Vu Pham

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 14, 2020 0:18
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v4 2/2] common/mlx5: refactor memory management codes
> 
> Refactor common memory btree and cache management to common driver.
> Replace some input parameters of MR APIs to more common datastructure
> like PD, port_id, share_cache,... so that muliptle PMD drivers can
> use those MR APIs.
> 
> Modify mlx5 net pmd driver to use MR management APIs from common
> driver.
> 
> Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

> ---
>  drivers/common/mlx5/Makefile                    |    1 +
>  drivers/common/mlx5/meson.build                 |    1 +
>  drivers/common/mlx5/mlx5_common_mr.c            | 1108
> +++++++++++++++++++++
>  drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
>  drivers/common/mlx5/rte_common_mlx5_version.map |   14 +
>  drivers/net/mlx5/mlx5.c                         |    4 +-
>  drivers/net/mlx5/mlx5.h                         |   12 +-
>  drivers/net/mlx5/mlx5_mp.c                      |    8 +-
>  drivers/net/mlx5/mlx5_mr.c                      | 1169 +----------------------
>  drivers/net/mlx5/mlx5_mr.h                      |   87 +-
>  drivers/net/mlx5/mlx5_rxtx.c                    |    1 +
>  drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
>  drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
>  drivers/net/mlx5/mlx5_trigger.c                 |    1 +
>  drivers/net/mlx5/mlx5_txq.c                     |    3 +-
>  15 files changed, 1357 insertions(+), 1224 deletions(-)
>  create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mr.h
> 
> diff --git a/drivers/common/mlx5/Makefile b/drivers/common/mlx5/Makefile
> index 2a88492731..26267c957a 100644
> --- a/drivers/common/mlx5/Makefile
> +++ b/drivers/common/mlx5/Makefile
> @@ -18,6 +18,7 @@ SRCS-y += mlx5_devx_cmds.c
>  SRCS-y += mlx5_common.c
>  SRCS-y += mlx5_nl.c
>  SRCS-y += mlx5_common_mp.c
> +SRCS-y += mlx5_common_mr.c
>  ifeq ($(CONFIG_RTE_IBVERBS_LINK_DLOPEN),y)
>  INSTALL-y-lib += $(LIB_GLUE)
>  endif
> diff --git a/drivers/common/mlx5/meson.build
> b/drivers/common/mlx5/meson.build
> index 83671861c9..175251b691 100644
> --- a/drivers/common/mlx5/meson.build
> +++ b/drivers/common/mlx5/meson.build
> @@ -56,6 +56,7 @@ sources = files(
>  	'mlx5_common.c',
>  	'mlx5_nl.c',
>  	'mlx5_common_mp.c',
> +	'mlx5_common_mr.c',
>  )
>  if not dlopen_ibverbs
>  	sources += files('mlx5_glue.c')
> diff --git a/drivers/common/mlx5/mlx5_common_mr.c
> b/drivers/common/mlx5/mlx5_common_mr.c
> new file mode 100644
> index 0000000000..9d4a06dd5b
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mr.c
> @@ -0,0 +1,1108 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2016 6WIND S.A.
> + * Copyright 2020 Mellanox Technologies, Ltd
> + */
> +#include <rte_eal_memconfig.h>
> +#include <rte_errno.h>
> +#include <rte_mempool.h>
> +#include <rte_malloc.h>
> +#include <rte_rwlock.h>
> +
> +#include "mlx5_glue.h"
> +#include "mlx5_common_mp.h"
> +#include "mlx5_common_mr.h"
> +#include "mlx5_common_utils.h"
> +
> +struct mr_find_contig_memsegs_data {
> +	uintptr_t addr;
> +	uintptr_t start;
> +	uintptr_t end;
> +	const struct rte_memseg_list *msl;
> +};
> +
> +/**
> + * Expand B-tree table to a given size. Can't be called with holding
> + * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param n
> + *   Number of entries for expansion.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_btree_expand(struct mlx5_mr_btree *bt, int n)
> +{
> +	void *mem;
> +	int ret = 0;
> +
> +	if (n <= bt->size)
> +		return ret;
> +	/*
> +	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
> +	 * used inside if there's no room to expand. Because this is a quite
> +	 * rare case and a part of very slow path, it is very acceptable.
> +	 * Initially cache_bh[] will be given practically enough space and once
> +	 * it is expanded, expansion wouldn't be needed again ever.
> +	 */
> +	mem = rte_realloc(bt->table, n * sizeof(struct mr_cache_entry), 0);
> +	if (mem == NULL) {
> +		/* Not an error, B-tree search will be skipped. */
> +		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
> +			(void *)bt);
> +		ret = -1;
> +	} else {
> +		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
> +		bt->table = mem;
> +		bt->size = n;
> +	}
> +	return ret;
> +}
> +
> +/**
> + * Look up LKey from given B-tree lookup table, store the last index and
> return
> + * searched LKey.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param[out] idx
> + *   Pointer to index. Even on search failure, returns index where it stops
> + *   searching so that index can be used when inserting a new entry.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
> +{
> +	struct mr_cache_entry *lkp_tbl;
> +	uint16_t n;
> +	uint16_t base = 0;
> +
> +	MLX5_ASSERT(bt != NULL);
> +	lkp_tbl = *bt->table;
> +	n = bt->len;
> +	/* First entry must be NULL for comparison. */
> +	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
> +				    lkp_tbl[0].lkey == UINT32_MAX));
> +	/* Binary search. */
> +	do {
> +		register uint16_t delta = n >> 1;
> +
> +		if (addr < lkp_tbl[base + delta].start) {
> +			n = delta;
> +		} else {
> +			base += delta;
> +			n -= delta;
> +		}
> +	} while (n > 1);
> +	MLX5_ASSERT(addr >= lkp_tbl[base].start);
> +	*idx = base;
> +	if (addr < lkp_tbl[base].end)
> +		return lkp_tbl[base].lkey;
> +	/* Not found. */
> +	return UINT32_MAX;
> +}
> +
> +/**
> + * Insert an entry to B-tree lookup table.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param entry
> + *   Pointer to new entry to insert.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_btree_insert(struct mlx5_mr_btree *bt, struct mr_cache_entry *entry)
> +{
> +	struct mr_cache_entry *lkp_tbl;
> +	uint16_t idx = 0;
> +	size_t shift;
> +
> +	MLX5_ASSERT(bt != NULL);
> +	MLX5_ASSERT(bt->len <= bt->size);
> +	MLX5_ASSERT(bt->len > 0);
> +	lkp_tbl = *bt->table;
> +	/* Find out the slot for insertion. */
> +	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
> +		DRV_LOG(DEBUG,
> +			"abort insertion to B-tree(%p): already exist at"
> +			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> +			(void *)bt, idx, entry->start, entry->end, entry->lkey);
> +		/* Already exist, return. */
> +		return 0;
> +	}
> +	/* If table is full, return error. */
> +	if (unlikely(bt->len == bt->size)) {
> +		bt->overflow = 1;
> +		return -1;
> +	}
> +	/* Insert entry. */
> +	++idx;
> +	shift = (bt->len - idx) * sizeof(struct mr_cache_entry);
> +	if (shift)
> +		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
> +	lkp_tbl[idx] = *entry;
> +	bt->len++;
> +	DRV_LOG(DEBUG,
> +		"inserted B-tree(%p)[%u],"
> +		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> +		(void *)bt, idx, entry->start, entry->end, entry->lkey);
> +	return 0;
> +}
> +
> +/**
> + * Initialize B-tree and allocate memory for lookup table.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param n
> + *   Number of entries to allocate.
> + * @param socket
> + *   NUMA socket on which memory must be allocated.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
> +{
> +	if (bt == NULL) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	MLX5_ASSERT(!bt->table && !bt->size);
> +	memset(bt, 0, sizeof(*bt));
> +	bt->table = rte_calloc_socket("B-tree table",
> +				      n, sizeof(struct mr_cache_entry),
> +				      0, socket);
> +	if (bt->table == NULL) {
> +		rte_errno = ENOMEM;
> +		DEBUG("failed to allocate memory for btree cache on socket
> %d",
> +		      socket);
> +		return -rte_errno;
> +	}
> +	bt->size = n;
> +	/* First entry must be NULL for binary search. */
> +	(*bt->table)[bt->len++] = (struct mr_cache_entry) {
> +		.lkey = UINT32_MAX,
> +	};
> +	DEBUG("initialized B-tree %p with table %p",
> +	      (void *)bt, (void *)bt->table);
> +	return 0;
> +}
> +
> +/**
> + * Free B-tree resources.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + */
> +void
> +mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
> +{
> +	if (bt == NULL)
> +		return;
> +	DEBUG("freeing B-tree %p with table %p",
> +	      (void *)bt, (void *)bt->table);
> +	rte_free(bt->table);
> +	memset(bt, 0, sizeof(*bt));
> +}
> +
> +/**
> + * Dump all the entries in a B-tree
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + */
> +void
> +mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
> +{
> +#ifdef RTE_LIBRTE_MLX5_DEBUG
> +	int idx;
> +	struct mr_cache_entry *lkp_tbl;
> +
> +	if (bt == NULL)
> +		return;
> +	lkp_tbl = *bt->table;
> +	for (idx = 0; idx < bt->len; ++idx) {
> +		struct mr_cache_entry *entry = &lkp_tbl[idx];
> +
> +		DEBUG("B-tree(%p)[%u],"
> +		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> +		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
> +	}
> +#endif
> +}
> +
> +/**
> + * Find virtually contiguous memory chunk in a given MR.
> + *
> + * @param dev
> + *   Pointer to MR structure.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If not found, this will not be
> + *   updated.
> + * @param start_idx
> + *   Start index of the memseg bitmap.
> + *
> + * @return
> + *   Next index to go on lookup.
> + */
> +static int
> +mr_find_next_chunk(struct mlx5_mr *mr, struct mr_cache_entry *entry,
> +		   int base_idx)
> +{
> +	uintptr_t start = 0;
> +	uintptr_t end = 0;
> +	uint32_t idx = 0;
> +
> +	/* MR for external memory doesn't have memseg list. */
> +	if (mr->msl == NULL) {
> +		struct ibv_mr *ibv_mr = mr->ibv_mr;
> +
> +		MLX5_ASSERT(mr->ms_bmp_n == 1);
> +		MLX5_ASSERT(mr->ms_n == 1);
> +		MLX5_ASSERT(base_idx == 0);
> +		/*
> +		 * Can't search it from memseg list but get it directly from
> +		 * verbs MR as there's only one chunk.
> +		 */
> +		entry->start = (uintptr_t)ibv_mr->addr;
> +		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
> +		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> +		/* Returning 1 ends iteration. */
> +		return 1;
> +	}
> +	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
> +		if (rte_bitmap_get(mr->ms_bmp, idx)) {
> +			const struct rte_memseg_list *msl;
> +			const struct rte_memseg *ms;
> +
> +			msl = mr->msl;
> +			ms = rte_fbarray_get(&msl->memseg_arr,
> +					     mr->ms_base_idx + idx);
> +			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> +			if (!start)
> +				start = ms->addr_64;
> +			end = ms->addr_64 + ms->hugepage_sz;
> +		} else if (start) {
> +			/* Passed the end of a fragment. */
> +			break;
> +		}
> +	}
> +	if (start) {
> +		/* Found one chunk. */
> +		entry->start = start;
> +		entry->end = end;
> +		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> +	}
> +	return idx;
> +}
> +
> +/**
> + * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
> + * Then, this entry will have to be searched by mr_lookup_list() in
> + * mlx5_mr_create() on miss.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param mr
> + *   Pointer to MR to insert.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +int
> +mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mlx5_mr *mr)
> +{
> +	unsigned int n;
> +
> +	DRV_LOG(DEBUG, "Inserting MR(%p) to global cache(%p)",
> +		(void *)mr, (void *)share_cache);
> +	for (n = 0; n < mr->ms_bmp_n; ) {
> +		struct mr_cache_entry entry;
> +
> +		memset(&entry, 0, sizeof(entry));
> +		/* Find a contiguous chunk and advance the index. */
> +		n = mr_find_next_chunk(mr, &entry, n);
> +		if (!entry.end)
> +			break;
> +		if (mr_btree_insert(&share_cache->cache, &entry) < 0) {
> +			/*
> +			 * Overflowed, but the global table cannot be
> expanded
> +			 * because of deadlock.
> +			 */
> +			return -1;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Look up address in the original global MR list.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Found MR on match, NULL otherwise.
> + */
> +struct mlx5_mr *
> +mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
> +		    struct mr_cache_entry *entry, uintptr_t addr)
> +{
> +	struct mlx5_mr *mr;
> +
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
> +		unsigned int n;
> +
> +		if (mr->ms_n == 0)
> +			continue;
> +		for (n = 0; n < mr->ms_bmp_n; ) {
> +			struct mr_cache_entry ret;
> +
> +			memset(&ret, 0, sizeof(ret));
> +			n = mr_find_next_chunk(mr, &ret, n);
> +			if (addr >= ret.start && addr < ret.end) {
> +				/* Found. */
> +				*entry = ret;
> +				return mr;
> +			}
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * Look up address on global MR cache.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +uint32_t
> +mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mr_cache_entry *entry, uintptr_t addr)
> +{
> +	uint16_t idx;
> +	uint32_t lkey = UINT32_MAX;
> +	struct mlx5_mr *mr;
> +
> +	/*
> +	 * If the global cache has overflowed since it failed to expand the
> +	 * B-tree table, it can't have all the existing MRs. Then, the address
> +	 * has to be searched by traversing the original MR list instead, which
> +	 * is very slow path. Otherwise, the global cache is all inclusive.
> +	 */
> +	if (!unlikely(share_cache->cache.overflow)) {
> +		lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
> +		if (lkey != UINT32_MAX)
> +			*entry = (*share_cache->cache.table)[idx];
> +	} else {
> +		/* Falling back to the slowest path. */
> +		mr = mlx5_mr_lookup_list(share_cache, entry, addr);
> +		if (mr != NULL)
> +			lkey = entry->lkey;
> +	}
> +	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
> +					   addr < entry->end));
> +	return lkey;
> +}
> +
> +/**
> + * Free MR resources. MR lock must not be held to avoid a deadlock.
> rte_free()
> + * can raise memory free event and the callback function will spin on the
> lock.
> + *
> + * @param mr
> + *   Pointer to MR to free.
> + */
> +static void
> +mr_free(struct mlx5_mr *mr)
> +{
> +	if (mr == NULL)
> +		return;
> +	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
> +	if (mr->ibv_mr != NULL)
> +		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
> +	if (mr->ms_bmp != NULL)
> +		rte_bitmap_free(mr->ms_bmp);
> +	rte_free(mr);
> +}
> +
> +void
> +mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache)
> +{
> +	struct mlx5_mr *mr;
> +
> +	DRV_LOG(DEBUG, "Rebuild dev cache[] %p", (void *)share_cache);
> +	/* Flush cache to rebuild. */
> +	share_cache->cache.len = 1;
> +	share_cache->cache.overflow = 0;
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &share_cache->mr_list, mr)
> +		if (mlx5_mr_insert_cache(share_cache, mr) < 0)
> +			return;
> +}
> +
> +/**
> + * Release resources of detached MR having no online entry.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + */
> +static void
> +mlx5_mr_garbage_collect(struct mlx5_mr_share_cache *share_cache)
> +{
> +	struct mlx5_mr *mr_next;
> +	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
> +
> +	/* Must be called from the primary process. */
> +	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> +	/*
> +	 * MR can't be freed with holding the lock because rte_free() could
> call
> +	 * memory free callback function. This will be a deadlock situation.
> +	 */
> +	rte_rwlock_write_lock(&share_cache->rwlock);
> +	/* Detach the whole free list and release it after unlocking. */
> +	free_list = share_cache->mr_free_list;
> +	LIST_INIT(&share_cache->mr_free_list);
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +	/* Release resources. */
> +	mr_next = LIST_FIRST(&free_list);
> +	while (mr_next != NULL) {
> +		struct mlx5_mr *mr = mr_next;
> +
> +		mr_next = LIST_NEXT(mr, mr);
> +		mr_free(mr);
> +	}
> +}
> +
> +/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
> +static int
> +mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
> +			  const struct rte_memseg *ms, size_t len, void *arg)
> +{
> +	struct mr_find_contig_memsegs_data *data = arg;
> +
> +	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
> +		return 0;
> +	/* Found, save it and stop walking. */
> +	data->start = ms->addr_64;
> +	data->end = ms->addr_64 + len;
> +	data->msl = msl;
> +	return 1;
> +}
> +
> +/**
> + * Create a new global Memory Region (MR) for a missing virtual address.
> + * This API should be called on a secondary process, then a request is sent to
> + * the primary process in order to create a MR for the address. As the global
> MR
> + * list is on the shared memory, following LKey lookup should succeed unless
> the
> + * request fails.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + * @param mr_ext_memseg_en
> + *   Configurable flag about external memory segment enable or not.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +static uint32_t
> +mlx5_mr_create_secondary(struct ibv_pd *pd __rte_unused,
> +			 struct mlx5_mp_id *mp_id,
> +			 struct mlx5_mr_share_cache *share_cache,
> +			 struct mr_cache_entry *entry, uintptr_t addr,
> +			 unsigned int mr_ext_memseg_en __rte_unused)
> +{
> +	int ret;
> +
> +	DEBUG("port %u requesting MR creation for address (%p)",
> +	      mp_id->port_id, (void *)addr);
> +	ret = mlx5_mp_req_mr_create(mp_id, addr);
> +	if (ret) {
> +		DEBUG("Fail to request MR creation for address (%p)",
> +		      (void *)addr);
> +		return UINT32_MAX;
> +	}
> +	rte_rwlock_read_lock(&share_cache->rwlock);
> +	/* Fill in output data. */
> +	mlx5_mr_lookup_cache(share_cache, entry, addr);
> +	/* Lookup can't fail. */
> +	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> +	rte_rwlock_read_unlock(&share_cache->rwlock);
> +	DEBUG("MR CREATED by primary process for %p:\n"
> +	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
> +	      (void *)addr, entry->start, entry->end, entry->lkey);
> +	return entry->lkey;
> +}
> +
> +/**
> + * Create a new global Memory Region (MR) for a missing virtual address.
> + * Register entire virtually contiguous memory chunk around the address.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + * @param mr_ext_memseg_en
> + *   Configurable flag about external memory segment enable or not.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +uint32_t
> +mlx5_mr_create_primary(struct ibv_pd *pd,
> +		       struct mlx5_mr_share_cache *share_cache,
> +		       struct mr_cache_entry *entry, uintptr_t addr,
> +		       unsigned int mr_ext_memseg_en)
> +{
> +	struct mr_find_contig_memsegs_data data = {.addr = addr, };
> +	struct mr_find_contig_memsegs_data data_re;
> +	const struct rte_memseg_list *msl;
> +	const struct rte_memseg *ms;
> +	struct mlx5_mr *mr = NULL;
> +	int ms_idx_shift = -1;
> +	uint32_t bmp_size;
> +	void *bmp_mem;
> +	uint32_t ms_n;
> +	uint32_t n;
> +	size_t len;
> +
> +	DRV_LOG(DEBUG, "Creating a MR using address (%p)", (void *)addr);
> +	/*
> +	 * Release detached MRs if any. This can't be called with holding
> either
> +	 * memory_hotplug_lock or share_cache->rwlock. MRs on the free list
> have
> +	 * been detached by the memory free event but it couldn't be
> released
> +	 * inside the callback due to deadlock. As a result, releasing resources
> +	 * is quite opportunistic.
> +	 */
> +	mlx5_mr_garbage_collect(share_cache);
> +	/*
> +	 * If enabled, find out a contiguous virtual address chunk in use, to
> +	 * which the given address belongs, in order to register maximum
> range.
> +	 * In the best case where mempools are not dynamically recreated
> and
> +	 * '--socket-mem' is specified as an EAL option, it is very likely to
> +	 * have only one MR(LKey) per a socket and per a hugepage-size even
> +	 * though the system memory is highly fragmented. As the whole
> memory
> +	 * chunk will be pinned by kernel, it can't be reused unless entire
> +	 * chunk is freed from EAL.
> +	 *
> +	 * If disabled, just register one memseg (page). Then, memory
> +	 * consumption will be minimized but it may drop performance if
> there
> +	 * are many MRs to lookup on the datapath.
> +	 */
> +	if (!mr_ext_memseg_en) {
> +		data.msl = rte_mem_virt2memseg_list((void *)addr);
> +		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
> +		data.end = data.start + data.msl->page_sz;
> +	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> &data)) {
> +		DRV_LOG(WARNING,
> +			"Unable to find virtually contiguous"
> +			" chunk for address (%p)."
> +			" rte_memseg_contig_walk() failed.", (void *)addr);
> +		rte_errno = ENXIO;
> +		goto err_nolock;
> +	}
> +alloc_resources:
> +	/* Addresses must be page-aligned. */
> +	MLX5_ASSERT(data.msl);
> +	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
> +	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
> +	msl = data.msl;
> +	ms = rte_mem_virt2memseg((void *)data.start, msl);
> +	len = data.end - data.start;
> +	MLX5_ASSERT(ms);
> +	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> +	/* Number of memsegs in the range. */
> +	ms_n = len / msl->page_sz;
> +	DEBUG("Extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> +	      " page_sz=0x%" PRIx64 ", ms_n=%u",
> +	      (void *)addr, data.start, data.end, msl->page_sz, ms_n);
> +	/* Size of memory for bitmap. */
> +	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
> +	mr = rte_zmalloc_socket(NULL,
> +				RTE_ALIGN_CEIL(sizeof(*mr),
> +					       RTE_CACHE_LINE_SIZE) +
> +				bmp_size,
> +				RTE_CACHE_LINE_SIZE, msl->socket_id);
> +	if (mr == NULL) {
> +		DEBUG("Unable to allocate memory for a new MR of"
> +		      " address (%p).", (void *)addr);
> +		rte_errno = ENOMEM;
> +		goto err_nolock;
> +	}
> +	mr->msl = msl;
> +	/*
> +	 * Save the index of the first memseg and initialize memseg bitmap.
> To
> +	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
> +	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
> +	 */
> +	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
> +	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
> +	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
> +	if (mr->ms_bmp == NULL) {
> +		DEBUG("Unable to initialize bitmap for a new MR of"
> +		      " address (%p).", (void *)addr);
> +		rte_errno = EINVAL;
> +		goto err_nolock;
> +	}
> +	/*
> +	 * Should recheck whether the extended contiguous chunk is still
> valid.
> +	 * Because memory_hotplug_lock can't be held if there's any memory
> +	 * related calls in a critical path, resource allocation above can't be
> +	 * locked. If the memory has been changed at this point, try again
> with
> +	 * just single page. If not, go on with the big chunk atomically from
> +	 * here.
> +	 */
> +	rte_mcfg_mem_read_lock();
> +	data_re = data;
> +	if (len > msl->page_sz &&
> +	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re))
> {
> +		DEBUG("Unable to find virtually contiguous"
> +		      " chunk for address (%p)."
> +		      " rte_memseg_contig_walk() failed.", (void *)addr);
> +		rte_errno = ENXIO;
> +		goto err_memlock;
> +	}
> +	if (data.start != data_re.start || data.end != data_re.end) {
> +		/*
> +		 * The extended contiguous chunk has been changed. Try
> again
> +		 * with single memseg instead.
> +		 */
> +		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
> +		data.end = data.start + msl->page_sz;
> +		rte_mcfg_mem_read_unlock();
> +		mr_free(mr);
> +		goto alloc_resources;
> +	}
> +	MLX5_ASSERT(data.msl == data_re.msl);
> +	rte_rwlock_write_lock(&share_cache->rwlock);
> +	/*
> +	 * Check the address is really missing. If other thread already created
> +	 * one or it is not found due to overflow, abort and return.
> +	 */
> +	if (mlx5_mr_lookup_cache(share_cache, entry, addr) != UINT32_MAX)
> {
> +		/*
> +		 * Insert to the global cache table. It may fail due to
> +		 * low-on-memory. Then, this entry will have to be searched
> +		 * here again.
> +		 */
> +		mr_btree_insert(&share_cache->cache, entry);
> +		DEBUG("Found MR for %p on final lookup, abort", (void
> *)addr);
> +		rte_rwlock_write_unlock(&share_cache->rwlock);
> +		rte_mcfg_mem_read_unlock();
> +		/*
> +		 * Must be unlocked before calling rte_free() because
> +		 * mlx5_mr_mem_event_free_cb() can be called inside.
> +		 */
> +		mr_free(mr);
> +		return entry->lkey;
> +	}
> +	/*
> +	 * Trim start and end addresses for verbs MR. Set bits for registering
> +	 * memsegs but exclude already registered ones. Bitmap can be
> +	 * fragmented.
> +	 */
> +	for (n = 0; n < ms_n; ++n) {
> +		uintptr_t start;
> +		struct mr_cache_entry ret;
> +
> +		memset(&ret, 0, sizeof(ret));
> +		start = data_re.start + n * msl->page_sz;
> +		/* Exclude memsegs already registered by other MRs. */
> +		if (mlx5_mr_lookup_cache(share_cache, &ret, start) ==
> +		    UINT32_MAX) {
> +			/*
> +			 * Start from the first unregistered memseg in the
> +			 * extended range.
> +			 */
> +			if (ms_idx_shift == -1) {
> +				mr->ms_base_idx += n;
> +				data.start = start;
> +				ms_idx_shift = n;
> +			}
> +			data.end = start + msl->page_sz;
> +			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
> +			++mr->ms_n;
> +		}
> +	}
> +	len = data.end - data.start;
> +	mr->ms_bmp_n = len / msl->page_sz;
> +	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
> +	/*
> +	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can
> be
> +	 * called with holding the memory lock because it doesn't use
> +	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
> +	 * through mlx5_alloc_verbs_buf().
> +	 */
> +	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)data.start, len,
> +				       IBV_ACCESS_LOCAL_WRITE |
> +					   IBV_ACCESS_RELAXED_ORDERING);
> +	if (mr->ibv_mr == NULL) {
> +		DEBUG("Fail to create a verbs MR for address (%p)",
> +		      (void *)addr);
> +		rte_errno = EINVAL;
> +		goto err_mrlock;
> +	}
> +	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
> +	MLX5_ASSERT(mr->ibv_mr->length == len);
> +	LIST_INSERT_HEAD(&share_cache->mr_list, mr, mr);
> +	DEBUG("MR CREATED (%p) for %p:\n"
> +	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> +	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> +	      (void *)mr, (void *)addr, data.start, data.end,
> +	      rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> +	/* Insert to the global cache table. */
> +	mlx5_mr_insert_cache(share_cache, mr);
> +	/* Fill in output data. */
> +	mlx5_mr_lookup_cache(share_cache, entry, addr);
> +	/* Lookup can't fail. */
> +	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +	rte_mcfg_mem_read_unlock();
> +	return entry->lkey;
> +err_mrlock:
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +err_memlock:
> +	rte_mcfg_mem_read_unlock();
> +err_nolock:
> +	/*
> +	 * In case of error, as this can be called in a datapath, a warning
> +	 * message per an error is preferable instead. Must be unlocked
> before
> +	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be
> called
> +	 * inside.
> +	 */
> +	mr_free(mr);
> +	return UINT32_MAX;
> +}
> +
> +/**
> + * Create a new global Memory Region (MR) for a missing virtual address.
> + * This can be called from primary and secondary process.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> + */
> +static uint32_t
> +mlx5_mr_create(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
> +	       struct mlx5_mr_share_cache *share_cache,
> +	       struct mr_cache_entry *entry, uintptr_t addr,
> +	       unsigned int mr_ext_memseg_en)
> +{
> +	uint32_t ret = 0;
> +
> +	switch (rte_eal_process_type()) {
> +	case RTE_PROC_PRIMARY:
> +		ret = mlx5_mr_create_primary(pd, share_cache, entry,
> +					     addr, mr_ext_memseg_en);
> +		break;
> +	case RTE_PROC_SECONDARY:
> +		ret = mlx5_mr_create_secondary(pd, mp_id, share_cache,
> entry,
> +					       addr, mr_ext_memseg_en);
> +		break;
> +	default:
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/**
> + * Look up address in the global MR cache table. If not found, create a new
> MR.
> + * Insert the found/created entry to local bottom-half cache table.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or newly
> + *   created. If failed to create one, this is not written.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mr_lookup_caches(struct ibv_pd *pd, struct mlx5_mp_id *mp_id,
> +		 struct mlx5_mr_share_cache *share_cache,
> +		 struct mlx5_mr_ctrl *mr_ctrl,
> +		 struct mr_cache_entry *entry, uintptr_t addr,
> +		 unsigned int mr_ext_memseg_en)
> +{
> +	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
> +	uint32_t lkey;
> +	uint16_t idx;
> +
> +	/* If local cache table is full, try to double it. */
> +	if (unlikely(bt->len == bt->size))
> +		mr_btree_expand(bt, bt->size << 1);
> +	/* Look up in the global cache. */
> +	rte_rwlock_read_lock(&share_cache->rwlock);
> +	lkey = mr_btree_lookup(&share_cache->cache, &idx, addr);
> +	if (lkey != UINT32_MAX) {
> +		/* Found. */
> +		*entry = (*share_cache->cache.table)[idx];
> +		rte_rwlock_read_unlock(&share_cache->rwlock);
> +		/*
> +		 * Update local cache. Even if it fails, return the found entry
> +		 * to update top-half cache. Next time, this entry will be
> found
> +		 * in the global cache.
> +		 */
> +		mr_btree_insert(bt, entry);
> +		return lkey;
> +	}
> +	rte_rwlock_read_unlock(&share_cache->rwlock);
> +	/* First time to see the address? Create a new MR. */
> +	lkey = mlx5_mr_create(pd, mp_id, share_cache, entry, addr,
> +			      mr_ext_memseg_en);
> +	/*
> +	 * Update the local cache if successfully created a new global MR.
> Even
> +	 * if failed to create one, there's no action to take in this datapath
> +	 * code. As returning LKey is invalid, this will eventually make HW
> +	 * fail.
> +	 */
> +	if (lkey != UINT32_MAX)
> +		mr_btree_insert(bt, entry);
> +	return lkey;
> +}
> +
> +/**
> + * Bottom-half of LKey search on datapath. First search in cache_bh[] and if
> + * misses, search in the global MR cache table and update the new entry to
> + * per-queue local caches.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id
> *mp_id,
> +			    struct mlx5_mr_share_cache *share_cache,
> +			    struct mlx5_mr_ctrl *mr_ctrl,
> +			    uintptr_t addr, unsigned int mr_ext_memseg_en)
> +{
> +	uint32_t lkey;
> +	uint16_t bh_idx = 0;
> +	/* Victim in top-half cache to replace with new entry. */
> +	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
> +
> +	/* Binary-search MR translation table. */
> +	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
> +	/* Update top-half cache. */
> +	if (likely(lkey != UINT32_MAX)) {
> +		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
> +	} else {
> +		/*
> +		 * If missed in local lookup table, search in the global cache
> +		 * and local cache_bh[] will be updated inside if possible.
> +		 * Top-half cache entry will also be updated.
> +		 */
> +		lkey = mr_lookup_caches(pd, mp_id, share_cache, mr_ctrl,
> +					repl, addr, mr_ext_memseg_en);
> +		if (unlikely(lkey == UINT32_MAX))
> +			return UINT32_MAX;
> +	}
> +	/* Update the most recently used entry. */
> +	mr_ctrl->mru = mr_ctrl->head;
> +	/* Point to the next victim, the oldest. */
> +	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
> +	return lkey;
> +}
> +
> +/**
> + * Release all the created MRs and resources on global MR cache of a device.
> + * list.
> + *
> + * @param share_cache
> + *   Pointer to a global shared MR cache.
> + */
> +void
> +mlx5_mr_release_cache(struct mlx5_mr_share_cache *share_cache)
> +{
> +	struct mlx5_mr *mr_next;
> +
> +	rte_rwlock_write_lock(&share_cache->rwlock);
> +	/* Detach from MR list and move to free list. */
> +	mr_next = LIST_FIRST(&share_cache->mr_list);
> +	while (mr_next != NULL) {
> +		struct mlx5_mr *mr = mr_next;
> +
> +		mr_next = LIST_NEXT(mr, mr);
> +		LIST_REMOVE(mr, mr);
> +		LIST_INSERT_HEAD(&share_cache->mr_free_list, mr, mr);
> +	}
> +	LIST_INIT(&share_cache->mr_list);
> +	/* Free global cache. */
> +	mlx5_mr_btree_free(&share_cache->cache);
> +	rte_rwlock_write_unlock(&share_cache->rwlock);
> +	/* Free all remaining MRs. */
> +	mlx5_mr_garbage_collect(share_cache);
> +}
> +
> +/**
> + * Flush all of the local cache entries.
> + *
> + * @param mr_ctrl
> + *   Pointer to per-queue MR local cache.
> + */
> +void
> +mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
> +{
> +	/* Reset the most-recently-used index. */
> +	mr_ctrl->mru = 0;
> +	/* Reset the linear search array. */
> +	mr_ctrl->head = 0;
> +	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
> +	/* Reset the B-tree table. */
> +	mr_ctrl->cache_bh.len = 1;
> +	mr_ctrl->cache_bh.overflow = 0;
> +	/* Update the generation number. */
> +	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
> +	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
> +		(void *)mr_ctrl, mr_ctrl->cur_gen);
> +}
> +
> +/**
> + * Creates a memory region for external memory, that is memory which is
> not
> + * part of the DPDK memory segments.
> + *
> + * @param pd
> + *   Pointer to ibv_pd of a device (net, regex, vdpa,...).
> + * @param addr
> + *   Starting virtual address of memory.
> + * @param len
> + *   Length of memory segment being mapped.
> + * @param socked_id
> + *   Socket to allocate heap memory for the control structures.
> + *
> + * @return
> + *   Pointer to MR structure on success, NULL otherwise.
> + */
> +struct mlx5_mr *
> +mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len, int
> socket_id)
> +{
> +	struct mlx5_mr *mr = NULL;
> +
> +	mr = rte_zmalloc_socket(NULL,
> +				RTE_ALIGN_CEIL(sizeof(*mr),
> +					       RTE_CACHE_LINE_SIZE),
> +				RTE_CACHE_LINE_SIZE, socket_id);
> +	if (mr == NULL)
> +		return NULL;
> +	mr->ibv_mr = mlx5_glue->reg_mr(pd, (void *)addr, len,
> +				       IBV_ACCESS_LOCAL_WRITE |
> +					   IBV_ACCESS_RELAXED_ORDERING);
> +	if (mr->ibv_mr == NULL) {
> +		DRV_LOG(WARNING,
> +			"Fail to create a verbs MR for address (%p)",
> +			(void *)addr);
> +		rte_free(mr);
> +		return NULL;
> +	}
> +	mr->msl = NULL; /* Mark it is external memory. */
> +	mr->ms_bmp = NULL;
> +	mr->ms_n = 1;
> +	mr->ms_bmp_n = 1;
> +	DRV_LOG(DEBUG,
> +		"MR CREATED (%p) for external memory %p:\n"
> +		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> +		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> +		(void *)mr, (void *)addr,
> +		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> +	return mr;
> +}
> +
> +/**
> + * Dump all the created MRs and the global cache entries.
> + *
> + * @param sh
> + *   Pointer to Ethernet device shared context.
> + */
> +void
> +mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache
> __rte_unused)
> +{
> +#ifdef RTE_LIBRTE_MLX5_DEBUG
> +	struct mlx5_mr *mr;
> +	int mr_n = 0;
> +	int chunk_n = 0;
> +
> +	rte_rwlock_read_lock(&share_cache->rwlock);
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &share_cache->mr_list, mr) {
> +		unsigned int n;
> +
> +		DEBUG("MR[%u], LKey = 0x%x, ms_n = %u, ms_bmp_n = %u",
> +		      mr_n++, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +		      mr->ms_n, mr->ms_bmp_n);
> +		if (mr->ms_n == 0)
> +			continue;
> +		for (n = 0; n < mr->ms_bmp_n; ) {
> +			struct mr_cache_entry ret = { 0, };
> +
> +			n = mr_find_next_chunk(mr, &ret, n);
> +			if (!ret.end)
> +				break;
> +			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR
> ")",
> +			      chunk_n++, ret.start, ret.end);
> +		}
> +	}
> +	DEBUG("Dumping global cache %p", (void *)share_cache);
> +	mlx5_mr_btree_dump(&share_cache->cache);
> +	rte_rwlock_read_unlock(&share_cache->rwlock);
> +#endif
> +}
> diff --git a/drivers/common/mlx5/mlx5_common_mr.h
> b/drivers/common/mlx5/mlx5_common_mr.h
> new file mode 100644
> index 0000000000..e805f96375
> --- /dev/null
> +++ b/drivers/common/mlx5/mlx5_common_mr.h
> @@ -0,0 +1,160 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2018 6WIND S.A.
> + * Copyright 2018 Mellanox Technologies, Ltd
> + */
> +
> +#ifndef RTE_PMD_MLX5_COMMON_MR_H_
> +#define RTE_PMD_MLX5_COMMON_MR_H_
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +
> +/* Verbs header. */
> +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic ignored "-Wpedantic"
> +#endif
> +#include <infiniband/verbs.h>
> +#include <infiniband/mlx5dv.h>
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic error "-Wpedantic"
> +#endif
> +
> +#include <rte_rwlock.h>
> +#include <rte_bitmap.h>
> +#include <rte_memory.h>
> +
> +#include "mlx5_common_mp.h"
> +
> +/* Size of per-queue MR cache array for linear search. */
> +#define MLX5_MR_CACHE_N 8
> +#define MLX5_MR_BTREE_CACHE_N 256
> +
> +/* Memory Region object. */
> +struct mlx5_mr {
> +	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
> +	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
> +	const struct rte_memseg_list *msl;
> +	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
> +	int ms_n; /* Number of memsegs in use. */
> +	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
> +	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR.
> */
> +};
> +
> +/* Cache entry for Memory Region. */
> +struct mr_cache_entry {
> +	uintptr_t start; /* Start address of MR. */
> +	uintptr_t end; /* End address of MR. */
> +	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
> +} __rte_packed;
> +
> +/* MR Cache table for Binary search. */
> +struct mlx5_mr_btree {
> +	uint16_t len; /* Number of entries. */
> +	uint16_t size; /* Total number of entries. */
> +	int overflow; /* Mark failure of table expansion. */
> +	struct mr_cache_entry (*table)[];
> +} __rte_packed;
> +
> +/* Per-queue MR control descriptor. */
> +struct mlx5_mr_ctrl {
> +	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
> +	uint32_t cur_gen; /* Generation number saved to flush caches. */
> +	uint16_t mru; /* Index of last hit entry in top-half cache. */
> +	uint16_t head; /* Index of the oldest entry in top-half cache. */
> +	struct mr_cache_entry cache[MLX5_MR_CACHE_N]; /* Cache for top-
> half. */
> +	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
> +} __rte_packed;
> +
> +LIST_HEAD(mlx5_mr_list, mlx5_mr);
> +
> +/* Global per-device MR cache. */
> +struct mlx5_mr_share_cache {
> +	uint32_t dev_gen; /* Generation number to flush local caches. */
> +	rte_rwlock_t rwlock; /* MR cache Lock. */
> +	struct mlx5_mr_btree cache; /* Global MR cache table. */
> +	struct mlx5_mr_list mr_list; /* Registered MR list. */
> +	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
> +} __rte_packed;
> +
> +/**
> + * Look up LKey from given lookup table by linear search. Firstly look up the
> + * last-hit entry. If miss, the entire array is searched. If found, update the
> + * last-hit index and return LKey.
> + *
> + * @param lkp_tbl
> + *   Pointer to lookup table.
> + * @param[in,out] cached_idx
> + *   Pointer to last-hit index.
> + * @param n
> + *   Size of lookup table.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static __rte_always_inline uint32_t
> +mlx5_mr_lookup_lkey(struct mr_cache_entry *lkp_tbl, uint16_t *cached_idx,
> +		    uint16_t n, uintptr_t addr)
> +{
> +	uint16_t idx;
> +
> +	if (likely(addr >= lkp_tbl[*cached_idx].start &&
> +		   addr < lkp_tbl[*cached_idx].end))
> +		return lkp_tbl[*cached_idx].lkey;
> +	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
> +		if (addr >= lkp_tbl[idx].start &&
> +		    addr < lkp_tbl[idx].end) {
> +			/* Found. */
> +			*cached_idx = idx;
> +			return lkp_tbl[idx].lkey;
> +		}
> +	}
> +	return UINT32_MAX;
> +}
> +
> +__rte_experimental
> +int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
> +__rte_experimental
> +void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
> +__rte_experimental
> +void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused);
> +__rte_experimental
> +uint32_t mlx5_mr_addr2mr_bh(struct ibv_pd *pd, struct mlx5_mp_id
> *mp_id,
> +			    struct mlx5_mr_share_cache *share_cache,
> +			    struct mlx5_mr_ctrl *mr_ctrl,
> +			    uintptr_t addr, unsigned int mr_ext_memseg_en);
> +__rte_experimental
> +void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
> +__rte_experimental
> +void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache
> __rte_unused);
> +__rte_experimental
> +void mlx5_mr_rebuild_cache(struct mlx5_mr_share_cache *share_cache);
> +__rte_experimental
> +void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
> +__rte_experimental
> +int
> +mlx5_mr_insert_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mlx5_mr *mr);
> +__rte_experimental
> +uint32_t
> +mlx5_mr_lookup_cache(struct mlx5_mr_share_cache *share_cache,
> +		     struct mr_cache_entry *entry, uintptr_t addr);
> +__rte_experimental
> +struct mlx5_mr *
> +mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache,
> +		    struct mr_cache_entry *entry, uintptr_t addr);
> +__rte_experimental
> +struct mlx5_mr *
> +mlx5_create_mr_ext(struct ibv_pd *pd, uintptr_t addr, size_t len,
> +		   int socket_id);
> +__rte_experimental
> +uint32_t
> +mlx5_mr_create_primary(struct ibv_pd *pd,
> +		       struct mlx5_mr_share_cache *share_cache,
> +		       struct mr_cache_entry *entry, uintptr_t addr,
> +		       unsigned int mr_ext_memseg_en);
> +
> +#endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
> diff --git a/drivers/common/mlx5/rte_common_mlx5_version.map
> b/drivers/common/mlx5/rte_common_mlx5_version.map
> index 265703d1c9..b58a378278 100644
> --- a/drivers/common/mlx5/rte_common_mlx5_version.map
> +++ b/drivers/common/mlx5/rte_common_mlx5_version.map
> @@ -61,4 +61,18 @@ EXPERIMENTAL {
>  	mlx5_mp_req_mr_create;
>  	mlx5_mp_req_queue_state_modify;
>  	mlx5_mp_req_verbs_cmd_fd;
> +
> +	mlx5_mr_btree_init;
> +	mlx5_mr_btree_free;
> +	mlx5_mr_btree_dump;
> +	mlx5_mr_addr2mr_bh;
> +	mlx5_mr_release_cache;
> +	mlx5_mr_dump_cache;
> +	mlx5_mr_rebuild_cache;
> +	mlx5_mr_insert_cache;
> +	mlx5_mr_lookup_cache;
> +	mlx5_mr_lookup_list;
> +	mlx5_create_mr_ext;
> +	mlx5_mr_create_primary;
> +	mlx5_mr_flush_local_cache;
>  };
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
> index d87c384422..f8b134ca66 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -623,7 +623,7 @@ mlx5_alloc_shared_ibctx(const struct
> mlx5_dev_spawn_data *spawn,
>  	 * At this point the device is not added to the memory
>  	 * event list yet, context is just being created.
>  	 */
> -	err = mlx5_mr_btree_init(&sh->mr.cache,
> +	err = mlx5_mr_btree_init(&sh->share_cache.cache,
>  				 MLX5_MR_BTREE_CACHE_N * 2,
>  				 spawn->pci_dev->device.numa_node);
>  	if (err) {
> @@ -695,7 +695,7 @@ mlx5_free_shared_ibctx(struct mlx5_ibv_shared *sh)
>  	LIST_REMOVE(sh, mem_event_cb);
>  	rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
>  	/* Release created Memory Regions. */
> -	mlx5_mr_release(sh);
> +	mlx5_mr_release_cache(&sh->share_cache);
>  	/* Remove context from the global device list. */
>  	LIST_REMOVE(sh, next);
>  	/*
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
> index e9d5868883..c45c01e916 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -37,10 +37,10 @@
>  #include <mlx5_prm.h>
>  #include <mlx5_nl.h>
>  #include <mlx5_common_mp.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
> -#include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
> 
>  /** Key string for IPC. */
> @@ -199,8 +199,6 @@ struct mlx5_verbs_alloc_ctx {
>  	const void *obj; /* Pointer to the DPDK object. */
>  };
> 
> -LIST_HEAD(mlx5_mr_list, mlx5_mr);
> -
>  /* Flow drop context necessary due to Verbs API. */
>  struct mlx5_drop {
>  	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
> @@ -411,13 +409,7 @@ struct mlx5_ibv_shared {
>  	struct ibv_device_attr_ex device_attr; /* Device properties. */
>  	LIST_ENTRY(mlx5_ibv_shared) mem_event_cb;
>  	/**< Called by memory event callback. */
> -	struct {
> -		uint32_t dev_gen; /* Generation number to flush local
> caches. */
> -		rte_rwlock_t rwlock; /* MR Lock. */
> -		struct mlx5_mr_btree cache; /* Global MR cache table. */
> -		struct mlx5_mr_list mr_list; /* Registered MR list. */
> -		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
> -	} mr;
> +	struct mlx5_mr_share_cache share_cache;
>  	/* Shared DV/DR flow data section. */
>  	pthread_mutex_t dv_mutex; /* DV context mutex. */
>  	uint32_t dv_meta_mask; /* flow META metadata supported mask. */
> diff --git a/drivers/net/mlx5/mlx5_mp.c b/drivers/net/mlx5/mlx5_mp.c
> index 43684dbc3a..7ad322d474 100644
> --- a/drivers/net/mlx5/mlx5_mp.c
> +++ b/drivers/net/mlx5/mlx5_mp.c
> @@ -11,6 +11,7 @@
>  #include <rte_string_fns.h>
> 
>  #include <mlx5_common_mp.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5.h"
>  #include "mlx5_rxtx.h"
> @@ -25,7 +26,7 @@ mlx5_mp_primary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  		(const struct mlx5_mp_param *)mp_msg->param;
>  	struct rte_eth_dev *dev;
>  	struct mlx5_priv *priv;
> -	struct mlx5_mr_cache entry;
> +	struct mr_cache_entry entry;
>  	uint32_t lkey;
>  	int ret;
> 
> @@ -40,7 +41,10 @@ mlx5_mp_primary_handle(const struct rte_mp_msg
> *mp_msg, const void *peer)
>  	switch (param->type) {
>  	case MLX5_MP_REQ_CREATE_MR:
>  		mp_init_msg(&priv->mp_id, &mp_res, param->type);
> -		lkey = mlx5_mr_create_primary(dev, &entry, param-
> >args.addr);
> +		lkey = mlx5_mr_create_primary(priv->sh->pd,
> +					      &priv->sh->share_cache,
> +					      &entry, param->args.addr,
> +					      priv->config.mr_ext_memseg_en);
>  		if (lkey == UINT32_MAX)
>  			res->result = -rte_errno;
>  		ret = rte_mp_reply(&mp_res, peer);
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> index 9151992a72..2b4b3e2891 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -18,6 +18,8 @@
>  #include <rte_bus_pci.h>
> 
>  #include <mlx5_glue.h>
> +#include <mlx5_common_mp.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5.h"
>  #include "mlx5_mr.h"
> @@ -36,834 +38,6 @@ struct mr_update_mp_data {
>  	int ret;
>  };
> 
> -/**
> - * Expand B-tree table to a given size. Can't be called with holding
> - * memory_hotplug_lock or sh->mr.rwlock due to rte_realloc().
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param n
> - *   Number of entries for expansion.
> - *
> - * @return
> - *   0 on success, -1 on failure.
> - */
> -static int
> -mr_btree_expand(struct mlx5_mr_btree *bt, int n)
> -{
> -	void *mem;
> -	int ret = 0;
> -
> -	if (n <= bt->size)
> -		return ret;
> -	/*
> -	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
> -	 * used inside if there's no room to expand. Because this is a quite
> -	 * rare case and a part of very slow path, it is very acceptable.
> -	 * Initially cache_bh[] will be given practically enough space and once
> -	 * it is expanded, expansion wouldn't be needed again ever.
> -	 */
> -	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
> -	if (mem == NULL) {
> -		/* Not an error, B-tree search will be skipped. */
> -		DRV_LOG(WARNING, "failed to expand MR B-tree (%p) table",
> -			(void *)bt);
> -		ret = -1;
> -	} else {
> -		DRV_LOG(DEBUG, "expanded MR B-tree table (size=%u)", n);
> -		bt->table = mem;
> -		bt->size = n;
> -	}
> -	return ret;
> -}
> -
> -/**
> - * Look up LKey from given B-tree lookup table, store the last index and
> return
> - * searched LKey.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param[out] idx
> - *   Pointer to index. Even on search failure, returns index where it stops
> - *   searching so that index can be used when inserting a new entry.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static uint32_t
> -mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
> -{
> -	struct mlx5_mr_cache *lkp_tbl;
> -	uint16_t n;
> -	uint16_t base = 0;
> -
> -	MLX5_ASSERT(bt != NULL);
> -	lkp_tbl = *bt->table;
> -	n = bt->len;
> -	/* First entry must be NULL for comparison. */
> -	MLX5_ASSERT(bt->len > 0 || (lkp_tbl[0].start == 0 &&
> -				    lkp_tbl[0].lkey == UINT32_MAX));
> -	/* Binary search. */
> -	do {
> -		register uint16_t delta = n >> 1;
> -
> -		if (addr < lkp_tbl[base + delta].start) {
> -			n = delta;
> -		} else {
> -			base += delta;
> -			n -= delta;
> -		}
> -	} while (n > 1);
> -	MLX5_ASSERT(addr >= lkp_tbl[base].start);
> -	*idx = base;
> -	if (addr < lkp_tbl[base].end)
> -		return lkp_tbl[base].lkey;
> -	/* Not found. */
> -	return UINT32_MAX;
> -}
> -
> -/**
> - * Insert an entry to B-tree lookup table.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param entry
> - *   Pointer to new entry to insert.
> - *
> - * @return
> - *   0 on success, -1 on failure.
> - */
> -static int
> -mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
> -{
> -	struct mlx5_mr_cache *lkp_tbl;
> -	uint16_t idx = 0;
> -	size_t shift;
> -
> -	MLX5_ASSERT(bt != NULL);
> -	MLX5_ASSERT(bt->len <= bt->size);
> -	MLX5_ASSERT(bt->len > 0);
> -	lkp_tbl = *bt->table;
> -	/* Find out the slot for insertion. */
> -	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
> -		DRV_LOG(DEBUG,
> -			"abort insertion to B-tree(%p): already exist at"
> -			" idx=%u [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> -			(void *)bt, idx, entry->start, entry->end, entry->lkey);
> -		/* Already exist, return. */
> -		return 0;
> -	}
> -	/* If table is full, return error. */
> -	if (unlikely(bt->len == bt->size)) {
> -		bt->overflow = 1;
> -		return -1;
> -	}
> -	/* Insert entry. */
> -	++idx;
> -	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
> -	if (shift)
> -		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
> -	lkp_tbl[idx] = *entry;
> -	bt->len++;
> -	DRV_LOG(DEBUG,
> -		"inserted B-tree(%p)[%u],"
> -		" [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> -		(void *)bt, idx, entry->start, entry->end, entry->lkey);
> -	return 0;
> -}
> -
> -/**
> - * Initialize B-tree and allocate memory for lookup table.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - * @param n
> - *   Number of entries to allocate.
> - * @param socket
> - *   NUMA socket on which memory must be allocated.
> - *
> - * @return
> - *   0 on success, a negative errno value otherwise and rte_errno is set.
> - */
> -int
> -mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
> -{
> -	if (bt == NULL) {
> -		rte_errno = EINVAL;
> -		return -rte_errno;
> -	}
> -	MLX5_ASSERT(!bt->table && !bt->size);
> -	memset(bt, 0, sizeof(*bt));
> -	bt->table = rte_calloc_socket("B-tree table",
> -				      n, sizeof(struct mlx5_mr_cache),
> -				      0, socket);
> -	if (bt->table == NULL) {
> -		rte_errno = ENOMEM;
> -		DEBUG("failed to allocate memory for btree cache on socket
> %d",
> -		      socket);
> -		return -rte_errno;
> -	}
> -	bt->size = n;
> -	/* First entry must be NULL for binary search. */
> -	(*bt->table)[bt->len++] = (struct mlx5_mr_cache) {
> -		.lkey = UINT32_MAX,
> -	};
> -	DEBUG("initialized B-tree %p with table %p",
> -	      (void *)bt, (void *)bt->table);
> -	return 0;
> -}
> -
> -/**
> - * Free B-tree resources.
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - */
> -void
> -mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
> -{
> -	if (bt == NULL)
> -		return;
> -	DEBUG("freeing B-tree %p with table %p",
> -	      (void *)bt, (void *)bt->table);
> -	rte_free(bt->table);
> -	memset(bt, 0, sizeof(*bt));
> -}
> -
> -/**
> - * Dump all the entries in a B-tree
> - *
> - * @param bt
> - *   Pointer to B-tree structure.
> - */
> -void
> -mlx5_mr_btree_dump(struct mlx5_mr_btree *bt __rte_unused)
> -{
> -#ifdef RTE_LIBRTE_MLX5_DEBUG
> -	int idx;
> -	struct mlx5_mr_cache *lkp_tbl;
> -
> -	if (bt == NULL)
> -		return;
> -	lkp_tbl = *bt->table;
> -	for (idx = 0; idx < bt->len; ++idx) {
> -		struct mlx5_mr_cache *entry = &lkp_tbl[idx];
> -
> -		DEBUG("B-tree(%p)[%u],"
> -		      " [0x%" PRIxPTR ", 0x%" PRIxPTR ") lkey=0x%x",
> -		      (void *)bt, idx, entry->start, entry->end, entry->lkey);
> -	}
> -#endif
> -}
> -
> -/**
> - * Find virtually contiguous memory chunk in a given MR.
> - *
> - * @param dev
> - *   Pointer to MR structure.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry. If not found, this will not be
> - *   updated.
> - * @param start_idx
> - *   Start index of the memseg bitmap.
> - *
> - * @return
> - *   Next index to go on lookup.
> - */
> -static int
> -mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
> -		   int base_idx)
> -{
> -	uintptr_t start = 0;
> -	uintptr_t end = 0;
> -	uint32_t idx = 0;
> -
> -	/* MR for external memory doesn't have memseg list. */
> -	if (mr->msl == NULL) {
> -		struct ibv_mr *ibv_mr = mr->ibv_mr;
> -
> -		MLX5_ASSERT(mr->ms_bmp_n == 1);
> -		MLX5_ASSERT(mr->ms_n == 1);
> -		MLX5_ASSERT(base_idx == 0);
> -		/*
> -		 * Can't search it from memseg list but get it directly from
> -		 * verbs MR as there's only one chunk.
> -		 */
> -		entry->start = (uintptr_t)ibv_mr->addr;
> -		entry->end = (uintptr_t)ibv_mr->addr + mr->ibv_mr->length;
> -		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> -		/* Returning 1 ends iteration. */
> -		return 1;
> -	}
> -	for (idx = base_idx; idx < mr->ms_bmp_n; ++idx) {
> -		if (rte_bitmap_get(mr->ms_bmp, idx)) {
> -			const struct rte_memseg_list *msl;
> -			const struct rte_memseg *ms;
> -
> -			msl = mr->msl;
> -			ms = rte_fbarray_get(&msl->memseg_arr,
> -					     mr->ms_base_idx + idx);
> -			MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> -			if (!start)
> -				start = ms->addr_64;
> -			end = ms->addr_64 + ms->hugepage_sz;
> -		} else if (start) {
> -			/* Passed the end of a fragment. */
> -			break;
> -		}
> -	}
> -	if (start) {
> -		/* Found one chunk. */
> -		entry->start = start;
> -		entry->end = end;
> -		entry->lkey = rte_cpu_to_be_32(mr->ibv_mr->lkey);
> -	}
> -	return idx;
> -}
> -
> -/**
> - * Insert a MR to the global B-tree cache. It may fail due to low-on-memory.
> - * Then, this entry will have to be searched by mr_lookup_dev_list() in
> - * mlx5_mr_create() on miss.
> - *
> - * @param dev
> - *   Pointer to Ethernet device shared context.
> - * @param mr
> - *   Pointer to MR to insert.
> - *
> - * @return
> - *   0 on success, -1 on failure.
> - */
> -static int
> -mr_insert_dev_cache(struct mlx5_ibv_shared *sh, struct mlx5_mr *mr)
> -{
> -	unsigned int n;
> -
> -	DRV_LOG(DEBUG, "device %s inserting MR(%p) to global cache",
> -		sh->ibdev_name, (void *)mr);
> -	for (n = 0; n < mr->ms_bmp_n; ) {
> -		struct mlx5_mr_cache entry;
> -
> -		memset(&entry, 0, sizeof(entry));
> -		/* Find a contiguous chunk and advance the index. */
> -		n = mr_find_next_chunk(mr, &entry, n);
> -		if (!entry.end)
> -			break;
> -		if (mr_btree_insert(&sh->mr.cache, &entry) < 0) {
> -			/*
> -			 * Overflowed, but the global table cannot be
> expanded
> -			 * because of deadlock.
> -			 */
> -			return -1;
> -		}
> -	}
> -	return 0;
> -}
> -
> -/**
> - * Look up address in the original global MR list.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Found MR on match, NULL otherwise.
> - */
> -static struct mlx5_mr *
> -mr_lookup_dev_list(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache
> *entry,
> -		   uintptr_t addr)
> -{
> -	struct mlx5_mr *mr;
> -
> -	/* Iterate all the existing MRs. */
> -	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
> -		unsigned int n;
> -
> -		if (mr->ms_n == 0)
> -			continue;
> -		for (n = 0; n < mr->ms_bmp_n; ) {
> -			struct mlx5_mr_cache ret;
> -
> -			memset(&ret, 0, sizeof(ret));
> -			n = mr_find_next_chunk(mr, &ret, n);
> -			if (addr >= ret.start && addr < ret.end) {
> -				/* Found. */
> -				*entry = ret;
> -				return mr;
> -			}
> -		}
> -	}
> -	return NULL;
> -}
> -
> -/**
> - * Look up address on device.
> - *
> - * @param dev
> - *   Pointer to Ethernet device shared context.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -static uint32_t
> -mr_lookup_dev(struct mlx5_ibv_shared *sh, struct mlx5_mr_cache *entry,
> -	      uintptr_t addr)
> -{
> -	uint16_t idx;
> -	uint32_t lkey = UINT32_MAX;
> -	struct mlx5_mr *mr;
> -
> -	/*
> -	 * If the global cache has overflowed since it failed to expand the
> -	 * B-tree table, it can't have all the existing MRs. Then, the address
> -	 * has to be searched by traversing the original MR list instead, which
> -	 * is very slow path. Otherwise, the global cache is all inclusive.
> -	 */
> -	if (!unlikely(sh->mr.cache.overflow)) {
> -		lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
> -		if (lkey != UINT32_MAX)
> -			*entry = (*sh->mr.cache.table)[idx];
> -	} else {
> -		/* Falling back to the slowest path. */
> -		mr = mr_lookup_dev_list(sh, entry, addr);
> -		if (mr != NULL)
> -			lkey = entry->lkey;
> -	}
> -	MLX5_ASSERT(lkey == UINT32_MAX || (addr >= entry->start &&
> -					   addr < entry->end));
> -	return lkey;
> -}
> -
> -/**
> - * Free MR resources. MR lock must not be held to avoid a deadlock.
> rte_free()
> - * can raise memory free event and the callback function will spin on the
> lock.
> - *
> - * @param mr
> - *   Pointer to MR to free.
> - */
> -static void
> -mr_free(struct mlx5_mr *mr)
> -{
> -	if (mr == NULL)
> -		return;
> -	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
> -	if (mr->ibv_mr != NULL)
> -		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
> -	if (mr->ms_bmp != NULL)
> -		rte_bitmap_free(mr->ms_bmp);
> -	rte_free(mr);
> -}
> -
> -/**
> - * Release resources of detached MR having no online entry.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -static void
> -mlx5_mr_garbage_collect(struct mlx5_ibv_shared *sh)
> -{
> -	struct mlx5_mr *mr_next;
> -	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
> -
> -	/* Must be called from the primary process. */
> -	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
> -	/*
> -	 * MR can't be freed with holding the lock because rte_free() could
> call
> -	 * memory free callback function. This will be a deadlock situation.
> -	 */
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	/* Detach the whole free list and release it after unlocking. */
> -	free_list = sh->mr.mr_free_list;
> -	LIST_INIT(&sh->mr.mr_free_list);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -	/* Release resources. */
> -	mr_next = LIST_FIRST(&free_list);
> -	while (mr_next != NULL) {
> -		struct mlx5_mr *mr = mr_next;
> -
> -		mr_next = LIST_NEXT(mr, mr);
> -		mr_free(mr);
> -	}
> -}
> -
> -/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
> -static int
> -mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
> -			  const struct rte_memseg *ms, size_t len, void *arg)
> -{
> -	struct mr_find_contig_memsegs_data *data = arg;
> -
> -	if (data->addr < ms->addr_64 || data->addr >= ms->addr_64 + len)
> -		return 0;
> -	/* Found, save it and stop walking. */
> -	data->start = ms->addr_64;
> -	data->end = ms->addr_64 + len;
> -	data->msl = msl;
> -	return 1;
> -}
> -
> -/**
> - * Create a new global Memory Region (MR) for a missing virtual address.
> - * This API should be called on a secondary process, then a request is sent to
> - * the primary process in order to create a MR for the address. As the global
> MR
> - * list is on the shared memory, following LKey lookup should succeed unless
> the
> - * request fails.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this will not be updated.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -static uint32_t
> -mlx5_mr_create_secondary(struct rte_eth_dev *dev, struct mlx5_mr_cache
> *entry,
> -			 uintptr_t addr)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	int ret;
> -
> -	DEBUG("port %u requesting MR creation for address (%p)",
> -	      dev->data->port_id, (void *)addr);
> -	ret = mlx5_mp_req_mr_create(&priv->mp_id, addr);
> -	if (ret) {
> -		DEBUG("port %u fail to request MR creation for address
> (%p)",
> -		      dev->data->port_id, (void *)addr);
> -		return UINT32_MAX;
> -	}
> -	rte_rwlock_read_lock(&priv->sh->mr.rwlock);
> -	/* Fill in output data. */
> -	mr_lookup_dev(priv->sh, entry, addr);
> -	/* Lookup can't fail. */
> -	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> -	rte_rwlock_read_unlock(&priv->sh->mr.rwlock);
> -	DEBUG("port %u MR CREATED by primary process for %p:\n"
> -	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "), lkey=0x%x",
> -	      dev->data->port_id, (void *)addr,
> -	      entry->start, entry->end, entry->lkey);
> -	return entry->lkey;
> -}
> -
> -/**
> - * Create a new global Memory Region (MR) for a missing virtual address.
> - * Register entire virtually contiguous memory chunk around the address.
> - * This must be called from the primary process.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this will not be updated.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -uint32_t
> -mlx5_mr_create_primary(struct rte_eth_dev *dev, struct mlx5_mr_cache
> *entry,
> -		       uintptr_t addr)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_ibv_shared *sh = priv->sh;
> -	struct mlx5_dev_config *config = &priv->config;
> -	const struct rte_memseg_list *msl;
> -	const struct rte_memseg *ms;
> -	struct mlx5_mr *mr = NULL;
> -	size_t len;
> -	uint32_t ms_n;
> -	uint32_t bmp_size;
> -	void *bmp_mem;
> -	int ms_idx_shift = -1;
> -	unsigned int n;
> -	struct mr_find_contig_memsegs_data data = {
> -		.addr = addr,
> -	};
> -	struct mr_find_contig_memsegs_data data_re;
> -
> -	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
> -		dev->data->port_id, (void *)addr);
> -	/*
> -	 * Release detached MRs if any. This can't be called with holding
> either
> -	 * memory_hotplug_lock or sh->mr.rwlock. MRs on the free list have
> -	 * been detached by the memory free event but it couldn't be
> released
> -	 * inside the callback due to deadlock. As a result, releasing resources
> -	 * is quite opportunistic.
> -	 */
> -	mlx5_mr_garbage_collect(sh);
> -	/*
> -	 * If enabled, find out a contiguous virtual address chunk in use, to
> -	 * which the given address belongs, in order to register maximum
> range.
> -	 * In the best case where mempools are not dynamically recreated
> and
> -	 * '--socket-mem' is specified as an EAL option, it is very likely to
> -	 * have only one MR(LKey) per a socket and per a hugepage-size even
> -	 * though the system memory is highly fragmented. As the whole
> memory
> -	 * chunk will be pinned by kernel, it can't be reused unless entire
> -	 * chunk is freed from EAL.
> -	 *
> -	 * If disabled, just register one memseg (page). Then, memory
> -	 * consumption will be minimized but it may drop performance if
> there
> -	 * are many MRs to lookup on the datapath.
> -	 */
> -	if (!config->mr_ext_memseg_en) {
> -		data.msl = rte_mem_virt2memseg_list((void *)addr);
> -		data.start = RTE_ALIGN_FLOOR(addr, data.msl->page_sz);
> -		data.end = data.start + data.msl->page_sz;
> -	} else if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> &data)) {
> -		DRV_LOG(WARNING,
> -			"port %u unable to find virtually contiguous"
> -			" chunk for address (%p)."
> -			" rte_memseg_contig_walk() failed.",
> -			dev->data->port_id, (void *)addr);
> -		rte_errno = ENXIO;
> -		goto err_nolock;
> -	}
> -alloc_resources:
> -	/* Addresses must be page-aligned. */
> -	MLX5_ASSERT(rte_is_aligned((void *)data.start, data.msl->page_sz));
> -	MLX5_ASSERT(rte_is_aligned((void *)data.end, data.msl->page_sz));
> -	msl = data.msl;
> -	ms = rte_mem_virt2memseg((void *)data.start, msl);
> -	len = data.end - data.start;
> -	MLX5_ASSERT(msl->page_sz == ms->hugepage_sz);
> -	/* Number of memsegs in the range. */
> -	ms_n = len / msl->page_sz;
> -	DEBUG("port %u extending %p to [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> -	      " page_sz=0x%" PRIx64 ", ms_n=%u",
> -	      dev->data->port_id, (void *)addr,
> -	      data.start, data.end, msl->page_sz, ms_n);
> -	/* Size of memory for bitmap. */
> -	bmp_size = rte_bitmap_get_memory_footprint(ms_n);
> -	mr = rte_zmalloc_socket(NULL,
> -				RTE_ALIGN_CEIL(sizeof(*mr),
> -					       RTE_CACHE_LINE_SIZE) +
> -				bmp_size,
> -				RTE_CACHE_LINE_SIZE, msl->socket_id);
> -	if (mr == NULL) {
> -		DEBUG("port %u unable to allocate memory for a new MR of"
> -		      " address (%p).",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = ENOMEM;
> -		goto err_nolock;
> -	}
> -	mr->msl = msl;
> -	/*
> -	 * Save the index of the first memseg and initialize memseg bitmap.
> To
> -	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
> -	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
> -	 */
> -	mr->ms_base_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
> -	bmp_mem = RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
> -	mr->ms_bmp = rte_bitmap_init(ms_n, bmp_mem, bmp_size);
> -	if (mr->ms_bmp == NULL) {
> -		DEBUG("port %u unable to initialize bitmap for a new MR of"
> -		      " address (%p).",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = EINVAL;
> -		goto err_nolock;
> -	}
> -	/*
> -	 * Should recheck whether the extended contiguous chunk is still
> valid.
> -	 * Because memory_hotplug_lock can't be held if there's any memory
> -	 * related calls in a critical path, resource allocation above can't be
> -	 * locked. If the memory has been changed at this point, try again
> with
> -	 * just single page. If not, go on with the big chunk atomically from
> -	 * here.
> -	 */
> -	rte_mcfg_mem_read_lock();
> -	data_re = data;
> -	if (len > msl->page_sz &&
> -	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, &data_re))
> {
> -		DEBUG("port %u unable to find virtually contiguous"
> -		      " chunk for address (%p)."
> -		      " rte_memseg_contig_walk() failed.",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = ENXIO;
> -		goto err_memlock;
> -	}
> -	if (data.start != data_re.start || data.end != data_re.end) {
> -		/*
> -		 * The extended contiguous chunk has been changed. Try
> again
> -		 * with single memseg instead.
> -		 */
> -		data.start = RTE_ALIGN_FLOOR(addr, msl->page_sz);
> -		data.end = data.start + msl->page_sz;
> -		rte_mcfg_mem_read_unlock();
> -		mr_free(mr);
> -		goto alloc_resources;
> -	}
> -	MLX5_ASSERT(data.msl == data_re.msl);
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	/*
> -	 * Check the address is really missing. If other thread already created
> -	 * one or it is not found due to overflow, abort and return.
> -	 */
> -	if (mr_lookup_dev(sh, entry, addr) != UINT32_MAX) {
> -		/*
> -		 * Insert to the global cache table. It may fail due to
> -		 * low-on-memory. Then, this entry will have to be searched
> -		 * here again.
> -		 */
> -		mr_btree_insert(&sh->mr.cache, entry);
> -		DEBUG("port %u found MR for %p on final lookup, abort",
> -		      dev->data->port_id, (void *)addr);
> -		rte_rwlock_write_unlock(&sh->mr.rwlock);
> -		rte_mcfg_mem_read_unlock();
> -		/*
> -		 * Must be unlocked before calling rte_free() because
> -		 * mlx5_mr_mem_event_free_cb() can be called inside.
> -		 */
> -		mr_free(mr);
> -		return entry->lkey;
> -	}
> -	/*
> -	 * Trim start and end addresses for verbs MR. Set bits for registering
> -	 * memsegs but exclude already registered ones. Bitmap can be
> -	 * fragmented.
> -	 */
> -	for (n = 0; n < ms_n; ++n) {
> -		uintptr_t start;
> -		struct mlx5_mr_cache ret;
> -
> -		memset(&ret, 0, sizeof(ret));
> -		start = data_re.start + n * msl->page_sz;
> -		/* Exclude memsegs already registered by other MRs. */
> -		if (mr_lookup_dev(sh, &ret, start) == UINT32_MAX) {
> -			/*
> -			 * Start from the first unregistered memseg in the
> -			 * extended range.
> -			 */
> -			if (ms_idx_shift == -1) {
> -				mr->ms_base_idx += n;
> -				data.start = start;
> -				ms_idx_shift = n;
> -			}
> -			data.end = start + msl->page_sz;
> -			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
> -			++mr->ms_n;
> -		}
> -	}
> -	len = data.end - data.start;
> -	mr->ms_bmp_n = len / msl->page_sz;
> -	MLX5_ASSERT(ms_idx_shift + mr->ms_bmp_n <= ms_n);
> -	/*
> -	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can
> be
> -	 * called with holding the memory lock because it doesn't use
> -	 * mlx5_alloc_buf_extern() which eventually calls rte_malloc_socket()
> -	 * through mlx5_alloc_verbs_buf().
> -	 */
> -	mr->ibv_mr = mlx5_glue->reg_mr(sh->pd, (void *)data.start, len,
> -				       IBV_ACCESS_LOCAL_WRITE |
> -					   IBV_ACCESS_RELAXED_ORDERING);
> -	if (mr->ibv_mr == NULL) {
> -		DEBUG("port %u fail to create a verbs MR for address (%p)",
> -		      dev->data->port_id, (void *)addr);
> -		rte_errno = EINVAL;
> -		goto err_mrlock;
> -	}
> -	MLX5_ASSERT((uintptr_t)mr->ibv_mr->addr == data.start);
> -	MLX5_ASSERT(mr->ibv_mr->length == len);
> -	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
> -	DEBUG("port %u MR CREATED (%p) for %p:\n"
> -	      "  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> -	      " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> -	      dev->data->port_id, (void *)mr, (void *)addr,
> -	      data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> -	      mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> -	/* Insert to the global cache table. */
> -	mr_insert_dev_cache(sh, mr);
> -	/* Fill in output data. */
> -	mr_lookup_dev(sh, entry, addr);
> -	/* Lookup can't fail. */
> -	MLX5_ASSERT(entry->lkey != UINT32_MAX);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -	rte_mcfg_mem_read_unlock();
> -	return entry->lkey;
> -err_mrlock:
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -err_memlock:
> -	rte_mcfg_mem_read_unlock();
> -err_nolock:
> -	/*
> -	 * In case of error, as this can be called in a datapath, a warning
> -	 * message per an error is preferable instead. Must be unlocked
> before
> -	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be
> called
> -	 * inside.
> -	 */
> -	mr_free(mr);
> -	return UINT32_MAX;
> -}
> -
> -/**
> - * Create a new global Memory Region (MR) for a missing virtual address.
> - * This can be called from primary and secondary process.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this will not be updated.
> - * @param addr
> - *   Target virtual address to register.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> - */
> -static uint32_t
> -mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
> -	       uintptr_t addr)
> -{
> -	uint32_t ret = 0;
> -
> -	switch (rte_eal_process_type()) {
> -	case RTE_PROC_PRIMARY:
> -		ret = mlx5_mr_create_primary(dev, entry, addr);
> -		break;
> -	case RTE_PROC_SECONDARY:
> -		ret = mlx5_mr_create_secondary(dev, entry, addr);
> -		break;
> -	default:
> -		break;
> -	}
> -	return ret;
> -}
> -
> -/**
> - * Rebuild the global B-tree cache of device from the original MR list.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -static void
> -mr_rebuild_dev_cache(struct mlx5_ibv_shared *sh)
> -{
> -	struct mlx5_mr *mr;
> -
> -	DRV_LOG(DEBUG, "device %s rebuild dev cache[]", sh->ibdev_name);
> -	/* Flush cache to rebuild. */
> -	sh->mr.cache.len = 1;
> -	sh->mr.cache.overflow = 0;
> -	/* Iterate all the existing MRs. */
> -	LIST_FOREACH(mr, &sh->mr.mr_list, mr)
> -		if (mr_insert_dev_cache(sh, mr) < 0)
> -			return;
> -}
> -
>  /**
>   * Callback for memory free event. Iterate freed memsegs and check whether
> it
>   * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a
> @@ -900,18 +74,18 @@ mlx5_mr_mem_event_free_cb(struct
> mlx5_ibv_shared *sh,
>  		    RTE_ALIGN((uintptr_t)addr, msl->page_sz));
>  	MLX5_ASSERT(len == RTE_ALIGN(len, msl->page_sz));
>  	ms_n = len / msl->page_sz;
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> +	rte_rwlock_write_lock(&sh->share_cache.rwlock);
>  	/* Clear bits of freed memsegs from MR. */
>  	for (i = 0; i < ms_n; ++i) {
>  		const struct rte_memseg *ms;
> -		struct mlx5_mr_cache entry;
> +		struct mr_cache_entry entry;
>  		uintptr_t start;
>  		int ms_idx;
>  		uint32_t pos;
> 
>  		/* Find MR having this memseg. */
>  		start = (uintptr_t)addr + i * msl->page_sz;
> -		mr = mr_lookup_dev_list(sh, &entry, start);
> +		mr = mlx5_mr_lookup_list(&sh->share_cache, &entry, start);
>  		if (mr == NULL)
>  			continue;
>  		MLX5_ASSERT(mr->msl); /* Can't be external memory. */
> @@ -927,7 +101,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared
> *sh,
>  		rte_bitmap_clear(mr->ms_bmp, pos);
>  		if (--mr->ms_n == 0) {
>  			LIST_REMOVE(mr, mr);
> -			LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
> +			LIST_INSERT_HEAD(&sh->share_cache.mr_free_list,
> mr, mr);
>  			DEBUG("device %s remove MR(%p) from list",
>  			      sh->ibdev_name, (void *)mr);
>  		}
> @@ -938,7 +112,7 @@ mlx5_mr_mem_event_free_cb(struct mlx5_ibv_shared
> *sh,
>  		rebuild = 1;
>  	}
>  	if (rebuild) {
> -		mr_rebuild_dev_cache(sh);
> +		mlx5_mr_rebuild_cache(&sh->share_cache);
>  		/*
>  		 * Flush local caches by propagating invalidation across cores.
>  		 * rte_smp_wmb() is enough to synchronize this event. If one
> of
> @@ -948,12 +122,12 @@ mlx5_mr_mem_event_free_cb(struct
> mlx5_ibv_shared *sh,
>  		 * generation below) will be guaranteed to be seen by other
> core
>  		 * before the core sees the newly allocated memory.
>  		 */
> -		++sh->mr.dev_gen;
> +		++sh->share_cache.dev_gen;
>  		DEBUG("broadcasting local cache flush, gen=%d",
> -		      sh->mr.dev_gen);
> +		      sh->share_cache.dev_gen);
>  		rte_smp_wmb();
>  	}
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> +	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  }
> 
>  /**
> @@ -990,111 +164,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event
> event_type, const void *addr,
>  	}
>  }
> 
> -/**
> - * Look up address in the global MR cache table. If not found, create a new
> MR.
> - * Insert the found/created entry to local bottom-half cache table.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param mr_ctrl
> - *   Pointer to per-queue MR control structure.
> - * @param[out] entry
> - *   Pointer to returning MR cache entry, found in the global cache or newly
> - *   created. If failed to create one, this is not written.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static uint32_t
> -mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
> -		   struct mlx5_mr_cache *entry, uintptr_t addr)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_ibv_shared *sh = priv->sh;
> -	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
> -	uint16_t idx;
> -	uint32_t lkey;
> -
> -	/* If local cache table is full, try to double it. */
> -	if (unlikely(bt->len == bt->size))
> -		mr_btree_expand(bt, bt->size << 1);
> -	/* Look up in the global cache. */
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	lkey = mr_btree_lookup(&sh->mr.cache, &idx, addr);
> -	if (lkey != UINT32_MAX) {
> -		/* Found. */
> -		*entry = (*sh->mr.cache.table)[idx];
> -		rte_rwlock_read_unlock(&sh->mr.rwlock);
> -		/*
> -		 * Update local cache. Even if it fails, return the found entry
> -		 * to update top-half cache. Next time, this entry will be
> found
> -		 * in the global cache.
> -		 */
> -		mr_btree_insert(bt, entry);
> -		return lkey;
> -	}
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> -	/* First time to see the address? Create a new MR. */
> -	lkey = mlx5_mr_create(dev, entry, addr);
> -	/*
> -	 * Update the local cache if successfully created a new global MR.
> Even
> -	 * if failed to create one, there's no action to take in this datapath
> -	 * code. As returning LKey is invalid, this will eventually make HW
> -	 * fail.
> -	 */
> -	if (lkey != UINT32_MAX)
> -		mr_btree_insert(bt, entry);
> -	return lkey;
> -}
> -
> -/**
> - * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] and if
> - * misses, search in the global MR cache table and update the new entry to
> - * per-queue local caches.
> - *
> - * @param dev
> - *   Pointer to Ethernet device.
> - * @param mr_ctrl
> - *   Pointer to per-queue MR control structure.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static uint32_t
> -mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
> -		   uintptr_t addr)
> -{
> -	uint32_t lkey;
> -	uint16_t bh_idx = 0;
> -	/* Victim in top-half cache to replace with new entry. */
> -	struct mlx5_mr_cache *repl = &mr_ctrl->cache[mr_ctrl->head];
> -
> -	/* Binary-search MR translation table. */
> -	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
> -	/* Update top-half cache. */
> -	if (likely(lkey != UINT32_MAX)) {
> -		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
> -	} else {
> -		/*
> -		 * If missed in local lookup table, search in the global cache
> -		 * and local cache_bh[] will be updated inside if possible.
> -		 * Top-half cache entry will also be updated.
> -		 */
> -		lkey = mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
> -		if (unlikely(lkey == UINT32_MAX))
> -			return UINT32_MAX;
> -	}
> -	/* Update the most recently used entry. */
> -	mr_ctrl->mru = mr_ctrl->head;
> -	/* Point to the next victim, the oldest. */
> -	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
> -	return lkey;
> -}
> -
>  /**
>   * Bottom-half of LKey search on Rx.
>   *
> @@ -1114,7 +183,9 @@ mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq,
> uintptr_t addr)
>  	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
>  	struct mlx5_priv *priv = rxq_ctrl->priv;
> 
> -	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
> +	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
> +				  &priv->sh->share_cache, mr_ctrl, addr,
> +				  priv->config.mr_ext_memseg_en);
>  }
> 
>  /**
> @@ -1136,7 +207,9 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq,
> uintptr_t addr)
>  	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
>  	struct mlx5_priv *priv = txq_ctrl->priv;
> 
> -	return mlx5_mr_addr2mr_bh(ETH_DEV(priv), mr_ctrl, addr);
> +	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
> +				  &priv->sh->share_cache, mr_ctrl, addr,
> +				  priv->config.mr_ext_memseg_en);
>  }
> 
>  /**
> @@ -1165,82 +238,6 @@ mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq,
> struct rte_mbuf *mb)
>  	return lkey;
>  }
> 
> -/**
> - * Flush all of the local cache entries.
> - *
> - * @param mr_ctrl
> - *   Pointer to per-queue MR control structure.
> - */
> -void
> -mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
> -{
> -	/* Reset the most-recently-used index. */
> -	mr_ctrl->mru = 0;
> -	/* Reset the linear search array. */
> -	mr_ctrl->head = 0;
> -	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
> -	/* Reset the B-tree table. */
> -	mr_ctrl->cache_bh.len = 1;
> -	mr_ctrl->cache_bh.overflow = 0;
> -	/* Update the generation number. */
> -	mr_ctrl->cur_gen = *mr_ctrl->dev_gen_ptr;
> -	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=%d",
> -		(void *)mr_ctrl, mr_ctrl->cur_gen);
> -}
> -
> -/**
> - * Creates a memory region for external memory, that is memory which is not
> - * part of the DPDK memory segments.
> - *
> - * @param dev
> - *   Pointer to the ethernet device.
> - * @param addr
> - *   Starting virtual address of memory.
> - * @param len
> - *   Length of memory segment being mapped.
> - * @param socked_id
> - *   Socket to allocate heap memory for the control structures.
> - *
> - * @return
> - *   Pointer to MR structure on success, NULL otherwise.
> - */
> -static struct mlx5_mr *
> -mlx5_create_mr_ext(struct rte_eth_dev *dev, uintptr_t addr, size_t len,
> -		   int socket_id)
> -{
> -	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_mr *mr = NULL;
> -
> -	mr = rte_zmalloc_socket(NULL,
> -				RTE_ALIGN_CEIL(sizeof(*mr),
> -					       RTE_CACHE_LINE_SIZE),
> -				RTE_CACHE_LINE_SIZE, socket_id);
> -	if (mr == NULL)
> -		return NULL;
> -	mr->ibv_mr = mlx5_glue->reg_mr(priv->sh->pd, (void *)addr, len,
> -				       IBV_ACCESS_LOCAL_WRITE |
> -					   IBV_ACCESS_RELAXED_ORDERING);
> -	if (mr->ibv_mr == NULL) {
> -		DRV_LOG(WARNING,
> -			"port %u fail to create a verbs MR for address (%p)",
> -			dev->data->port_id, (void *)addr);
> -		rte_free(mr);
> -		return NULL;
> -	}
> -	mr->msl = NULL; /* Mark it is external memory. */
> -	mr->ms_bmp = NULL;
> -	mr->ms_n = 1;
> -	mr->ms_bmp_n = 1;
> -	DRV_LOG(DEBUG,
> -		"port %u MR CREATED (%p) for external memory %p:\n"
> -		"  [0x%" PRIxPTR ", 0x%" PRIxPTR "),"
> -		" lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u",
> -		dev->data->port_id, (void *)mr, (void *)addr,
> -		addr, addr + len, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> -		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> -	return mr;
> -}
> -
>  /**
>   * Called during rte_mempool_mem_iter() by mlx5_mr_update_ext_mp().
>   *
> @@ -1267,19 +264,19 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool
> *mp, void *opaque,
>  	struct mlx5_mr *mr = NULL;
>  	uintptr_t addr = (uintptr_t)memhdr->addr;
>  	size_t len = memhdr->len;
> -	struct mlx5_mr_cache entry;
> +	struct mr_cache_entry entry;
>  	uint32_t lkey;
> 
>  	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
>  	/* If already registered, it should return. */
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	lkey = mr_lookup_dev(sh, &entry, addr);
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> +	rte_rwlock_read_lock(&sh->share_cache.rwlock);
> +	lkey = mlx5_mr_lookup_cache(&sh->share_cache, &entry, addr);
> +	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  	if (lkey != UINT32_MAX)
>  		return;
>  	DRV_LOG(DEBUG, "port %u register MR for chunk #%d of mempool
> (%s)",
>  		dev->data->port_id, mem_idx, mp->name);
> -	mr = mlx5_create_mr_ext(dev, addr, len, mp->socket_id);
> +	mr = mlx5_create_mr_ext(sh->pd, addr, len, mp->socket_id);
>  	if (!mr) {
>  		DRV_LOG(WARNING,
>  			"port %u unable to allocate a new MR of"
> @@ -1288,13 +285,14 @@ mlx5_mr_update_ext_mp_cb(struct rte_mempool
> *mp, void *opaque,
>  		data->ret = -1;
>  		return;
>  	}
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
> +	rte_rwlock_write_lock(&sh->share_cache.rwlock);
> +	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
>  	/* Insert to the global cache table. */
> -	mr_insert_dev_cache(sh, mr);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> +	mlx5_mr_insert_cache(&sh->share_cache, mr);
> +	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  	/* Insert to the local cache table */
> -	mlx5_mr_addr2mr_bh(dev, mr_ctrl, addr);
> +	mlx5_mr_addr2mr_bh(sh->pd, &priv->mp_id, &sh->share_cache,
> +			   mr_ctrl, addr, priv->config.mr_ext_memseg_en);
>  }
> 
>  /**
> @@ -1351,19 +349,19 @@ mlx5_dma_map(struct rte_pci_device *pdev, void
> *addr,
>  		return -1;
>  	}
>  	priv = dev->data->dev_private;
> -	mr = mlx5_create_mr_ext(dev, (uintptr_t)addr, len, SOCKET_ID_ANY);
> +	sh = priv->sh;
> +	mr = mlx5_create_mr_ext(sh->pd, (uintptr_t)addr, len,
> SOCKET_ID_ANY);
>  	if (!mr) {
>  		DRV_LOG(WARNING,
>  			"port %u unable to dma map", dev->data->port_id);
>  		rte_errno = EINVAL;
>  		return -1;
>  	}
> -	sh = priv->sh;
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	LIST_INSERT_HEAD(&sh->mr.mr_list, mr, mr);
> +	rte_rwlock_write_lock(&sh->share_cache.rwlock);
> +	LIST_INSERT_HEAD(&sh->share_cache.mr_list, mr, mr);
>  	/* Insert to the global cache table. */
> -	mr_insert_dev_cache(sh, mr);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> +	mlx5_mr_insert_cache(&sh->share_cache, mr);
> +	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  	return 0;
>  }
> 
> @@ -1390,7 +388,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void
> *addr,
>  	struct mlx5_priv *priv;
>  	struct mlx5_ibv_shared *sh;
>  	struct mlx5_mr *mr;
> -	struct mlx5_mr_cache entry;
> +	struct mr_cache_entry entry;
> 
>  	dev = pci_dev_to_eth_dev(pdev);
>  	if (!dev) {
> @@ -1401,10 +399,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  	}
>  	priv = dev->data->dev_private;
>  	sh = priv->sh;
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	mr = mr_lookup_dev_list(sh, &entry, (uintptr_t)addr);
> +	rte_rwlock_read_lock(&sh->share_cache.rwlock);
> +	mr = mlx5_mr_lookup_list(&sh->share_cache, &entry,
> (uintptr_t)addr);
>  	if (!mr) {
> -		rte_rwlock_read_unlock(&sh->mr.rwlock);
> +		rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  		DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't
> registered "
>  				 "to PCI device %p", (uintptr_t)addr,
>  				 (void *)pdev);
> @@ -1412,10 +410,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  		return -1;
>  	}
>  	LIST_REMOVE(mr, mr);
> -	LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
> +	LIST_INSERT_HEAD(&sh->share_cache.mr_free_list, mr, mr);
>  	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
>  	      (void *)mr);
> -	mr_rebuild_dev_cache(sh);
> +	mlx5_mr_rebuild_cache(&sh->share_cache);
>  	/*
>  	 * Flush local caches by propagating invalidation across cores.
>  	 * rte_smp_wmb() is enough to synchronize this event. If one of
> @@ -1425,10 +423,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  	 * generation below) will be guaranteed to be seen by other core
>  	 * before the core sees the newly allocated memory.
>  	 */
> -	++sh->mr.dev_gen;
> -	DEBUG("broadcasting local cache flush, gen=%d",	sh-
> >mr.dev_gen);
> +	++sh->share_cache.dev_gen;
> +	DEBUG("broadcasting local cache flush, gen=%d",
> +	      sh->share_cache.dev_gen);
>  	rte_smp_wmb();
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> +	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  	return 0;
>  }
> 
> @@ -1503,14 +502,19 @@ mlx5_mr_update_mp_cb(struct rte_mempool *mp
> __rte_unused, void *opaque,
>  		     unsigned mem_idx __rte_unused)
>  {
>  	struct mr_update_mp_data *data = opaque;
> +	struct rte_eth_dev *dev = data->dev;
> +	struct mlx5_priv *priv = dev->data->dev_private;
> +
>  	uint32_t lkey;
> 
>  	/* Stop iteration if failed in the previous walk. */
>  	if (data->ret < 0)
>  		return;
>  	/* Register address of the chunk and update local caches. */
> -	lkey = mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
> -				  (uintptr_t)memhdr->addr);
> +	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
> +				  &priv->sh->share_cache, data->mr_ctrl,
> +				  (uintptr_t)memhdr->addr,
> +				  priv->config.mr_ext_memseg_en);
>  	if (lkey == UINT32_MAX)
>  		data->ret = -1;
>  }
> @@ -1545,76 +549,3 @@ mlx5_mr_update_mp(struct rte_eth_dev *dev,
> struct mlx5_mr_ctrl *mr_ctrl,
>  	}
>  	return data.ret;
>  }
> -
> -/**
> - * Dump all the created MRs and the global cache entries.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -void
> -mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh __rte_unused)
> -{
> -#ifdef RTE_LIBRTE_MLX5_DEBUG
> -	struct mlx5_mr *mr;
> -	int mr_n = 0;
> -	int chunk_n = 0;
> -
> -	rte_rwlock_read_lock(&sh->mr.rwlock);
> -	/* Iterate all the existing MRs. */
> -	LIST_FOREACH(mr, &sh->mr.mr_list, mr) {
> -		unsigned int n;
> -
> -		DEBUG("device %s MR[%u], LKey = 0x%x, ms_n = %u,
> ms_bmp_n = %u",
> -		      sh->ibdev_name, mr_n++,
> -		      rte_cpu_to_be_32(mr->ibv_mr->lkey),
> -		      mr->ms_n, mr->ms_bmp_n);
> -		if (mr->ms_n == 0)
> -			continue;
> -		for (n = 0; n < mr->ms_bmp_n; ) {
> -			struct mlx5_mr_cache ret = { 0, };
> -
> -			n = mr_find_next_chunk(mr, &ret, n);
> -			if (!ret.end)
> -				break;
> -			DEBUG("  chunk[%u], [0x%" PRIxPTR ", 0x%" PRIxPTR
> ")",
> -			      chunk_n++, ret.start, ret.end);
> -		}
> -	}
> -	DEBUG("device %s dumping global cache", sh->ibdev_name);
> -	mlx5_mr_btree_dump(&sh->mr.cache);
> -	rte_rwlock_read_unlock(&sh->mr.rwlock);
> -#endif
> -}
> -
> -/**
> - * Release all the created MRs and resources for shared device context.
> - * list.
> - *
> - * @param sh
> - *   Pointer to Ethernet device shared context.
> - */
> -void
> -mlx5_mr_release(struct mlx5_ibv_shared *sh)
> -{
> -	struct mlx5_mr *mr_next;
> -
> -	if (rte_log_can_log(mlx5_logtype, RTE_LOG_DEBUG))
> -		mlx5_mr_dump_dev(sh);
> -	rte_rwlock_write_lock(&sh->mr.rwlock);
> -	/* Detach from MR list and move to free list. */
> -	mr_next = LIST_FIRST(&sh->mr.mr_list);
> -	while (mr_next != NULL) {
> -		struct mlx5_mr *mr = mr_next;
> -
> -		mr_next = LIST_NEXT(mr, mr);
> -		LIST_REMOVE(mr, mr);
> -		LIST_INSERT_HEAD(&sh->mr.mr_free_list, mr, mr);
> -	}
> -	LIST_INIT(&sh->mr.mr_list);
> -	/* Free global cache. */
> -	mlx5_mr_btree_free(&sh->mr.cache);
> -	rte_rwlock_write_unlock(&sh->mr.rwlock);
> -	/* Free all remaining MRs. */
> -	mlx5_mr_garbage_collect(sh);
> -}
> diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
> index 48264c8294..0c5877b3d6 100644
> --- a/drivers/net/mlx5/mlx5_mr.h
> +++ b/drivers/net/mlx5/mlx5_mr.h
> @@ -24,99 +24,16 @@
>  #include <rte_ethdev.h>
>  #include <rte_rwlock.h>
>  #include <rte_bitmap.h>
> +#include <rte_memory.h>
> 
> -/* Memory Region object. */
> -struct mlx5_mr {
> -	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
> -	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
> -	const struct rte_memseg_list *msl;
> -	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
> -	int ms_n; /* Number of memsegs in use. */
> -	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
> -	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to MR.
> */
> -};
> -
> -/* Cache entry for Memory Region. */
> -struct mlx5_mr_cache {
> -	uintptr_t start; /* Start address of MR. */
> -	uintptr_t end; /* End address of MR. */
> -	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
> -} __rte_packed;
> -
> -/* MR Cache table for Binary search. */
> -struct mlx5_mr_btree {
> -	uint16_t len; /* Number of entries. */
> -	uint16_t size; /* Total number of entries. */
> -	int overflow; /* Mark failure of table expansion. */
> -	struct mlx5_mr_cache (*table)[];
> -} __rte_packed;
> -
> -/* Per-queue MR control descriptor. */
> -struct mlx5_mr_ctrl {
> -	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
> -	uint32_t cur_gen; /* Generation number saved to flush caches. */
> -	uint16_t mru; /* Index of last hit entry in top-half cache. */
> -	uint16_t head; /* Index of the oldest entry in top-half cache. */
> -	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for top-
> half. */
> -	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
> -} __rte_packed;
> -
> -struct mlx5_ibv_shared;
> -extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
> -extern rte_rwlock_t mlx5_mem_event_rwlock;
> +#include <mlx5_common_mr.h>
> 
>  /* First entry must be NULL for comparison. */
>  #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
> 
> -int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
> -void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
> -uint32_t mlx5_mr_create_primary(struct rte_eth_dev *dev,
> -				struct mlx5_mr_cache *entry, uintptr_t addr);
>  void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void
> *addr,
>  			  size_t len, void *arg);
>  int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> *mr_ctrl,
>  		      struct rte_mempool *mp);
> -void mlx5_mr_release(struct mlx5_ibv_shared *sh);
> -
> -/* Debug purpose functions. */
> -void mlx5_mr_btree_dump(struct mlx5_mr_btree *bt);
> -void mlx5_mr_dump_dev(struct mlx5_ibv_shared *sh);
> -
> -/**
> - * Look up LKey from given lookup table by linear search. Firstly look up the
> - * last-hit entry. If miss, the entire array is searched. If found, update the
> - * last-hit index and return LKey.
> - *
> - * @param lkp_tbl
> - *   Pointer to lookup table.
> - * @param[in,out] cached_idx
> - *   Pointer to last-hit index.
> - * @param n
> - *   Size of lookup table.
> - * @param addr
> - *   Search key.
> - *
> - * @return
> - *   Searched LKey on success, UINT32_MAX on no match.
> - */
> -static __rte_always_inline uint32_t
> -mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t
> *cached_idx,
> -		     uint16_t n, uintptr_t addr)
> -{
> -	uint16_t idx;
> -
> -	if (likely(addr >= lkp_tbl[*cached_idx].start &&
> -		   addr < lkp_tbl[*cached_idx].end))
> -		return lkp_tbl[*cached_idx].lkey;
> -	for (idx = 0; idx < n && lkp_tbl[idx].start != 0; ++idx) {
> -		if (addr >= lkp_tbl[idx].start &&
> -		    addr < lkp_tbl[idx].end) {
> -			/* Found. */
> -			*cached_idx = idx;
> -			return lkp_tbl[idx].lkey;
> -		}
> -	}
> -	return UINT32_MAX;
> -}
> 
>  #endif /* RTE_PMD_MLX5_MR_H_ */
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index 42d7da8a4b..3e583d49a6 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -33,6 +33,7 @@
> 
>  #include "mlx5_defs.h"
>  #include "mlx5.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_utils.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_autoconf.h"
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index d155c241eb..537d449c88 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -34,11 +34,11 @@
>  #include <mlx5_glue.h>
>  #include <mlx5_prm.h>
>  #include <mlx5_common.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
>  #include "mlx5.h"
> -#include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
> 
>  /* Support tunnel matching. */
> @@ -598,8 +598,8 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq,
> uintptr_t addr)
>  	uint32_t lkey;
> 
>  	/* Linear search on MR cache array. */
> -	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
> -				    MLX5_MR_CACHE_N, addr);
> +	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
> +				   MLX5_MR_CACHE_N, addr);
>  	if (likely(lkey != UINT32_MAX))
>  		return lkey;
>  	/* Take slower bottom-half (Binary Search) on miss. */
> @@ -630,8 +630,8 @@ mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct
> rte_mbuf *mb)
>  	if (unlikely(*mr_ctrl->dev_gen_ptr != mr_ctrl->cur_gen))
>  		mlx5_mr_flush_local_cache(mr_ctrl);
>  	/* Linear search on MR cache array. */
> -	lkey = mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
> -				    MLX5_MR_CACHE_N, addr);
> +	lkey = mlx5_mr_lookup_lkey(mr_ctrl->cache, &mr_ctrl->mru,
> +				   MLX5_MR_CACHE_N, addr);
>  	if (likely(lkey != UINT32_MAX))
>  		return lkey;
>  	/* Take slower bottom-half on miss. */
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> b/drivers/net/mlx5/mlx5_rxtx_vec.h
> index ea925156f0..6ddcbfb0ad 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> @@ -13,6 +13,8 @@
> 
>  #include "mlx5_autoconf.h"
> 
> +#include "mlx5_mr.h"
> +
>  /* HW checksum offload capabilities of vectorized Tx. */
>  #define MLX5_VEC_TX_CKSUM_OFFLOAD_CAP \
>  	(DEV_TX_OFFLOAD_IPV4_CKSUM | \
> diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
> index 438b705952..759670408b 100644
> --- a/drivers/net/mlx5/mlx5_trigger.c
> +++ b/drivers/net/mlx5/mlx5_trigger.c
> @@ -11,6 +11,7 @@
>  #include <rte_alarm.h>
> 
>  #include "mlx5.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_utils.h"
>  #include "rte_pmd_mlx5.h"
> diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
> index 0653f4cf30..29e5cabab6 100644
> --- a/drivers/net/mlx5/mlx5_txq.c
> +++ b/drivers/net/mlx5/mlx5_txq.c
> @@ -30,6 +30,7 @@
>  #include <mlx5_glue.h>
>  #include <mlx5_devx_cmds.h>
>  #include <mlx5_common.h>
> +#include <mlx5_common_mr.h>
> 
>  #include "mlx5_defs.h"
>  #include "mlx5_utils.h"
> @@ -1289,7 +1290,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t
> idx, uint16_t desc,
>  		goto error;
>  	}
>  	/* Save pointer of global generation number to check memory event.
> */
> -	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->mr.dev_gen;
> +	tmpl->txq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
>  	MLX5_ASSERT(desc > MLX5_TX_COMP_THRESH);
>  	tmpl->txq.offloads = conf->offloads |
>  			     dev->data->dev_conf.txmode.offloads;
> --
> 2.16.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver
  2020-04-13 21:17 ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Vu Pham
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling " Vu Pham
  2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 2/2] common/mlx5: refactor memory management codes Vu Pham
@ 2020-04-15  9:30   ` Raslan Darawsheh
  2 siblings, 0 replies; 26+ messages in thread
From: Raslan Darawsheh @ 2020-04-15  9:30 UTC (permalink / raw)
  To: Vu Pham, dev; +Cc: Slava Ovsiienko, Ori Kam, Matan Azrad, Vu Pham

Hi,

> -----Original Message-----
> From: Vu Pham <vuhuong@mellanox.com>
> Sent: Tuesday, April 14, 2020 12:18 AM
> To: dev@dpdk.org
> Cc: Slava Ovsiienko <viacheslavo@mellanox.com>; Ori Kam
> <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>; Raslan
> Darawsheh <rasland@mellanox.com>; Vu Pham <vuhuong@mellanox.com>
> Subject: [PATCH v4 0/2] refactor multi-process IPC and memory
> management codes to common driver
> 
> Current mlx5 net PMD and future mlx5(regex,...) PMDs that run
> and share the same HCAs need to use common memory management
> driver. Memory management codes embeddedly use multi-process IPC
> for primary/secondary processes to register and sync on memory
> registrations MRs. That's the main reason to refactor and move
> multi-process IPC APIs to mlx5 common driver and make it become
> the base commit, then refactor and move common MR codes to
> common driver in subsequent patch.
> 
> Vu Pham (2):
>   common/mlx5: refactor multi-process IPC handling codes to common
>     driver
>   common/mlx5: refactor memory management codes
> 
>  drivers/common/mlx5/Makefile                    |    4 +-
>  drivers/common/mlx5/meson.build                 |    2 +
>  drivers/common/mlx5/mlx5_common_mp.c            |  188 ++++
>  drivers/common/mlx5/mlx5_common_mp.h            |   98 ++
>  drivers/common/mlx5/mlx5_common_mr.c            | 1108
> +++++++++++++++++++++
>  drivers/common/mlx5/mlx5_common_mr.h            |  160 ++++
>  drivers/common/mlx5/rte_common_mlx5_version.map |   27 +
>  drivers/net/mlx5/mlx5.c                         |   19 +-
>  drivers/net/mlx5/mlx5.h                         |   55 +-
>  drivers/net/mlx5/mlx5_mp.c                      |  242 +----
>  drivers/net/mlx5/mlx5_mr.c                      | 1169 +----------------------
>  drivers/net/mlx5/mlx5_mr.h                      |   87 +-
>  drivers/net/mlx5/mlx5_rxtx.c                    |    4 +-
>  drivers/net/mlx5/mlx5_rxtx.h                    |   10 +-
>  drivers/net/mlx5/mlx5_rxtx_vec.h                |    2 +
>  drivers/net/mlx5/mlx5_trigger.c                 |    1 +
>  drivers/net/mlx5/mlx5_txq.c                     |    3 +-
>  17 files changed, 1692 insertions(+), 1487 deletions(-)
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mp.h
>  create mode 100644 drivers/common/mlx5/mlx5_common_mr.c
>  create mode 100644 drivers/common/mlx5/mlx5_common_mr.h
> 
> --
> 2.16.6


Fixed first commit title too long, 

Series applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-04-15  9:30 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-02 19:21 [dpdk-dev] [PATCH 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
2020-04-02 19:21 ` [dpdk-dev] [PATCH 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
2020-04-02 19:21 ` [dpdk-dev] [PATCH 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
2020-04-02 19:21 ` [dpdk-dev] [PATCH 3/4] common/mlx5: refactor memory management codes Vu Pham
2020-04-02 19:21 ` [dpdk-dev] [PATCH 4/4] net/mlx5: modify net PMD to use common memory management driver Vu Pham
2020-04-07 16:48 ` [dpdk-dev] [PATCH v2 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: refactor MP IPC handling " Vu Pham
2020-04-08  9:05     ` Slava Ovsiienko
2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 2/4] net/mlx5: modify net pmd to use common multi-process APIs Vu Pham
2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 3/4] common/mlx5: refactor memory management codes Vu Pham
2020-04-07 16:48   ` [dpdk-dev] [PATCH v2 4/4] net/mlx5: modify net pmd to use common MR driver Vu Pham
2020-04-07 17:00 ` [dpdk-dev] [PATCH v3 0/4] refactor multi-process IPC and memory management codes to common driver Vu Pham
2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: refactor multi-process IPC handling " Vu Pham
2020-04-08  9:05     ` Slava Ovsiienko
2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 2/4] net/mlx5: modify net PMD to use common multi-process APIs Vu Pham
2020-04-08  9:05     ` Slava Ovsiienko
2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: refactor memory management codes Vu Pham
2020-04-08  9:04     ` Slava Ovsiienko
2020-04-07 17:00   ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: modify net PMD to use common MR driver Vu Pham
2020-04-08  9:06     ` Slava Ovsiienko
2020-04-13 21:17 ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Vu Pham
2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 1/2] common/mlx5: refactor multi-process IPC handling " Vu Pham
2020-04-14  7:26     ` Slava Ovsiienko
2020-04-13 21:17   ` [dpdk-dev] [PATCH v4 2/2] common/mlx5: refactor memory management codes Vu Pham
2020-04-14  7:27     ` Slava Ovsiienko
2020-04-15  9:30   ` [dpdk-dev] [PATCH v4 0/2] refactor multi-process IPC and memory management codes to common driver Raslan Darawsheh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.