All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 0/9] Peer-Direct support
@ 2014-10-01 15:18 Yishai Hadas
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

The following set of patches implements Peer-Direct support over RDMA
stack.
    
Peer-Direct technology allows RDMA operations to directly target
memory in external hardware devices, such as GPU cards, SSD based
storage, dedicated ASIC accelerators, etc.
    
This technology allows RDMA-based (over InfiniBand/RoCE) application
to avoid unneeded data copying when sharing data between peer hardware
devices.
    
To implement this technology, we defined an API to securely expose the
memory of a hardware device (peer memory) to an RDMA hardware device.
    
The API defined for Peer-Direct is described in this cover letter.
The required implementation for a hardware device to expose memory
buffers over Peer-Direct is also detailed in this letter.
    
Finally, the cover letter includes a description of the flow and the
API that IB core and low level IB hardware drivers implement to
support the technology

Flow:
-----------------
Each peer memory client should register itself into the IB core (ib_core)
module, and provide a set of callbacks to manage its memory basic
functionality.

The required functionality includes getting pages descriptors based
upon user space virtual address, dma mapping these pages, getting the
memory page size, removing the DMA mapping of the pages and releasing
page descriptors.
Those callbacks are quite similar to the kernel API used to pin normal
host memory and exposed it to the hardware.
Description of the API is included later in this cover
letter.

The peer direct controller, implemented as part of the IB core
services, provides registry and brokering services between peer memory
providers and low level IB hardware drivers.
This makes the usage of peer-direct almost completely transparent to
the individual hardware drivers. The only changes required in the low
level IB hardware drivers is supporting an interface for immediate
invalidation of registered memory regions.

The IB hardware driver should use ib_umem_get with an extra signaling
that the requested memory may reside on a peer memory. When a given
user space virtual memory address found to belong to a peer memory
client, an ib_umem is built using the callbacks provided by the peer
memory client. In case the IB hardware driver supports invalidation
on that ib_umem it must be signaled as part of ib_umem_get, otherwise
if the peer memory requires invalidation support the registration will
be rejected.

After getting the ib_umem, if it is residing on a peer memory that requires
invalidation support, the low level IB hardware driver must register the
invalidation callback for this ib_umem.
If this callback is called, the driver should ensure that no access to
the memory mapped by the umem will happen once the callback returns.

Information and statistics regarding the registered peer memory
clients are exported to the user space at:
/sys/kernel/mm/memory_peers/<peer_name>/.
===============================================================================
Peer memory API
===============================================================================
Peer client structure:
-------------------------------------------------------------------------------
struct peer_memory_client {
       char   name[IB_PEER_MEMORY_NAME_MAX];
       char   version[IB_PEER_MEMORY_VER_MAX];
       int (*acquire) (unsigned long addr, size_t size, void *peer_mem_private_data,
                                  char *peer_mem_name, void **client_context);
       int (*get_pages) (unsigned long addr,
                       size_t size, int write, int force,
                       struct sg_table *sg_head,
                       void *client_context, void *core_context);
       int (*dma_map) (struct sg_table *sg_head, void *client_context,
                     struct device *dma_device, int dmasync, int *nmap);
       int (*dma_unmap) (struct sg_table *sg_head, void *client_context,
                        struct device  *dma_device);
       void (*put_pages) (struct sg_table *sg_head, void *client_context);
       unsigned long (*get_page_size) (void *client_context);
       void (*release) (void *client_context);

};

A detailed description of above callbacks is defined as part of the first patch
in peer_mem.h header file. 
-----------------------------------------------------------------------------------
void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
                                    invalidate_peer_memory *invalidate_callback);

Description:
Each peer memory should use this function to register as an available
peer memory client during its initialization. The callbacks provided
as part of the peer_client may be used later on by the IB core when
registering and unregistering its memory.
----------------------------------------------------------------------------------

void ib_unregister_peer_memory_client(void *reg_handle);

Description:
On unload, the peer memory client must unregister itself, to prevent
any additional callbacks to the unloaded module.

----------------------------------------------------------------------------------
typedef int (*invalidate_peer_memory)(void *reg_handle,
                                    void *core_context);

Description:
A callback function to be called by the peer driver when an allocation
should be invalidated. When the invalidation callback returns, the user
of the allocation is guaranteed not to access it.

-------------------------------------------------------------------------------

The structure of the patchset

First, the patches apply against the for-next branch in the
roland/infiniband.git tree, based upon commit ID
3bdad2d13fa62bcb59ca2506e74ce467ea436586 having subject: "Merge
branches 'core', 'ipoib', 'iser', 'mlx4', 'ocrdma' and 'qib' into
for-next"

Patches 1-3:
This set of patches introduces the API, adds the required support to
the IB core layer, allowing Peers to be registered and be part of the
flow. The first patch introduces the API, the next two patches add the
infrastructure to manage peer client and use its registration
callbacks.

Patch 4-5:
Those patches allow peers to notify IB core that a specific
registration should be invalidated.

Patch 6:
This patch exposes some information and statistics for a given peer
memory by using the sysfs mechanism.

Patches 7-8:
Those patches add the required functionality needed by mlx4 & mlx5 to
work with peer clients that require invalidation support.  Currently
that support was added for only MRs.

Patch 9:
This patch is an example peer memory client which uses the HOST
memory, it can serve as very good reference for peer client writers.

Yishai Hadas (9):
  IB/core: Introduce peer client interface
  IB/core: Get/put peer memory client
  IB/core: Umem tunneling peer memory APIs
  IB/core: Infrastructure to manage peer core context
  IB/core: Invalidation support for peer memory
  IB/core: Sysfs support for peer memory
  IB/mlx4: Invalidation support for MR over peer memory
  IB/mlx5: Invalidation support for MR over peer memory
  Samples: Peer memory client example

 drivers/infiniband/core/Makefile             |    3 +-
 drivers/infiniband/core/peer_mem.c           |  530 ++++++++++++++++++++++++++
 drivers/infiniband/core/umem.c               |  118 ++++++-
 drivers/infiniband/core/uverbs_cmd.c         |    2 +
 drivers/infiniband/hw/amso1100/c2_provider.c |    2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |    2 +-
 drivers/infiniband/hw/cxgb4/mem.c            |    2 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c       |    2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c       |    2 +-
 drivers/infiniband/hw/mlx4/cq.c              |    2 +-
 drivers/infiniband/hw/mlx4/doorbell.c        |    2 +-
 drivers/infiniband/hw/mlx4/main.c            |    3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h         |    5 +
 drivers/infiniband/hw/mlx4/mr.c              |   90 ++++-
 drivers/infiniband/hw/mlx4/qp.c              |    2 +-
 drivers/infiniband/hw/mlx4/srq.c             |    2 +-
 drivers/infiniband/hw/mlx5/cq.c              |    5 +-
 drivers/infiniband/hw/mlx5/doorbell.c        |    2 +-
 drivers/infiniband/hw/mlx5/main.c            |    3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h         |   10 +
 drivers/infiniband/hw/mlx5/mr.c              |   84 ++++-
 drivers/infiniband/hw/mlx5/qp.c              |    2 +-
 drivers/infiniband/hw/mlx5/srq.c             |    2 +-
 drivers/infiniband/hw/mthca/mthca_provider.c |    2 +-
 drivers/infiniband/hw/nes/nes_verbs.c        |    2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |    2 +-
 drivers/infiniband/hw/qib/qib_mr.c           |    2 +-
 include/rdma/ib_peer_mem.h                   |   58 +++
 include/rdma/ib_umem.h                       |   36 ++-
 include/rdma/ib_verbs.h                      |    5 +-
 include/rdma/peer_mem.h                      |  185 +++++++++
 samples/Kconfig                              |   10 +
 samples/Makefile                             |    3 +-
 samples/peer_memory/Makefile                 |    1 +
 samples/peer_memory/example_peer_mem.c       |  260 +++++++++++++
 35 files changed, 1403 insertions(+), 40 deletions(-)
 create mode 100644 drivers/infiniband/core/peer_mem.c
 create mode 100644 include/rdma/ib_peer_mem.h
 create mode 100644 include/rdma/peer_mem.h
 create mode 100644 samples/peer_memory/Makefile
 create mode 100644 samples/peer_memory/example_peer_mem.c

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH for-next 1/9] IB/core: Introduce peer client interface
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-01 15:18   ` Yishai Hadas
       [not found]     ` <1412176717-11979-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-01 15:18   ` [PATCH for-next 2/9] IB/core: Get/put peer memory client Yishai Hadas
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Introduces an API between IB core to peer memory clients,(e.g. GPU cards)
to provide access for the HCA to read/write GPU memory.

As a result it allows RDMA-based application to use GPU computing power,
and RDMA interconnect at the same time w/o copying the data between the P2P devices.

Each peer memory client should register with IB core. In the registration request,
it should supply callbacks to its memory basic functionality such as get/put pages,
get_page_size, dma map/unmap.

The client can optionally require the ability to invalidate memory it provided,
by requesting an invalidation callback details.

Upon successful registration, IB core will provide the client with a unique
registration handle and an invalidate callback function in case required by
the peer.

The handle should be used when unregistering the client, the callback function
can be used by the client in later patches, for a request from the client to
immediately release pinned pages.

Each peer must be able to recognize whether it's the owner of
a specific virtual address range. In case the answer is YES, further calls for memory
functionality will be tunneled to that peer.

The recognition is done via the 'acquire' call. The call arguments provide the
address and size of the memory requested. In case peer-direct context information
is available from the user verbs context, it is provided as well.
Upon recognition, the acquire call returns a peer direct client specific context.
The context will be provided by the peer direct controller to the peer direct client
callbacks when referring the specific address range.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/Makefile   |    3 +-
 drivers/infiniband/core/peer_mem.c |  116 ++++++++++++++++++++++
 include/rdma/ib_peer_mem.h         |   12 +++
 include/rdma/peer_mem.h            |  185 ++++++++++++++++++++++++++++++++++++
 4 files changed, 315 insertions(+), 1 deletions(-)
 create mode 100644 drivers/infiniband/core/peer_mem.c
 create mode 100644 include/rdma/ib_peer_mem.h
 create mode 100644 include/rdma/peer_mem.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index ffd0af6..e541ff0 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
-				device.o fmr_pool.o cache.o netlink.o
+				device.o fmr_pool.o cache.o netlink.o \
+				peer_mem.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
new file mode 100644
index 0000000..ed2c9b1
--- /dev/null
+++ b/drivers/infiniband/core/peer_mem.c
@@ -0,0 +1,116 @@
+/*
+ * Copyright (c) 2014,  Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/ib_peer_mem.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_umem.h>
+
+static DEFINE_MUTEX(peer_memory_mutex);
+static LIST_HEAD(peer_memory_list);
+static int num_registered_peers;
+
+static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
+
+{
+	return -ENOSYS;
+}
+
+static int ib_memory_peer_check_mandatory(const struct peer_memory_client
+						     *peer_client)
+{
+#define PEER_MEM_MANDATORY_FUNC(x) {\
+	offsetof(struct peer_memory_client, x), #x }
+
+		static const struct {
+			size_t offset;
+			char  *name;
+		} mandatory_table[] = {
+			PEER_MEM_MANDATORY_FUNC(acquire),
+			PEER_MEM_MANDATORY_FUNC(get_pages),
+			PEER_MEM_MANDATORY_FUNC(put_pages),
+			PEER_MEM_MANDATORY_FUNC(get_page_size),
+			PEER_MEM_MANDATORY_FUNC(dma_map),
+			PEER_MEM_MANDATORY_FUNC(dma_unmap)
+		};
+		int i;
+
+		for (i = 0; i < ARRAY_SIZE(mandatory_table); ++i) {
+			if (!*(void **)((void *)peer_client + mandatory_table[i].offset)) {
+				pr_err("Peer memory %s is missing mandatory function %s\n",
+				       peer_client->name, mandatory_table[i].name);
+				return -EINVAL;
+			}
+		}
+
+		return 0;
+}
+
+void *ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
+				     invalidate_peer_memory *invalidate_callback)
+{
+	struct ib_peer_memory_client *ib_peer_client;
+
+	if (ib_memory_peer_check_mandatory(peer_client))
+		return NULL;
+
+	ib_peer_client = kzalloc(sizeof(*ib_peer_client), GFP_KERNEL);
+	if (!ib_peer_client)
+		return NULL;
+
+	ib_peer_client->peer_mem = peer_client;
+	/* Once peer supplied a non NULL callback it's an indication that invalidation support is
+	 * required for any memory owning.
+	 */
+	if (invalidate_callback) {
+		*invalidate_callback = ib_invalidate_peer_memory;
+		ib_peer_client->invalidation_required = 1;
+	}
+	mutex_lock(&peer_memory_mutex);
+	list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
+	num_registered_peers++;
+	mutex_unlock(&peer_memory_mutex);
+	return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_register_peer_memory_client);
+
+void ib_unregister_peer_memory_client(void *reg_handle)
+{
+	struct ib_peer_memory_client *ib_peer_client =
+		(struct ib_peer_memory_client *)reg_handle;
+
+	mutex_lock(&peer_memory_mutex);
+	list_del(&ib_peer_client->core_peer_list);
+	num_registered_peers--;
+	mutex_unlock(&peer_memory_mutex);
+	kfree(ib_peer_client);
+}
+EXPORT_SYMBOL(ib_unregister_peer_memory_client);
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
new file mode 100644
index 0000000..fac37b7
--- /dev/null
+++ b/include/rdma/ib_peer_mem.h
@@ -0,0 +1,12 @@
+#if !defined(IB_PEER_MEM_H)
+#define IB_PEER_MEM_H
+
+#include <rdma/peer_mem.h>
+
+struct ib_peer_memory_client {
+	const struct peer_memory_client *peer_mem;
+	struct list_head	core_peer_list;
+	int invalidation_required;
+};
+
+#endif
diff --git a/include/rdma/peer_mem.h b/include/rdma/peer_mem.h
new file mode 100644
index 0000000..9acb754
--- /dev/null
+++ b/include/rdma/peer_mem.h
@@ -0,0 +1,185 @@
+/*
+ * Copyright (c) 2014,  Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(PEER_MEM_H)
+#define PEER_MEM_H
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/scatterlist.h>
+
+#define IB_PEER_MEMORY_NAME_MAX 64
+#define IB_PEER_MEMORY_VER_MAX 16
+
+struct peer_memory_client {
+	char	name[IB_PEER_MEMORY_NAME_MAX];
+	char	version[IB_PEER_MEMORY_VER_MAX];
+	/* The peer-direct controller (IB CORE) uses this callback to detect if a virtual address is under
+	 * the responsibility of a specific peer direct client. If the answer is positive further calls
+	 * for memory management will be directed to the callback of this peer driver.
+	 * Any peer internal error should resulted in a zero answer, in case address range
+	 * really belongs to the peer, no owner will be found and application will get an error
+	 * from IB CORE as expected.
+	 * Parameters:
+		addr                  [IN]  - virtual address to be checked whether belongs to.
+		size                  [IN]  - size of memory area starting at addr.
+		peer_mem_private_data [IN]  - The contents of ib_ucontext-> peer_mem_private_data.
+					      This parameter allows usage of the peer-direct
+					      API in implementations where it is impossible
+					      to detect if the memory belongs to the device
+					      based upon the virtual address alone. In such
+					      cases, the peer device can create a special
+					      ib_ucontext, which will be associated with the
+					      relevant peer memory.
+		peer_mem_name         [IN]  - The contents of ib_ucontext-> peer_mem_name.
+					      Used to identify the peer memory client that
+					      initialized the ib_ucontext.
+					      This parameter is normally used along with
+					      peer_mem_private_data.
+		client_context        [OUT] - peer opaque data which holds a peer context for
+					      the acquired address range, will be provided
+					      back to the peer memory in subsequent
+					      calls for that given memory.
+
+	* Return value:
+	*	1 - virtual address belongs to the peer device, otherwise 0
+	*/
+	int (*acquire)(unsigned long addr, size_t size, void *peer_mem_private_data,
+		       char *peer_mem_name, void **client_context);
+	/* The peer memory client is expected to pin the physical pages of the given address range
+	 * and to fill sg_table with the information of the
+	 * physical pages associated with the given address range. This function is
+	 * equivalent to the kernel API of get_user_pages(), but targets peer memory.
+	 * Parameters:
+		addr           [IN] - start virtual address of that given allocation.
+		size           [IN] - size of memory area starting at addr.
+		write          [IN] - indicates whether the pages will be written to by the caller.
+				      Same meaning as of kernel API get_user_pages, can be
+				      ignored if not relevant.
+		force          [IN] - indicates whether to force write access even if user
+				      mapping is readonly. Same meaning as of kernel API
+				      get_user_pages, can be ignored if not relevant.
+		sg_head        [IN/OUT] - pointer to head of struct sg_table.
+					  The peer client should allocate a table big
+					  enough to store all of the required entries. This
+					  function should fill the table with physical addresses
+					  and sizes of the memory segments composing this
+					  memory mapping.
+					  The table allocation can be done using sg_alloc_table.
+					  Filling in the physical memory addresses and size can
+					  be done using sg_set_page.
+		client_context [IN] - peer context for the given allocation, as received from
+				      the acquire call.
+		core_context   [IN] - opaque IB core context. If the peer client wishes to
+				      invalidate any of the pages pinned through this API,
+				      it must provide this context as an argument to the
+				      invalidate callback.
+
+	* Return value:
+	*	0 success, otherwise errno error code.
+	*/
+	int (*get_pages)(unsigned long addr,
+			 size_t size, int write, int force,
+			 struct sg_table *sg_head,
+			 void *client_context, void *core_context);
+	/* The peer-direct controller (IB CORE) calls this function to request from the
+	 * peer driver to fill the sg_table with dma address mapping for the peer memory exposed.
+	 * The parameters provided have the parameters for calling dma_map_sg.
+	 * Parameters:
+		sg_head        [IN/OUT] - pointer to head of struct sg_table. The peer memory
+					  should fill the dma_address & dma_length for
+					  each scatter gather entry in the table.
+		client_context [IN] - peer context for the allocation mapped.
+		dma_device     [IN] - the RDMA capable device which requires access to the
+				      peer memory.
+		dmasync        [IN] - flush in-flight DMA when the memory region is written.
+				      Same meaning as with host memory mapping, can be ignored if not relevant.
+		nmap           [OUT] - number of mapped/set entries.
+
+	* Return value:
+	*		0 success, otherwise errno error code.
+	*/
+	int (*dma_map)(struct sg_table *sg_head, void *client_context,
+		       struct device *dma_device, int dmasync, int *nmap);
+	/* This callback is the opposite of the dma map API, it should take relevant actions
+	 * to unmap the memory.
+	* Parameters:
+		sg_head        [IN/OUT] - pointer to head of struct sg_table. The peer memory
+					  should fill the dma_address & dma_length for
+					  each scatter gather entry in the table.
+		client_context [IN] - peer context for the allocation mapped.
+		dma_device     [IN] - the RDMA capable device which requires access to the
+				      peer memory.
+		dmasync        [IN] - flush in-flight DMA when the memory region is written.
+				      Same meaning as with host memory mapping, can be ignored if not relevant.
+		nmap           [OUT] - number of mapped/set entries.
+
+	* Return value:
+	*	0 success, otherwise errno error code.
+	*/
+	int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
+			 struct device  *dma_device);
+	/* This callback is the opposite of the get_pages API, it should remove the pinning
+	 * from the pages, it's the peer-direct equivalent of the kernel API put_page.
+	 * Parameters:
+		sg_head        [IN] - pointer to head of struct sg_table.
+		client_context [IN] - peer context for that given allocation.
+	*/
+	void (*put_pages)(struct sg_table *sg_head, void *client_context);
+	/* This callback returns page size for the given allocation
+	 * Parameters:
+		sg_head        [IN] - pointer to head of struct sg_table.
+		client_context [IN] - peer context for that given allocation.
+	* Return value:
+	*	Page size in bytes
+	*/
+	unsigned long (*get_page_size)(void *client_context);
+	/* This callback is the opposite of the acquire call, let peer release all resources associated
+	 * with the acquired context. The call will be performed only for contexts that have been
+	 * successfully acquired (i.e. acquire returned a non-zero value).
+	 * Parameters:
+	 *	client_context [IN] - peer context for the given allocation.
+	*/
+	void (*release)(void *client_context);
+
+};
+
+typedef int (*invalidate_peer_memory)(void *reg_handle, void *core_context);
+
+void *ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
+				     invalidate_peer_memory *invalidate_callback);
+void ib_unregister_peer_memory_client(void *reg_handle);
+
+#endif
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 2/9] IB/core: Get/put peer memory client
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-01 15:18   ` [PATCH for-next 1/9] IB/core: Introduce peer client interface Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
  2014-10-01 15:18   ` [PATCH for-next 3/9] IB/core: Umem tunneling peer memory APIs Yishai Hadas
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Supplies an API to get/put a peer client functionality.
It encapsulates the details of how to acquire/release a peer client from
its callers and let them get the required peer client in case it exists.

The 'get' call iterates over registered peer clients looking for an
owner of a given address range by calling peer's 'acquire' call.
In case an owner is found the loop is stopped.

The 'put' call does the opposite, lets peer release its resources for
that given address range.

A reference counting/completion mechanism is used to prevent a peer
memory client from going down once there are active users for its memory.

In addition:
- ib_ucontext was extended to enable peers setting their private
  context, get it via the 'acquire' call then be able to recognize their memory. A given
  ucontext can be served only by one peer which it belongs to.
- an extra device capability named IB_DEVICE_PEER_MEMORY was introduced,
  to be used by low level drivers to mark that they support this functionality.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/peer_mem.c   |   50 ++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/uverbs_cmd.c |    2 +
 include/rdma/ib_peer_mem.h           |   10 +++++++
 include/rdma/ib_verbs.h              |    5 +++-
 4 files changed, 66 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index ed2c9b1..3936e13 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -74,6 +74,14 @@ static int ib_memory_peer_check_mandatory(const struct peer_memory_client
 		return 0;
 }
 
+static void complete_peer(struct kref *kref)
+{
+	struct ib_peer_memory_client *ib_peer_client =
+		container_of(kref, struct ib_peer_memory_client, ref);
+
+	complete(&ib_peer_client->unload_comp);
+}
+
 void *ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
 				     invalidate_peer_memory *invalidate_callback)
 {
@@ -86,6 +94,8 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
 	if (!ib_peer_client)
 		return NULL;
 
+	init_completion(&ib_peer_client->unload_comp);
+	kref_init(&ib_peer_client->ref);
 	ib_peer_client->peer_mem = peer_client;
 	/* Once peer supplied a non NULL callback it's an indication that invalidation support is
 	 * required for any memory owning.
@@ -98,6 +108,7 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
 	list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
 	num_registered_peers++;
 	mutex_unlock(&peer_memory_mutex);
+
 	return ib_peer_client;
 }
 EXPORT_SYMBOL(ib_register_peer_memory_client);
@@ -111,6 +122,45 @@ void ib_unregister_peer_memory_client(void *reg_handle)
 	list_del(&ib_peer_client->core_peer_list);
 	num_registered_peers--;
 	mutex_unlock(&peer_memory_mutex);
+	kref_put(&ib_peer_client->ref, complete_peer);
+	wait_for_completion(&ib_peer_client->unload_comp);
 	kfree(ib_peer_client);
 }
 EXPORT_SYMBOL(ib_unregister_peer_memory_client);
+
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
+						 size_t size, void **peer_client_context)
+{
+	struct ib_peer_memory_client *ib_peer_client;
+	int ret;
+
+	mutex_lock(&peer_memory_mutex);
+	list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+		ret = ib_peer_client->peer_mem->acquire(addr, size,
+						   context->peer_mem_private_data,
+						   context->peer_mem_name,
+						   peer_client_context);
+		if (ret > 0)
+			goto found;
+	}
+
+	ib_peer_client = NULL;
+
+found:
+	if (ib_peer_client)
+		kref_get(&ib_peer_client->ref);
+
+	mutex_unlock(&peer_memory_mutex);
+	return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_get_peer_client);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+			void *peer_client_context)
+{
+	if (ib_peer_client->peer_mem->release)
+		ib_peer_client->peer_mem->release(peer_client_context);
+
+	kref_put(&ib_peer_client->ref, complete_peer);
+}
+EXPORT_SYMBOL(ib_put_peer_client);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 0600c50..3f5d754 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -326,6 +326,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	INIT_LIST_HEAD(&ucontext->xrcd_list);
 	INIT_LIST_HEAD(&ucontext->rule_list);
 	ucontext->closing = 0;
+	ucontext->peer_mem_private_data = NULL;
+	ucontext->peer_mem_name = NULL;
 
 	resp.num_comp_vectors = file->device->num_comp_vectors;
 
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index fac37b7..3353ae7 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -3,10 +3,20 @@
 
 #include <rdma/peer_mem.h>
 
+struct ib_ucontext;
+
 struct ib_peer_memory_client {
 	const struct peer_memory_client *peer_mem;
 	struct list_head	core_peer_list;
 	int invalidation_required;
+	struct kref ref;
+	struct completion unload_comp;
 };
 
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
+						 size_t size, void **peer_client_context);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+			void *peer_client_context);
+
 #endif
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index ed44cc0..685e0b9 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -123,7 +123,8 @@ enum ib_device_cap_flags {
 	IB_DEVICE_MEM_WINDOW_TYPE_2A	= (1<<23),
 	IB_DEVICE_MEM_WINDOW_TYPE_2B	= (1<<24),
 	IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
-	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30)
+	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30),
+	IB_DEVICE_PEER_MEMORY		= (1<<31)
 };
 
 enum ib_signature_prot_cap {
@@ -1131,6 +1132,8 @@ struct ib_ucontext {
 	struct list_head	xrcd_list;
 	struct list_head	rule_list;
 	int			closing;
+	void		 *peer_mem_private_data;
+	char		 *peer_mem_name;
 };
 
 struct ib_uobject {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 3/9] IB/core: Umem tunneling peer memory APIs
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-01 15:18   ` [PATCH for-next 1/9] IB/core: Introduce peer client interface Yishai Hadas
  2014-10-01 15:18   ` [PATCH for-next 2/9] IB/core: Get/put peer memory client Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
  2014-10-01 15:18   ` [PATCH for-next 4/9] IB/core: Infrastructure to manage peer core context Yishai Hadas
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Builds umem over peer memory client functionality.
It tries getting a peer client for a given address range, in case found
further memory calls are tunneled to that peer client.
ib_umem_get was extended to have an indication whether this umem can
be part of a peer client. As a result, usage of
ib_umem_get was updated accordingly.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/umem.c               |   77 +++++++++++++++++++++++++-
 drivers/infiniband/hw/amso1100/c2_provider.c |    2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |    2 +-
 drivers/infiniband/hw/cxgb4/mem.c            |    2 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c       |    2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c       |    2 +-
 drivers/infiniband/hw/mlx4/cq.c              |    2 +-
 drivers/infiniband/hw/mlx4/doorbell.c        |    2 +-
 drivers/infiniband/hw/mlx4/mr.c              |   11 +++-
 drivers/infiniband/hw/mlx4/qp.c              |    2 +-
 drivers/infiniband/hw/mlx4/srq.c             |    2 +-
 drivers/infiniband/hw/mlx5/cq.c              |    5 +-
 drivers/infiniband/hw/mlx5/doorbell.c        |    2 +-
 drivers/infiniband/hw/mlx5/mr.c              |    2 +-
 drivers/infiniband/hw/mlx5/qp.c              |    2 +-
 drivers/infiniband/hw/mlx5/srq.c             |    2 +-
 drivers/infiniband/hw/mthca/mthca_provider.c |    2 +-
 drivers/infiniband/hw/nes/nes_verbs.c        |    2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |    2 +-
 drivers/infiniband/hw/qib/qib_mr.c           |    2 +-
 include/rdma/ib_peer_mem.h                   |    4 +
 include/rdma/ib_umem.h                       |   13 +++-
 22 files changed, 119 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index df0c4f6..0de9916 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -42,6 +42,66 @@
 
 #include "uverbs.h"
 
+static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
+				     struct ib_umem *umem, unsigned long addr,
+				     int dmasync)
+{
+	int ret;
+	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+
+	umem->ib_peer_mem = ib_peer_mem;
+	/*
+	 * We always request write permissions to the pages, to force breaking of any CoW
+	 * during the registration of the MR. For read-only MRs we use the "force" flag to
+	 * indicate that CoW breaking is required but the registration should not fail if
+	 * referencing read-only areas.
+	 */
+	ret = peer_mem->get_pages(addr, umem->length,
+				  1, !umem->writable,
+				  &umem->sg_head,
+				  umem->peer_mem_client_context,
+				  NULL);
+	if (ret)
+		goto out;
+
+	umem->page_size = peer_mem->get_page_size
+					(umem->peer_mem_client_context);
+	if (umem->page_size <= 0)
+		goto put_pages;
+
+	umem->offset = addr & ((unsigned long)umem->page_size - 1);
+	ret = peer_mem->dma_map(&umem->sg_head,
+				umem->peer_mem_client_context,
+				umem->context->device->dma_device,
+				dmasync,
+				&umem->nmap);
+	if (ret)
+		goto put_pages;
+
+	return umem;
+
+put_pages:
+	peer_mem->put_pages(umem->peer_mem_client_context,
+					&umem->sg_head);
+out:
+	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
+	kfree(umem);
+	return ERR_PTR(ret);
+}
+
+static void peer_umem_release(struct ib_umem *umem)
+{
+	const struct peer_memory_client *peer_mem =
+				umem->ib_peer_mem->peer_mem;
+
+	peer_mem->dma_unmap(&umem->sg_head,
+			    umem->peer_mem_client_context,
+			    umem->context->device->dma_device);
+	peer_mem->put_pages(&umem->sg_head,
+			    umem->peer_mem_client_context);
+	ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+	kfree(umem);
+}
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
@@ -74,9 +134,11 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
  * @size: length of region to pin
  * @access: IB_ACCESS_xxx flags for memory being pinned
  * @dmasync: flush in-flight DMA when the memory region is written
+ * @peer_mem_flags: IB_PEER_MEM_xxx flags for memory being used
  */
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-			    size_t size, int access, int dmasync)
+			       size_t size, int access, int dmasync,
+			       unsigned long peer_mem_flags)
 {
 	struct ib_umem *umem;
 	struct page **page_list;
@@ -114,6 +176,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	 * "MW bind" can change permissions by binding a window.
 	 */
 	umem->writable  = !!(access & ~IB_ACCESS_REMOTE_READ);
+	if (peer_mem_flags & IB_PEER_MEM_ALLOW) {
+		struct ib_peer_memory_client *peer_mem_client;
+
+		peer_mem_client =  ib_get_peer_client(context, addr, size,
+						      &umem->peer_mem_client_context);
+		if (peer_mem_client)
+			return peer_umem_get(peer_mem_client, umem, addr,
+					dmasync);
+	}
 
 	/* We assume the memory is from hugetlb until proved otherwise */
 	umem->hugetlb   = 1;
@@ -234,6 +305,10 @@ void ib_umem_release(struct ib_umem *umem)
 	struct mm_struct *mm;
 	struct task_struct *task;
 	unsigned long diff;
+	if (umem->ib_peer_mem) {
+		peer_umem_release(umem);
+		return;
+	}
 
 	__ib_umem_release(umem->context->device, umem, 1);
 
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 2d5cbf4..e88d222 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -444,7 +444,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		return ERR_PTR(-ENOMEM);
 	c2mr->pd = c2pd;
 
-	c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+	c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
 	if (IS_ERR(c2mr->umem)) {
 		err = PTR_ERR(c2mr->umem);
 		kfree(c2mr);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 811b24a..aa9c142 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -635,7 +635,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	mhp->rhp = rhp;
 
-	mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+	mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
 	if (IS_ERR(mhp->umem)) {
 		err = PTR_ERR(mhp->umem);
 		kfree(mhp);
diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index ec7a298..506ddd2 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -705,7 +705,7 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	mhp->rhp = rhp;
 
-	mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+	mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
 	if (IS_ERR(mhp->umem)) {
 		err = PTR_ERR(mhp->umem);
 		kfree(mhp);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 3488e8c..d5bbbc0 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -359,7 +359,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	}
 
 	e_mr->umem = ib_umem_get(pd->uobject->context, start, length,
-				 mr_access_flags, 0);
+				 mr_access_flags, 0, 0);
 	if (IS_ERR(e_mr->umem)) {
 		ib_mr = (void *)e_mr->umem;
 		goto reg_user_mr_exit1;
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c
index 5e61e9b..d6641be 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -198,7 +198,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	}
 
 	umem = ib_umem_get(pd->uobject->context, start, length,
-			   mr_access_flags, 0);
+			   mr_access_flags, 0, 0);
 	if (IS_ERR(umem))
 		return (void *) umem;
 
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 1066eec..23aaf77 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -142,7 +142,7 @@ static int mlx4_ib_get_cq_umem(struct mlx4_ib_dev *dev, struct ib_ucontext *cont
 	int cqe_size = dev->dev->caps.cqe_size;
 
 	*umem = ib_umem_get(context, buf_addr, cqe * cqe_size,
-			    IB_ACCESS_LOCAL_WRITE, 1);
+			    IB_ACCESS_LOCAL_WRITE, 1, IB_PEER_MEM_ALLOW);
 	if (IS_ERR(*umem))
 		return PTR_ERR(*umem);
 
diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c
index c517409..71e7b66 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -62,7 +62,7 @@ int mlx4_ib_db_map_user(struct mlx4_ib_ucontext *context, unsigned long virt,
 	page->user_virt = (virt & PAGE_MASK);
 	page->refcnt    = 0;
 	page->umem      = ib_umem_get(&context->ibucontext, virt & PAGE_MASK,
-				      PAGE_SIZE, 0, 0);
+				      PAGE_SIZE, 0, 0, IB_PEER_MEM_ALLOW);
 	if (IS_ERR(page->umem)) {
 		err = PTR_ERR(page->umem);
 		kfree(page);
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 8f9325c..ad4cdfd 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -147,7 +147,8 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	/* Force registering the memory as writable. */
 	/* Used for memory re-registeration. HCA protects the access */
 	mr->umem = ib_umem_get(pd->uobject->context, start, length,
-			       access_flags | IB_ACCESS_LOCAL_WRITE, 0);
+			       access_flags | IB_ACCESS_LOCAL_WRITE, 0,
+			       IB_PEER_MEM_ALLOW);
 	if (IS_ERR(mr->umem)) {
 		err = PTR_ERR(mr->umem);
 		goto err_free;
@@ -226,12 +227,18 @@ int mlx4_ib_rereg_user_mr(struct ib_mr *mr, int flags,
 		int err;
 		int n;
 
+		/* Peer memory isn't supported */
+		if (mmr->umem->ib_peer_mem) {
+			err = -ENOTSUPP;
+			goto release_mpt_entry;
+		}
+
 		mlx4_mr_rereg_mem_cleanup(dev->dev, &mmr->mmr);
 		ib_umem_release(mmr->umem);
 		mmr->umem = ib_umem_get(mr->uobject->context, start, length,
 					mr_access_flags |
 					IB_ACCESS_LOCAL_WRITE,
-					0);
+					0, 0);
 		if (IS_ERR(mmr->umem)) {
 			err = PTR_ERR(mmr->umem);
 			/* Prevent mlx4_ib_dereg_mr from free'ing invalid pointer */
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 577b477..15d6430 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -721,7 +721,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			goto err;
 
 		qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
-				       qp->buf_size, 0, 0);
+				       qp->buf_size, 0, 0, IB_PEER_MEM_ALLOW);
 		if (IS_ERR(qp->umem)) {
 			err = PTR_ERR(qp->umem);
 			goto err;
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 62d9285..e05c772 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -114,7 +114,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
 		}
 
 		srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
-					buf_size, 0, 0);
+					buf_size, 0, 0, IB_PEER_MEM_ALLOW);
 		if (IS_ERR(srq->umem)) {
 			err = PTR_ERR(srq->umem);
 			goto err_srq;
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index e405627..a968a54 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -628,7 +628,8 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 
 	cq->buf.umem = ib_umem_get(context, ucmd.buf_addr,
 				   entries * ucmd.cqe_size,
-				   IB_ACCESS_LOCAL_WRITE, 1);
+				   IB_ACCESS_LOCAL_WRITE, 1,
+				   IB_PEER_MEM_ALLOW);
 	if (IS_ERR(cq->buf.umem)) {
 		err = PTR_ERR(cq->buf.umem);
 		return err;
@@ -958,7 +959,7 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
 		return -EINVAL;
 
 	umem = ib_umem_get(context, ucmd.buf_addr, entries * ucmd.cqe_size,
-			   IB_ACCESS_LOCAL_WRITE, 1);
+			   IB_ACCESS_LOCAL_WRITE, 1, IB_PEER_MEM_ALLOW);
 	if (IS_ERR(umem)) {
 		err = PTR_ERR(umem);
 		return err;
diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
index ece028f..5d7f427 100644
--- a/drivers/infiniband/hw/mlx5/doorbell.c
+++ b/drivers/infiniband/hw/mlx5/doorbell.c
@@ -64,7 +64,7 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
 	page->user_virt = (virt & PAGE_MASK);
 	page->refcnt    = 0;
 	page->umem      = ib_umem_get(&context->ibucontext, virt & PAGE_MASK,
-				      PAGE_SIZE, 0, 0);
+				      PAGE_SIZE, 0, 0, IB_PEER_MEM_ALLOW);
 	if (IS_ERR(page->umem)) {
 		err = PTR_ERR(page->umem);
 		kfree(page);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 80b3c63..55c6649 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -884,7 +884,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	mlx5_ib_dbg(dev, "start 0x%llx, virt_addr 0x%llx, length 0x%llx\n",
 		    start, virt_addr, length);
 	umem = ib_umem_get(pd->uobject->context, start, length, access_flags,
-			   0);
+			   0, IB_PEER_MEM_ALLOW);
 	if (IS_ERR(umem)) {
 		mlx5_ib_dbg(dev, "umem get failed\n");
 		return (void *)umem;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8c574b6..d6856c6 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -584,7 +584,7 @@ static int create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 
 	if (ucmd.buf_addr && qp->buf_size) {
 		qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
-				       qp->buf_size, 0, 0);
+				       qp->buf_size, 0, 0, IB_PEER_MEM_ALLOW);
 		if (IS_ERR(qp->umem)) {
 			mlx5_ib_dbg(dev, "umem_get failed\n");
 			err = PTR_ERR(qp->umem);
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 70bd131..4bca523 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -103,7 +103,7 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
 	srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
 
 	srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, buf_size,
-				0, 0);
+				0, 0, IB_PEER_MEM_ALLOW);
 	if (IS_ERR(srq->umem)) {
 		mlx5_ib_dbg(dev, "failed umem get, size %d\n", buf_size);
 		err = PTR_ERR(srq->umem);
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 415f8e1..599ee1f 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1002,7 +1002,7 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		return ERR_PTR(-ENOMEM);
 
 	mr->umem = ib_umem_get(pd->uobject->context, start, length, acc,
-			       ucmd.mr_attrs & MTHCA_MR_DMASYNC);
+			       ucmd.mr_attrs & MTHCA_MR_DMASYNC, 0);
 
 	if (IS_ERR(mr->umem)) {
 		err = PTR_ERR(mr->umem);
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index fef067c..5b70588 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -2333,7 +2333,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	u8 stag_key;
 	int first_page = 1;
 
-	region = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+	region = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
 	if (IS_ERR(region)) {
 		return (struct ib_mr *)region;
 	}
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 8f5f257..a90c88b 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -794,7 +794,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
 	if (!mr)
 		return ERR_PTR(status);
-	mr->umem = ib_umem_get(ibpd->uobject->context, start, len, acc, 0);
+	mr->umem = ib_umem_get(ibpd->uobject->context, start, len, acc, 0, 0);
 	if (IS_ERR(mr->umem)) {
 		status = -EFAULT;
 		goto umem_err;
diff --git a/drivers/infiniband/hw/qib/qib_mr.c b/drivers/infiniband/hw/qib/qib_mr.c
index 9bbb553..aadce11 100644
--- a/drivers/infiniband/hw/qib/qib_mr.c
+++ b/drivers/infiniband/hw/qib/qib_mr.c
@@ -242,7 +242,7 @@ struct ib_mr *qib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	}
 
 	umem = ib_umem_get(pd->uobject->context, start, length,
-			   mr_access_flags, 0);
+			   mr_access_flags, 0, 0);
 	if (IS_ERR(umem))
 		return (void *) umem;
 
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 3353ae7..98056c5 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -13,6 +13,10 @@ struct ib_peer_memory_client {
 	struct completion unload_comp;
 };
 
+enum ib_peer_mem_flags {
+	IB_PEER_MEM_ALLOW	= 1,
+};
+
 struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
 						 size_t size, void **peer_client_context);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index a2bf41e..a22dde0 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -36,6 +36,7 @@
 #include <linux/list.h>
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
+#include <rdma/ib_peer_mem.h>
 
 struct ib_ucontext;
 
@@ -53,12 +54,17 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/* peer memory that manages this umem */
+	struct ib_peer_memory_client *ib_peer_mem;
+	/* peer memory private context */
+	void *peer_mem_client_context;
 };
 
 #ifdef CONFIG_INFINIBAND_USER_MEM
 
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-			    size_t size, int access, int dmasync);
+			       size_t size, int access, int dmasync,
+			       unsigned long peer_mem_flags);
 void ib_umem_release(struct ib_umem *umem);
 int ib_umem_page_count(struct ib_umem *umem);
 
@@ -67,8 +73,9 @@ int ib_umem_page_count(struct ib_umem *umem);
 #include <linux/err.h>
 
 static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
-					  unsigned long addr, size_t size,
-					  int access, int dmasync) {
+					     unsigned long addr, size_t size,
+					     int access, int dmasync,
+					     unsigned long peer_mem_flags) {
 	return ERR_PTR(-EINVAL);
 }
 static inline void ib_umem_release(struct ib_umem *umem) { }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 4/9] IB/core: Infrastructure to manage peer core context
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2014-10-01 15:18   ` [PATCH for-next 3/9] IB/core: Umem tunneling peer memory APIs Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
       [not found]     ` <1412176717-11979-5-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-01 15:18   ` [PATCH for-next 5/9] IB/core: Invalidation support for peer memory Yishai Hadas
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Adds an infrastructure to manage core context for a given umem,
it's needed for the invalidation flow.

Core context is supplied to peer clients as some opaque data for a given
memory pages represented by a umem.

If the peer client needs to invalidate memory it provided through the peer memory callbacks,
it should call the invalidation callback, supplying the relevant core context.
IB core will use this context to invalidate the relevant memory.

To prevent cases when there are inflight invalidation calls in parallel
to releasing this memory (e.g. by dereg_mr) we must ensure that context
is valid before accessing it, that's why couldn't use the core context
pointer directly. For that reason we added a lookup table to map between
a ticket id to a core context. Peer client will get/supply the ticket
id, core will check whether exists before accessing its corresponding
context.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/peer_mem.c |  132 ++++++++++++++++++++++++++++++++++++
 include/rdma/ib_peer_mem.h         |   19 +++++
 include/rdma/ib_umem.h             |    6 ++
 3 files changed, 157 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 3936e13..ad10672 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -44,6 +44,136 @@ static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
 	return -ENOSYS;
 }
 
+static int peer_ticket_exists(struct ib_peer_memory_client *ib_peer_client,
+			      unsigned long ticket)
+{
+	struct core_ticket *core_ticket;
+
+	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+			    ticket_list) {
+		if (core_ticket->key == ticket)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int peer_get_free_ticket(struct ib_peer_memory_client *ib_peer_client,
+				unsigned long *new_ticket)
+{
+	unsigned long candidate_ticket = ib_peer_client->last_ticket + 1;
+	static int max_retries = 1000;
+	int i;
+
+	for (i = 0; i < max_retries; i++) {
+		if (peer_ticket_exists(ib_peer_client, candidate_ticket)) {
+			candidate_ticket++;
+			continue;
+		}
+		*new_ticket = candidate_ticket;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
+				  void *context,
+				  unsigned long *context_ticket)
+{
+	struct core_ticket *core_ticket = kzalloc(sizeof(*core_ticket), GFP_KERNEL);
+	int ret;
+
+	if (!core_ticket)
+		return -ENOMEM;
+
+	mutex_lock(&ib_peer_client->lock);
+	if (ib_peer_client->last_ticket < ib_peer_client->last_ticket + 1 &&
+	    !ib_peer_client->ticket_wrapped) {
+		core_ticket->key = ib_peer_client->last_ticket++;
+	} else {
+		/* special/rare case when wrap around occurred, not expected on 64 bit machines */
+		unsigned long new_ticket;
+
+		ib_peer_client->ticket_wrapped = 1;
+		ret = peer_get_free_ticket(ib_peer_client, &new_ticket);
+		if (ret) {
+			mutex_unlock(&ib_peer_client->lock);
+			kfree(core_ticket);
+			return ret;
+		}
+		ib_peer_client->last_ticket = new_ticket;
+		core_ticket->key = ib_peer_client->last_ticket;
+	}
+	core_ticket->context = context;
+	list_add_tail(&core_ticket->ticket_list,
+		      &ib_peer_client->core_ticket_list);
+	*context_ticket = core_ticket->key;
+	mutex_unlock(&ib_peer_client->lock);
+	return 0;
+}
+
+/* Caller should be holding the peer client lock, specifically, the caller should hold ib_peer_client->lock */
+static int ib_peer_remove_context(struct ib_peer_memory_client *ib_peer_client,
+				  unsigned long key)
+{
+	struct core_ticket *core_ticket;
+
+	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+			    ticket_list) {
+		if (core_ticket->key == key) {
+			list_del(&core_ticket->ticket_list);
+			kfree(core_ticket);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
+/**
+** ib_peer_create_invalidation_ctx - creates invalidation context for a given umem
+** @ib_peer_mem: peer client to be used
+** @umem: umem struct belongs to that context
+** @invalidation_ctx: output context
+**/
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, struct ib_umem *umem,
+				    struct invalidation_ctx **invalidation_ctx)
+{
+	int ret;
+	struct invalidation_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ret = ib_peer_insert_context(ib_peer_mem, ctx,
+				     &ctx->context_ticket);
+	if (ret) {
+		kfree(ctx);
+		return ret;
+	}
+
+	ctx->umem = umem;
+	umem->invalidation_ctx = ctx;
+	*invalidation_ctx = ctx;
+	return 0;
+}
+
+/**
+ * ** ib_peer_destroy_invalidation_ctx - destroy a given invalidation context
+ * ** @ib_peer_mem: peer client to be used
+ * ** @invalidation_ctx: context to be invalidated
+ * **/
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+				      struct invalidation_ctx *invalidation_ctx)
+{
+	mutex_lock(&ib_peer_mem->lock);
+	ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
+	mutex_unlock(&ib_peer_mem->lock);
+
+	kfree(invalidation_ctx);
+}
 static int ib_memory_peer_check_mandatory(const struct peer_memory_client
 						     *peer_client)
 {
@@ -94,6 +224,8 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
 	if (!ib_peer_client)
 		return NULL;
 
+	INIT_LIST_HEAD(&ib_peer_client->core_ticket_list);
+	mutex_init(&ib_peer_client->lock);
 	init_completion(&ib_peer_client->unload_comp);
 	kref_init(&ib_peer_client->ref);
 	ib_peer_client->peer_mem = peer_client;
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 98056c5..d3fbb50 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -4,6 +4,8 @@
 #include <rdma/peer_mem.h>
 
 struct ib_ucontext;
+struct ib_umem;
+struct invalidation_ctx;
 
 struct ib_peer_memory_client {
 	const struct peer_memory_client *peer_mem;
@@ -11,16 +13,33 @@ struct ib_peer_memory_client {
 	int invalidation_required;
 	struct kref ref;
 	struct completion unload_comp;
+	/* lock is used via the invalidation flow */
+	struct mutex lock;
+	struct list_head   core_ticket_list;
+	unsigned long       last_ticket;
+	int ticket_wrapped;
 };
 
 enum ib_peer_mem_flags {
 	IB_PEER_MEM_ALLOW	= 1,
 };
 
+struct core_ticket {
+	unsigned long key;
+	void *context;
+	struct list_head   ticket_list;
+};
+
 struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
 						 size_t size, void **peer_client_context);
 
 void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
 			void *peer_client_context);
 
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, struct ib_umem *umem,
+				    struct invalidation_ctx **invalidation_ctx);
+
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+				      struct invalidation_ctx *invalidation_ctx);
+
 #endif
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index a22dde0..4b8a042 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -40,6 +40,11 @@
 
 struct ib_ucontext;
 
+struct invalidation_ctx {
+	struct ib_umem *umem;
+	unsigned long context_ticket;
+};
+
 struct ib_umem {
 	struct ib_ucontext     *context;
 	size_t			length;
@@ -56,6 +61,7 @@ struct ib_umem {
 	int             npages;
 	/* peer memory that manages this umem */
 	struct ib_peer_memory_client *ib_peer_mem;
+	struct invalidation_ctx *invalidation_ctx;
 	/* peer memory private context */
 	void *peer_mem_client_context;
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 5/9] IB/core: Invalidation support for peer memory
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2014-10-01 15:18   ` [PATCH for-next 4/9] IB/core: Infrastructure to manage peer core context Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
       [not found]     ` <1412176717-11979-6-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-01 15:18   ` [PATCH for-next 6/9] IB/core: Sysfs " Yishai Hadas
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Adds the required functionality to invalidate a given peer
memory represented by some core context.

Each umem that was built over peer memory and supports invalidation has
some invalidation context assigned to it with the required data to
manage, once peer will call the invalidation callback below actions are
taken:

1) Taking lock on peer client to sync with inflight dereg_mr on that
memory.
2) Once lock is taken have a lookup for ticket id to find the matching
core context.
3) In case found will call umem invalidation function, otherwise call is
returned.

Some notes:
1) As peer invalidate callback defined to be blocking it must return
just after that pages are not going to be accessed any more. For that
reason ib_invalidate_peer_memory is waiting for a completion event in
case there is other inflight call coming as part of dereg_mr.

2) The peer memory API assumes that a lock might be taken by a peer
client to protect its memory operations. Specifically, its invalidate
callback might be called under that lock which may lead to an AB/BA
dead-lock in case IB core will call get/put pages APIs with the IB core peer's lock taken,
for that reason as part of  ib_umem_activate_invalidation_notifier lock is taken
then checking for some inflight invalidation state before activating it.

3) Once a peer client admits as part of its registration that it may
require invalidation support, it can't be an owner of a memory range
which doesn't support it.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/peer_mem.c |   86 +++++++++++++++++++++++++++++++++---
 drivers/infiniband/core/umem.c     |   51 ++++++++++++++++++---
 include/rdma/ib_peer_mem.h         |    4 +-
 include/rdma/ib_umem.h             |   17 +++++++
 4 files changed, 143 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index ad10672..d6bd192 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -38,10 +38,57 @@ static DEFINE_MUTEX(peer_memory_mutex);
 static LIST_HEAD(peer_memory_list);
 static int num_registered_peers;
 
-static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
+/* Caller should be holding the peer client lock, ib_peer_client->lock */
+static struct core_ticket *ib_peer_search_context(struct ib_peer_memory_client *ib_peer_client,
+						  unsigned long key)
+{
+	struct core_ticket *core_ticket;
+
+	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+			    ticket_list) {
+		if (core_ticket->key == key)
+			return core_ticket;
+	}
 
+	return NULL;
+}
+
+static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
 {
-	return -ENOSYS;
+	struct ib_peer_memory_client *ib_peer_client =
+		(struct ib_peer_memory_client *)reg_handle;
+	struct invalidation_ctx *invalidation_ctx;
+	struct core_ticket *core_ticket;
+	int need_unlock = 1;
+
+	mutex_lock(&ib_peer_client->lock);
+	core_ticket = ib_peer_search_context(ib_peer_client,
+					     (unsigned long)core_context);
+	if (!core_ticket)
+		goto out;
+
+	invalidation_ctx = (struct invalidation_ctx *)core_ticket->context;
+	/* If context is not ready yet, mark it to be invalidated */
+	if (!invalidation_ctx->func) {
+		invalidation_ctx->peer_invalidated = 1;
+		goto out;
+	}
+	invalidation_ctx->func(invalidation_ctx->cookie,
+					invalidation_ctx->umem, 0, 0);
+	if (invalidation_ctx->inflight_invalidation) {
+		/* init the completion to wait on before letting other thread to run */
+		init_completion(&invalidation_ctx->comp);
+		mutex_unlock(&ib_peer_client->lock);
+		need_unlock = 0;
+		wait_for_completion(&invalidation_ctx->comp);
+	}
+
+	kfree(invalidation_ctx);
+out:
+	if (need_unlock)
+		mutex_unlock(&ib_peer_client->lock);
+
+	return 0;
 }
 
 static int peer_ticket_exists(struct ib_peer_memory_client *ib_peer_client,
@@ -168,11 +215,30 @@ int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, s
 void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
 				      struct invalidation_ctx *invalidation_ctx)
 {
-	mutex_lock(&ib_peer_mem->lock);
-	ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
-	mutex_unlock(&ib_peer_mem->lock);
+	int peer_callback;
+	int inflight_invalidation;
 
-	kfree(invalidation_ctx);
+	/* If we are under peer callback lock was already taken.*/
+	if (!invalidation_ctx->peer_callback)
+		mutex_lock(&ib_peer_mem->lock);
+	ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
+	/* make sure to check inflight flag after took the lock and remove from tree.
+	 * in addition, from that point using local variables for peer_callback and
+	 * inflight_invalidation as after the complete invalidation_ctx can't be accessed
+	 * any more as it may be freed by the callback.
+	 */
+	peer_callback = invalidation_ctx->peer_callback;
+	inflight_invalidation = invalidation_ctx->inflight_invalidation;
+	if (inflight_invalidation)
+		complete(&invalidation_ctx->comp);
+
+	/* On peer callback lock is handled externally */
+	if (!peer_callback)
+		mutex_unlock(&ib_peer_mem->lock);
+
+	/* in case under callback context or callback is pending let it free the invalidation context */
+	if (!peer_callback && !inflight_invalidation)
+		kfree(invalidation_ctx);
 }
 static int ib_memory_peer_check_mandatory(const struct peer_memory_client
 						     *peer_client)
@@ -261,13 +327,19 @@ void ib_unregister_peer_memory_client(void *reg_handle)
 EXPORT_SYMBOL(ib_unregister_peer_memory_client);
 
 struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
-						 size_t size, void **peer_client_context)
+						 size_t size, unsigned long peer_mem_flags,
+						 void **peer_client_context)
 {
 	struct ib_peer_memory_client *ib_peer_client;
 	int ret;
 
 	mutex_lock(&peer_memory_mutex);
 	list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+		/* In case peer requires invalidation it can't own memory which doesn't support it */
+		if (ib_peer_client->invalidation_required &&
+		    (!(peer_mem_flags & IB_PEER_MEM_INVAL_SUPP)))
+			continue;
+
 		ret = ib_peer_client->peer_mem->acquire(addr, size,
 						   context->peer_mem_private_data,
 						   context->peer_mem_name,
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 0de9916..51f32a1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -44,12 +44,19 @@
 
 static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
 				     struct ib_umem *umem, unsigned long addr,
-				     int dmasync)
+				     int dmasync, unsigned long peer_mem_flags)
 {
 	int ret;
 	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+	struct invalidation_ctx *invalidation_ctx = NULL;
 
 	umem->ib_peer_mem = ib_peer_mem;
+	if (peer_mem_flags & IB_PEER_MEM_INVAL_SUPP) {
+		ret = ib_peer_create_invalidation_ctx(ib_peer_mem, umem, &invalidation_ctx);
+		if (ret)
+			goto end;
+	}
+
 	/*
 	 * We always request write permissions to the pages, to force breaking of any CoW
 	 * during the registration of the MR. For read-only MRs we use the "force" flag to
@@ -60,7 +67,9 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
 				  1, !umem->writable,
 				  &umem->sg_head,
 				  umem->peer_mem_client_context,
-				  NULL);
+				  invalidation_ctx ?
+				  (void *)invalidation_ctx->context_ticket : NULL);
+
 	if (ret)
 		goto out;
 
@@ -84,6 +93,9 @@ put_pages:
 	peer_mem->put_pages(umem->peer_mem_client_context,
 					&umem->sg_head);
 out:
+	if (invalidation_ctx)
+		ib_peer_destroy_invalidation_ctx(ib_peer_mem, invalidation_ctx);
+end:
 	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
 	kfree(umem);
 	return ERR_PTR(ret);
@@ -91,15 +103,19 @@ out:
 
 static void peer_umem_release(struct ib_umem *umem)
 {
-	const struct peer_memory_client *peer_mem =
-				umem->ib_peer_mem->peer_mem;
+	struct ib_peer_memory_client *ib_peer_mem = umem->ib_peer_mem;
+	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+	struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
+
+	if (invalidation_ctx)
+		ib_peer_destroy_invalidation_ctx(ib_peer_mem, invalidation_ctx);
 
 	peer_mem->dma_unmap(&umem->sg_head,
 			    umem->peer_mem_client_context,
 			    umem->context->device->dma_device);
 	peer_mem->put_pages(&umem->sg_head,
 			    umem->peer_mem_client_context);
-	ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
 	kfree(umem);
 }
 
@@ -127,6 +143,27 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 
 }
 
+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+					   umem_invalidate_func_t func,
+					   void *cookie)
+{
+	struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
+	int ret = 0;
+
+	mutex_lock(&umem->ib_peer_mem->lock);
+	if (invalidation_ctx->peer_invalidated) {
+		pr_err("ib_umem_activate_invalidation_notifier: pages were invalidated by peer\n");
+		ret = -EINVAL;
+		goto end;
+	}
+	invalidation_ctx->func = func;
+	invalidation_ctx->cookie = cookie;
+	/* from that point any pending invalidations can be called */
+end:
+	mutex_unlock(&umem->ib_peer_mem->lock);
+	return ret;
+}
+EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
  * @context: userspace context to pin memory for
@@ -179,11 +216,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	if (peer_mem_flags & IB_PEER_MEM_ALLOW) {
 		struct ib_peer_memory_client *peer_mem_client;
 
-		peer_mem_client =  ib_get_peer_client(context, addr, size,
+		peer_mem_client =  ib_get_peer_client(context, addr, size, peer_mem_flags,
 						      &umem->peer_mem_client_context);
 		if (peer_mem_client)
 			return peer_umem_get(peer_mem_client, umem, addr,
-					dmasync);
+					dmasync, peer_mem_flags);
 	}
 
 	/* We assume the memory is from hugetlb until proved otherwise */
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index d3fbb50..8f67aaf 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -22,6 +22,7 @@ struct ib_peer_memory_client {
 
 enum ib_peer_mem_flags {
 	IB_PEER_MEM_ALLOW	= 1,
+	IB_PEER_MEM_INVAL_SUPP = (1<<1),
 };
 
 struct core_ticket {
@@ -31,7 +32,8 @@ struct core_ticket {
 };
 
 struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
-						 size_t size, void **peer_client_context);
+						 size_t size, unsigned long peer_mem_flags,
+						 void **peer_client_context);
 
 void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
 			void *peer_client_context);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 4b8a042..83d6059 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -39,10 +39,21 @@
 #include <rdma/ib_peer_mem.h>
 
 struct ib_ucontext;
+struct ib_umem;
+
+typedef void (*umem_invalidate_func_t)(void *invalidation_cookie,
+					    struct ib_umem *umem,
+					    unsigned long addr, size_t size);
 
 struct invalidation_ctx {
 	struct ib_umem *umem;
 	unsigned long context_ticket;
+	umem_invalidate_func_t func;
+	void *cookie;
+	int peer_callback;
+	int inflight_invalidation;
+	int peer_invalidated;
+	struct completion comp;
 };
 
 struct ib_umem {
@@ -73,6 +84,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 			       unsigned long peer_mem_flags);
 void ib_umem_release(struct ib_umem *umem);
 int ib_umem_page_count(struct ib_umem *umem);
+int  ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+					    umem_invalidate_func_t func,
+					    void *cookie);
 
 #else /* CONFIG_INFINIBAND_USER_MEM */
 
@@ -87,6 +101,9 @@ static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
 static inline void ib_umem_release(struct ib_umem *umem) { }
 static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; }
 
+static inline int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+							 umem_invalidate_func_t func,
+							 void *cookie) {return 0; }
 #endif /* CONFIG_INFINIBAND_USER_MEM */
 
 #endif /* IB_UMEM_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 6/9] IB/core: Sysfs support for peer memory
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2014-10-01 15:18   ` [PATCH for-next 5/9] IB/core: Invalidation support for peer memory Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
       [not found]     ` <1412176717-11979-7-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-01 15:18   ` [PATCH for-next 7/9] IB/mlx4: Invalidation support for MR over " Yishai Hadas
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Supplies the required functionality to expose information and
statistics over sysfs for a given peer memory client.

This mechanism enables userspace application to check
which peers are available (based on name & version) and based on that
decides whether it can run successfully.

Root sysfs directory is /sys/kernel/mm/<peer_name>, under that directory
will reside some files that represent the statistics for that peer.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/peer_mem.c |  162 +++++++++++++++++++++++++++++++++++-
 drivers/infiniband/core/umem.c     |    4 +
 include/rdma/ib_peer_mem.h         |   11 +++
 3 files changed, 176 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index d6bd192..74ec8b4 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -37,6 +37,159 @@
 static DEFINE_MUTEX(peer_memory_mutex);
 static LIST_HEAD(peer_memory_list);
 static int num_registered_peers;
+static struct kobject *peers_kobj;
+
+static void complete_peer(struct kref *kref);
+static struct ib_peer_memory_client *get_peer_by_kobj(void *kobj);
+static ssize_t version_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+	if (ib_peer_client) {
+		sprintf(buf, "%s\n", ib_peer_client->peer_mem->version);
+		kref_put(&ib_peer_client->ref, complete_peer);
+		return strlen(buf);
+	}
+	/* not found - nothing is return */
+	return 0;
+}
+
+static ssize_t num_alloc_mrs_show(struct kobject *kobj,
+				  struct kobj_attribute *attr, char *buf)
+{
+	struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+	if (ib_peer_client) {
+		sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_alloc_mrs));
+		kref_put(&ib_peer_client->ref, complete_peer);
+		return strlen(buf);
+	}
+	/* not found - nothing is return */
+	return 0;
+}
+
+static ssize_t num_reg_pages_show(struct kobject *kobj,
+				  struct kobj_attribute *attr, char *buf)
+{
+	struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+	if (ib_peer_client) {
+		sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_reg_pages));
+		kref_put(&ib_peer_client->ref, complete_peer);
+		return strlen(buf);
+	}
+	/* not found - nothing is return */
+	return 0;
+}
+
+static ssize_t num_dereg_pages_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+	if (ib_peer_client) {
+		sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_dereg_pages));
+		kref_put(&ib_peer_client->ref, complete_peer);
+		return strlen(buf);
+	}
+	/* not found - nothing is return */
+	return 0;
+}
+
+static ssize_t num_free_callbacks_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+	if (ib_peer_client) {
+		sprintf(buf, "%lu\n", ib_peer_client->stats.num_free_callbacks);
+		kref_put(&ib_peer_client->ref, complete_peer);
+		return strlen(buf);
+	}
+	/* not found - nothing is return */
+	return 0;
+}
+
+static struct kobj_attribute version_attr = __ATTR_RO(version);
+static struct kobj_attribute num_alloc_mrs = __ATTR_RO(num_alloc_mrs);
+static struct kobj_attribute num_reg_pages = __ATTR_RO(num_reg_pages);
+static struct kobj_attribute num_dereg_pages = __ATTR_RO(num_dereg_pages);
+static struct kobj_attribute num_free_callbacks = __ATTR_RO(num_free_callbacks);
+
+static struct attribute *peer_mem_attrs[] = {
+			&version_attr.attr,
+			&num_alloc_mrs.attr,
+			&num_reg_pages.attr,
+			&num_dereg_pages.attr,
+			&num_free_callbacks.attr,
+			NULL,
+};
+
+static void destroy_peer_sysfs(struct ib_peer_memory_client *ib_peer_client)
+{
+	kobject_put(ib_peer_client->kobj);
+	if (!num_registered_peers)
+		kobject_put(peers_kobj);
+}
+
+static int create_peer_sysfs(struct ib_peer_memory_client *ib_peer_client)
+{
+	int ret;
+
+	if (!num_registered_peers) {
+		/* creating under /sys/kernel/mm */
+		peers_kobj = kobject_create_and_add("memory_peers", mm_kobj);
+		if (!peers_kobj)
+			return -ENOMEM;
+	}
+
+	ib_peer_client->peer_mem_attr_group.attrs = peer_mem_attrs;
+	/* Dir alreday was created explicitly to get its kernel object for further usage */
+	ib_peer_client->peer_mem_attr_group.name =  NULL;
+	ib_peer_client->kobj = kobject_create_and_add(ib_peer_client->peer_mem->name,
+		peers_kobj);
+
+	if (!ib_peer_client->kobj) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	/* Create the files associated with this kobject */
+	ret = sysfs_create_group(ib_peer_client->kobj,
+				 &ib_peer_client->peer_mem_attr_group);
+	if (ret)
+		goto peer_free;
+
+	return 0;
+
+peer_free:
+	kobject_put(ib_peer_client->kobj);
+
+free:
+	if (!num_registered_peers)
+		kobject_put(peers_kobj);
+
+	return ret;
+}
+
+static struct ib_peer_memory_client *get_peer_by_kobj(void *kobj)
+{
+	struct ib_peer_memory_client *ib_peer_client;
+
+	mutex_lock(&peer_memory_mutex);
+	list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+		if (ib_peer_client->kobj == kobj) {
+			kref_get(&ib_peer_client->ref);
+			goto found;
+		}
+	}
+
+	ib_peer_client = NULL;
+found:
+	mutex_unlock(&peer_memory_mutex);
+	return ib_peer_client;
+}
 
 /* Caller should be holding the peer client lock, ib_peer_client->lock */
 static struct core_ticket *ib_peer_search_context(struct ib_peer_memory_client *ib_peer_client,
@@ -62,6 +215,7 @@ static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
 	int need_unlock = 1;
 
 	mutex_lock(&ib_peer_client->lock);
+	ib_peer_client->stats.num_free_callbacks += 1;
 	core_ticket = ib_peer_search_context(ib_peer_client,
 					     (unsigned long)core_context);
 	if (!core_ticket)
@@ -303,10 +457,15 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
 		ib_peer_client->invalidation_required = 1;
 	}
 	mutex_lock(&peer_memory_mutex);
+	if (create_peer_sysfs(ib_peer_client)) {
+		kfree(ib_peer_client);
+		ib_peer_client = NULL;
+		goto end;
+	}
 	list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
 	num_registered_peers++;
+end:
 	mutex_unlock(&peer_memory_mutex);
-
 	return ib_peer_client;
 }
 EXPORT_SYMBOL(ib_register_peer_memory_client);
@@ -319,6 +478,7 @@ void ib_unregister_peer_memory_client(void *reg_handle)
 	mutex_lock(&peer_memory_mutex);
 	list_del(&ib_peer_client->core_peer_list);
 	num_registered_peers--;
+	destroy_peer_sysfs(ib_peer_client);
 	mutex_unlock(&peer_memory_mutex);
 	kref_put(&ib_peer_client->ref, complete_peer);
 	wait_for_completion(&ib_peer_client->unload_comp);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 51f32a1..9df5616 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -87,6 +87,8 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
 	if (ret)
 		goto put_pages;
 
+	atomic64_add(umem->nmap * (umem->page_size >> PAGE_SHIFT), &ib_peer_mem->stats.num_reg_pages);
+	atomic64_inc(&ib_peer_mem->stats.num_alloc_mrs);
 	return umem;
 
 put_pages:
@@ -115,6 +117,8 @@ static void peer_umem_release(struct ib_umem *umem)
 			    umem->context->device->dma_device);
 	peer_mem->put_pages(&umem->sg_head,
 			    umem->peer_mem_client_context);
+	atomic64_add(umem->nmap * (umem->page_size >> PAGE_SHIFT), &ib_peer_mem->stats.num_dereg_pages);
+	atomic64_inc(&ib_peer_mem->stats.num_dealloc_mrs);
 	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
 	kfree(umem);
 }
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 8f67aaf..4b1ae14 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -3,6 +3,14 @@
 
 #include <rdma/peer_mem.h>
 
+struct ib_peer_memory_statistics {
+	atomic64_t num_alloc_mrs;
+	atomic64_t num_dealloc_mrs;
+	atomic64_t num_reg_pages;
+	atomic64_t num_dereg_pages;
+	unsigned long num_free_callbacks;
+};
+
 struct ib_ucontext;
 struct ib_umem;
 struct invalidation_ctx;
@@ -18,6 +26,9 @@ struct ib_peer_memory_client {
 	struct list_head   core_ticket_list;
 	unsigned long       last_ticket;
 	int ticket_wrapped;
+	struct kobject *kobj;
+	struct attribute_group peer_mem_attr_group;
+	struct ib_peer_memory_statistics stats;
 };
 
 enum ib_peer_mem_flags {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 7/9] IB/mlx4: Invalidation support for MR over peer memory
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2014-10-01 15:18   ` [PATCH for-next 6/9] IB/core: Sysfs " Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
  2014-10-01 15:18   ` [PATCH for-next 8/9] IB/mlx5: " Yishai Hadas
  2014-10-01 15:18   ` [PATCH for-next 9/9] Samples: Peer memory client example Yishai Hadas
  8 siblings, 0 replies; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Adds the required functionality to work with peer memory
clients which require invalidation support.

It includes:

- umem invalidation callback - once called should free any HW
  resources assigned to that umem, then free peer resources
  corresponding to that umem.
- The MR object relates to that umem is stay alive till dereg_mr is
  called.
- synchronizing support between dereg_mr to invalidate callback.
- advertises the P2P device capability.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |    3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |    5 ++
 drivers/infiniband/hw/mlx4/mr.c      |   81 +++++++++++++++++++++++++++++++---
 3 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index c7586a1..2f349a2 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -162,7 +162,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 		IB_DEVICE_PORT_ACTIVE_EVENT		|
 		IB_DEVICE_SYS_IMAGE_GUID		|
 		IB_DEVICE_RC_RNR_NAK_GEN		|
-		IB_DEVICE_BLOCK_MULTICAST_LOOPBACK;
+		IB_DEVICE_BLOCK_MULTICAST_LOOPBACK	|
+		IB_DEVICE_PEER_MEMORY;
 	if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR)
 		props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
 	if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6eb743f..4b3dc70 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -116,6 +116,11 @@ struct mlx4_ib_mr {
 	struct ib_mr		ibmr;
 	struct mlx4_mr		mmr;
 	struct ib_umem	       *umem;
+	atomic_t      invalidated;
+	struct completion invalidation_comp;
+	/* lock protects the live indication */
+	struct mutex lock;
+	int    live;
 };
 
 struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index ad4cdfd..ddc9530 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -59,7 +59,7 @@ struct ib_mr *mlx4_ib_get_dma_mr(struct ib_pd *pd, int acc)
 	struct mlx4_ib_mr *mr;
 	int err;
 
-	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	mr = kzalloc(sizeof *mr, GFP_KERNEL);
 	if (!mr)
 		return ERR_PTR(-ENOMEM);
 
@@ -130,6 +130,31 @@ out:
 	return err;
 }
 
+static void mlx4_invalidate_umem(void *invalidation_cookie,
+				 struct ib_umem *umem,
+				 unsigned long addr, size_t size)
+{
+	struct mlx4_ib_mr *mr = (struct mlx4_ib_mr *)invalidation_cookie;
+
+	mutex_lock(&mr->lock);
+	/* This function is called under client peer lock so its resources are race protected */
+	if (atomic_inc_return(&mr->invalidated) > 1) {
+		umem->invalidation_ctx->inflight_invalidation = 1;
+		mutex_unlock(&mr->lock);
+		return;
+	}
+	if (!mr->live) {
+		mutex_unlock(&mr->lock);
+		return;
+	}
+
+	mutex_unlock(&mr->lock);
+	umem->invalidation_ctx->peer_callback = 1;
+	mlx4_mr_free(to_mdev(mr->ibmr.device)->dev, &mr->mmr);
+	ib_umem_release(umem);
+	complete(&mr->invalidation_comp);
+}
+
 struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				  u64 virt_addr, int access_flags,
 				  struct ib_udata *udata)
@@ -139,28 +164,54 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	int shift;
 	int err;
 	int n;
+	struct ib_peer_memory_client *ib_peer_mem;
 
-	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	mr = kzalloc(sizeof *mr, GFP_KERNEL);
 	if (!mr)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&mr->lock);
 	/* Force registering the memory as writable. */
 	/* Used for memory re-registeration. HCA protects the access */
 	mr->umem = ib_umem_get(pd->uobject->context, start, length,
 			       access_flags | IB_ACCESS_LOCAL_WRITE, 0,
-			       IB_PEER_MEM_ALLOW);
+			       IB_PEER_MEM_ALLOW | IB_PEER_MEM_INVAL_SUPP);
 	if (IS_ERR(mr->umem)) {
 		err = PTR_ERR(mr->umem);
 		goto err_free;
 	}
 
+	ib_peer_mem = mr->umem->ib_peer_mem;
+	if (ib_peer_mem) {
+		err = ib_umem_activate_invalidation_notifier(mr->umem, mlx4_invalidate_umem, mr);
+		if (err)
+			goto err_umem;
+	}
+
+	mutex_lock(&mr->lock);
+	if (atomic_read(&mr->invalidated))
+		goto err_locked_umem;
+
+	if (ib_peer_mem) {
+		if (access_flags & IB_ACCESS_MW_BIND) {
+			/* Prevent binding MW on peer clients, mlx4_invalidate_umem is a void
+			 * function and must succeed, however, mlx4_mr_free might fail when MW
+			 * are used.
+			*/
+			err = -ENOSYS;
+			pr_err("MW is not supported with peer memory client");
+			goto err_locked_umem;
+		}
+		init_completion(&mr->invalidation_comp);
+	}
+
 	n = ib_umem_page_count(mr->umem);
 	shift = ilog2(mr->umem->page_size);
 
 	err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length,
 			    convert_access(access_flags), n, shift, &mr->mmr);
 	if (err)
-		goto err_umem;
+		goto err_locked_umem;
 
 	err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem);
 	if (err)
@@ -171,12 +222,16 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		goto err_mr;
 
 	mr->ibmr.rkey = mr->ibmr.lkey = mr->mmr.key;
-
+	mr->live = 1;
+	mutex_unlock(&mr->lock);
 	return &mr->ibmr;
 
 err_mr:
 	(void) mlx4_mr_free(to_mdev(pd->device)->dev, &mr->mmr);
 
+err_locked_umem:
+	mutex_unlock(&mr->lock);
+
 err_umem:
 	ib_umem_release(mr->umem);
 
@@ -284,11 +339,23 @@ int mlx4_ib_dereg_mr(struct ib_mr *ibmr)
 	struct mlx4_ib_mr *mr = to_mmr(ibmr);
 	int ret;
 
+	if (atomic_inc_return(&mr->invalidated) > 1) {
+		wait_for_completion(&mr->invalidation_comp);
+		goto end;
+	}
+
 	ret = mlx4_mr_free(to_mdev(ibmr->device)->dev, &mr->mmr);
-	if (ret)
+	if (ret) {
+		/* Error is not expected here, except when memory windows
+		 * are bound to MR which is not supported with
+		 * peer memory clients.
+		*/
+		atomic_set(&mr->invalidated, 0);
 		return ret;
+	}
 	if (mr->umem)
 		ib_umem_release(mr->umem);
+end:
 	kfree(mr);
 
 	return 0;
@@ -365,7 +432,7 @@ struct ib_mr *mlx4_ib_alloc_fast_reg_mr(struct ib_pd *pd,
 	struct mlx4_ib_mr *mr;
 	int err;
 
-	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	mr = kzalloc(sizeof *mr, GFP_KERNEL);
 	if (!mr)
 		return ERR_PTR(-ENOMEM);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 8/9] IB/mlx5: Invalidation support for MR over peer memory
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2014-10-01 15:18   ` [PATCH for-next 7/9] IB/mlx4: Invalidation support for MR over " Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
  2014-10-01 15:18   ` [PATCH for-next 9/9] Samples: Peer memory client example Yishai Hadas
  8 siblings, 0 replies; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Adds the required functionality to work with peer memory
clients which require invalidation support.

It includes:

- umem invalidation callback - once called should free any HW
  resources assigned to that umem, then free peer resources
  corresponding to that umem.
- The MR object relates to that umem is stay alive till dereg_mr is
  called.
- synchronizing support between dereg_mr to invalidate callback.
- advertises the P2P device capability.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx5/main.c    |    3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   10 ++++
 drivers/infiniband/hw/mlx5/mr.c      |   84 ++++++++++++++++++++++++++++++++--
 3 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d8907b2..4185531 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -182,7 +182,8 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	props->device_cap_flags    = IB_DEVICE_CHANGE_PHY_PORT |
 		IB_DEVICE_PORT_ACTIVE_EVENT		|
 		IB_DEVICE_SYS_IMAGE_GUID		|
-		IB_DEVICE_RC_RNR_NAK_GEN;
+		IB_DEVICE_RC_RNR_NAK_GEN		|
+		IB_DEVICE_PEER_MEMORY;
 	flags = dev->mdev->caps.flags;
 	if (flags & MLX5_DEV_CAP_FLAG_BAD_PKEY_CNTR)
 		props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 386780f..bae7338 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -85,6 +85,8 @@ enum mlx5_ib_mad_ifc_flags {
 	MLX5_MAD_IFC_NET_VIEW		= 4,
 };
 
+struct mlx5_ib_peer_id;
+
 struct mlx5_ib_ucontext {
 	struct ib_ucontext	ibucontext;
 	struct list_head	db_page_list;
@@ -267,6 +269,14 @@ struct mlx5_ib_mr {
 	struct mlx5_ib_dev     *dev;
 	struct mlx5_create_mkey_mbox_out out;
 	struct mlx5_core_sig_ctx    *sig;
+	struct mlx5_ib_peer_id *peer_id;
+	atomic_t      invalidated;
+	struct completion invalidation_comp;
+};
+
+struct mlx5_ib_peer_id {
+	struct completion comp;
+	struct mlx5_ib_mr *mr;
 };
 
 struct mlx5_ib_fast_reg_page_list {
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 55c6649..390b149 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -38,6 +38,9 @@
 #include <linux/delay.h>
 #include <rdma/ib_umem.h>
 #include "mlx5_ib.h"
+static void mlx5_invalidate_umem(void *invalidation_cookie,
+				 struct ib_umem *umem,
+				 unsigned long addr, size_t size);
 
 enum {
 	MAX_PENDING_REG_MR = 8,
@@ -880,16 +883,32 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	int ncont;
 	int order;
 	int err;
+	struct ib_peer_memory_client *ib_peer_mem;
+	struct mlx5_ib_peer_id *mlx5_ib_peer_id = NULL;
 
 	mlx5_ib_dbg(dev, "start 0x%llx, virt_addr 0x%llx, length 0x%llx\n",
 		    start, virt_addr, length);
 	umem = ib_umem_get(pd->uobject->context, start, length, access_flags,
-			   0, IB_PEER_MEM_ALLOW);
+			   0, IB_PEER_MEM_ALLOW | IB_PEER_MEM_INVAL_SUPP);
 	if (IS_ERR(umem)) {
 		mlx5_ib_dbg(dev, "umem get failed\n");
 		return (void *)umem;
 	}
 
+	ib_peer_mem = umem->ib_peer_mem;
+	if (ib_peer_mem) {
+		mlx5_ib_peer_id = kzalloc(sizeof(*mlx5_ib_peer_id), GFP_KERNEL);
+		if (!mlx5_ib_peer_id) {
+			err = -ENOMEM;
+			goto error;
+		}
+		init_completion(&mlx5_ib_peer_id->comp);
+		err = ib_umem_activate_invalidation_notifier(umem, mlx5_invalidate_umem,
+							     mlx5_ib_peer_id);
+		if (err)
+			goto error;
+	}
+
 	mlx5_ib_cont_pages(umem, start, &npages, &page_shift, &ncont, &order);
 	if (!npages) {
 		mlx5_ib_warn(dev, "avoid zero region\n");
@@ -927,11 +946,21 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	spin_unlock(&dev->mr_lock);
 	mr->ibmr.lkey = mr->mmr.key;
 	mr->ibmr.rkey = mr->mmr.key;
+	atomic_set(&mr->invalidated, 0);
+	if (ib_peer_mem) {
+		init_completion(&mr->invalidation_comp);
+		mlx5_ib_peer_id->mr = mr;
+		mr->peer_id = mlx5_ib_peer_id;
+		complete(&mlx5_ib_peer_id->comp);
+	}
 
 	return &mr->ibmr;
 
 error:
+	if (mlx5_ib_peer_id)
+		complete(&mlx5_ib_peer_id->comp);
 	ib_umem_release(umem);
+	kfree(mlx5_ib_peer_id);
 	return ERR_PTR(err);
 }
 
@@ -968,7 +997,7 @@ error:
 	return err;
 }
 
-int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
+static int mlx5_ib_invalidate_mr(struct ib_mr *ibmr)
 {
 	struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
 	struct mlx5_ib_mr *mr = to_mmr(ibmr);
@@ -990,7 +1019,6 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 			mlx5_ib_warn(dev, "failed unregister\n");
 			return err;
 		}
-		free_cached_mr(dev, mr);
 	}
 
 	if (umem) {
@@ -1000,9 +1028,32 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		spin_unlock(&dev->mr_lock);
 	}
 
-	if (!umred)
-		kfree(mr);
+	return 0;
+}
+
+int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
+{
+	struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
+	struct mlx5_ib_mr *mr = to_mmr(ibmr);
+	int ret = 0;
+	int umred = mr->umred;
 
+	if (atomic_inc_return(&mr->invalidated) > 1) {
+		/* In case there is inflight invalidation call pending for its termination */
+		wait_for_completion(&mr->invalidation_comp);
+	} else {
+		ret = mlx5_ib_invalidate_mr(ibmr);
+		if (ret)
+			return ret;
+	}
+	kfree(mr->peer_id);
+	mr->peer_id = NULL;
+	if (umred) {
+		atomic_set(&mr->invalidated, 0);
+		free_cached_mr(dev, mr);
+	} else {
+		kfree(mr);
+	}
 	return 0;
 }
 
@@ -1122,6 +1173,29 @@ int mlx5_ib_destroy_mr(struct ib_mr *ibmr)
 	return err;
 }
 
+static void mlx5_invalidate_umem(void *invalidation_cookie,
+				 struct ib_umem *umem,
+				 unsigned long addr, size_t size)
+{
+	struct mlx5_ib_mr *mr;
+	struct mlx5_ib_peer_id *peer_id = (struct mlx5_ib_peer_id *)invalidation_cookie;
+
+	wait_for_completion(&peer_id->comp);
+	if (peer_id->mr == NULL)
+		return;
+
+	mr = peer_id->mr;
+	/* This function is called under client peer lock so its resources are race protected */
+	if (atomic_inc_return(&mr->invalidated) > 1) {
+		umem->invalidation_ctx->inflight_invalidation = 1;
+		return;
+	}
+
+	umem->invalidation_ctx->peer_callback = 1;
+	mlx5_ib_invalidate_mr(&mr->ibmr);
+	complete(&mr->invalidation_comp);
+}
+
 struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
 					int max_page_list_len)
 {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH for-next 9/9] Samples: Peer memory client example
       [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (7 preceding siblings ...)
  2014-10-01 15:18   ` [PATCH for-next 8/9] IB/mlx5: " Yishai Hadas
@ 2014-10-01 15:18   ` Yishai Hadas
       [not found]     ` <1412176717-11979-10-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  8 siblings, 1 reply; 22+ messages in thread
From: Yishai Hadas @ 2014-10-01 15:18 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w

Adds an example of a peer memory client which implements the peer memory
API as defined under include/rdma/peer_mem.h.
It uses the HOST memory functionality to implement the APIs and
can be a good reference for peer memory client writers.

Usage:
- It's built as a kernel module.
- The sample peer memory client takes ownership of a virtual memory area
  defined using module parameters.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 samples/Kconfig                        |   10 ++
 samples/Makefile                       |    3 +-
 samples/peer_memory/Makefile           |    1 +
 samples/peer_memory/example_peer_mem.c |  260 ++++++++++++++++++++++++++++++++
 4 files changed, 273 insertions(+), 1 deletions(-)
 create mode 100644 samples/peer_memory/Makefile
 create mode 100644 samples/peer_memory/example_peer_mem.c

diff --git a/samples/Kconfig b/samples/Kconfig
index 6181c2c..b75b771 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -21,6 +21,16 @@ config SAMPLE_KOBJECT
 
 	  If in doubt, say "N" here.
 
+config SAMPLE_PEER_MEMORY_CLIENT
+	tristate "Build peer memory sample client -- loadable modules only"
+	depends on INFINIBAND_USER_MEM && m
+	help
+	  This config option will allow you to build a peer memory
+	  example module that can be a very good reference for
+	  peer memory client plugin writers.
+
+	  If in doubt, say "N" here.
+
 config SAMPLE_KPROBES
 	tristate "Build kprobes examples -- loadable modules only"
 	depends on KPROBES && m
diff --git a/samples/Makefile b/samples/Makefile
index 1a60c62..b42117a 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,5 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ trace_events/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
+			   peer_memory/
diff --git a/samples/peer_memory/Makefile b/samples/peer_memory/Makefile
new file mode 100644
index 0000000..f498125
--- /dev/null
+++ b/samples/peer_memory/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SAMPLE_PEER_MEMORY_CLIENT) += example_peer_mem.o
diff --git a/samples/peer_memory/example_peer_mem.c b/samples/peer_memory/example_peer_mem.c
new file mode 100644
index 0000000..4febfd1
--- /dev/null
+++ b/samples/peer_memory/example_peer_mem.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2014, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/sched.h>
+#include <rdma/peer_mem.h>
+
+#define DRV_NAME	"example_peer_mem"
+#define DRV_VERSION	"1.0"
+#define DRV_RELDATE	__DATE__
+
+MODULE_AUTHOR("Yishai Hadas");
+MODULE_DESCRIPTION("Example peer memory");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+static unsigned long example_mem_start_range;
+static unsigned long example_mem_end_range;
+
+module_param(example_mem_start_range, ulong, 0444);
+MODULE_PARM_DESC(example_mem_start_range, "peer example start memory range");
+module_param(example_mem_end_range, ulong, 0444);
+MODULE_PARM_DESC(example_mem_end_range, "peer example end memory range");
+
+static void *reg_handle;
+
+struct example_mem_context {
+	void *core_context;
+	u64 page_virt_start;
+	u64 page_virt_end;
+	size_t mapped_size;
+	unsigned long npages;
+	int	      nmap;
+	unsigned long page_size;
+	int	      writable;
+	int dirty;
+};
+
+static void example_mem_put_pages(struct sg_table *sg_head, void *context);
+
+/* acquire return code: 1 mine, 0 - not mine */
+static int example_mem_acquire(unsigned long addr, size_t size, void *peer_mem_private_data,
+			       char *peer_mem_name, void **client_context)
+{
+	struct example_mem_context *example_mem_context;
+
+	if (!(addr >= example_mem_start_range) ||
+	    !(addr + size < example_mem_end_range))
+		/* peer is not the owner */
+		return 0;
+
+	example_mem_context = kzalloc(sizeof(*example_mem_context), GFP_KERNEL);
+	if (!example_mem_context)
+		/* Error case handled as not mine */
+		return 0;
+
+	example_mem_context->page_virt_start = addr & PAGE_MASK;
+	example_mem_context->page_virt_end   = (addr + size + PAGE_SIZE - 1) & PAGE_MASK;
+	example_mem_context->mapped_size  = example_mem_context->page_virt_end - example_mem_context->page_virt_start;
+
+	/* 1 means mine */
+	*client_context = example_mem_context;
+	__module_get(THIS_MODULE);
+	return 1;
+}
+
+static int example_mem_get_pages(unsigned long addr, size_t size, int write, int force,
+				 struct sg_table *sg_head, void *client_context, void *core_context)
+{
+	int ret;
+	unsigned long npages;
+	unsigned long cur_base;
+	struct page **page_list;
+	struct scatterlist *sg, *sg_list_start;
+	int i;
+	struct example_mem_context *example_mem_context;
+
+	example_mem_context = (struct example_mem_context *)client_context;
+	example_mem_context->core_context = core_context;
+	example_mem_context->page_size = PAGE_SIZE;
+	example_mem_context->writable = write;
+	npages = example_mem_context->mapped_size >> PAGE_SHIFT;
+
+	if (npages == 0)
+		return -EINVAL;
+
+	ret = sg_alloc_table(sg_head, npages, GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	page_list = (struct page **)__get_free_page(GFP_KERNEL);
+	if (!page_list) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	sg_list_start = sg_head->sgl;
+	cur_base = addr & PAGE_MASK;
+
+	while (npages) {
+		ret = get_user_pages(current, current->mm, cur_base,
+				     min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)),
+				     write, force, page_list, NULL);
+
+		if (ret < 0)
+			goto out;
+
+		example_mem_context->npages += ret;
+		cur_base += ret * PAGE_SIZE;
+		npages   -= ret;
+
+		for_each_sg(sg_list_start, sg, ret, i)
+				sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+
+		/* preparing for next loop */
+		sg_list_start = sg;
+	}
+
+out:
+	if (page_list)
+		free_page((unsigned long)page_list);
+
+	if (ret < 0) {
+		example_mem_put_pages(sg_head, client_context);
+		return ret;
+	}
+	/* mark that pages were exposed from the peer memory */
+	example_mem_context->dirty = 1;
+	return 0;
+}
+
+static int example_mem_dma_map(struct sg_table *sg_head, void *context,
+			       struct device *dma_device, int dmasync,
+			       int *nmap)
+{
+	DEFINE_DMA_ATTRS(attrs);
+	struct example_mem_context *example_mem_context =
+		(struct example_mem_context *)context;
+
+	if (dmasync)
+		dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
+	 example_mem_context->nmap = dma_map_sg_attrs(dma_device, sg_head->sgl,
+						      example_mem_context->npages,
+						      DMA_BIDIRECTIONAL, &attrs);
+	if (example_mem_context->nmap <= 0)
+		return -ENOMEM;
+
+	*nmap = example_mem_context->nmap;
+	return 0;
+}
+
+static int example_mem_dma_unmap(struct sg_table *sg_head, void *context,
+				 struct device  *dma_device)
+{
+	struct example_mem_context *example_mem_context =
+		(struct example_mem_context *)context;
+
+	dma_unmap_sg(dma_device, sg_head->sgl,
+		     example_mem_context->nmap,
+		     DMA_BIDIRECTIONAL);
+	return 0;
+}
+
+static void example_mem_put_pages(struct sg_table *sg_head, void *context)
+{
+	struct scatterlist *sg;
+	struct page *page;
+	int i;
+
+	struct example_mem_context *example_mem_context =
+		(struct example_mem_context *)context;
+
+	for_each_sg(sg_head->sgl, sg, example_mem_context->npages, i) {
+		page = sg_page(sg);
+		if (example_mem_context->writable && example_mem_context->dirty)
+			set_page_dirty_lock(page);
+		put_page(page);
+	}
+
+	sg_free_table(sg_head);
+}
+
+static void example_mem_release(void *context)
+{
+	struct example_mem_context *example_mem_context =
+		(struct example_mem_context *)context;
+
+	kfree(example_mem_context);
+	module_put(THIS_MODULE);
+}
+
+static unsigned long example_mem_get_page_size(void *context)
+{
+	struct example_mem_context *example_mem_context =
+				(struct example_mem_context *)context;
+
+	return example_mem_context->page_size;
+}
+
+static const struct peer_memory_client example_mem_client = {
+	.name			= DRV_NAME,
+	.version		= DRV_VERSION,
+	.acquire		= example_mem_acquire,
+	.get_pages	= example_mem_get_pages,
+	.dma_map	= example_mem_dma_map,
+	.dma_unmap	= example_mem_dma_unmap,
+	.put_pages	= example_mem_put_pages,
+	.get_page_size	= example_mem_get_page_size,
+	.release		= example_mem_release,
+};
+
+static int __init example_mem_client_init(void)
+{
+	reg_handle = ib_register_peer_memory_client(&example_mem_client, NULL);
+	if (!reg_handle)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void __exit example_mem_client_cleanup(void)
+{
+	ib_unregister_peer_memory_client(reg_handle);
+}
+
+module_init(example_mem_client_init);
+module_exit(example_mem_client_cleanup);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next 5/9] IB/core: Invalidation support for peer memory
       [not found]     ` <1412176717-11979-6-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-01 16:25       ` Yann Droneaud
       [not found]         ` <1412180704.4380.40.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Yann Droneaud @ 2014-10-01 16:25 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w

Le mercredi 01 octobre 2014 à 18:18 +0300, Yishai Hadas a écrit :
> Adds the required functionality to invalidate a given peer
> memory represented by some core context.
> 
> Each umem that was built over peer memory and supports invalidation has
> some invalidation context assigned to it with the required data to
> manage, once peer will call the invalidation callback below actions are
> taken:
> 
> 1) Taking lock on peer client to sync with inflight dereg_mr on that
> memory.
> 2) Once lock is taken have a lookup for ticket id to find the matching
> core context.
> 3) In case found will call umem invalidation function, otherwise call is
> returned.
> 
> Some notes:
> 1) As peer invalidate callback defined to be blocking it must return
> just after that pages are not going to be accessed any more. For that
> reason ib_invalidate_peer_memory is waiting for a completion event in
> case there is other inflight call coming as part of dereg_mr.
> 
> 2) The peer memory API assumes that a lock might be taken by a peer
> client to protect its memory operations. Specifically, its invalidate
> callback might be called under that lock which may lead to an AB/BA
> dead-lock in case IB core will call get/put pages APIs with the IB core peer's lock taken,
> for that reason as part of  ib_umem_activate_invalidation_notifier lock is taken
> then checking for some inflight invalidation state before activating it.
> 
> 3) Once a peer client admits as part of its registration that it may
> require invalidation support, it can't be an owner of a memory range
> which doesn't support it.
> 
> Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/core/peer_mem.c |   86 +++++++++++++++++++++++++++++++++---
>  drivers/infiniband/core/umem.c     |   51 ++++++++++++++++++---
>  include/rdma/ib_peer_mem.h         |    4 +-
>  include/rdma/ib_umem.h             |   17 +++++++
>  4 files changed, 143 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
> index ad10672..d6bd192 100644
> --- a/drivers/infiniband/core/peer_mem.c
> +++ b/drivers/infiniband/core/peer_mem.c
> @@ -38,10 +38,57 @@ static DEFINE_MUTEX(peer_memory_mutex);
>  static LIST_HEAD(peer_memory_list);
>  static int num_registered_peers;
>  
> -static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
> +/* Caller should be holding the peer client lock, ib_peer_client->lock */
> +static struct core_ticket *ib_peer_search_context(struct ib_peer_memory_client *ib_peer_client,
> +						  unsigned long key)
> +{
> +	struct core_ticket *core_ticket;
> +
> +	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
> +			    ticket_list) {
> +		if (core_ticket->key == key)
> +			return core_ticket;
> +	}
>  
> +	return NULL;
> +}
> +

You have now two functions to lookup key in ticket list:
see peer_ticket_exists().

> +static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
>  {
> -	return -ENOSYS;
> +	struct ib_peer_memory_client *ib_peer_client =
> +		(struct ib_peer_memory_client *)reg_handle;
> +	struct invalidation_ctx *invalidation_ctx;
> +	struct core_ticket *core_ticket;
> +	int need_unlock = 1;
> +
> +	mutex_lock(&ib_peer_client->lock);
> +	core_ticket = ib_peer_search_context(ib_peer_client,
> +					     (unsigned long)core_context);
> +	if (!core_ticket)
> +		goto out;
> +
> +	invalidation_ctx = (struct invalidation_ctx *)core_ticket->context;
> +	/* If context is not ready yet, mark it to be invalidated */
> +	if (!invalidation_ctx->func) {
> +		invalidation_ctx->peer_invalidated = 1;
> +		goto out;
> +	}
> +	invalidation_ctx->func(invalidation_ctx->cookie,
> +					invalidation_ctx->umem, 0, 0);
> +	if (invalidation_ctx->inflight_invalidation) {
> +		/* init the completion to wait on before letting other thread to run */
> +		init_completion(&invalidation_ctx->comp);
> +		mutex_unlock(&ib_peer_client->lock);
> +		need_unlock = 0;
> +		wait_for_completion(&invalidation_ctx->comp);
> +	}
> +
> +	kfree(invalidation_ctx);
> +out:
> +	if (need_unlock)
> +		mutex_unlock(&ib_peer_client->lock);
> +
> +	return 0;
>  }
>  
>  static int peer_ticket_exists(struct ib_peer_memory_client *ib_peer_client,
> @@ -168,11 +215,30 @@ int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, s
>  void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
>  				      struct invalidation_ctx *invalidation_ctx)
>  {
> -	mutex_lock(&ib_peer_mem->lock);
> -	ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
> -	mutex_unlock(&ib_peer_mem->lock);
> +	int peer_callback;
> +	int inflight_invalidation;
>  
> -	kfree(invalidation_ctx);
> +	/* If we are under peer callback lock was already taken.*/
> +	if (!invalidation_ctx->peer_callback)
> +		mutex_lock(&ib_peer_mem->lock);
> +	ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
> +	/* make sure to check inflight flag after took the lock and remove from tree.
> +	 * in addition, from that point using local variables for peer_callback and
> +	 * inflight_invalidation as after the complete invalidation_ctx can't be accessed
> +	 * any more as it may be freed by the callback.
> +	 */
> +	peer_callback = invalidation_ctx->peer_callback;
> +	inflight_invalidation = invalidation_ctx->inflight_invalidation;
> +	if (inflight_invalidation)
> +		complete(&invalidation_ctx->comp);
> +
> +	/* On peer callback lock is handled externally */
> +	if (!peer_callback)
> +		mutex_unlock(&ib_peer_mem->lock);
> +
> +	/* in case under callback context or callback is pending let it free the invalidation context */
> +	if (!peer_callback && !inflight_invalidation)
> +		kfree(invalidation_ctx);
>  }
>  static int ib_memory_peer_check_mandatory(const struct peer_memory_client
>  						     *peer_client)
> @@ -261,13 +327,19 @@ void ib_unregister_peer_memory_client(void *reg_handle)
>  EXPORT_SYMBOL(ib_unregister_peer_memory_client);
>  
>  struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
> -						 size_t size, void **peer_client_context)
> +						 size_t size, unsigned long peer_mem_flags,
> +						 void **peer_client_context)
>  {
>  	struct ib_peer_memory_client *ib_peer_client;
>  	int ret;
>  
>  	mutex_lock(&peer_memory_mutex);
>  	list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
> +		/* In case peer requires invalidation it can't own memory which doesn't support it */
> +		if (ib_peer_client->invalidation_required &&
> +		    (!(peer_mem_flags & IB_PEER_MEM_INVAL_SUPP)))
> +			continue;
> +
>  		ret = ib_peer_client->peer_mem->acquire(addr, size,
>  						   context->peer_mem_private_data,
>  						   context->peer_mem_name,
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 0de9916..51f32a1 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -44,12 +44,19 @@
>  
>  static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
>  				     struct ib_umem *umem, unsigned long addr,
> -				     int dmasync)
> +				     int dmasync, unsigned long peer_mem_flags)
>  {
>  	int ret;
>  	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
> +	struct invalidation_ctx *invalidation_ctx = NULL;
>  
>  	umem->ib_peer_mem = ib_peer_mem;
> +	if (peer_mem_flags & IB_PEER_MEM_INVAL_SUPP) {
> +		ret = ib_peer_create_invalidation_ctx(ib_peer_mem, umem, &invalidation_ctx);
> +		if (ret)
> +			goto end;
> +	}
> +
>  	/*
>  	 * We always request write permissions to the pages, to force breaking of any CoW
>  	 * during the registration of the MR. For read-only MRs we use the "force" flag to
> @@ -60,7 +67,9 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
>  				  1, !umem->writable,
>  				  &umem->sg_head,
>  				  umem->peer_mem_client_context,
> -				  NULL);
> +				  invalidation_ctx ?
> +				  (void *)invalidation_ctx->context_ticket : NULL);
> +

NULL may be a valid "ticket" once converted to unsigned long and looked
up in the ticket list.

>  	if (ret)
>  		goto out;
>  
> @@ -84,6 +93,9 @@ put_pages:
>  	peer_mem->put_pages(umem->peer_mem_client_context,
>  					&umem->sg_head);
>  out:
> +	if (invalidation_ctx)
> +		ib_peer_destroy_invalidation_ctx(ib_peer_mem, invalidation_ctx);
> +end:
>  	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
>  	kfree(umem);
>  	return ERR_PTR(ret);
> @@ -91,15 +103,19 @@ out:
>  
>  static void peer_umem_release(struct ib_umem *umem)
>  {
> -	const struct peer_memory_client *peer_mem =
> -				umem->ib_peer_mem->peer_mem;
> +	struct ib_peer_memory_client *ib_peer_mem = umem->ib_peer_mem;
> +	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
> +	struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
> +
> +	if (invalidation_ctx)
> +		ib_peer_destroy_invalidation_ctx(ib_peer_mem, invalidation_ctx);
>  
>  	peer_mem->dma_unmap(&umem->sg_head,
>  			    umem->peer_mem_client_context,
>  			    umem->context->device->dma_device);
>  	peer_mem->put_pages(&umem->sg_head,
>  			    umem->peer_mem_client_context);
> -	ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
> +	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
>  	kfree(umem);
>  }
>  
> @@ -127,6 +143,27 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>  
>  }
>  
> +int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> +					   umem_invalidate_func_t func,
> +					   void *cookie)
> +{
> +	struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
> +	int ret = 0;
> +
> +	mutex_lock(&umem->ib_peer_mem->lock);
> +	if (invalidation_ctx->peer_invalidated) {
> +		pr_err("ib_umem_activate_invalidation_notifier: pages were invalidated by peer\n");
> +		ret = -EINVAL;
> +		goto end;
> +	}
> +	invalidation_ctx->func = func;
> +	invalidation_ctx->cookie = cookie;
> +	/* from that point any pending invalidations can be called */
> +end:
> +	mutex_unlock(&umem->ib_peer_mem->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
>  /**
>   * ib_umem_get - Pin and DMA map userspace memory.
>   * @context: userspace context to pin memory for
> @@ -179,11 +216,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  	if (peer_mem_flags & IB_PEER_MEM_ALLOW) {
>  		struct ib_peer_memory_client *peer_mem_client;
>  
> -		peer_mem_client =  ib_get_peer_client(context, addr, size,
> +		peer_mem_client =  ib_get_peer_client(context, addr, size, peer_mem_flags,
>  						      &umem->peer_mem_client_context);
>  		if (peer_mem_client)
>  			return peer_umem_get(peer_mem_client, umem, addr,
> -					dmasync);
> +					dmasync, peer_mem_flags);
>  	}
>  
>  	/* We assume the memory is from hugetlb until proved otherwise */
> diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
> index d3fbb50..8f67aaf 100644
> --- a/include/rdma/ib_peer_mem.h
> +++ b/include/rdma/ib_peer_mem.h
> @@ -22,6 +22,7 @@ struct ib_peer_memory_client {
>  
>  enum ib_peer_mem_flags {
>  	IB_PEER_MEM_ALLOW	= 1,
> +	IB_PEER_MEM_INVAL_SUPP = (1<<1),
>  };
>  
>  struct core_ticket {
> @@ -31,7 +32,8 @@ struct core_ticket {
>  };
>  
>  struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
> -						 size_t size, void **peer_client_context);
> +						 size_t size, unsigned long peer_mem_flags,
> +						 void **peer_client_context);
>  
>  void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
>  			void *peer_client_context);
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 4b8a042..83d6059 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -39,10 +39,21 @@
>  #include <rdma/ib_peer_mem.h>
>  
>  struct ib_ucontext;
> +struct ib_umem;
> +
> +typedef void (*umem_invalidate_func_t)(void *invalidation_cookie,
> +					    struct ib_umem *umem,
> +					    unsigned long addr, size_t size);
>  
>  struct invalidation_ctx {
>  	struct ib_umem *umem;
>  	unsigned long context_ticket;
> +	umem_invalidate_func_t func;
> +	void *cookie;
> +	int peer_callback;
> +	int inflight_invalidation;
> +	int peer_invalidated;
> +	struct completion comp;
>  };
>  
>  struct ib_umem {
> @@ -73,6 +84,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  			       unsigned long peer_mem_flags);
>  void ib_umem_release(struct ib_umem *umem);
>  int ib_umem_page_count(struct ib_umem *umem);
> +int  ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> +					    umem_invalidate_func_t func,
> +					    void *cookie);
>  
>  #else /* CONFIG_INFINIBAND_USER_MEM */
>  
> @@ -87,6 +101,9 @@ static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
>  static inline void ib_umem_release(struct ib_umem *umem) { }
>  static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; }
>  
> +static inline int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> +							 umem_invalidate_func_t func,
> +							 void *cookie) {return 0; }
>  #endif /* CONFIG_INFINIBAND_USER_MEM */
>  
>  #endif /* IB_UMEM_H */


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next 1/9] IB/core: Introduce peer client interface
       [not found]     ` <1412176717-11979-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-01 16:34       ` Bart Van Assche
       [not found]         ` <542C2D23.30508-HInyCGIudOg@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2014-10-01 16:34 UTC (permalink / raw)
  To: Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w

On 10/01/14 17:18, Yishai Hadas wrote:
> +static int num_registered_peers;

Is the only purpose of this variable to check whether or not 
peer_memory_list is empty ? In that case please drop this variable and 
use list_empty() instead.

> +static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
> +
> +{
> +	return -ENOSYS;
> +}

Please follow the Linux kernel coding style which means no empty line 
above the function body.

> +#define PEER_MEM_MANDATORY_FUNC(x) {\
> +	offsetof(struct peer_memory_client, x), #x }

Shouldn't the opening brace have been placed on the same line as the 
offsetof() macro to improve readability ?

> +	if (invalidate_callback) {
> +		*invalidate_callback = ib_invalidate_peer_memory;
> +		ib_peer_client->invalidation_required = 1;
> +	}
> +	mutex_lock(&peer_memory_mutex);
> +	list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
> +	num_registered_peers++;
> +	mutex_unlock(&peer_memory_mutex);
> +	return ib_peer_client;

Please insert an empty line before mutex_lock() and after mutex_unlock().

> +void ib_unregister_peer_memory_client(void *reg_handle)
> +{
> +	struct ib_peer_memory_client *ib_peer_client =
> +		(struct ib_peer_memory_client *)reg_handle;

No cast is needed when assigning a void pointer to a non-void pointer.

> +struct peer_memory_client {
> +	char	name[IB_PEER_MEMORY_NAME_MAX];
> +	char	version[IB_PEER_MEMORY_VER_MAX];
> +	/* The peer-direct controller (IB CORE) uses this callback to detect if a virtual address is under
> +	 * the responsibility of a specific peer direct client. If the answer is positive further calls
> +	 * for memory management will be directed to the callback of this peer driver.
> +	 * Any peer internal error should resulted in a zero answer, in case address range
> +	 * really belongs to the peer, no owner will be found and application will get an error
> +	 * from IB CORE as expected.
> +	 * Parameters:
> +		addr                  [IN]  - virtual address to be checked whether belongs to.
> +		size                  [IN]  - size of memory area starting at addr.
> +		peer_mem_private_data [IN]  - The contents of ib_ucontext-> peer_mem_private_data.
> +					      This parameter allows usage of the peer-direct
> +					      API in implementations where it is impossible
> +					      to detect if the memory belongs to the device
> +					      based upon the virtual address alone. In such
> +					      cases, the peer device can create a special
> +					      ib_ucontext, which will be associated with the
> +					      relevant peer memory.
> +		peer_mem_name         [IN]  - The contents of ib_ucontext-> peer_mem_name.
> +					      Used to identify the peer memory client that
> +					      initialized the ib_ucontext.
> +					      This parameter is normally used along with
> +					      peer_mem_private_data.
> +		client_context        [OUT] - peer opaque data which holds a peer context for
> +					      the acquired address range, will be provided
> +					      back to the peer memory in subsequent
> +					      calls for that given memory.
> +
> +	* Return value:
> +	*	1 - virtual address belongs to the peer device, otherwise 0
> +	*/
> +	int (*acquire)(unsigned long addr, size_t size, void *peer_mem_private_data,
> +		       char *peer_mem_name, void **client_context);
> +	/* The peer memory client is expected to pin the physical pages of the given address range
> +	 * and to fill sg_table with the information of the
> +	 * physical pages associated with the given address range. This function is
> +	 * equivalent to the kernel API of get_user_pages(), but targets peer memory.
> +	 * Parameters:
> +		addr           [IN] - start virtual address of that given allocation.
> +		size           [IN] - size of memory area starting at addr.
> +		write          [IN] - indicates whether the pages will be written to by the caller.
> +				      Same meaning as of kernel API get_user_pages, can be
> +				      ignored if not relevant.
> +		force          [IN] - indicates whether to force write access even if user
> +				      mapping is readonly. Same meaning as of kernel API
> +				      get_user_pages, can be ignored if not relevant.
> +		sg_head        [IN/OUT] - pointer to head of struct sg_table.
> +					  The peer client should allocate a table big
> +					  enough to store all of the required entries. This
> +					  function should fill the table with physical addresses
> +					  and sizes of the memory segments composing this
> +					  memory mapping.
> +					  The table allocation can be done using sg_alloc_table.
> +					  Filling in the physical memory addresses and size can
> +					  be done using sg_set_page.
> +		client_context [IN] - peer context for the given allocation, as received from
> +				      the acquire call.
> +		core_context   [IN] - opaque IB core context. If the peer client wishes to
> +				      invalidate any of the pages pinned through this API,
> +				      it must provide this context as an argument to the
> +				      invalidate callback.
> +
> +	* Return value:
> +	*	0 success, otherwise errno error code.
> +	*/
> +	int (*get_pages)(unsigned long addr,
> +			 size_t size, int write, int force,
> +			 struct sg_table *sg_head,
> +			 void *client_context, void *core_context);
> +	/* The peer-direct controller (IB CORE) calls this function to request from the
> +	 * peer driver to fill the sg_table with dma address mapping for the peer memory exposed.
> +	 * The parameters provided have the parameters for calling dma_map_sg.
> +	 * Parameters:
> +		sg_head        [IN/OUT] - pointer to head of struct sg_table. The peer memory
> +					  should fill the dma_address & dma_length for
> +					  each scatter gather entry in the table.
> +		client_context [IN] - peer context for the allocation mapped.
> +		dma_device     [IN] - the RDMA capable device which requires access to the
> +				      peer memory.
> +		dmasync        [IN] - flush in-flight DMA when the memory region is written.
> +				      Same meaning as with host memory mapping, can be ignored if not relevant.
> +		nmap           [OUT] - number of mapped/set entries.
> +
> +	* Return value:
> +	*		0 success, otherwise errno error code.
> +	*/
> +	int (*dma_map)(struct sg_table *sg_head, void *client_context,
> +		       struct device *dma_device, int dmasync, int *nmap);
> +	/* This callback is the opposite of the dma map API, it should take relevant actions
> +	 * to unmap the memory.
> +	* Parameters:
> +		sg_head        [IN/OUT] - pointer to head of struct sg_table. The peer memory
> +					  should fill the dma_address & dma_length for
> +					  each scatter gather entry in the table.
> +		client_context [IN] - peer context for the allocation mapped.
> +		dma_device     [IN] - the RDMA capable device which requires access to the
> +				      peer memory.
> +		dmasync        [IN] - flush in-flight DMA when the memory region is written.
> +				      Same meaning as with host memory mapping, can be ignored if not relevant.
> +		nmap           [OUT] - number of mapped/set entries.
> +
> +	* Return value:
> +	*	0 success, otherwise errno error code.
> +	*/
> +	int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
> +			 struct device  *dma_device);
> +	/* This callback is the opposite of the get_pages API, it should remove the pinning
> +	 * from the pages, it's the peer-direct equivalent of the kernel API put_page.
> +	 * Parameters:
> +		sg_head        [IN] - pointer to head of struct sg_table.
> +		client_context [IN] - peer context for that given allocation.
> +	*/
> +	void (*put_pages)(struct sg_table *sg_head, void *client_context);
> +	/* This callback returns page size for the given allocation
> +	 * Parameters:
> +		sg_head        [IN] - pointer to head of struct sg_table.
> +		client_context [IN] - peer context for that given allocation.
> +	* Return value:
> +	*	Page size in bytes
> +	*/
> +	unsigned long (*get_page_size)(void *client_context);
> +	/* This callback is the opposite of the acquire call, let peer release all resources associated
> +	 * with the acquired context. The call will be performed only for contexts that have been
> +	 * successfully acquired (i.e. acquire returned a non-zero value).
> +	 * Parameters:
> +	 *	client_context [IN] - peer context for the given allocation.
> +	*/
> +	void (*release)(void *client_context);
> +
> +};

All these comments inside a struct make a struct definition hard to 
read. Please use kernel-doc style instead. See also 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/kernel-doc-nano-HOWTO.txt.

Thanks,

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next 9/9] Samples: Peer memory client example
       [not found]     ` <1412176717-11979-10-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-01 17:16       ` Hefty, Sean
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237399DE5096-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Hefty, Sean @ 2014-10-01 17:16 UTC (permalink / raw)
  To: Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w

> Adds an example of a peer memory client which implements the peer memory
> API as defined under include/rdma/peer_mem.h.
> It uses the HOST memory functionality to implement the APIs and
> can be a good reference for peer memory client writers.

Is there a real user of these changes?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next 4/9] IB/core: Infrastructure to manage peer core context
       [not found]     ` <1412176717-11979-5-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-01 22:24       ` Yann Droneaud
       [not found]         ` <1412202261.28184.0.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Yann Droneaud @ 2014-10-01 22:24 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w


Le mercredi 01 octobre 2014 à 18:18 +0300, Yishai Hadas a écrit :
> Adds an infrastructure to manage core context for a given umem,
> it's needed for the invalidation flow.
> 
> Core context is supplied to peer clients as some opaque data for a given
> memory pages represented by a umem.
> 
> If the peer client needs to invalidate memory it provided through the peer memory callbacks,
> it should call the invalidation callback, supplying the relevant core context.
> IB core will use this context to invalidate the relevant memory.
> 
> To prevent cases when there are inflight invalidation calls in parallel
> to releasing this memory (e.g. by dereg_mr) we must ensure that context
> is valid before accessing it, that's why couldn't use the core context
> pointer directly. For that reason we added a lookup table to map between
> a ticket id to a core context.

You could have use the context pointer as the key, instead of creating
the "ticket" abstraction. The context pointer can be looked up in a data
structure which track the current contexts.

But I'm not sure to understand the purpose of the indirection:
if dereg_mr() can release the context in parallel,
is the context pointer stored in the "ticket" going to point to
something no more valid ?

>  Peer client will get/supply the ticket
> id, core will check whether exists before accessing its corresponding
> context.
> 

Could you explain the expected lifetime of the ticket id and whether it
will be exchanged with "remote" parties (remote node, hardware,
userspace, etc.)

> Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/core/peer_mem.c |  132 ++++++++++++++++++++++++++++++++++++
>  include/rdma/ib_peer_mem.h         |   19 +++++
>  include/rdma/ib_umem.h             |    6 ++
>  3 files changed, 157 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
> index 3936e13..ad10672 100644
> --- a/drivers/infiniband/core/peer_mem.c
> +++ b/drivers/infiniband/core/peer_mem.c
> @@ -44,6 +44,136 @@ static int ib_invalidate_peer_memory(void *reg_handle, void *core_context)
>  	return -ENOSYS;
>  }
>  
> +static int peer_ticket_exists(struct ib_peer_memory_client *ib_peer_client,
> +			      unsigned long ticket)
> +{
> +	struct core_ticket *core_ticket;
> +
> +	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
> +			    ticket_list) {
> +		if (core_ticket->key == ticket)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int peer_get_free_ticket(struct ib_peer_memory_client *ib_peer_client,
> +				unsigned long *new_ticket)
> +{
> +	unsigned long candidate_ticket = ib_peer_client->last_ticket + 1;

Overflow ?

> +	static int max_retries = 1000;

It's no static, and why 1000 ?

> +	int i;
> +
> +	for (i = 0; i < max_retries; i++) {
> +		if (peer_ticket_exists(ib_peer_client, candidate_ticket)) {
> +			candidate_ticket++;

Overflow ?

> +			continue;
> +		}
> +		*new_ticket = candidate_ticket;
> +		return 0;
> +	}
> +

Counting the number of allocated ticket number could prevent looping in
the case every numbers are used (unlikely).

> +	return -EINVAL;
> +}
> +

> +static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
> +				  void *context,
> +				  unsigned long *context_ticket)
> +{
> +	struct core_ticket *core_ticket = kzalloc(sizeof(*core_ticket), GFP_KERNEL);
> +	int ret;
> +
> +	if (!core_ticket)
> +		return -ENOMEM;
> +
> +	mutex_lock(&ib_peer_client->lock);
> +	if (ib_peer_client->last_ticket < ib_peer_client->last_ticket + 1 &&
> +	    !ib_peer_client->ticket_wrapped) {
> +		core_ticket->key = ib_peer_client->last_ticket++;
> +	} else {
> +		/* special/rare case when wrap around occurred, not expected on 64 bit machines */

Some still have 32bits system ...

> +		unsigned long new_ticket;
> +
> +		ib_peer_client->ticket_wrapped = 1;

The whole mechanism to handle wrapping seems fragile, at best.
Wrapping could happen multiple times by the way.

Additionally, it would make more sense to have ticket number handling in
peer_get_free_ticket().

> +		ret = peer_get_free_ticket(ib_peer_client, &new_ticket);
> +		if (ret) {
> +			mutex_unlock(&ib_peer_client->lock);
> +			kfree(core_ticket);
> +			return ret;
> +		}
> +		ib_peer_client->last_ticket = new_ticket;
> +		core_ticket->key = ib_peer_client->last_ticket;
> +	}
> +	core_ticket->context = context;
> +	list_add_tail(&core_ticket->ticket_list,
> +		      &ib_peer_client->core_ticket_list);
> +	*context_ticket = core_ticket->key;
> +	mutex_unlock(&ib_peer_client->lock);
> +	return 0;
> +}
> +

Perhaps idr could be used to track the "ticket" number ?

> +/* Caller should be holding the peer client lock, specifically, the caller should hold ib_peer_client->lock */
> +static int ib_peer_remove_context(struct ib_peer_memory_client *ib_peer_client,
> +				  unsigned long key)
> +{
> +	struct core_ticket *core_ticket;
> +
> +	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
> +			    ticket_list) {
> +		if (core_ticket->key == key) {
> +			list_del(&core_ticket->ticket_list);
> +			kfree(core_ticket);
> +			return 0;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
> +/**
> +** ib_peer_create_invalidation_ctx - creates invalidation context for a given umem
> +** @ib_peer_mem: peer client to be used
> +** @umem: umem struct belongs to that context
> +** @invalidation_ctx: output context
> +**/
> +int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, struct ib_umem *umem,
> +				    struct invalidation_ctx **invalidation_ctx)
> +{
> +	int ret;
> +	struct invalidation_ctx *ctx;
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	ret = ib_peer_insert_context(ib_peer_mem, ctx,
> +				     &ctx->context_ticket);
> +	if (ret) {
> +		kfree(ctx);
> +		return ret;
> +	}
> +
> +	ctx->umem = umem;
> +	umem->invalidation_ctx = ctx;
> +	*invalidation_ctx = ctx;
> +	return 0;
> +}
> +
> +/**
> + * ** ib_peer_destroy_invalidation_ctx - destroy a given invalidation context
> + * ** @ib_peer_mem: peer client to be used
> + * ** @invalidation_ctx: context to be invalidated
> + * **/
> +void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
> +				      struct invalidation_ctx *invalidation_ctx)
> +{
> +	mutex_lock(&ib_peer_mem->lock);
> +	ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
> +	mutex_unlock(&ib_peer_mem->lock);
> +
> +	kfree(invalidation_ctx);
> +}
>  static int ib_memory_peer_check_mandatory(const struct peer_memory_client
>  						     *peer_client)
>  {
> @@ -94,6 +224,8 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
>  	if (!ib_peer_client)
>  		return NULL;
>  
> +	INIT_LIST_HEAD(&ib_peer_client->core_ticket_list);
> +	mutex_init(&ib_peer_client->lock);
>  	init_completion(&ib_peer_client->unload_comp);
>  	kref_init(&ib_peer_client->ref);
>  	ib_peer_client->peer_mem = peer_client;
> diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
> index 98056c5..d3fbb50 100644
> --- a/include/rdma/ib_peer_mem.h
> +++ b/include/rdma/ib_peer_mem.h
> @@ -4,6 +4,8 @@
>  #include <rdma/peer_mem.h>
>  
>  struct ib_ucontext;
> +struct ib_umem;
> +struct invalidation_ctx;
>  
>  struct ib_peer_memory_client {
>  	const struct peer_memory_client *peer_mem;
> @@ -11,16 +13,33 @@ struct ib_peer_memory_client {
>  	int invalidation_required;
>  	struct kref ref;
>  	struct completion unload_comp;
> +	/* lock is used via the invalidation flow */
> +	struct mutex lock;
> +	struct list_head   core_ticket_list;
> +	unsigned long       last_ticket;
> +	int ticket_wrapped;
>  };
>  
>  enum ib_peer_mem_flags {
>  	IB_PEER_MEM_ALLOW	= 1,
>  };
>  
> +struct core_ticket {
> +	unsigned long key;
> +	void *context;
> +	struct list_head   ticket_list;
> +};
> +
>  struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
>  						 size_t size, void **peer_client_context);
>  
>  void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
>  			void *peer_client_context);
>  
> +int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, struct ib_umem *umem,
> +				    struct invalidation_ctx **invalidation_ctx);
> +
> +void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
> +				      struct invalidation_ctx *invalidation_ctx);
> +
>  #endif
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index a22dde0..4b8a042 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -40,6 +40,11 @@
>  
>  struct ib_ucontext;
>  
> +struct invalidation_ctx {
> +	struct ib_umem *umem;
> +	unsigned long context_ticket;
> +};
> +
>  struct ib_umem {
>  	struct ib_ucontext     *context;
>  	size_t			length;
> @@ -56,6 +61,7 @@ struct ib_umem {
>  	int             npages;
>  	/* peer memory that manages this umem */
>  	struct ib_peer_memory_client *ib_peer_mem;
> +	struct invalidation_ctx *invalidation_ctx;
>  	/* peer memory private context */
>  	void *peer_mem_client_context;
>  };



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next 9/9] Samples: Peer memory client example
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237399DE5096-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2014-10-02  3:14           ` Jason Gunthorpe
       [not found]             ` <20141002031441.GA10386-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2014-10-07 16:57           ` Davis, Arlin R
  1 sibling, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2014-10-02  3:14 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w

On Wed, Oct 01, 2014 at 05:16:12PM +0000, Hefty, Sean wrote:
> > Adds an example of a peer memory client which implements the peer memory
> > API as defined under include/rdma/peer_mem.h.
> > It uses the HOST memory functionality to implement the APIs and
> > can be a good reference for peer memory client writers.
> 
> Is there a real user of these changes?

Agreed..

Can you also discuss what is going on at the PCI-E level? How are the
peer-to-peer transactions addressed? Is this elaborate scheme just a
way to 'window' GPU memory or is the NIC sending special PCI-E packets
at the GPU?

I'm really confused why this is all necessary, we can already map
PCI-E memory into user space, and there were much simpler patches
floating around to make that work several years ago..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next 9/9] Samples: Peer memory client example
       [not found]             ` <20141002031441.GA10386-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-10-02 13:35               ` Shachar Raindel
  0 siblings, 0 replies; 22+ messages in thread
From: Shachar Raindel @ 2014-10-02 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Hefty, Sean
  Cc: Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> On Wed, Oct 01, 2014 at 05:16:12PM +0000, Hefty, Sean wrote:
> > > Adds an example of a peer memory client which implements the peer
> memory
> > > API as defined under include/rdma/peer_mem.h.
> > > It uses the HOST memory functionality to implement the APIs and
> > > can be a good reference for peer memory client writers.
> >
> > Is there a real user of these changes?
> 
> Agreed..
> 
> Can you also discuss what is going on at the PCI-E level? How are the
> peer-to-peer transactions addressed? Is this elaborate scheme just a
> way to 'window' GPU memory or is the NIC sending special PCI-E packets
> at the GPU?
> 

The current implementation is using a 'window' on the GPU memory,
opened through one of the device's PCI-e BAR. Future iterations of the
technology might use a different kind of messaging. The proposed 
interface enables such mechanisms, as both parties of the peer to peer 
communication are explicitly informed of the peer identity for the
given region.

> I'm really confused why this is all necessary, we can already map
> PCI-E memory into user space, and there were much simpler patches
> floating around to make that work several years ago..
> 

We believe that a specialized interface for pinning/registering peer
to peer memory regions is needed here.

First of all, most hardware vendors don't provide a user space mapping
mechanism of memory over PCI-E. Even the vendors who are supporting
such mapping, require the usage of different, proprietary interfaces
to do so. As such, a solution which relies on a user space based
mapping will be extremely clunky. The user space code will have to
keep an intimate knowledge of how to map the memory for each and every
of the different hardware vendors supported. This adds an additional
user space/kernel dependency, making portability and usability harder.
The proposed solution provides a simple, "one stop shop" for all the
memory registration needs. The application simply provides a pointer
to the reg_mr verb, and internally the kernel is handling any mapping
and pinning needed. This interface is easier to use from the user
perspective, compared to the suggested alternative.

Additionally, there are cases where the peer memory client, which
provides the memory, requires an immediate invalidation of the memory
mapping. For example, when the accelerator card is swapping tasks and
the pervious allocated memory is being discarded or swapped out. The
current umem interface does not support this kind of
functionality. The suggested patchset defines an enriched interface,
where the RDMA low level driver is notified when the memory must be
invalidated. This interface is implemented in two low level drivers as
part of this patchset. A possible future peer memory client could
replace the functionality of umem for standard host memory (similar to
the example), and use the mmu_notifiers callbacks to invalidate memory
that is not accessible any more.


Thanks,
--Shachar
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next 1/9] IB/core: Introduce peer client interface
       [not found]         ` <542C2D23.30508-HInyCGIudOg@public.gmane.org>
@ 2014-10-02 14:37           ` Yishai Hadas
  0 siblings, 0 replies; 22+ messages in thread
From: Yishai Hadas @ 2014-10-02 14:37 UTC (permalink / raw)
  To: Bart Van Assche, Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w

On 10/1/2014 7:34 PM, Bart Van Assche wrote:
> On 10/01/14 17:18, Yishai Hadas wrote:
>> +static int num_registered_peers;
>
> Is the only purpose of this variable to check whether or not 
> peer_memory_list is empty ? In that case please drop this variable and 
> use list_empty() instead.
     Agree.
>
>> +static int ib_invalidate_peer_memory(void *reg_handle, void 
>> *core_context)
>> +
>> +{
>> +    return -ENOSYS;
>> +}
> Please follow the Linux kernel coding style which means no empty line 
> above the function body.
     OK
>
>> +#define PEER_MEM_MANDATORY_FUNC(x) {\
>> +    offsetof(struct peer_memory_client, x), #x }
>
> Shouldn't the opening brace have been placed on the same line as the 
> offsetof() macro to improve readability ?
>
     OK
>> +    if (invalidate_callback) {
>> +        *invalidate_callback = ib_invalidate_peer_memory;
>> +        ib_peer_client->invalidation_required = 1;
>> +    }
>> +    mutex_lock(&peer_memory_mutex);
>> +    list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
>> +    num_registered_peers++;
>> +    mutex_unlock(&peer_memory_mutex);
>> +    return ib_peer_client;
>
> Please insert an empty line before mutex_lock() and after mutex_unlock().
>
     OK
>> +void ib_unregister_peer_memory_client(void *reg_handle)
>> +{
>> +    struct ib_peer_memory_client *ib_peer_client =
>> +        (struct ib_peer_memory_client *)reg_handle;
>
> No cast is needed when assigning a void pointer to a non-void pointer.
>
     Agree.
>> +struct peer_memory_client {
>> +    char    name[IB_PEER_MEMORY_NAME_MAX];
>> +    char    version[IB_PEER_MEMORY_VER_MAX];
>> +    /* The peer-direct controller (IB CORE) uses this callback to 
>> detect if a virtual address is under
>> +     * the responsibility of a specific peer direct client. If the 
>> answer is positive further calls
>> +     * for memory management will be directed to the callback of 
>> this peer driver.
>> +     * Any peer internal error should resulted in a zero answer, in 
>> case address range
>> +     * really belongs to the peer, no owner will be found and 
>> application will get an error
>> +     * from IB CORE as expected.
>> +     * Parameters:
>> +        addr                  [IN]  - virtual address to be checked 
>> whether belongs to.
>> +        size                  [IN]  - size of memory area starting 
>> at addr.
>> +        peer_mem_private_data [IN]  - The contents of ib_ucontext-> 
>> peer_mem_private_data.
>> +                          This parameter allows usage of the 
>> peer-direct
>> +                          API in implementations where it is impossible
>> +                          to detect if the memory belongs to the device
>> +                          based upon the virtual address alone. In such
>> +                          cases, the peer device can create a special
>> +                          ib_ucontext, which will be associated with 
>> the
>> +                          relevant peer memory.
>> +        peer_mem_name         [IN]  - The contents of ib_ucontext-> 
>> peer_mem_name.
>> +                          Used to identify the peer memory client that
>> +                          initialized the ib_ucontext.
>> +                          This parameter is normally used along with
>> +                          peer_mem_private_data.
>> +        client_context        [OUT] - peer opaque data which holds a 
>> peer context for
>> +                          the acquired address range, will be provided
>> +                          back to the peer memory in subsequent
>> +                          calls for that given memory.
>> +
>> +    * Return value:
>> +    *    1 - virtual address belongs to the peer device, otherwise 0
>> +    */
>> +    int (*acquire)(unsigned long addr, size_t size, void 
>> *peer_mem_private_data,
>> +               char *peer_mem_name, void **client_context);
>> +    /* The peer memory client is expected to pin the physical pages 
>> of the given address range
>> +     * and to fill sg_table with the information of the
>> +     * physical pages associated with the given address range. This 
>> function is
>> +     * equivalent to the kernel API of get_user_pages(), but targets 
>> peer memory.
>> +     * Parameters:
>> +        addr           [IN] - start virtual address of that given 
>> allocation.
>> +        size           [IN] - size of memory area starting at addr.
>> +        write          [IN] - indicates whether the pages will be 
>> written to by the caller.
>> +                      Same meaning as of kernel API get_user_pages, 
>> can be
>> +                      ignored if not relevant.
>> +        force          [IN] - indicates whether to force write 
>> access even if user
>> +                      mapping is readonly. Same meaning as of kernel 
>> API
>> +                      get_user_pages, can be ignored if not relevant.
>> +        sg_head        [IN/OUT] - pointer to head of struct sg_table.
>> +                      The peer client should allocate a table big
>> +                      enough to store all of the required entries. This
>> +                      function should fill the table with physical 
>> addresses
>> +                      and sizes of the memory segments composing this
>> +                      memory mapping.
>> +                      The table allocation can be done using 
>> sg_alloc_table.
>> +                      Filling in the physical memory addresses and 
>> size can
>> +                      be done using sg_set_page.
>> +        client_context [IN] - peer context for the given allocation, 
>> as received from
>> +                      the acquire call.
>> +        core_context   [IN] - opaque IB core context. If the peer 
>> client wishes to
>> +                      invalidate any of the pages pinned through 
>> this API,
>> +                      it must provide this context as an argument to 
>> the
>> +                      invalidate callback.
>> +
>> +    * Return value:
>> +    *    0 success, otherwise errno error code.
>> +    */
>> +    int (*get_pages)(unsigned long addr,
>> +             size_t size, int write, int force,
>> +             struct sg_table *sg_head,
>> +             void *client_context, void *core_context);
>> +    /* The peer-direct controller (IB CORE) calls this function to 
>> request from the
>> +     * peer driver to fill the sg_table with dma address mapping for 
>> the peer memory exposed.
>> +     * The parameters provided have the parameters for calling 
>> dma_map_sg.
>> +     * Parameters:
>> +        sg_head        [IN/OUT] - pointer to head of struct 
>> sg_table. The peer memory
>> +                      should fill the dma_address & dma_length for
>> +                      each scatter gather entry in the table.
>> +        client_context [IN] - peer context for the allocation mapped.
>> +        dma_device     [IN] - the RDMA capable device which requires 
>> access to the
>> +                      peer memory.
>> +        dmasync        [IN] - flush in-flight DMA when the memory 
>> region is written.
>> +                      Same meaning as with host memory mapping, can 
>> be ignored if not relevant.
>> +        nmap           [OUT] - number of mapped/set entries.
>> +
>> +    * Return value:
>> +    *        0 success, otherwise errno error code.
>> +    */
>> +    int (*dma_map)(struct sg_table *sg_head, void *client_context,
>> +               struct device *dma_device, int dmasync, int *nmap);
>> +    /* This callback is the opposite of the dma map API, it should 
>> take relevant actions
>> +     * to unmap the memory.
>> +    * Parameters:
>> +        sg_head        [IN/OUT] - pointer to head of struct 
>> sg_table. The peer memory
>> +                      should fill the dma_address & dma_length for
>> +                      each scatter gather entry in the table.
>> +        client_context [IN] - peer context for the allocation mapped.
>> +        dma_device     [IN] - the RDMA capable device which requires 
>> access to the
>> +                      peer memory.
>> +        dmasync        [IN] - flush in-flight DMA when the memory 
>> region is written.
>> +                      Same meaning as with host memory mapping, can 
>> be ignored if not relevant.
>> +        nmap           [OUT] - number of mapped/set entries.
>> +
>> +    * Return value:
>> +    *    0 success, otherwise errno error code.
>> +    */
>> +    int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
>> +             struct device  *dma_device);
>> +    /* This callback is the opposite of the get_pages API, it should 
>> remove the pinning
>> +     * from the pages, it's the peer-direct equivalent of the kernel 
>> API put_page.
>> +     * Parameters:
>> +        sg_head        [IN] - pointer to head of struct sg_table.
>> +        client_context [IN] - peer context for that given allocation.
>> +    */
>> +    void (*put_pages)(struct sg_table *sg_head, void *client_context);
>> +    /* This callback returns page size for the given allocation
>> +     * Parameters:
>> +        sg_head        [IN] - pointer to head of struct sg_table.
>> +        client_context [IN] - peer context for that given allocation.
>> +    * Return value:
>> +    *    Page size in bytes
>> +    */
>> +    unsigned long (*get_page_size)(void *client_context);
>> +    /* This callback is the opposite of the acquire call, let peer 
>> release all resources associated
>> +     * with the acquired context. The call will be performed only 
>> for contexts that have been
>> +     * successfully acquired (i.e. acquire returned a non-zero value).
>> +     * Parameters:
>> +     *    client_context [IN] - peer context for the given allocation.
>> +    */
>> +    void (*release)(void *client_context);
>> +
>> +};
>
> All these comments inside a struct make a struct definition hard to 
> read. Please use kernel-doc style instead. See also 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/kernel-doc-nano-HOWTO.txt.
>
     Thanks, will fix in next series to match the kernel-doc style.
> Thanks,
>
> Bart.
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next 4/9] IB/core: Infrastructure to manage peer core context
       [not found]         ` <1412202261.28184.0.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2014-10-02 15:02           ` Shachar Raindel
  0 siblings, 0 replies; 22+ messages in thread
From: Shachar Raindel @ 2014-10-02 15:02 UTC (permalink / raw)
  To: Yann Droneaud, Yishai Hadas
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Le mercredi 01 octobre 2014 à 18:18 +0300, Yishai Hadas a écrit :
> > Adds an infrastructure to manage core context for a given umem,
> > it's needed for the invalidation flow.
> >
> > Core context is supplied to peer clients as some opaque data for a
> given
> > memory pages represented by a umem.
> >
> > If the peer client needs to invalidate memory it provided through the
> peer memory callbacks,
> > it should call the invalidation callback, supplying the relevant core
> context.
> > IB core will use this context to invalidate the relevant memory.
> >
> > To prevent cases when there are inflight invalidation calls in
> parallel
> > to releasing this memory (e.g. by dereg_mr) we must ensure that
> context
> > is valid before accessing it, that's why couldn't use the core context
> > pointer directly. For that reason we added a lookup table to map
> between
> > a ticket id to a core context.
> 
> You could have use the context pointer as the key, instead of creating
> the "ticket" abstraction. The context pointer can be looked up in a data
> structure which track the current contexts.
> 
> But I'm not sure to understand the purpose of the indirection:
> if dereg_mr() can release the context in parallel,
> is the context pointer stored in the "ticket" going to point to
> something no more valid ?
> 
> >  Peer client will get/supply the ticket
> > id, core will check whether exists before accessing its corresponding
> > context.
> >
> 
> Could you explain the expected lifetime of the ticket id and whether it
> will be exchanged with "remote" parties (remote node, hardware,
> userspace, etc.)
> 

The ticket id is provided to the peer memory client, as part of the
get_pages API. The only "remote" party using it is the peer memory
client. It is used for invalidation flow, to specify what memory
registration should be invalidated. This flow might be called
asynchronously, in parallel to an ongoing dereg_mr operation. As such,
the invalidation flow might be called after the memory registration
has been completely released. Relying on a pointer-based, or IDR-based
ticket value can result in spurious invalidation of unrelated memory
regions. Internally, we carefully lock the data structures and
synchronize as needed when extracting the context from the
ticket. This ensures a proper, synchronized release of the memory
mapping. The ticket mechanism allows us to safely ignore inflight
invalidation calls that were arrived too late.

We will change the ticket type in the API to be 64-bit value. This
will allow us to avoid handling wrap around issues, as doing 2**64
registrations will take huge amount of time - assuming 1 microsecond
registration time, you get:

You have: 1 microseconds * (2**64)
You want: years
	* 584554.53

This will allow us to cleanup all of the overflow/wraparound related
code.

> > Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
> > Signed-off-by: Shachar Raindel <raindel@mellanox.com>
> > ---
> >  drivers/infiniband/core/peer_mem.c |  132
> ++++++++++++++++++++++++++++++++++++
> >  include/rdma/ib_peer_mem.h         |   19 +++++
> >  include/rdma/ib_umem.h             |    6 ++
> >  3 files changed, 157 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/peer_mem.c
> b/drivers/infiniband/core/peer_mem.c
> > index 3936e13..ad10672 100644
> > --- a/drivers/infiniband/core/peer_mem.c
> > +++ b/drivers/infiniband/core/peer_mem.c
> > @@ -44,6 +44,136 @@ static int ib_invalidate_peer_memory(void
> *reg_handle, void *core_context)
> >  	return -ENOSYS;
> >  }
> >
> > +static int peer_ticket_exists(struct ib_peer_memory_client
> *ib_peer_client,
> > +			      unsigned long ticket)
> > +{
> > +	struct core_ticket *core_ticket;
> > +
> > +	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
> > +			    ticket_list) {
> > +		if (core_ticket->key == ticket)
> > +			return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int peer_get_free_ticket(struct ib_peer_memory_client
> *ib_peer_client,
> > +				unsigned long *new_ticket)
> > +{
> > +	unsigned long candidate_ticket = ib_peer_client->last_ticket + 1;
> 
> Overflow ?
> 

Will convert to 64-bit, to prevent overflow.

> > +	static int max_retries = 1000;
> 
> It's no static, and why 1000 ?
> 

Will be removed in next patch set, as overflow handling is not needed
for 64-bit values

> > +	int i;
> > +
> > +	for (i = 0; i < max_retries; i++) {
> > +		if (peer_ticket_exists(ib_peer_client, candidate_ticket)) {
> > +			candidate_ticket++;
> 
> Overflow ?
> 

Will be removed in next patch set, as overflow handling is not needed
for 64-bit values

> > +			continue;
> > +		}
> > +		*new_ticket = candidate_ticket;
> > +		return 0;
> > +	}
> > +
> 
> Counting the number of allocated ticket number could prevent looping in
> the case every numbers are used (unlikely).
> 

The entire loop will be removed in next patch set, as overflow
handling is not needed for 64-bit values


> > +	return -EINVAL;
> > +}
> > +
> 
> > +static int ib_peer_insert_context(struct ib_peer_memory_client
> *ib_peer_client,
> > +				  void *context,
> > +				  unsigned long *context_ticket)
> > +{
> > +	struct core_ticket *core_ticket = kzalloc(sizeof(*core_ticket),
> GFP_KERNEL);
> > +	int ret;
> > +
> > +	if (!core_ticket)
> > +		return -ENOMEM;
> > +
> > +	mutex_lock(&ib_peer_client->lock);
> > +	if (ib_peer_client->last_ticket < ib_peer_client->last_ticket + 1
> &&
> > +	    !ib_peer_client->ticket_wrapped) {
> > +		core_ticket->key = ib_peer_client->last_ticket++;
> > +	} else {
> > +		/* special/rare case when wrap around occurred, not expected
> on 64 bit machines */
> 
> Some still have 32bits system ...
> 

All linux systems support explicit 64-bit values. We will change the
ticket type to be an explicit 64-bit value (u64). This way, we will
handle also 32-bit machines.

> > +		unsigned long new_ticket;
> > +
> > +		ib_peer_client->ticket_wrapped = 1;
> 
> The whole mechanism to handle wrapping seems fragile, at best.
> Wrapping could happen multiple times by the way.
> 

Will be removed in next patch set, as overflow handling is not needed
for 64-bit values

> Additionally, it would make more sense to have ticket number handling in
> peer_get_free_ticket().
> 

peer_get_free_ticket will be removed in next patch set, as overflow
handling is not needed for 64-bit values. Getting a free ticket will
become a simple "++" operation.

> > +		ret = peer_get_free_ticket(ib_peer_client, &new_ticket);
> > +		if (ret) {
> > +			mutex_unlock(&ib_peer_client->lock);
> > +			kfree(core_ticket);
> > +			return ret;
> > +		}
> > +		ib_peer_client->last_ticket = new_ticket;
> > +		core_ticket->key = ib_peer_client->last_ticket;
> > +	}
> > +	core_ticket->context = context;
> > +	list_add_tail(&core_ticket->ticket_list,
> > +		      &ib_peer_client->core_ticket_list);
> > +	*context_ticket = core_ticket->key;
> > +	mutex_unlock(&ib_peer_client->lock);
> > +	return 0;
> > +}
> > +
> 
> Perhaps idr could be used to track the "ticket" number ?
> 

Sadly, we can't use idr, as idr is recycling ticket numbers. The whole
goal of the mechanism here is to allow us to ignore obsolete tickets,
referencing memory regions that have been already destroyed. A
mechanism that is recycling ticket numbers might result in spurious
invalidations.

> > +/* Caller should be holding the peer client lock, specifically, the
> caller should hold ib_peer_client->lock */
> > +static int ib_peer_remove_context(struct ib_peer_memory_client
> *ib_peer_client,
> > +				  unsigned long key)
> > +{
> > +	struct core_ticket *core_ticket;
> > +
> > +	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
> > +			    ticket_list) {
> > +		if (core_ticket->key == key) {
> > +			list_del(&core_ticket->ticket_list);
> > +			kfree(core_ticket);
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> > +/**
> > +** ib_peer_create_invalidation_ctx - creates invalidation context for
> a given umem
> > +** @ib_peer_mem: peer client to be used
> > +** @umem: umem struct belongs to that context
> > +** @invalidation_ctx: output context
> > +**/
> > +int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client
> *ib_peer_mem, struct ib_umem *umem,
> > +				    struct invalidation_ctx **invalidation_ctx)
> > +{
> > +	int ret;
> > +	struct invalidation_ctx *ctx;
> > +
> > +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> > +	if (!ctx)
> > +		return -ENOMEM;
> > +
> > +	ret = ib_peer_insert_context(ib_peer_mem, ctx,
> > +				     &ctx->context_ticket);
> > +	if (ret) {
> > +		kfree(ctx);
> > +		return ret;
> > +	}
> > +
> > +	ctx->umem = umem;
> > +	umem->invalidation_ctx = ctx;
> > +	*invalidation_ctx = ctx;
> > +	return 0;
> > +}
> > +
> > +/**
> > + * ** ib_peer_destroy_invalidation_ctx - destroy a given invalidation
> context
> > + * ** @ib_peer_mem: peer client to be used
> > + * ** @invalidation_ctx: context to be invalidated
> > + * **/
> > +void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client
> *ib_peer_mem,
> > +				      struct invalidation_ctx *invalidation_ctx)
> > +{
> > +	mutex_lock(&ib_peer_mem->lock);
> > +	ib_peer_remove_context(ib_peer_mem, invalidation_ctx-
> >context_ticket);
> > +	mutex_unlock(&ib_peer_mem->lock);
> > +
> > +	kfree(invalidation_ctx);
> > +}
> >  static int ib_memory_peer_check_mandatory(const struct
> peer_memory_client
> >  						     *peer_client)
> >  {
> > @@ -94,6 +224,8 @@ void *ib_register_peer_memory_client(const struct
> peer_memory_client *peer_clien
> >  	if (!ib_peer_client)
> >  		return NULL;
> >
> > +	INIT_LIST_HEAD(&ib_peer_client->core_ticket_list);
> > +	mutex_init(&ib_peer_client->lock);
> >  	init_completion(&ib_peer_client->unload_comp);
> >  	kref_init(&ib_peer_client->ref);
> >  	ib_peer_client->peer_mem = peer_client;
> > diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
> > index 98056c5..d3fbb50 100644
> > --- a/include/rdma/ib_peer_mem.h
> > +++ b/include/rdma/ib_peer_mem.h
> > @@ -4,6 +4,8 @@
> >  #include <rdma/peer_mem.h>
> >
> >  struct ib_ucontext;
> > +struct ib_umem;
> > +struct invalidation_ctx;
> >
> >  struct ib_peer_memory_client {
> >  	const struct peer_memory_client *peer_mem;
> > @@ -11,16 +13,33 @@ struct ib_peer_memory_client {
> >  	int invalidation_required;
> >  	struct kref ref;
> >  	struct completion unload_comp;
> > +	/* lock is used via the invalidation flow */
> > +	struct mutex lock;
> > +	struct list_head   core_ticket_list;
> > +	unsigned long       last_ticket;
> > +	int ticket_wrapped;
> >  };
> >
> >  enum ib_peer_mem_flags {
> >  	IB_PEER_MEM_ALLOW	= 1,
> >  };
> >
> > +struct core_ticket {
> > +	unsigned long key;
> > +	void *context;
> > +	struct list_head   ticket_list;
> > +};
> > +
> >  struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext
> *context, unsigned long addr,
> >  						 size_t size, void
> **peer_client_context);
> >
> >  void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
> >  			void *peer_client_context);
> >
> > +int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client
> *ib_peer_mem, struct ib_umem *umem,
> > +				    struct invalidation_ctx **invalidation_ctx);
> > +
> > +void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client
> *ib_peer_mem,
> > +				      struct invalidation_ctx *invalidation_ctx);
> > +
> >  #endif
> > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> > index a22dde0..4b8a042 100644
> > --- a/include/rdma/ib_umem.h
> > +++ b/include/rdma/ib_umem.h
> > @@ -40,6 +40,11 @@
> >
> >  struct ib_ucontext;
> >
> > +struct invalidation_ctx {
> > +	struct ib_umem *umem;
> > +	unsigned long context_ticket;
> > +};
> > +
> >  struct ib_umem {
> >  	struct ib_ucontext     *context;
> >  	size_t			length;
> > @@ -56,6 +61,7 @@ struct ib_umem {
> >  	int             npages;
> >  	/* peer memory that manages this umem */
> >  	struct ib_peer_memory_client *ib_peer_mem;
> > +	struct invalidation_ctx *invalidation_ctx;
> >  	/* peer memory private context */
> >  	void *peer_mem_client_context;
> >  };
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next 5/9] IB/core: Invalidation support for peer memory
       [not found]         ` <1412180704.4380.40.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2014-10-02 15:05           ` Shachar Raindel
  0 siblings, 0 replies; 22+ messages in thread
From: Shachar Raindel @ 2014-10-02 15:05 UTC (permalink / raw)
  To: Yann Droneaud, Yishai Hadas
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> Sent: Wednesday, October 01, 2014 7:25 PM
> To: Yishai Hadas
> Cc: roland@kernel.org; linux-rdma@vger.kernel.org; Shachar Raindel
> Subject: Re: [PATCH for-next 5/9] IB/core: Invalidation support for peer
> memory
> 
> Le mercredi 01 octobre 2014 à 18:18 +0300, Yishai Hadas a écrit :
> > Adds the required functionality to invalidate a given peer
> > memory represented by some core context.
> >
> > Each umem that was built over peer memory and supports invalidation
> has
> > some invalidation context assigned to it with the required data to
> > manage, once peer will call the invalidation callback below actions
> are
> > taken:
> >
> > 1) Taking lock on peer client to sync with inflight dereg_mr on that
> > memory.
> > 2) Once lock is taken have a lookup for ticket id to find the matching
> > core context.
> > 3) In case found will call umem invalidation function, otherwise call
> is
> > returned.
> >
> > Some notes:
> > 1) As peer invalidate callback defined to be blocking it must return
> > just after that pages are not going to be accessed any more. For that
> > reason ib_invalidate_peer_memory is waiting for a completion event in
> > case there is other inflight call coming as part of dereg_mr.
> >
> > 2) The peer memory API assumes that a lock might be taken by a peer
> > client to protect its memory operations. Specifically, its invalidate
> > callback might be called under that lock which may lead to an AB/BA
> > dead-lock in case IB core will call get/put pages APIs with the IB
> core peer's lock taken,
> > for that reason as part of  ib_umem_activate_invalidation_notifier
> lock is taken
> > then checking for some inflight invalidation state before activating
> it.
> >
> > 3) Once a peer client admits as part of its registration that it may
> > require invalidation support, it can't be an owner of a memory range
> > which doesn't support it.
> >
> > Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
> > Signed-off-by: Shachar Raindel <raindel@mellanox.com>
> > ---
> >  drivers/infiniband/core/peer_mem.c |   86
> +++++++++++++++++++++++++++++++++---
> >  drivers/infiniband/core/umem.c     |   51 ++++++++++++++++++---
> >  include/rdma/ib_peer_mem.h         |    4 +-
> >  include/rdma/ib_umem.h             |   17 +++++++
> >  4 files changed, 143 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/peer_mem.c
> b/drivers/infiniband/core/peer_mem.c
> > index ad10672..d6bd192 100644
> > --- a/drivers/infiniband/core/peer_mem.c
> > +++ b/drivers/infiniband/core/peer_mem.c
> > @@ -38,10 +38,57 @@ static DEFINE_MUTEX(peer_memory_mutex);
> >  static LIST_HEAD(peer_memory_list);
> >  static int num_registered_peers;
> >
> > -static int ib_invalidate_peer_memory(void *reg_handle, void
> *core_context)
> > +/* Caller should be holding the peer client lock, ib_peer_client-
> >lock */
> > +static struct core_ticket *ib_peer_search_context(struct
> ib_peer_memory_client *ib_peer_client,
> > +						  unsigned long key)
> > +{
> > +	struct core_ticket *core_ticket;
> > +
> > +	list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
> > +			    ticket_list) {
> > +		if (core_ticket->key == key)
> > +			return core_ticket;
> > +	}
> >
> > +	return NULL;
> > +}
> > +
> 
> You have now two functions to lookup key in ticket list:
> see peer_ticket_exists().
> 

As discussed in previous e-mail, peer_ticket_exists will be deleted.

> > +static int ib_invalidate_peer_memory(void *reg_handle, void
> *core_context)
> >  {
> > -	return -ENOSYS;
> > +	struct ib_peer_memory_client *ib_peer_client =
> > +		(struct ib_peer_memory_client *)reg_handle;
> > +	struct invalidation_ctx *invalidation_ctx;
> > +	struct core_ticket *core_ticket;
> > +	int need_unlock = 1;
> > +
> > +	mutex_lock(&ib_peer_client->lock);
> > +	core_ticket = ib_peer_search_context(ib_peer_client,
> > +					     (unsigned long)core_context);
> > +	if (!core_ticket)
> > +		goto out;
> > +
> > +	invalidation_ctx = (struct invalidation_ctx *)core_ticket->context;
> > +	/* If context is not ready yet, mark it to be invalidated */
> > +	if (!invalidation_ctx->func) {
> > +		invalidation_ctx->peer_invalidated = 1;
> > +		goto out;
> > +	}
> > +	invalidation_ctx->func(invalidation_ctx->cookie,
> > +					invalidation_ctx->umem, 0, 0);
> > +	if (invalidation_ctx->inflight_invalidation) {
> > +		/* init the completion to wait on before letting other thread
> to run */
> > +		init_completion(&invalidation_ctx->comp);
> > +		mutex_unlock(&ib_peer_client->lock);
> > +		need_unlock = 0;
> > +		wait_for_completion(&invalidation_ctx->comp);
> > +	}
> > +
> > +	kfree(invalidation_ctx);
> > +out:
> > +	if (need_unlock)
> > +		mutex_unlock(&ib_peer_client->lock);
> > +
> > +	return 0;
> >  }
> >
> >  static int peer_ticket_exists(struct ib_peer_memory_client
> *ib_peer_client,
> > @@ -168,11 +215,30 @@ int ib_peer_create_invalidation_ctx(struct
> ib_peer_memory_client *ib_peer_mem, s
> >  void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client
> *ib_peer_mem,
> >  				      struct invalidation_ctx *invalidation_ctx)
> >  {
> > -	mutex_lock(&ib_peer_mem->lock);
> > -	ib_peer_remove_context(ib_peer_mem, invalidation_ctx-
> >context_ticket);
> > -	mutex_unlock(&ib_peer_mem->lock);
> > +	int peer_callback;
> > +	int inflight_invalidation;
> >
> > -	kfree(invalidation_ctx);
> > +	/* If we are under peer callback lock was already taken.*/
> > +	if (!invalidation_ctx->peer_callback)
> > +		mutex_lock(&ib_peer_mem->lock);
> > +	ib_peer_remove_context(ib_peer_mem, invalidation_ctx-
> >context_ticket);
> > +	/* make sure to check inflight flag after took the lock and remove
> from tree.
> > +	 * in addition, from that point using local variables for
> peer_callback and
> > +	 * inflight_invalidation as after the complete invalidation_ctx
> can't be accessed
> > +	 * any more as it may be freed by the callback.
> > +	 */
> > +	peer_callback = invalidation_ctx->peer_callback;
> > +	inflight_invalidation = invalidation_ctx->inflight_invalidation;
> > +	if (inflight_invalidation)
> > +		complete(&invalidation_ctx->comp);
> > +
> > +	/* On peer callback lock is handled externally */
> > +	if (!peer_callback)
> > +		mutex_unlock(&ib_peer_mem->lock);
> > +
> > +	/* in case under callback context or callback is pending let it
> free the invalidation context */
> > +	if (!peer_callback && !inflight_invalidation)
> > +		kfree(invalidation_ctx);
> >  }
> >  static int ib_memory_peer_check_mandatory(const struct
> peer_memory_client
> >  						     *peer_client)
> > @@ -261,13 +327,19 @@ void ib_unregister_peer_memory_client(void
> *reg_handle)
> >  EXPORT_SYMBOL(ib_unregister_peer_memory_client);
> >
> >  struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext
> *context, unsigned long addr,
> > -						 size_t size, void
> **peer_client_context)
> > +						 size_t size, unsigned long
> peer_mem_flags,
> > +						 void **peer_client_context)
> >  {
> >  	struct ib_peer_memory_client *ib_peer_client;
> >  	int ret;
> >
> >  	mutex_lock(&peer_memory_mutex);
> >  	list_for_each_entry(ib_peer_client, &peer_memory_list,
> core_peer_list) {
> > +		/* In case peer requires invalidation it can't own memory
> which doesn't support it */
> > +		if (ib_peer_client->invalidation_required &&
> > +		    (!(peer_mem_flags & IB_PEER_MEM_INVAL_SUPP)))
> > +			continue;
> > +
> >  		ret = ib_peer_client->peer_mem->acquire(addr, size,
> >  						   context->peer_mem_private_data,
> >  						   context->peer_mem_name,
> > diff --git a/drivers/infiniband/core/umem.c
> b/drivers/infiniband/core/umem.c
> > index 0de9916..51f32a1 100644
> > --- a/drivers/infiniband/core/umem.c
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -44,12 +44,19 @@
> >
> >  static struct ib_umem *peer_umem_get(struct ib_peer_memory_client
> *ib_peer_mem,
> >  				     struct ib_umem *umem, unsigned long addr,
> > -				     int dmasync)
> > +				     int dmasync, unsigned long peer_mem_flags)
> >  {
> >  	int ret;
> >  	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
> > +	struct invalidation_ctx *invalidation_ctx = NULL;
> >
> >  	umem->ib_peer_mem = ib_peer_mem;
> > +	if (peer_mem_flags & IB_PEER_MEM_INVAL_SUPP) {
> > +		ret = ib_peer_create_invalidation_ctx(ib_peer_mem, umem,
> &invalidation_ctx);
> > +		if (ret)
> > +			goto end;
> > +	}
> > +
> >  	/*
> >  	 * We always request write permissions to the pages, to force
> breaking of any CoW
> >  	 * during the registration of the MR. For read-only MRs we use the
> "force" flag to
> > @@ -60,7 +67,9 @@ static struct ib_umem *peer_umem_get(struct
> ib_peer_memory_client *ib_peer_mem,
> >  				  1, !umem->writable,
> >  				  &umem->sg_head,
> >  				  umem->peer_mem_client_context,
> > -				  NULL);
> > +				  invalidation_ctx ?
> > +				  (void *)invalidation_ctx->context_ticket : NULL);
> > +
> 
> NULL may be a valid "ticket" once converted to unsigned long and looked
> up in the ticket list.
> 

We will move to use uint64 value, and make sure the first ticket value
is 1. A ticket value of 0 indicates that there is no support for 
invalidation of this memory.

> >  	if (ret)
> >  		goto out;
> >
> > @@ -84,6 +93,9 @@ put_pages:
> >  	peer_mem->put_pages(umem->peer_mem_client_context,
> >  					&umem->sg_head);
> >  out:
> > +	if (invalidation_ctx)
> > +		ib_peer_destroy_invalidation_ctx(ib_peer_mem,
> invalidation_ctx);
> > +end:
> >  	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
> >  	kfree(umem);
> >  	return ERR_PTR(ret);
> > @@ -91,15 +103,19 @@ out:
> >
> >  static void peer_umem_release(struct ib_umem *umem)
> >  {
> > -	const struct peer_memory_client *peer_mem =
> > -				umem->ib_peer_mem->peer_mem;
> > +	struct ib_peer_memory_client *ib_peer_mem = umem->ib_peer_mem;
> > +	const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
> > +	struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
> > +
> > +	if (invalidation_ctx)
> > +		ib_peer_destroy_invalidation_ctx(ib_peer_mem,
> invalidation_ctx);
> >
> >  	peer_mem->dma_unmap(&umem->sg_head,
> >  			    umem->peer_mem_client_context,
> >  			    umem->context->device->dma_device);
> >  	peer_mem->put_pages(&umem->sg_head,
> >  			    umem->peer_mem_client_context);
> > -	ib_put_peer_client(umem->ib_peer_mem, umem-
> >peer_mem_client_context);
> > +	ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
> >  	kfree(umem);
> >  }
> >
> > @@ -127,6 +143,27 @@ static void __ib_umem_release(struct ib_device
> *dev, struct ib_umem *umem, int d
> >
> >  }
> >
> > +int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> > +					   umem_invalidate_func_t func,
> > +					   void *cookie)
> > +{
> > +	struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&umem->ib_peer_mem->lock);
> > +	if (invalidation_ctx->peer_invalidated) {
> > +		pr_err("ib_umem_activate_invalidation_notifier: pages were
> invalidated by peer\n");
> > +		ret = -EINVAL;
> > +		goto end;
> > +	}
> > +	invalidation_ctx->func = func;
> > +	invalidation_ctx->cookie = cookie;
> > +	/* from that point any pending invalidations can be called */
> > +end:
> > +	mutex_unlock(&umem->ib_peer_mem->lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
> >  /**
> >   * ib_umem_get - Pin and DMA map userspace memory.
> >   * @context: userspace context to pin memory for
> > @@ -179,11 +216,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext
> *context, unsigned long addr,
> >  	if (peer_mem_flags & IB_PEER_MEM_ALLOW) {
> >  		struct ib_peer_memory_client *peer_mem_client;
> >
> > -		peer_mem_client =  ib_get_peer_client(context, addr, size,
> > +		peer_mem_client =  ib_get_peer_client(context, addr, size,
> peer_mem_flags,
> >  						      &umem->peer_mem_client_context);
> >  		if (peer_mem_client)
> >  			return peer_umem_get(peer_mem_client, umem, addr,
> > -					dmasync);
> > +					dmasync, peer_mem_flags);
> >  	}
> >
> >  	/* We assume the memory is from hugetlb until proved otherwise */
> > diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
> > index d3fbb50..8f67aaf 100644
> > --- a/include/rdma/ib_peer_mem.h
> > +++ b/include/rdma/ib_peer_mem.h
> > @@ -22,6 +22,7 @@ struct ib_peer_memory_client {
> >
> >  enum ib_peer_mem_flags {
> >  	IB_PEER_MEM_ALLOW	= 1,
> > +	IB_PEER_MEM_INVAL_SUPP = (1<<1),
> >  };
> >
> >  struct core_ticket {
> > @@ -31,7 +32,8 @@ struct core_ticket {
> >  };
> >
> >  struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext
> *context, unsigned long addr,
> > -						 size_t size, void
> **peer_client_context);
> > +						 size_t size, unsigned long
> peer_mem_flags,
> > +						 void **peer_client_context);
> >
> >  void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
> >  			void *peer_client_context);
> > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> > index 4b8a042..83d6059 100644
> > --- a/include/rdma/ib_umem.h
> > +++ b/include/rdma/ib_umem.h
> > @@ -39,10 +39,21 @@
> >  #include <rdma/ib_peer_mem.h>
> >
> >  struct ib_ucontext;
> > +struct ib_umem;
> > +
> > +typedef void (*umem_invalidate_func_t)(void *invalidation_cookie,
> > +					    struct ib_umem *umem,
> > +					    unsigned long addr, size_t size);
> >
> >  struct invalidation_ctx {
> >  	struct ib_umem *umem;
> >  	unsigned long context_ticket;
> > +	umem_invalidate_func_t func;
> > +	void *cookie;
> > +	int peer_callback;
> > +	int inflight_invalidation;
> > +	int peer_invalidated;
> > +	struct completion comp;
> >  };
> >
> >  struct ib_umem {
> > @@ -73,6 +84,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext
> *context, unsigned long addr,
> >  			       unsigned long peer_mem_flags);
> >  void ib_umem_release(struct ib_umem *umem);
> >  int ib_umem_page_count(struct ib_umem *umem);
> > +int  ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> > +					    umem_invalidate_func_t func,
> > +					    void *cookie);
> >
> >  #else /* CONFIG_INFINIBAND_USER_MEM */
> >
> > @@ -87,6 +101,9 @@ static inline struct ib_umem *ib_umem_get(struct
> ib_ucontext *context,
> >  static inline void ib_umem_release(struct ib_umem *umem) { }
> >  static inline int ib_umem_page_count(struct ib_umem *umem) { return
> 0; }
> >
> > +static inline int ib_umem_activate_invalidation_notifier(struct
> ib_umem *umem,
> > +							 umem_invalidate_func_t func,
> > +							 void *cookie) {return 0; }
> >  #endif /* CONFIG_INFINIBAND_USER_MEM */
> >
> >  #endif /* IB_UMEM_H */
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH for-next 6/9] IB/core: Sysfs support for peer memory
       [not found]     ` <1412176717-11979-7-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-02 16:41       ` Jason Gunthorpe
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2014-10-02 16:41 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w

On Wed, Oct 01, 2014 at 06:18:34PM +0300, Yishai Hadas wrote:
> Supplies the required functionality to expose information and
> statistics over sysfs for a given peer memory client.
> 
> This mechanism enables userspace application to check
> which peers are available (based on name & version) and based on that
> decides whether it can run successfully.

Seems like this belongs in debugfs..

At a minimum you must provide documentation for sysfs files.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next 9/9] Samples: Peer memory client example
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237399DE5096-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2014-10-02  3:14           ` Jason Gunthorpe
@ 2014-10-07 16:57           ` Davis, Arlin R
       [not found]             ` <54347E5A035A054EAE9D05927FB467F977CC9244-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Davis, Arlin R @ 2014-10-07 16:57 UTC (permalink / raw)
  To: Hefty, Sean, Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, Coffman, Jerrie L

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Hefty, Sean
> Sent: Wednesday, October 01, 2014 10:16 AM
> To: Yishai Hadas; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
> Subject: RE: [PATCH for-next 9/9] Samples: Peer memory client example
> 
> > Adds an example of a peer memory client which implements the peer
> > memory API as defined under include/rdma/peer_mem.h.
> > It uses the HOST memory functionality to implement the APIs and can be
> > a good reference for peer memory client writers.
> 
> Is there a real user of these changes?

CCL (co-processor communication link) Direct for Intel Xeon Phi, included in
OFED 3.12-1 and OFED-3.5-2-MIC, uses the peer-direct interface.   



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH for-next 9/9] Samples: Peer memory client example
       [not found]             ` <54347E5A035A054EAE9D05927FB467F977CC9244-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2014-10-07 17:09               ` Hefty, Sean
  0 siblings, 0 replies; 22+ messages in thread
From: Hefty, Sean @ 2014-10-07 17:09 UTC (permalink / raw)
  To: Davis, Arlin R, Yishai Hadas, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	raindel-VPRAkNaXOzVWk0Htik3J/w, Coffman, Jerrie L

> > > Adds an example of a peer memory client which implements the peer
> > > memory API as defined under include/rdma/peer_mem.h.
> > > It uses the HOST memory functionality to implement the APIs and can be
> > > a good reference for peer memory client writers.
> >
> > Is there a real user of these changes?
> 
> CCL (co-processor communication link) Direct for Intel Xeon Phi, included
> in
> OFED 3.12-1 and OFED-3.5-2-MIC, uses the peer-direct interface.

And where are the upstream patches for this?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-10-07 17:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-01 15:18 [PATCH for-next 0/9] Peer-Direct support Yishai Hadas
     [not found] ` <1412176717-11979-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-01 15:18   ` [PATCH for-next 1/9] IB/core: Introduce peer client interface Yishai Hadas
     [not found]     ` <1412176717-11979-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-01 16:34       ` Bart Van Assche
     [not found]         ` <542C2D23.30508-HInyCGIudOg@public.gmane.org>
2014-10-02 14:37           ` Yishai Hadas
2014-10-01 15:18   ` [PATCH for-next 2/9] IB/core: Get/put peer memory client Yishai Hadas
2014-10-01 15:18   ` [PATCH for-next 3/9] IB/core: Umem tunneling peer memory APIs Yishai Hadas
2014-10-01 15:18   ` [PATCH for-next 4/9] IB/core: Infrastructure to manage peer core context Yishai Hadas
     [not found]     ` <1412176717-11979-5-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-01 22:24       ` Yann Droneaud
     [not found]         ` <1412202261.28184.0.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-10-02 15:02           ` Shachar Raindel
2014-10-01 15:18   ` [PATCH for-next 5/9] IB/core: Invalidation support for peer memory Yishai Hadas
     [not found]     ` <1412176717-11979-6-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-01 16:25       ` Yann Droneaud
     [not found]         ` <1412180704.4380.40.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-10-02 15:05           ` Shachar Raindel
2014-10-01 15:18   ` [PATCH for-next 6/9] IB/core: Sysfs " Yishai Hadas
     [not found]     ` <1412176717-11979-7-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-02 16:41       ` Jason Gunthorpe
2014-10-01 15:18   ` [PATCH for-next 7/9] IB/mlx4: Invalidation support for MR over " Yishai Hadas
2014-10-01 15:18   ` [PATCH for-next 8/9] IB/mlx5: " Yishai Hadas
2014-10-01 15:18   ` [PATCH for-next 9/9] Samples: Peer memory client example Yishai Hadas
     [not found]     ` <1412176717-11979-10-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-01 17:16       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A8237399DE5096-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-10-02  3:14           ` Jason Gunthorpe
     [not found]             ` <20141002031441.GA10386-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-10-02 13:35               ` Shachar Raindel
2014-10-07 16:57           ` Davis, Arlin R
     [not found]             ` <54347E5A035A054EAE9D05927FB467F977CC9244-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-10-07 17:09               ` Hefty, Sean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.