All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
@ 2016-03-08 19:14 Dennis Dalessandro
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

This patch series adds a performance improvement to user SDMA transfers from
PSM applications by caching user buffer pages after pinning them. Subsequent
uses of the same user buffer will not incur the cost of pinning the same pages
again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
pages are unpinned when the context is torn down or when the driver determines
that the buffer should be evicted from the cache. Cache evictions happen when
there is a request for a new, uncached buffer and the current size of the cache
has reached a pre-defined limit.

Series applies on "Adaptive PIO bug fixes..." series submitted yesterday and
should apply cleanly to Doug's GitHub hfi1 branch.

Can also be seen in my GitHub at:
https://github.com/ddalessa/kernel/tree/for-4.6.

---

Mitko Haralanov (16):
      IB/hfi1: Re-factor MMU notification code
      IB/hfi1: Allow MMU function execution in IRQ context
      IB/hfi1: Prevent NULL pointer dereference
      IB/hfi1: Allow remove MMU callbacks to free nodes
      IB/hfi1: Remove the use of add/remove RB function pointers
      IB/hfi1: Notify remove MMU/RB callback of calling context
      IB/hfi1: Use interval RB trees
      IB/hfi1: Add MMU tracing
      IB/hfi1: Remove compare callback
      IB/hfi1: Add filter callback
      IB/hfi1: Adjust last address values for intervals
      IB/hfi1: Implement SDMA-side buffer caching
      IB/hfi1: Add pin query function
      IB/hfi1: Specify mm when releasing pages
      IB/hfi1: Switch to using the pin query function
      IB/hfi1: Add SDMA cache eviction algorithm


 drivers/infiniband/hw/hfi1/Makefile       |    2 
 drivers/infiniband/hw/hfi1/file_ops.c     |    1 
 drivers/infiniband/hw/hfi1/hfi.h          |   16 +
 drivers/infiniband/hw/hfi1/mmu_rb.c       |  292 +++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/mmu_rb.h       |   73 ++++++
 drivers/infiniband/hw/hfi1/trace.c        |    1 
 drivers/infiniband/hw/hfi1/trace.h        |    1 
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |  362 ++++++++---------------------
 drivers/infiniband/hw/hfi1/user_pages.c   |   63 ++++-
 drivers/infiniband/hw/hfi1/user_sdma.c    |  319 +++++++++++++++++---------
 drivers/infiniband/hw/hfi1/user_sdma.h    |    4 
 11 files changed, 747 insertions(+), 387 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/mmu_rb.c
 create mode 100644 drivers/infiniband/hw/hfi1/mmu_rb.h

-- 
-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/16] IB/hfi1: Re-factor MMU notification code
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 02/16] IB/hfi1: Allow MMU function execution in IRQ context Dennis Dalessandro
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The MMU notification code added to the
expected receive side has been re-factored and
split into it's own file. This was done in
order to make the code more general and, therefore,
usable by other parts of the driver.

The caching behavior remains the same. However,
the handling of the RB tree (insertion, deletions,
and searching) as well as the MMU invalidation
processing is now handled by functions in the
mmu_rb.[ch] files.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/Makefile       |    2 
 drivers/infiniband/hw/hfi1/file_ops.c     |    1 
 drivers/infiniband/hw/hfi1/hfi.h          |   14 +
 drivers/infiniband/hw/hfi1/mmu_rb.c       |  304 ++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/mmu_rb.h       |   73 ++++++
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |  336 +++++++----------------------
 6 files changed, 471 insertions(+), 259 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/mmu_rb.c
 create mode 100644 drivers/infiniband/hw/hfi1/mmu_rb.h

diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile
index 9b11706..8dc5938 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o
 
 hfi1-y := affinity.o chip.o device.o diag.o driver.o efivar.o \
 	eprom.o file_ops.o firmware.o \
-	init.o intr.o mad.o pcie.o pio.o pio_copy.o platform.o \
+	init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
 	qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o twsi.o \
 	uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
 	verbs_txreq.o
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index e4490ae..e460261 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -58,6 +58,7 @@
 #include "user_exp_rcv.h"
 #include "eprom.h"
 #include "aspm.h"
+#include "mmu_rb.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 5722883..78c8e24 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1179,6 +1179,7 @@ struct hfi1_devdata {
 #define PT_EAGER    1
 #define PT_INVALID  2
 
+struct tid_rb_node;
 struct mmu_rb_node;
 
 /* Private data for file operations */
@@ -1189,20 +1190,17 @@ struct hfi1_filedata {
 	struct hfi1_user_sdma_pkt_q *pq;
 	/* for cpu affinity; -1 if none */
 	int rec_cpu_num;
-	struct mmu_notifier mn;
 	struct rb_root tid_rb_root;
-	struct mmu_rb_node **entry_to_rb;
+	struct tid_rb_node **entry_to_rb;
 	spinlock_t tid_lock; /* protect tid_[limit,used] counters */
 	u32 tid_limit;
 	u32 tid_used;
-	spinlock_t rb_lock; /* protect tid_rb_root RB tree */
 	u32 *invalid_tids;
 	u32 invalid_tid_idx;
-	spinlock_t invalid_lock; /* protect the invalid_tids array */
-	int (*mmu_rb_insert)(struct hfi1_filedata *, struct rb_root *,
-			     struct mmu_rb_node *);
-	void (*mmu_rb_remove)(struct hfi1_filedata *, struct rb_root *,
-			      struct mmu_rb_node *);
+	/* protect invalid_tids array and invalid_tid_idx */
+	spinlock_t invalid_lock;
+	int (*mmu_rb_insert)(struct rb_root *, struct mmu_rb_node *);
+	void (*mmu_rb_remove)(struct rb_root *, struct mmu_rb_node *);
 };
 
 extern struct list_head hfi1_dev_list;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
new file mode 100644
index 0000000..779ebaf
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -0,0 +1,304 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#include <linux/list.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rbtree.h>
+
+#include "mmu_rb.h"
+#include "trace.h"
+
+struct mmu_rb_handler {
+	struct list_head list;
+	struct mmu_notifier mn;
+	struct rb_root *root;
+	spinlock_t lock;        /* protect the RB tree */
+	struct mmu_rb_ops *ops;
+};
+
+static LIST_HEAD(mmu_rb_handlers);
+static DEFINE_SPINLOCK(mmu_rb_lock); /* protect mmu_rb_handlers list */
+
+static struct mmu_rb_handler *find_mmu_handler(struct rb_root *);
+static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
+				     unsigned long);
+static inline void mmu_notifier_range_start(struct mmu_notifier *,
+					    struct mm_struct *,
+					    unsigned long, unsigned long);
+static void mmu_notifier_mem_invalidate(struct mmu_notifier *,
+					unsigned long, unsigned long);
+static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
+					   unsigned long, unsigned long);
+
+static struct mmu_notifier_ops mn_opts = {
+	.invalidate_page = mmu_notifier_page,
+	.invalidate_range_start = mmu_notifier_range_start,
+};
+
+int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
+{
+	struct mmu_rb_handler *handlr;
+
+	if (!ops->compare || !ops->invalidate)
+		return -EINVAL;
+
+	handlr = kmalloc(sizeof(*handlr), GFP_KERNEL);
+	if (!handlr)
+		return -ENOMEM;
+
+	handlr->root = root;
+	handlr->ops = ops;
+	INIT_HLIST_NODE(&handlr->mn.hlist);
+	spin_lock_init(&handlr->lock);
+	handlr->mn.ops = &mn_opts;
+	spin_lock(&mmu_rb_lock);
+	list_add_tail(&handlr->list, &mmu_rb_handlers);
+	spin_unlock(&mmu_rb_lock);
+
+	return mmu_notifier_register(&handlr->mn, current->mm);
+}
+
+void hfi1_mmu_rb_unregister(struct rb_root *root)
+{
+	struct mmu_rb_handler *handler = find_mmu_handler(root);
+
+	spin_lock(&mmu_rb_lock);
+	list_del(&handler->list);
+	spin_unlock(&mmu_rb_lock);
+
+	if (!RB_EMPTY_ROOT(root)) {
+		struct rb_node *node;
+		struct mmu_rb_node *rbnode;
+
+		while ((node = rb_first(root))) {
+			rbnode = rb_entry(node, struct mmu_rb_node, node);
+			if (handler->ops->remove)
+				handler->ops->remove(root, rbnode);
+			rb_erase(node, root);
+			kfree(rbnode);
+		}
+	}
+
+	if (current->mm)
+		mmu_notifier_unregister(&handler->mn, current->mm);
+	kfree(handler);
+}
+
+int hfi1_mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *mnode)
+{
+	struct rb_node **new, *parent = NULL;
+	struct mmu_rb_handler *handler = find_mmu_handler(root);
+	struct mmu_rb_node *this;
+	int res, ret = 0;
+
+	if (!handler)
+		return -EINVAL;
+
+	new = &handler->root->rb_node;
+	spin_lock(&handler->lock);
+	while (*new) {
+		this = container_of(*new, struct mmu_rb_node, node);
+		res = handler->ops->compare(this, mnode->addr, mnode->len);
+		parent = *new;
+
+		if (res < 0) {
+			new = &((*new)->rb_left);
+		} else if (res > 0) {
+			new = &((*new)->rb_right);
+		} else {
+			ret = 1;
+			goto unlock;
+		}
+	}
+
+	if (handler->ops->insert) {
+		ret = handler->ops->insert(root, mnode);
+		if (ret)
+			goto unlock;
+	}
+
+	rb_link_node(&mnode->node, parent, new);
+	rb_insert_color(&mnode->node, root);
+unlock:
+	spin_unlock(&handler->lock);
+	return ret;
+}
+
+/* Caller must host handler lock */
+static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *handler,
+					   unsigned long addr,
+					   unsigned long len)
+{
+	struct rb_node *node = handler->root->rb_node;
+	struct mmu_rb_node *mnode;
+	int res;
+
+	while (node) {
+		mnode = container_of(node, struct mmu_rb_node, node);
+		res = handler->ops->compare(mnode, addr, len);
+
+		if (res < 0)
+			node = node->rb_left;
+		else if (res > 0)
+			node = node->rb_right;
+		else
+			return mnode;
+	}
+	return NULL;
+}
+
+static void __mmu_rb_remove(struct mmu_rb_handler *handler,
+			    struct mmu_rb_node *node)
+{
+	/* Validity of handler and node pointers has been checked by caller. */
+	if (handler->ops->remove)
+		handler->ops->remove(handler->root, node);
+	rb_erase(&node->node, handler->root);
+}
+
+struct mmu_rb_node *hfi1_mmu_rb_search(struct rb_root *root, unsigned long addr,
+				       unsigned long len)
+{
+	struct mmu_rb_handler *handler = find_mmu_handler(root);
+	struct mmu_rb_node *node;
+
+	if (!handler)
+		return ERR_PTR(-EINVAL);
+
+	spin_lock(&handler->lock);
+	node = __mmu_rb_search(handler, addr, len);
+	spin_unlock(&handler->lock);
+
+	return node;
+}
+
+void hfi1_mmu_rb_remove(struct rb_root *root, struct mmu_rb_node *node)
+{
+	struct mmu_rb_handler *handler = find_mmu_handler(root);
+
+	if (!handler || !node)
+		return;
+
+	spin_lock(&handler->lock);
+	__mmu_rb_remove(handler, node);
+	spin_unlock(&handler->lock);
+}
+
+static struct mmu_rb_handler *find_mmu_handler(struct rb_root *root)
+{
+	struct mmu_rb_handler *handler;
+
+	spin_lock(&mmu_rb_lock);
+	list_for_each_entry(handler, &mmu_rb_handlers, list) {
+		if (handler->root == root)
+			goto unlock;
+	}
+	handler = NULL;
+unlock:
+	spin_unlock(&mmu_rb_lock);
+	return handler;
+}
+
+static inline void mmu_notifier_page(struct mmu_notifier *mn,
+				     struct mm_struct *mm, unsigned long addr)
+{
+	mmu_notifier_mem_invalidate(mn, addr, addr + PAGE_SIZE);
+}
+
+static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
+					    struct mm_struct *mm,
+					    unsigned long start,
+					    unsigned long end)
+{
+	mmu_notifier_mem_invalidate(mn, start, end);
+}
+
+static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
+					unsigned long start, unsigned long end)
+{
+	struct mmu_rb_handler *handler =
+		container_of(mn, struct mmu_rb_handler, mn);
+	struct rb_root *root = handler->root;
+	struct mmu_rb_node *node;
+	unsigned long addr = start;
+
+	spin_lock(&handler->lock);
+	while (addr < end) {
+		/*
+		 * There is no good way to provide a reasonable length to the
+		 * search function at this point. Using the remaining length in
+		 * the invalidation range is not the right thing to do.
+		 * We have to rely on the fact that the insertion algorithm
+		 * takes care of any overlap or length restrictions by using the
+		 * actual size of each node. Therefore, we can use a page as an
+		 * arbitrary, non-zero value.
+		 */
+		node = __mmu_rb_search(handler, addr, PAGE_SIZE);
+
+		if (!node) {
+			/*
+			 * Didn't find a node at this address. However, the
+			 * range could be bigger than what we have registered
+			 * so we have to keep looking.
+			 */
+			addr += PAGE_SIZE;
+			continue;
+		}
+		if (handler->ops->invalidate(root, node))
+			__mmu_rb_remove(handler, node);
+
+		/*
+		 * The next address to be looked up is computed based
+		 * on the node's starting address. This is due to the
+		 * fact that the range where we start might be in the
+		 * middle of the node's buffer so simply incrementing
+		 * the address by the node's size would result is a
+		 * bad address.
+		 */
+		addr = node->addr + node->len;
+	}
+	spin_unlock(&handler->lock);
+}
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
new file mode 100644
index 0000000..9fe1076
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#ifndef _HFI1_MMU_RB_H
+#define _HFI1_MMU_RB_H
+
+#include "hfi.h"
+
+struct mmu_rb_node {
+	struct rb_node node;
+	unsigned long addr;
+	unsigned long len;
+};
+
+struct mmu_rb_ops {
+	int (*compare)(struct mmu_rb_node *, unsigned long,
+		       unsigned long);
+	int (*insert)(struct rb_root *, struct mmu_rb_node *);
+	void (*remove)(struct rb_root *, struct mmu_rb_node *);
+	int (*invalidate)(struct rb_root *, struct mmu_rb_node *);
+};
+
+int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops);
+void hfi1_mmu_rb_unregister(struct rb_root *);
+int hfi1_mmu_rb_insert(struct rb_root *, struct mmu_rb_node *);
+void hfi1_mmu_rb_remove(struct rb_root *, struct mmu_rb_node *);
+struct mmu_rb_node *hfi1_mmu_rb_search(struct rb_root *, unsigned long,
+				       unsigned long);
+
+#endif /* _HFI1_MMU_RB_H */
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index fccae50..c9e05dd 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -48,6 +48,7 @@
 
 #include "user_exp_rcv.h"
 #include "trace.h"
+#include "mmu_rb.h"
 
 struct tid_group {
 	struct list_head list;
@@ -57,11 +58,9 @@ struct tid_group {
 	u8 map;
 };
 
-struct mmu_rb_node {
-	struct rb_node rbnode;
-	unsigned long virt;
+struct tid_rb_node {
+	struct mmu_rb_node mmu;
 	unsigned long phys;
-	unsigned long len;
 	struct tid_group *grp;
 	u32 rcventry;
 	dma_addr_t dma_addr;
@@ -70,16 +69,6 @@ struct mmu_rb_node {
 	struct page *pages[0];
 };
 
-enum mmu_call_types {
-	MMU_INVALIDATE_PAGE = 0,
-	MMU_INVALIDATE_RANGE = 1
-};
-
-static const char * const mmu_types[] = {
-	"PAGE",
-	"RANGE"
-};
-
 struct tid_pageset {
 	u16 idx;
 	u16 count;
@@ -99,28 +88,21 @@ static int set_rcvarray_entry(struct file *, unsigned long, u32,
 			      struct tid_group *, struct page **, unsigned);
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
 			       unsigned long);
-static struct mmu_rb_node *mmu_rb_search(struct rb_root *, unsigned long);
-static int mmu_rb_insert_by_addr(struct hfi1_filedata *, struct rb_root *,
-				 struct mmu_rb_node *);
-static int mmu_rb_insert_by_entry(struct hfi1_filedata *, struct rb_root *,
-				  struct mmu_rb_node *);
-static void mmu_rb_remove_by_addr(struct hfi1_filedata *, struct rb_root *,
-				  struct mmu_rb_node *);
-static void mmu_rb_remove_by_entry(struct hfi1_filedata *, struct rb_root *,
-				   struct mmu_rb_node *);
-static void mmu_notifier_mem_invalidate(struct mmu_notifier *,
-					unsigned long, unsigned long,
-					enum mmu_call_types);
-static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
-				     unsigned long);
-static inline void mmu_notifier_range_start(struct mmu_notifier *,
-					    struct mm_struct *,
-					    unsigned long, unsigned long);
+static int mmu_rb_insert(struct rb_root *, struct mmu_rb_node *);
+static void mmu_rb_remove(struct rb_root *, struct mmu_rb_node *);
+static int mmu_rb_invalidate(struct rb_root *, struct mmu_rb_node *);
 static int program_rcvarray(struct file *, unsigned long, struct tid_group *,
 			    struct tid_pageset *, unsigned, u16, struct page **,
 			    u32 *, unsigned *, unsigned *);
 static int unprogram_rcvarray(struct file *, u32, struct tid_group **);
-static void clear_tid_node(struct hfi1_filedata *, u16, struct mmu_rb_node *);
+static void clear_tid_node(struct hfi1_filedata *, u16, struct tid_rb_node *);
+
+static struct mmu_rb_ops tid_rb_ops = {
+	.compare = mmu_addr_cmp,
+	.insert = mmu_rb_insert,
+	.remove = mmu_rb_remove,
+	.invalidate = mmu_rb_invalidate
+};
 
 static inline u32 rcventry2tidinfo(u32 rcventry)
 {
@@ -167,11 +149,6 @@ static inline void tid_group_move(struct tid_group *group,
 	tid_group_add_tail(group, s2);
 }
 
-static struct mmu_notifier_ops mn_opts = {
-	.invalidate_page = mmu_notifier_page,
-	.invalidate_range_start = mmu_notifier_range_start,
-};
-
 /*
  * Initialize context and file private data needed for Expected
  * receive caching. This needs to be done after the context has
@@ -185,11 +162,8 @@ int hfi1_user_exp_rcv_init(struct file *fp)
 	unsigned tidbase;
 	int i, ret = 0;
 
-	INIT_HLIST_NODE(&fd->mn.hlist);
-	spin_lock_init(&fd->rb_lock);
 	spin_lock_init(&fd->tid_lock);
 	spin_lock_init(&fd->invalid_lock);
-	fd->mn.ops = &mn_opts;
 	fd->tid_rb_root = RB_ROOT;
 
 	if (!uctxt->subctxt_cnt || !fd->subctxt) {
@@ -239,7 +213,7 @@ int hfi1_user_exp_rcv_init(struct file *fp)
 		 * fails, continue but turn off the TID caching for
 		 * all user contexts.
 		 */
-		ret = mmu_notifier_register(&fd->mn, current->mm);
+		ret = hfi1_mmu_rb_register(&fd->tid_rb_root, &tid_rb_ops);
 		if (ret) {
 			dd_dev_info(dd,
 				    "Failed MMU notifier registration %d\n",
@@ -250,11 +224,11 @@ int hfi1_user_exp_rcv_init(struct file *fp)
 	}
 
 	if (HFI1_CAP_IS_USET(TID_UNMAP)) {
-		fd->mmu_rb_insert = mmu_rb_insert_by_entry;
-		fd->mmu_rb_remove = mmu_rb_remove_by_entry;
+		fd->mmu_rb_insert = mmu_rb_insert;
+		fd->mmu_rb_remove = mmu_rb_remove;
 	} else {
-		fd->mmu_rb_insert = mmu_rb_insert_by_addr;
-		fd->mmu_rb_remove = mmu_rb_remove_by_addr;
+		fd->mmu_rb_insert = hfi1_mmu_rb_insert;
+		fd->mmu_rb_remove = hfi1_mmu_rb_remove;
 	}
 
 	/*
@@ -295,8 +269,8 @@ int hfi1_user_exp_rcv_free(struct hfi1_filedata *fd)
 	 * The notifier would have been removed when the process'es mm
 	 * was freed.
 	 */
-	if (current->mm && !HFI1_CAP_IS_USET(TID_UNMAP))
-		mmu_notifier_unregister(&fd->mn, current->mm);
+	if (!HFI1_CAP_IS_USET(TID_UNMAP))
+		hfi1_mmu_rb_unregister(&fd->tid_rb_root);
 
 	kfree(fd->invalid_tids);
 
@@ -312,19 +286,6 @@ int hfi1_user_exp_rcv_free(struct hfi1_filedata *fd)
 			list_del_init(&grp->list);
 			kfree(grp);
 		}
-		spin_lock(&fd->rb_lock);
-		if (!RB_EMPTY_ROOT(&fd->tid_rb_root)) {
-			struct rb_node *node;
-			struct mmu_rb_node *rbnode;
-
-			while ((node = rb_first(&fd->tid_rb_root))) {
-				rbnode = rb_entry(node, struct mmu_rb_node,
-						  rbnode);
-				rb_erase(&rbnode->rbnode, &fd->tid_rb_root);
-				kfree(rbnode);
-			}
-		}
-		spin_unlock(&fd->rb_lock);
 		hfi1_clear_tids(uctxt);
 	}
 
@@ -866,7 +827,7 @@ static int set_rcvarray_entry(struct file *fp, unsigned long vaddr,
 	int ret;
 	struct hfi1_filedata *fd = fp->private_data;
 	struct hfi1_ctxtdata *uctxt = fd->uctxt;
-	struct mmu_rb_node *node;
+	struct tid_rb_node *node;
 	struct hfi1_devdata *dd = uctxt->dd;
 	struct rb_root *root = &fd->tid_rb_root;
 	dma_addr_t phys;
@@ -890,9 +851,9 @@ static int set_rcvarray_entry(struct file *fp, unsigned long vaddr,
 		return -EFAULT;
 	}
 
-	node->virt = vaddr;
+	node->mmu.addr = vaddr;
+	node->mmu.len = npages * PAGE_SIZE;
 	node->phys = page_to_phys(pages[0]);
-	node->len = npages * PAGE_SIZE;
 	node->npages = npages;
 	node->rcventry = rcventry;
 	node->dma_addr = phys;
@@ -900,21 +861,19 @@ static int set_rcvarray_entry(struct file *fp, unsigned long vaddr,
 	node->freed = false;
 	memcpy(node->pages, pages, sizeof(struct page *) * npages);
 
-	spin_lock(&fd->rb_lock);
-	ret = fd->mmu_rb_insert(fd, root, node);
-	spin_unlock(&fd->rb_lock);
+	ret = fd->mmu_rb_insert(root, &node->mmu);
 
 	if (ret) {
 		hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d",
-			  node->rcventry, node->virt, node->phys, ret);
+			  node->rcventry, node->mmu.addr, node->phys, ret);
 		pci_unmap_single(dd->pcidev, phys, npages * PAGE_SIZE,
 				 PCI_DMA_FROMDEVICE);
 		kfree(node);
 		return -EFAULT;
 	}
 	hfi1_put_tid(dd, rcventry, PT_EXPECTED, phys, ilog2(npages) + 1);
-	trace_hfi1_exp_tid_reg(uctxt->ctxt, fd->subctxt, rcventry,
-			       npages, node->virt, node->phys, phys);
+	trace_hfi1_exp_tid_reg(uctxt->ctxt, fd->subctxt, rcventry, npages,
+			       node->mmu.addr, node->phys, phys);
 	return 0;
 }
 
@@ -924,7 +883,7 @@ static int unprogram_rcvarray(struct file *fp, u32 tidinfo,
 	struct hfi1_filedata *fd = fp->private_data;
 	struct hfi1_ctxtdata *uctxt = fd->uctxt;
 	struct hfi1_devdata *dd = uctxt->dd;
-	struct mmu_rb_node *node;
+	struct tid_rb_node *node;
 	u8 tidctrl = EXP_TID_GET(tidinfo, CTRL);
 	u32 tididx = EXP_TID_GET(tidinfo, IDX) << 1, rcventry;
 
@@ -939,14 +898,11 @@ static int unprogram_rcvarray(struct file *fp, u32 tidinfo,
 
 	rcventry = tididx + (tidctrl - 1);
 
-	spin_lock(&fd->rb_lock);
 	node = fd->entry_to_rb[rcventry];
-	if (!node || node->rcventry != (uctxt->expected_base + rcventry)) {
-		spin_unlock(&fd->rb_lock);
+	if (!node || node->rcventry != (uctxt->expected_base + rcventry))
 		return -EBADF;
-	}
-	fd->mmu_rb_remove(fd, &fd->tid_rb_root, node);
-	spin_unlock(&fd->rb_lock);
+	fd->mmu_rb_remove(&fd->tid_rb_root, &node->mmu);
+
 	if (grp)
 		*grp = node->grp;
 	clear_tid_node(fd, fd->subctxt, node);
@@ -954,13 +910,13 @@ static int unprogram_rcvarray(struct file *fp, u32 tidinfo,
 }
 
 static void clear_tid_node(struct hfi1_filedata *fd, u16 subctxt,
-			   struct mmu_rb_node *node)
+			   struct tid_rb_node *node)
 {
 	struct hfi1_ctxtdata *uctxt = fd->uctxt;
 	struct hfi1_devdata *dd = uctxt->dd;
 
 	trace_hfi1_exp_tid_unreg(uctxt->ctxt, fd->subctxt, node->rcventry,
-				 node->npages, node->virt, node->phys,
+				 node->npages, node->mmu.addr, node->phys,
 				 node->dma_addr);
 
 	hfi1_put_tid(dd, node->rcventry, PT_INVALID, 0, 0);
@@ -970,7 +926,7 @@ static void clear_tid_node(struct hfi1_filedata *fd, u16 subctxt,
 	 */
 	flush_wc();
 
-	pci_unmap_single(dd->pcidev, node->dma_addr, node->len,
+	pci_unmap_single(dd->pcidev, node->dma_addr, node->mmu.len,
 			 PCI_DMA_FROMDEVICE);
 	hfi1_release_user_pages(node->pages, node->npages, true);
 
@@ -997,216 +953,96 @@ static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
 	list_for_each_entry_safe(grp, ptr, &set->list, list) {
 		list_del_init(&grp->list);
 
-		spin_lock(&fd->rb_lock);
 		for (i = 0; i < grp->size; i++) {
 			if (grp->map & (1 << i)) {
 				u16 rcventry = grp->base + i;
-				struct mmu_rb_node *node;
+				struct tid_rb_node *node;
 
 				node = fd->entry_to_rb[rcventry -
 							  uctxt->expected_base];
 				if (!node || node->rcventry != rcventry)
 					continue;
-				fd->mmu_rb_remove(fd, root, node);
+				fd->mmu_rb_remove(root, &node->mmu);
 				clear_tid_node(fd, -1, node);
 			}
 		}
-		spin_unlock(&fd->rb_lock);
 	}
 }
 
-static inline void mmu_notifier_page(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long addr)
-{
-	mmu_notifier_mem_invalidate(mn, addr, addr + PAGE_SIZE,
-				    MMU_INVALIDATE_PAGE);
-}
-
-static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
-					    struct mm_struct *mm,
-					    unsigned long start,
-					    unsigned long end)
+static int mmu_rb_invalidate(struct rb_root *root, struct mmu_rb_node *mnode)
 {
-	mmu_notifier_mem_invalidate(mn, start, end, MMU_INVALIDATE_RANGE);
-}
+	struct hfi1_filedata *fdata =
+		container_of(root, struct hfi1_filedata, tid_rb_root);
+	struct hfi1_ctxtdata *uctxt = fdata->uctxt;
+	struct tid_rb_node *node =
+		container_of(mnode, struct tid_rb_node, mmu);
 
-static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
-					unsigned long start, unsigned long end,
-					enum mmu_call_types type)
-{
-	struct hfi1_filedata *fd = container_of(mn, struct hfi1_filedata, mn);
-	struct hfi1_ctxtdata *uctxt = fd->uctxt;
-	struct rb_root *root = &fd->tid_rb_root;
-	struct mmu_rb_node *node;
-	unsigned long addr = start;
+	if (node->freed)
+		return 0;
 
-	trace_hfi1_mmu_invalidate(uctxt->ctxt, fd->subctxt, mmu_types[type],
-				  start, end);
+	trace_hfi1_exp_tid_inval(uctxt->ctxt, fdata->subctxt, node->mmu.addr,
+				 node->rcventry, node->npages, node->dma_addr);
+	node->freed = true;
 
-	spin_lock(&fd->rb_lock);
-	while (addr < end) {
-		node = mmu_rb_search(root, addr);
+	spin_lock(&fdata->invalid_lock);
+	if (fdata->invalid_tid_idx < uctxt->expected_count) {
+		fdata->invalid_tids[fdata->invalid_tid_idx] =
+			rcventry2tidinfo(node->rcventry - uctxt->expected_base);
+		fdata->invalid_tids[fdata->invalid_tid_idx] |=
+			EXP_TID_SET(LEN, node->npages);
+		if (!fdata->invalid_tid_idx) {
+			unsigned long *ev;
 
-		if (!node) {
 			/*
-			 * Didn't find a node at this address. However, the
-			 * range could be bigger than what we have registered
-			 * so we have to keep looking.
+			 * hfi1_set_uevent_bits() sets a user event flag
+			 * for all processes. Because calling into the
+			 * driver to process TID cache invalidations is
+			 * expensive and TID cache invalidations are
+			 * handled on a per-process basis, we can
+			 * optimize this to set the flag only for the
+			 * process in question.
 			 */
-			addr += PAGE_SIZE;
-			continue;
-		}
-
-		/*
-		 * The next address to be looked up is computed based
-		 * on the node's starting address. This is due to the
-		 * fact that the range where we start might be in the
-		 * middle of the node's buffer so simply incrementing
-		 * the address by the node's size would result is a
-		 * bad address.
-		 */
-		addr = node->virt + (node->npages * PAGE_SIZE);
-		if (node->freed)
-			continue;
-
-		trace_hfi1_exp_tid_inval(uctxt->ctxt, fd->subctxt, node->virt,
-					 node->rcventry, node->npages,
-					 node->dma_addr);
-		node->freed = true;
-
-		spin_lock(&fd->invalid_lock);
-		if (fd->invalid_tid_idx < uctxt->expected_count) {
-			fd->invalid_tids[fd->invalid_tid_idx] =
-				rcventry2tidinfo(node->rcventry -
-						 uctxt->expected_base);
-			fd->invalid_tids[fd->invalid_tid_idx] |=
-				EXP_TID_SET(LEN, node->npages);
-			if (!fd->invalid_tid_idx) {
-				unsigned long *ev;
-
-				/*
-				 * hfi1_set_uevent_bits() sets a user event flag
-				 * for all processes. Because calling into the
-				 * driver to process TID cache invalidations is
-				 * expensive and TID cache invalidations are
-				 * handled on a per-process basis, we can
-				 * optimize this to set the flag only for the
-				 * process in question.
-				 */
-				ev = uctxt->dd->events +
-					(((uctxt->ctxt -
-					   uctxt->dd->first_user_ctxt) *
-					  HFI1_MAX_SHARED_CTXTS) + fd->subctxt);
-				set_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
-			}
-			fd->invalid_tid_idx++;
+			ev = uctxt->dd->events +
+				(((uctxt->ctxt - uctxt->dd->first_user_ctxt) *
+				  HFI1_MAX_SHARED_CTXTS) + fdata->subctxt);
+			set_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
 		}
-		spin_unlock(&fd->invalid_lock);
+		fdata->invalid_tid_idx++;
 	}
-	spin_unlock(&fd->rb_lock);
+	spin_unlock(&fdata->invalid_lock);
+	return 0;
 }
 
-static inline int mmu_addr_cmp(struct mmu_rb_node *node, unsigned long addr,
-			       unsigned long len)
+static int mmu_addr_cmp(struct mmu_rb_node *node, unsigned long addr,
+			unsigned long len)
 {
-	if ((addr + len) <= node->virt)
+	if ((addr + len) <= node->addr)
 		return -1;
-	else if (addr >= node->virt && addr < (node->virt + node->len))
+	else if (addr >= node->addr && addr < (node->addr + node->len))
 		return 0;
 	else
 		return 1;
 }
 
-static inline int mmu_entry_cmp(struct mmu_rb_node *node, u32 entry)
-{
-	if (entry < node->rcventry)
-		return -1;
-	else if (entry > node->rcventry)
-		return 1;
-	else
-		return 0;
-}
-
-static struct mmu_rb_node *mmu_rb_search(struct rb_root *root,
-					 unsigned long addr)
-{
-	struct rb_node *node = root->rb_node;
-
-	while (node) {
-		struct mmu_rb_node *mnode =
-			container_of(node, struct mmu_rb_node, rbnode);
-		/*
-		 * When searching, use at least one page length for size. The
-		 * MMU notifier will not give us anything less than that. We
-		 * also don't need anything more than a page because we are
-		 * guaranteed to have non-overlapping buffers in the tree.
-		 */
-		int result = mmu_addr_cmp(mnode, addr, PAGE_SIZE);
-
-		if (result < 0)
-			node = node->rb_left;
-		else if (result > 0)
-			node = node->rb_right;
-		else
-			return mnode;
-	}
-	return NULL;
-}
-
-static int mmu_rb_insert_by_entry(struct hfi1_filedata *fdata,
-				  struct rb_root *root,
-				  struct mmu_rb_node *node)
+static int mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *node)
 {
+	struct hfi1_filedata *fdata =
+		container_of(root, struct hfi1_filedata, tid_rb_root);
+	struct tid_rb_node *tnode =
+		container_of(node, struct tid_rb_node, mmu);
 	u32 base = fdata->uctxt->expected_base;
 
-	fdata->entry_to_rb[node->rcventry - base] = node;
+	fdata->entry_to_rb[tnode->rcventry - base] = tnode;
 	return 0;
 }
 
-static int mmu_rb_insert_by_addr(struct hfi1_filedata *fdata,
-				 struct rb_root *root, struct mmu_rb_node *node)
-{
-	struct rb_node **new = &root->rb_node, *parent = NULL;
-	u32 base = fdata->uctxt->expected_base;
-
-	/* Figure out where to put new node */
-	while (*new) {
-		struct mmu_rb_node *this =
-			container_of(*new, struct mmu_rb_node, rbnode);
-		int result = mmu_addr_cmp(this, node->virt, node->len);
-
-		parent = *new;
-		if (result < 0)
-			new = &((*new)->rb_left);
-		else if (result > 0)
-			new = &((*new)->rb_right);
-		else
-			return 1;
-	}
-
-	/* Add new node and rebalance tree. */
-	rb_link_node(&node->rbnode, parent, new);
-	rb_insert_color(&node->rbnode, root);
-
-	fdata->entry_to_rb[node->rcventry - base] = node;
-	return 0;
-}
-
-static void mmu_rb_remove_by_entry(struct hfi1_filedata *fdata,
-				   struct rb_root *root,
-				   struct mmu_rb_node *node)
-{
-	u32 base = fdata->uctxt->expected_base;
-
-	fdata->entry_to_rb[node->rcventry - base] = NULL;
-}
-
-static void mmu_rb_remove_by_addr(struct hfi1_filedata *fdata,
-				  struct rb_root *root,
-				  struct mmu_rb_node *node)
+static void mmu_rb_remove(struct rb_root *root, struct mmu_rb_node *node)
 {
+	struct hfi1_filedata *fdata =
+		container_of(root, struct hfi1_filedata, tid_rb_root);
+	struct tid_rb_node *tnode =
+		container_of(node, struct tid_rb_node, mmu);
 	u32 base = fdata->uctxt->expected_base;
 
-	fdata->entry_to_rb[node->rcventry - base] = NULL;
-	rb_erase(&node->rbnode, root);
+	fdata->entry_to_rb[tnode->rcventry - base] = NULL;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/16] IB/hfi1: Allow MMU function execution in IRQ context
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-03-08 19:14   ` [PATCH 01/16] IB/hfi1: Re-factor MMU notification code Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 03/16] IB/hfi1: Prevent NULL pointer dereference Dennis Dalessandro
                     ` (14 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Future users of the MMU/RB functions might be searching or
manipulating the MMU RB trees in interrupt context. Therefore,
the MMU/RB functions need to be able to run in interrupt
context. This requires that we use the IRQ-aware API for
spin locks.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |   36 ++++++++++++++++++++---------------
 1 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 779ebaf..648f7e0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -81,6 +81,7 @@ static struct mmu_notifier_ops mn_opts = {
 int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
 {
 	struct mmu_rb_handler *handlr;
+	unsigned long flags;
 
 	if (!ops->compare || !ops->invalidate)
 		return -EINVAL;
@@ -94,9 +95,9 @@ int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
 	INIT_HLIST_NODE(&handlr->mn.hlist);
 	spin_lock_init(&handlr->lock);
 	handlr->mn.ops = &mn_opts;
-	spin_lock(&mmu_rb_lock);
+	spin_lock_irqsave(&mmu_rb_lock, flags);
 	list_add_tail(&handlr->list, &mmu_rb_handlers);
-	spin_unlock(&mmu_rb_lock);
+	spin_unlock_irqrestore(&mmu_rb_lock, flags);
 
 	return mmu_notifier_register(&handlr->mn, current->mm);
 }
@@ -104,10 +105,11 @@ int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
 void hfi1_mmu_rb_unregister(struct rb_root *root)
 {
 	struct mmu_rb_handler *handler = find_mmu_handler(root);
+	unsigned long flags;
 
-	spin_lock(&mmu_rb_lock);
+	spin_lock_irqsave(&mmu_rb_lock, flags);
 	list_del(&handler->list);
-	spin_unlock(&mmu_rb_lock);
+	spin_unlock_irqrestore(&mmu_rb_lock, flags);
 
 	if (!RB_EMPTY_ROOT(root)) {
 		struct rb_node *node;
@@ -132,13 +134,14 @@ int hfi1_mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *mnode)
 	struct rb_node **new, *parent = NULL;
 	struct mmu_rb_handler *handler = find_mmu_handler(root);
 	struct mmu_rb_node *this;
+	unsigned long flags;
 	int res, ret = 0;
 
 	if (!handler)
 		return -EINVAL;
 
 	new = &handler->root->rb_node;
-	spin_lock(&handler->lock);
+	spin_lock_irqsave(&handler->lock, flags);
 	while (*new) {
 		this = container_of(*new, struct mmu_rb_node, node);
 		res = handler->ops->compare(this, mnode->addr, mnode->len);
@@ -163,7 +166,7 @@ int hfi1_mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *mnode)
 	rb_link_node(&mnode->node, parent, new);
 	rb_insert_color(&mnode->node, root);
 unlock:
-	spin_unlock(&handler->lock);
+	spin_unlock_irqrestore(&handler->lock, flags);
 	return ret;
 }
 
@@ -204,13 +207,14 @@ struct mmu_rb_node *hfi1_mmu_rb_search(struct rb_root *root, unsigned long addr,
 {
 	struct mmu_rb_handler *handler = find_mmu_handler(root);
 	struct mmu_rb_node *node;
+	unsigned long flags;
 
 	if (!handler)
 		return ERR_PTR(-EINVAL);
 
-	spin_lock(&handler->lock);
+	spin_lock_irqsave(&handler->lock, flags);
 	node = __mmu_rb_search(handler, addr, len);
-	spin_unlock(&handler->lock);
+	spin_unlock_irqrestore(&handler->lock, flags);
 
 	return node;
 }
@@ -218,27 +222,29 @@ struct mmu_rb_node *hfi1_mmu_rb_search(struct rb_root *root, unsigned long addr,
 void hfi1_mmu_rb_remove(struct rb_root *root, struct mmu_rb_node *node)
 {
 	struct mmu_rb_handler *handler = find_mmu_handler(root);
+	unsigned long flags;
 
 	if (!handler || !node)
 		return;
 
-	spin_lock(&handler->lock);
+	spin_lock_irqsave(&handler->lock, flags);
 	__mmu_rb_remove(handler, node);
-	spin_unlock(&handler->lock);
+	spin_unlock_irqrestore(&handler->lock, flags);
 }
 
 static struct mmu_rb_handler *find_mmu_handler(struct rb_root *root)
 {
 	struct mmu_rb_handler *handler;
+	unsigned long flags;
 
-	spin_lock(&mmu_rb_lock);
+	spin_lock_irqsave(&mmu_rb_lock, flags);
 	list_for_each_entry(handler, &mmu_rb_handlers, list) {
 		if (handler->root == root)
 			goto unlock;
 	}
 	handler = NULL;
 unlock:
-	spin_unlock(&mmu_rb_lock);
+	spin_unlock_irqrestore(&mmu_rb_lock, flags);
 	return handler;
 }
 
@@ -263,9 +269,9 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 		container_of(mn, struct mmu_rb_handler, mn);
 	struct rb_root *root = handler->root;
 	struct mmu_rb_node *node;
-	unsigned long addr = start;
+	unsigned long addr = start, flags;
 
-	spin_lock(&handler->lock);
+	spin_lock_irqsave(&handler->lock, flags);
 	while (addr < end) {
 		/*
 		 * There is no good way to provide a reasonable length to the
@@ -300,5 +306,5 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 		 */
 		addr = node->addr + node->len;
 	}
-	spin_unlock(&handler->lock);
+	spin_unlock_irqrestore(&handler->lock, flags);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/16] IB/hfi1: Prevent NULL pointer dereference
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-03-08 19:14   ` [PATCH 01/16] IB/hfi1: Re-factor MMU notification code Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 02/16] IB/hfi1: Allow MMU function execution in IRQ context Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 04/16] IB/hfi1: Allow remove MMU callbacks to free nodes Dennis Dalessandro
                     ` (13 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Prevent a potential NULL pointer dereference (found
by code inspection) when unregistering an MMU handler.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 648f7e0..f42a33b 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -107,6 +107,9 @@ void hfi1_mmu_rb_unregister(struct rb_root *root)
 	struct mmu_rb_handler *handler = find_mmu_handler(root);
 	unsigned long flags;
 
+	if (!handler)
+		return;
+
 	spin_lock_irqsave(&mmu_rb_lock, flags);
 	list_del(&handler->list);
 	spin_unlock_irqrestore(&mmu_rb_lock, flags);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/16] IB/hfi1: Allow remove MMU callbacks to free nodes
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-03-08 19:14   ` [PATCH 03/16] IB/hfi1: Prevent NULL pointer dereference Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 05/16] IB/hfi1: Remove the use of add/remove RB function pointers Dennis Dalessandro
                     ` (12 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

In order to allow the remove MMU callbacks to free the
RB nodes, it is necessary to prevent any references to
the nodes after the remove callback has been called.

Therefore, remove the node from the tree prior to calling
the callback. In other words, the MMU/RB API now guarantees
that all RB node operations it performs will be done prior
to calling the remove callback and that the RB node will
not be touched afterwards.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index f42a33b..a3515d7 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -120,10 +120,9 @@ void hfi1_mmu_rb_unregister(struct rb_root *root)
 
 		while ((node = rb_first(root))) {
 			rbnode = rb_entry(node, struct mmu_rb_node, node);
+			rb_erase(node, root);
 			if (handler->ops->remove)
 				handler->ops->remove(root, rbnode);
-			rb_erase(node, root);
-			kfree(rbnode);
 		}
 	}
 
@@ -200,9 +199,9 @@ static void __mmu_rb_remove(struct mmu_rb_handler *handler,
 			    struct mmu_rb_node *node)
 {
 	/* Validity of handler and node pointers has been checked by caller. */
+	rb_erase(&node->node, handler->root);
 	if (handler->ops->remove)
 		handler->ops->remove(handler->root, node);
-	rb_erase(&node->node, handler->root);
 }
 
 struct mmu_rb_node *hfi1_mmu_rb_search(struct rb_root *root, unsigned long addr,
@@ -272,7 +271,7 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 		container_of(mn, struct mmu_rb_handler, mn);
 	struct rb_root *root = handler->root;
 	struct mmu_rb_node *node;
-	unsigned long addr = start, flags;
+	unsigned long addr = start, naddr, nlen, flags;
 
 	spin_lock_irqsave(&handler->lock, flags);
 	while (addr < end) {
@@ -296,6 +295,9 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 			addr += PAGE_SIZE;
 			continue;
 		}
+
+		naddr = node->addr;
+		nlen = node->len;
 		if (handler->ops->invalidate(root, node))
 			__mmu_rb_remove(handler, node);
 
@@ -307,7 +309,7 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 		 * the address by the node's size would result is a
 		 * bad address.
 		 */
-		addr = node->addr + node->len;
+		addr = naddr + nlen;
 	}
 	spin_unlock_irqrestore(&handler->lock, flags);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/16] IB/hfi1: Remove the use of add/remove RB function pointers
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-03-08 19:14   ` [PATCH 04/16] IB/hfi1: Allow remove MMU callbacks to free nodes Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 06/16] IB/hfi1: Notify remove MMU/RB callback of calling context Dennis Dalessandro
                     ` (11 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The usage of function pointers for RB node insertion
and removal in the expected receive code path was
meant to be a small performance optimization. However,
maintaining it, especially with the new MMU API, would
become more troublesome as the API is extended.

Since the performance optimization is minor, remove the
function pointers and replace with direct calls.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/hfi.h          |    2 --
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   25 ++++++++++++++-----------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 78c8e24..2107cdc 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1199,8 +1199,6 @@ struct hfi1_filedata {
 	u32 invalid_tid_idx;
 	/* protect invalid_tids array and invalid_tid_idx */
 	spinlock_t invalid_lock;
-	int (*mmu_rb_insert)(struct rb_root *, struct mmu_rb_node *);
-	void (*mmu_rb_remove)(struct rb_root *, struct mmu_rb_node *);
 };
 
 extern struct list_head hfi1_dev_list;
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index c9e05dd..b0b193f 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -223,14 +223,6 @@ int hfi1_user_exp_rcv_init(struct file *fp)
 		}
 	}
 
-	if (HFI1_CAP_IS_USET(TID_UNMAP)) {
-		fd->mmu_rb_insert = mmu_rb_insert;
-		fd->mmu_rb_remove = mmu_rb_remove;
-	} else {
-		fd->mmu_rb_insert = hfi1_mmu_rb_insert;
-		fd->mmu_rb_remove = hfi1_mmu_rb_remove;
-	}
-
 	/*
 	 * PSM does not have a good way to separate, count, and
 	 * effectively enforce a limit on RcvArray entries used by
@@ -861,7 +853,10 @@ static int set_rcvarray_entry(struct file *fp, unsigned long vaddr,
 	node->freed = false;
 	memcpy(node->pages, pages, sizeof(struct page *) * npages);
 
-	ret = fd->mmu_rb_insert(root, &node->mmu);
+	if (HFI1_CAP_IS_USET(TID_UNMAP))
+		ret = mmu_rb_insert(root, &node->mmu);
+	else
+		ret = hfi1_mmu_rb_insert(root, &node->mmu);
 
 	if (ret) {
 		hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d",
@@ -901,7 +896,10 @@ static int unprogram_rcvarray(struct file *fp, u32 tidinfo,
 	node = fd->entry_to_rb[rcventry];
 	if (!node || node->rcventry != (uctxt->expected_base + rcventry))
 		return -EBADF;
-	fd->mmu_rb_remove(&fd->tid_rb_root, &node->mmu);
+	if (HFI1_CAP_IS_USET(TID_UNMAP))
+		mmu_rb_remove(&fd->tid_rb_root, &node->mmu);
+	else
+		hfi1_mmu_rb_remove(&fd->tid_rb_root, &node->mmu);
 
 	if (grp)
 		*grp = node->grp;
@@ -962,7 +960,12 @@ static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
 							  uctxt->expected_base];
 				if (!node || node->rcventry != rcventry)
 					continue;
-				fd->mmu_rb_remove(root, &node->mmu);
+				if (HFI1_CAP_IS_USET(TID_UNMAP))
+					mmu_rb_remove(&fd->tid_rb_root,
+						      &node->mmu);
+				else
+					hfi1_mmu_rb_remove(&fd->tid_rb_root,
+							   &node->mmu);
 				clear_tid_node(fd, -1, node);
 			}
 		}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/16] IB/hfi1: Notify remove MMU/RB callback of calling context
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-03-08 19:14   ` [PATCH 05/16] IB/hfi1: Remove the use of add/remove RB function pointers Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 07/16] IB/hfi1: Use interval RB trees Dennis Dalessandro
                     ` (10 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Tell the remove MMU/RB callback if it's being called as
part of a memory invalidation or not. This can be important
in preventing a deadlock if the remove callback attempts to
take the map_sem semaphore because the kernel's MMU
invalidation functions have already taken it.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c       |   10 +++++-----
 drivers/infiniband/hw/hfi1/mmu_rb.h       |    2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |    9 +++++----
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index a3515d7..29d6d3e 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -122,7 +122,7 @@ void hfi1_mmu_rb_unregister(struct rb_root *root)
 			rbnode = rb_entry(node, struct mmu_rb_node, node);
 			rb_erase(node, root);
 			if (handler->ops->remove)
-				handler->ops->remove(root, rbnode);
+				handler->ops->remove(root, rbnode, false);
 		}
 	}
 
@@ -196,12 +196,12 @@ static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *handler,
 }
 
 static void __mmu_rb_remove(struct mmu_rb_handler *handler,
-			    struct mmu_rb_node *node)
+			    struct mmu_rb_node *node, bool arg)
 {
 	/* Validity of handler and node pointers has been checked by caller. */
 	rb_erase(&node->node, handler->root);
 	if (handler->ops->remove)
-		handler->ops->remove(handler->root, node);
+		handler->ops->remove(handler->root, node, arg);
 }
 
 struct mmu_rb_node *hfi1_mmu_rb_search(struct rb_root *root, unsigned long addr,
@@ -230,7 +230,7 @@ void hfi1_mmu_rb_remove(struct rb_root *root, struct mmu_rb_node *node)
 		return;
 
 	spin_lock_irqsave(&handler->lock, flags);
-	__mmu_rb_remove(handler, node);
+	__mmu_rb_remove(handler, node, false);
 	spin_unlock_irqrestore(&handler->lock, flags);
 }
 
@@ -299,7 +299,7 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 		naddr = node->addr;
 		nlen = node->len;
 		if (handler->ops->invalidate(root, node))
-			__mmu_rb_remove(handler, node);
+			__mmu_rb_remove(handler, node, true);
 
 		/*
 		 * The next address to be looked up is computed based
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
index 9fe1076..fdd9787 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.h
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -59,7 +59,7 @@ struct mmu_rb_ops {
 	int (*compare)(struct mmu_rb_node *, unsigned long,
 		       unsigned long);
 	int (*insert)(struct rb_root *, struct mmu_rb_node *);
-	void (*remove)(struct rb_root *, struct mmu_rb_node *);
+	void (*remove)(struct rb_root *, struct mmu_rb_node *, bool);
 	int (*invalidate)(struct rb_root *, struct mmu_rb_node *);
 };
 
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index b0b193f..1d971c0 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -89,7 +89,7 @@ static int set_rcvarray_entry(struct file *, unsigned long, u32,
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
 			       unsigned long);
 static int mmu_rb_insert(struct rb_root *, struct mmu_rb_node *);
-static void mmu_rb_remove(struct rb_root *, struct mmu_rb_node *);
+static void mmu_rb_remove(struct rb_root *, struct mmu_rb_node *, bool);
 static int mmu_rb_invalidate(struct rb_root *, struct mmu_rb_node *);
 static int program_rcvarray(struct file *, unsigned long, struct tid_group *,
 			    struct tid_pageset *, unsigned, u16, struct page **,
@@ -897,7 +897,7 @@ static int unprogram_rcvarray(struct file *fp, u32 tidinfo,
 	if (!node || node->rcventry != (uctxt->expected_base + rcventry))
 		return -EBADF;
 	if (HFI1_CAP_IS_USET(TID_UNMAP))
-		mmu_rb_remove(&fd->tid_rb_root, &node->mmu);
+		mmu_rb_remove(&fd->tid_rb_root, &node->mmu, false);
 	else
 		hfi1_mmu_rb_remove(&fd->tid_rb_root, &node->mmu);
 
@@ -962,7 +962,7 @@ static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
 					continue;
 				if (HFI1_CAP_IS_USET(TID_UNMAP))
 					mmu_rb_remove(&fd->tid_rb_root,
-						      &node->mmu);
+						      &node->mmu, false);
 				else
 					hfi1_mmu_rb_remove(&fd->tid_rb_root,
 							   &node->mmu);
@@ -1039,7 +1039,8 @@ static int mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *node)
 	return 0;
 }
 
-static void mmu_rb_remove(struct rb_root *root, struct mmu_rb_node *node)
+static void mmu_rb_remove(struct rb_root *root, struct mmu_rb_node *node,
+			  bool notifier)
 {
 	struct hfi1_filedata *fdata =
 		container_of(root, struct hfi1_filedata, tid_rb_root);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/16] IB/hfi1: Use interval RB trees
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-03-08 19:14   ` [PATCH 06/16] IB/hfi1: Notify remove MMU/RB callback of calling context Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:14   ` [PATCH 08/16] IB/hfi1: Add MMU tracing Dennis Dalessandro
                     ` (9 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The interval RB trees can handle RB nodes which
hold ranged information. This is exactly the usage
for the buffer cache implemented in the expected
receive code path.

Convert the MMU/RB functions to use the interval RB
tree API. This will help with future users of the
caching API, as well.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |  106 +++++++++++------------------------
 drivers/infiniband/hw/hfi1/mmu_rb.h |    3 +
 2 files changed, 34 insertions(+), 75 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 29d6d3e..540e267 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -46,7 +46,7 @@
  */
 #include <linux/list.h>
 #include <linux/mmu_notifier.h>
-#include <linux/rbtree.h>
+#include <linux/interval_tree_generic.h>
 
 #include "mmu_rb.h"
 #include "trace.h"
@@ -62,6 +62,8 @@ struct mmu_rb_handler {
 static LIST_HEAD(mmu_rb_handlers);
 static DEFINE_SPINLOCK(mmu_rb_lock); /* protect mmu_rb_handlers list */
 
+static unsigned long mmu_node_start(struct mmu_rb_node *);
+static unsigned long mmu_node_last(struct mmu_rb_node *);
 static struct mmu_rb_handler *find_mmu_handler(struct rb_root *);
 static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
 				     unsigned long);
@@ -78,6 +80,19 @@ static struct mmu_notifier_ops mn_opts = {
 	.invalidate_range_start = mmu_notifier_range_start,
 };
 
+INTERVAL_TREE_DEFINE(struct mmu_rb_node, node, unsigned long, __last,
+		     mmu_node_start, mmu_node_last, static, __mmu_int_rb);
+
+static unsigned long mmu_node_start(struct mmu_rb_node *node)
+{
+	return node->addr & PAGE_MASK;
+}
+
+static unsigned long mmu_node_last(struct mmu_rb_node *node)
+{
+	return ((node->addr & PAGE_MASK) + node->len);
+}
+
 int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
 {
 	struct mmu_rb_handler *handlr;
@@ -133,40 +148,27 @@ void hfi1_mmu_rb_unregister(struct rb_root *root)
 
 int hfi1_mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *mnode)
 {
-	struct rb_node **new, *parent = NULL;
 	struct mmu_rb_handler *handler = find_mmu_handler(root);
-	struct mmu_rb_node *this;
+	struct mmu_rb_node *node;
 	unsigned long flags;
-	int res, ret = 0;
+	int ret = 0;
 
 	if (!handler)
 		return -EINVAL;
 
-	new = &handler->root->rb_node;
 	spin_lock_irqsave(&handler->lock, flags);
-	while (*new) {
-		this = container_of(*new, struct mmu_rb_node, node);
-		res = handler->ops->compare(this, mnode->addr, mnode->len);
-		parent = *new;
-
-		if (res < 0) {
-			new = &((*new)->rb_left);
-		} else if (res > 0) {
-			new = &((*new)->rb_right);
-		} else {
-			ret = 1;
-			goto unlock;
-		}
+	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
+	if (node) {
+		ret = -EINVAL;
+		goto unlock;
 	}
+	__mmu_int_rb_insert(mnode, root);
 
 	if (handler->ops->insert) {
 		ret = handler->ops->insert(root, mnode);
 		if (ret)
-			goto unlock;
+			__mmu_int_rb_remove(mnode, root);
 	}
-
-	rb_link_node(&mnode->node, parent, new);
-	rb_insert_color(&mnode->node, root);
 unlock:
 	spin_unlock_irqrestore(&handler->lock, flags);
 	return ret;
@@ -177,29 +179,17 @@ static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *handler,
 					   unsigned long addr,
 					   unsigned long len)
 {
-	struct rb_node *node = handler->root->rb_node;
-	struct mmu_rb_node *mnode;
-	int res;
-
-	while (node) {
-		mnode = container_of(node, struct mmu_rb_node, node);
-		res = handler->ops->compare(mnode, addr, len);
-
-		if (res < 0)
-			node = node->rb_left;
-		else if (res > 0)
-			node = node->rb_right;
-		else
-			return mnode;
-	}
-	return NULL;
+	struct mmu_rb_node *node;
+
+	node = __mmu_int_rb_iter_first(handler->root, addr, len);
+	return node;
 }
 
 static void __mmu_rb_remove(struct mmu_rb_handler *handler,
 			    struct mmu_rb_node *node, bool arg)
 {
 	/* Validity of handler and node pointers has been checked by caller. */
-	rb_erase(&node->node, handler->root);
+	__mmu_int_rb_remove(node, handler->root);
 	if (handler->ops->remove)
 		handler->ops->remove(handler->root, node, arg);
 }
@@ -271,45 +261,13 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 		container_of(mn, struct mmu_rb_handler, mn);
 	struct rb_root *root = handler->root;
 	struct mmu_rb_node *node;
-	unsigned long addr = start, naddr, nlen, flags;
+	unsigned long flags;
 
 	spin_lock_irqsave(&handler->lock, flags);
-	while (addr < end) {
-		/*
-		 * There is no good way to provide a reasonable length to the
-		 * search function at this point. Using the remaining length in
-		 * the invalidation range is not the right thing to do.
-		 * We have to rely on the fact that the insertion algorithm
-		 * takes care of any overlap or length restrictions by using the
-		 * actual size of each node. Therefore, we can use a page as an
-		 * arbitrary, non-zero value.
-		 */
-		node = __mmu_rb_search(handler, addr, PAGE_SIZE);
-
-		if (!node) {
-			/*
-			 * Didn't find a node at this address. However, the
-			 * range could be bigger than what we have registered
-			 * so we have to keep looking.
-			 */
-			addr += PAGE_SIZE;
-			continue;
-		}
-
-		naddr = node->addr;
-		nlen = node->len;
+	for (node = __mmu_int_rb_iter_first(root, start, end); node;
+	     node = __mmu_int_rb_iter_next(node, start, end)) {
 		if (handler->ops->invalidate(root, node))
 			__mmu_rb_remove(handler, node, true);
-
-		/*
-		 * The next address to be looked up is computed based
-		 * on the node's starting address. This is due to the
-		 * fact that the range where we start might be in the
-		 * middle of the node's buffer so simply incrementing
-		 * the address by the node's size would result is a
-		 * bad address.
-		 */
-		addr = naddr + nlen;
 	}
 	spin_unlock_irqrestore(&handler->lock, flags);
 }
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
index fdd9787..abed3a6 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.h
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -50,9 +50,10 @@
 #include "hfi.h"
 
 struct mmu_rb_node {
-	struct rb_node node;
 	unsigned long addr;
 	unsigned long len;
+	unsigned long __last;
+	struct rb_node node;
 };
 
 struct mmu_rb_ops {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/16] IB/hfi1: Add MMU tracing
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (6 preceding siblings ...)
  2016-03-08 19:14   ` [PATCH 07/16] IB/hfi1: Use interval RB trees Dennis Dalessandro
@ 2016-03-08 19:14   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 09/16] IB/hfi1: Remove compare callback Dennis Dalessandro
                     ` (8 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add a new tracepoint type for the MMU functions and calls
to that tracepoint to allow tracing of MMU functionality.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |   10 ++++++++++
 drivers/infiniband/hw/hfi1/trace.c  |    1 +
 drivers/infiniband/hw/hfi1/trace.h  |    1 +
 3 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 540e267..c30373d 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -157,6 +157,8 @@ int hfi1_mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *mnode)
 		return -EINVAL;
 
 	spin_lock_irqsave(&handler->lock, flags);
+	hfi1_cdbg(MMU, "Inserting node addr 0x%llx, len %u", mnode->addr,
+		  mnode->len);
 	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
 	if (node) {
 		ret = -EINVAL;
@@ -181,7 +183,11 @@ static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *handler,
 {
 	struct mmu_rb_node *node;
 
+	hfi1_cdbg(MMU, "Searching for addr 0x%llx, len %u", addr, len);
 	node = __mmu_int_rb_iter_first(handler->root, addr, len);
+	if (node)
+		hfi1_cdbg(MMU, "Found node addr 0x%llx, len %u", node->addr,
+			  node->len);
 	return node;
 }
 
@@ -189,6 +195,8 @@ static void __mmu_rb_remove(struct mmu_rb_handler *handler,
 			    struct mmu_rb_node *node, bool arg)
 {
 	/* Validity of handler and node pointers has been checked by caller. */
+	hfi1_cdbg(MMU, "Removing node addr 0x%llx, len %u", node->addr,
+		  node->len);
 	__mmu_int_rb_remove(node, handler->root);
 	if (handler->ops->remove)
 		handler->ops->remove(handler->root, node, arg);
@@ -266,6 +274,8 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 	spin_lock_irqsave(&handler->lock, flags);
 	for (node = __mmu_int_rb_iter_first(root, start, end); node;
 	     node = __mmu_int_rb_iter_next(node, start, end)) {
+		hfi1_cdbg(MMU, "Invalidating node addr 0x%llx, len %u",
+			  node->addr, node->len);
 		if (handler->ops->invalidate(root, node))
 			__mmu_rb_remove(handler, node, true);
 	}
diff --git a/drivers/infiniband/hw/hfi1/trace.c b/drivers/infiniband/hw/hfi1/trace.c
index 6821d7c..8b62fef 100644
--- a/drivers/infiniband/hw/hfi1/trace.c
+++ b/drivers/infiniband/hw/hfi1/trace.c
@@ -232,3 +232,4 @@ __hfi1_trace_fn(DC8051);
 __hfi1_trace_fn(FIRMWARE);
 __hfi1_trace_fn(RCVCTRL);
 __hfi1_trace_fn(TID);
+__hfi1_trace_fn(MMU);
diff --git a/drivers/infiniband/hw/hfi1/trace.h b/drivers/infiniband/hw/hfi1/trace.h
index 4d91c18..963dc94 100644
--- a/drivers/infiniband/hw/hfi1/trace.h
+++ b/drivers/infiniband/hw/hfi1/trace.h
@@ -1340,6 +1340,7 @@ __hfi1_trace_def(DC8051);
 __hfi1_trace_def(FIRMWARE);
 __hfi1_trace_def(RCVCTRL);
 __hfi1_trace_def(TID);
+__hfi1_trace_def(MMU);
 
 #define hfi1_cdbg(which, fmt, ...) \
 	__hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/16] IB/hfi1: Remove compare callback
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (7 preceding siblings ...)
  2016-03-08 19:14   ` [PATCH 08/16] IB/hfi1: Add MMU tracing Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 10/16] IB/hfi1: Add filter callback Dennis Dalessandro
                     ` (7 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Interval RB trees provide their own searching function,
which also takes care of determining the path through
the tree that should be taken.

This make the compare callback unnecessary.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c       |    2 +-
 drivers/infiniband/hw/hfi1/mmu_rb.h       |    2 --
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   14 --------------
 3 files changed, 1 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index c30373d..5d27fee 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -98,7 +98,7 @@ int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
 	struct mmu_rb_handler *handlr;
 	unsigned long flags;
 
-	if (!ops->compare || !ops->invalidate)
+	if (!ops->invalidate)
 		return -EINVAL;
 
 	handlr = kmalloc(sizeof(*handlr), GFP_KERNEL);
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
index abed3a6..9c26009 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.h
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -57,8 +57,6 @@ struct mmu_rb_node {
 };
 
 struct mmu_rb_ops {
-	int (*compare)(struct mmu_rb_node *, unsigned long,
-		       unsigned long);
 	int (*insert)(struct rb_root *, struct mmu_rb_node *);
 	void (*remove)(struct rb_root *, struct mmu_rb_node *, bool);
 	int (*invalidate)(struct rb_root *, struct mmu_rb_node *);
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index 1d971c0..bf670cb 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -86,8 +86,6 @@ static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *,
 static u32 find_phys_blocks(struct page **, unsigned, struct tid_pageset *);
 static int set_rcvarray_entry(struct file *, unsigned long, u32,
 			      struct tid_group *, struct page **, unsigned);
-static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
-			       unsigned long);
 static int mmu_rb_insert(struct rb_root *, struct mmu_rb_node *);
 static void mmu_rb_remove(struct rb_root *, struct mmu_rb_node *, bool);
 static int mmu_rb_invalidate(struct rb_root *, struct mmu_rb_node *);
@@ -98,7 +96,6 @@ static int unprogram_rcvarray(struct file *, u32, struct tid_group **);
 static void clear_tid_node(struct hfi1_filedata *, u16, struct tid_rb_node *);
 
 static struct mmu_rb_ops tid_rb_ops = {
-	.compare = mmu_addr_cmp,
 	.insert = mmu_rb_insert,
 	.remove = mmu_rb_remove,
 	.invalidate = mmu_rb_invalidate
@@ -1016,17 +1013,6 @@ static int mmu_rb_invalidate(struct rb_root *root, struct mmu_rb_node *mnode)
 	return 0;
 }
 
-static int mmu_addr_cmp(struct mmu_rb_node *node, unsigned long addr,
-			unsigned long len)
-{
-	if ((addr + len) <= node->addr)
-		return -1;
-	else if (addr >= node->addr && addr < (node->addr + node->len))
-		return 0;
-	else
-		return 1;
-}
-
 static int mmu_rb_insert(struct rb_root *root, struct mmu_rb_node *node)
 {
 	struct hfi1_filedata *fdata =

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/16] IB/hfi1: Add filter callback
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (8 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 09/16] IB/hfi1: Remove compare callback Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 11/16] IB/hfi1: Adjust last address values for intervals Dennis Dalessandro
                     ` (6 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This commit adds a filter callback, which can be used to filter
out interval RB nodes matching a certain interval down to a
single one.

This is needed for the upcoming SDMA-side caching where buffers
will need to be filtered by their virtual address.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |   19 ++++++++++++++-----
 drivers/infiniband/hw/hfi1/mmu_rb.h |    1 +
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 5d27fee..6edd5f0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -181,13 +181,22 @@ static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *handler,
 					   unsigned long addr,
 					   unsigned long len)
 {
-	struct mmu_rb_node *node;
+	struct mmu_rb_node *node = NULL;
 
 	hfi1_cdbg(MMU, "Searching for addr 0x%llx, len %u", addr, len);
-	node = __mmu_int_rb_iter_first(handler->root, addr, len);
-	if (node)
-		hfi1_cdbg(MMU, "Found node addr 0x%llx, len %u", node->addr,
-			  node->len);
+	if (!handler->ops->filter) {
+		node = __mmu_int_rb_iter_first(handler->root, addr,
+					       (addr + len) - 1);
+	} else {
+		for (node = __mmu_int_rb_iter_first(handler->root, addr,
+						    (addr + len) - 1);
+		     node;
+		     node = __mmu_int_rb_iter_next(node, addr,
+						   (addr + len) - 1)) {
+			if (handler->ops->filter(node, addr, len))
+				return node;
+		}
+	}
 	return node;
 }
 
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
index 9c26009..f8523fd 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.h
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -57,6 +57,7 @@ struct mmu_rb_node {
 };
 
 struct mmu_rb_ops {
+	bool (*filter)(struct mmu_rb_node *, unsigned long, unsigned long);
 	int (*insert)(struct rb_root *, struct mmu_rb_node *);
 	void (*remove)(struct rb_root *, struct mmu_rb_node *, bool);
 	int (*invalidate)(struct rb_root *, struct mmu_rb_node *);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/16] IB/hfi1: Adjust last address values for intervals
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (9 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 10/16] IB/hfi1: Add filter callback Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 12/16] IB/hfi1: Implement SDMA-side buffer caching Dennis Dalessandro
                     ` (5 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Last address values for intervals in the interval RB tree
nodes should be non-inclusive in order to avoid confusing
ranges.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 6edd5f0..c7ad016 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -90,7 +90,7 @@ static unsigned long mmu_node_start(struct mmu_rb_node *node)
 
 static unsigned long mmu_node_last(struct mmu_rb_node *node)
 {
-	return ((node->addr & PAGE_MASK) + node->len);
+	return PAGE_ALIGN((node->addr & PAGE_MASK) + node->len) - 1;
 }
 
 int hfi1_mmu_rb_register(struct rb_root *root, struct mmu_rb_ops *ops)
@@ -281,8 +281,8 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
 	unsigned long flags;
 
 	spin_lock_irqsave(&handler->lock, flags);
-	for (node = __mmu_int_rb_iter_first(root, start, end); node;
-	     node = __mmu_int_rb_iter_next(node, start, end)) {
+	for (node = __mmu_int_rb_iter_first(root, start, end - 1); node;
+	     node = __mmu_int_rb_iter_next(node, start, end - 1)) {
 		hfi1_cdbg(MMU, "Invalidating node addr 0x%llx, len %u",
 			  node->addr, node->len);
 		if (handler->ops->invalidate(root, node))

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/16] IB/hfi1: Implement SDMA-side buffer caching
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (10 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 11/16] IB/hfi1: Adjust last address values for intervals Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 13/16] IB/hfi1: Add pin query function Dennis Dalessandro
                     ` (4 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add support for caching of user buffers used for SDMA
transfers. This change improves performance by
avoiding repeatedly pinning the pages of buffers, which
are being re-used by the application.

While the cost of the pinning operation has been made
heavier by adding the extra code to search the cache tree,
re-allocate pages arrays, and future cache evictions,
that cost will be amortized against the savings when the
same buffer is re-used. It is also worth noting that in
most cases, the cost of pinning should be much lower due
to the buffer already being in the cache.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/user_sdma.c |  255 +++++++++++++++++++-------------
 drivers/infiniband/hw/hfi1/user_sdma.h |    1 
 2 files changed, 155 insertions(+), 101 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
index 14fe079..a53edb9 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -68,6 +68,7 @@
 #include "verbs.h"  /* for the headers */
 #include "common.h" /* for struct hfi1_tid_info */
 #include "trace.h"
+#include "mmu_rb.h"
 
 static uint hfi1_sdma_comp_ring_size = 128;
 module_param_named(sdma_comp_size, hfi1_sdma_comp_ring_size, uint, S_IRUGO);
@@ -145,9 +146,6 @@ MODULE_PARM_DESC(sdma_comp_size, "Size of User SDMA completion ring. Default: 12
 /* Last packet in the request */
 #define TXREQ_FLAGS_REQ_LAST_PKT BIT(0)
 
-/* Last packet that uses a particular io vector */
-#define TXREQ_FLAGS_IOVEC_LAST_PKT BIT(0)
-
 #define SDMA_REQ_IN_USE     0
 #define SDMA_REQ_FOR_THREAD 1
 #define SDMA_REQ_SEND_DONE  2
@@ -183,6 +181,13 @@ struct user_sdma_iovec {
 	u64 offset;
 };
 
+struct sdma_mmu_node {
+	struct mmu_rb_node rb;
+	atomic_t refcount;
+	struct page **pages;
+	unsigned npages;
+};
+
 struct user_sdma_request {
 	struct sdma_req_info info;
 	struct hfi1_user_sdma_pkt_q *pq;
@@ -252,11 +257,6 @@ struct user_sdma_txreq {
 	struct sdma_txreq txreq;
 	struct list_head list;
 	struct user_sdma_request *req;
-	struct {
-		struct user_sdma_iovec *vec;
-		u8 flags;
-	} iovecs[3];
-	int idx;
 	u16 flags;
 	unsigned busycount;
 	u64 seqnum;
@@ -277,7 +277,7 @@ static inline void pq_update(struct hfi1_user_sdma_pkt_q *);
 static void user_sdma_free_request(struct user_sdma_request *, bool);
 static int pin_vector_pages(struct user_sdma_request *,
 			    struct user_sdma_iovec *);
-static void unpin_vector_pages(struct user_sdma_iovec *);
+static void unpin_vector_pages(struct page **, unsigned);
 static int check_header_template(struct user_sdma_request *,
 				 struct hfi1_pkt_header *, u32, u32);
 static int set_txreq_header(struct user_sdma_request *,
@@ -296,6 +296,17 @@ static int defer_packet_queue(
 	struct sdma_txreq *,
 	unsigned seq);
 static void activate_packet_queue(struct iowait *, int);
+static bool sdma_rb_filter(struct mmu_rb_node *, unsigned long, unsigned long);
+static int sdma_rb_insert(struct rb_root *, struct mmu_rb_node *);
+static void sdma_rb_remove(struct rb_root *, struct mmu_rb_node *, bool);
+static int sdma_rb_invalidate(struct rb_root *, struct mmu_rb_node *);
+
+static struct mmu_rb_ops sdma_rb_ops = {
+	.filter = sdma_rb_filter,
+	.insert = sdma_rb_insert,
+	.remove = sdma_rb_remove,
+	.invalidate = sdma_rb_invalidate
+};
 
 static int defer_packet_queue(
 	struct sdma_engine *sde,
@@ -385,6 +396,7 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt, struct file *fp)
 	pq->state = SDMA_PKT_Q_INACTIVE;
 	atomic_set(&pq->n_reqs, 0);
 	init_waitqueue_head(&pq->wait);
+	pq->sdma_rb_root = RB_ROOT;
 
 	iowait_init(&pq->busy, 0, NULL, defer_packet_queue,
 		    activate_packet_queue, NULL);
@@ -415,6 +427,12 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt, struct file *fp)
 	cq->nentries = hfi1_sdma_comp_ring_size;
 	fd->cq = cq;
 
+	ret = hfi1_mmu_rb_register(&pq->sdma_rb_root, &sdma_rb_ops);
+	if (ret) {
+		dd_dev_err(dd, "Failed to register with MMU %d", ret);
+		goto done;
+	}
+
 	spin_lock_irqsave(&uctxt->sdma_qlock, flags);
 	list_add(&pq->list, &uctxt->sdma_queues);
 	spin_unlock_irqrestore(&uctxt->sdma_qlock, flags);
@@ -444,6 +462,7 @@ int hfi1_user_sdma_free_queues(struct hfi1_filedata *fd)
 	hfi1_cdbg(SDMA, "[%u:%u:%u] Freeing user SDMA queues", uctxt->dd->unit,
 		  uctxt->ctxt, fd->subctxt);
 	pq = fd->pq;
+	hfi1_mmu_rb_unregister(&pq->sdma_rb_root);
 	if (pq) {
 		spin_lock_irqsave(&uctxt->sdma_qlock, flags);
 		if (!list_empty(&pq->list))
@@ -477,7 +496,7 @@ int hfi1_user_sdma_process_request(struct file *fp, struct iovec *iovec,
 	struct hfi1_user_sdma_pkt_q *pq = fd->pq;
 	struct hfi1_user_sdma_comp_q *cq = fd->cq;
 	struct hfi1_devdata *dd = pq->dd;
-	unsigned long idx = 0, unpinned;
+	unsigned long idx = 0;
 	u8 pcount = initial_pkt_count;
 	struct sdma_req_info info;
 	struct user_sdma_request *req;
@@ -498,14 +517,6 @@ int hfi1_user_sdma_process_request(struct file *fp, struct iovec *iovec,
 		return -EFAULT;
 	}
 
-	/* Process any completed vectors */
-	unpinned = xchg(&pq->unpinned, 0);
-	if (unpinned) {
-		down_write(&current->mm->mmap_sem);
-		current->mm->pinned_vm -= unpinned;
-		up_write(&current->mm->mmap_sem);
-	}
-
 	trace_hfi1_sdma_user_reqinfo(dd, uctxt->ctxt, fd->subctxt,
 				     (u16 *)&info);
 	if (cq->comps[info.comp_idx].status == QUEUED ||
@@ -609,7 +620,11 @@ int hfi1_user_sdma_process_request(struct file *fp, struct iovec *iovec,
 	while (i < req->data_iovs) {
 		INIT_LIST_HEAD(&req->iovs[i].list);
 		memcpy(&req->iovs[i].iov, iovec + idx++, sizeof(struct iovec));
-		req->iovs[i].offset = 0;
+		ret = pin_vector_pages(req, &req->iovs[i]);
+		if (ret) {
+			req->status = ret;
+			goto free_req;
+		}
 		req->data_len += req->iovs[i++].iov.iov_len;
 	}
 	SDMA_DBG(req, "total data length %u", req->data_len);
@@ -827,9 +842,7 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 		tx->flags = 0;
 		tx->req = req;
 		tx->busycount = 0;
-		tx->idx = -1;
 		INIT_LIST_HEAD(&tx->list);
-		memset(tx->iovecs, 0, sizeof(tx->iovecs));
 
 		if (req->seqnum == req->info.npkts - 1)
 			tx->flags |= TXREQ_FLAGS_REQ_LAST_PKT;
@@ -850,18 +863,6 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 				WARN_ON(iovec->offset);
 			}
 
-			/*
-			 * This request might include only a header and no user
-			 * data, so pin pages only if there is data and it the
-			 * pages have not been pinned already.
-			 */
-			if (unlikely(!iovec->pages && iovec->iov.iov_len)) {
-				ret = pin_vector_pages(req, iovec);
-				if (ret)
-					goto free_tx;
-			}
-
-			tx->iovecs[++tx->idx].vec = iovec;
 			datalen = compute_data_length(req, tx);
 			if (!datalen) {
 				SDMA_DBG(req,
@@ -960,19 +961,10 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 			data_sent += len;
 			if (unlikely(queued < datalen &&
 				     pageidx == iovec->npages &&
-				     req->iov_idx < req->data_iovs - 1 &&
-				     tx->idx < ARRAY_SIZE(tx->iovecs))) {
+				     req->iov_idx < req->data_iovs - 1)) {
 				iovec->offset += iov_offset;
-				tx->iovecs[tx->idx].flags |=
-					TXREQ_FLAGS_IOVEC_LAST_PKT;
 				iovec = &req->iovs[++req->iov_idx];
-				if (!iovec->pages) {
-					ret = pin_vector_pages(req, iovec);
-					if (ret)
-						goto free_txreq;
-				}
 				iov_offset = 0;
-				tx->iovecs[++tx->idx].vec = iovec;
 			}
 		}
 		/*
@@ -983,18 +975,8 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 		if (req_opcode(req->info.ctrl) == EXPECTED)
 			req->tidoffset += datalen;
 		req->sent += data_sent;
-		if (req->data_len) {
-			tx->iovecs[tx->idx].vec->offset += iov_offset;
-			/*
-			 * If we've reached the end of the io vector, mark it
-			 * so the callback can unpin the pages and free it.
-			 */
-			if (tx->iovecs[tx->idx].vec->offset ==
-			    tx->iovecs[tx->idx].vec->iov.iov_len)
-				tx->iovecs[tx->idx].flags |=
-					TXREQ_FLAGS_IOVEC_LAST_PKT;
-		}
-
+		if (req->data_len)
+			iovec->offset += iov_offset;
 		list_add_tail(&tx->txreq.list, &req->txps);
 		/*
 		 * It is important to increment this here as it is used to
@@ -1047,38 +1029,78 @@ static inline int num_user_pages(const struct iovec *iov)
 
 static int pin_vector_pages(struct user_sdma_request *req,
 			    struct user_sdma_iovec *iovec) {
-	int pinned, npages;
+	int ret = 0, pinned, npages;
+	struct page **pages;
+	struct hfi1_user_sdma_pkt_q *pq = req->pq;
+	struct sdma_mmu_node *node = NULL;
+	struct mmu_rb_node *rb_node;
+
+	rb_node = hfi1_mmu_rb_search(&pq->sdma_rb_root,
+				     (unsigned long)iovec->iov.iov_base,
+				     iovec->iov.iov_len);
+	if (rb_node)
+		node = container_of(rb_node, struct sdma_mmu_node, rb);
+
+	if (!node) {
+		node = kzalloc(sizeof(*node), GFP_KERNEL);
+		if (!node)
+			return -ENOMEM;
 
-	npages = num_user_pages(&iovec->iov);
-	iovec->pages = kcalloc(npages, sizeof(*iovec->pages), GFP_KERNEL);
-	if (!iovec->pages) {
-		SDMA_DBG(req, "Failed page array alloc");
-		return -ENOMEM;
+		node->rb.addr = (unsigned long)iovec->iov.iov_base;
+		node->rb.len = iovec->iov.iov_len;
+		atomic_set(&node->refcount, 0);
 	}
 
-	pinned = hfi1_acquire_user_pages((unsigned long)iovec->iov.iov_base,
-					 npages, 0, iovec->pages);
-
-	if (pinned < 0)
-		return pinned;
+	npages = num_user_pages(&iovec->iov);
+	if (node->npages < npages) {
+		pages = kcalloc(npages, sizeof(*pages), GFP_KERNEL);
+		if (!pages) {
+			SDMA_DBG(req, "Failed page array alloc");
+			ret = -ENOMEM;
+			goto bail;
+		}
+		memcpy(pages, node->pages, node->npages * sizeof(*pages));
+
+		npages -= node->npages;
+		pinned = hfi1_acquire_user_pages(
+			((unsigned long)iovec->iov.iov_base +
+			 (node->npages * PAGE_SIZE)), npages, 0,
+			pages + node->npages);
+		if (pinned < 0) {
+			kfree(pages);
+			ret = pinned;
+			goto bail;
+		}
+		if (pinned != npages) {
+			unpin_vector_pages(pages, pinned);
+			ret = -EFAULT;
+			goto bail;
+		}
+		kfree(node->pages);
+		node->pages = pages;
+		node->npages += pinned;
+		npages = node->npages;
+	}
+	iovec->pages = node->pages;
+	iovec->npages = npages;
 
-	iovec->npages = pinned;
-	if (pinned != npages) {
-		SDMA_DBG(req, "Failed to pin pages (%d/%u)", pinned, npages);
-		unpin_vector_pages(iovec);
-		return -EFAULT;
+	if (!rb_node) {
+		if (hfi1_mmu_rb_insert(&req->pq->sdma_rb_root, &node->rb))
+			goto bail;
+	} else {
+		atomic_inc(&node->refcount);
 	}
 	return 0;
+bail:
+	if (!rb_node)
+		kfree(node);
+	return ret;
 }
 
-static void unpin_vector_pages(struct user_sdma_iovec *iovec)
+static void unpin_vector_pages(struct page **pages, unsigned npages)
 {
-	hfi1_release_user_pages(iovec->pages, iovec->npages, 0);
-
-	kfree(iovec->pages);
-	iovec->pages = NULL;
-	iovec->npages = 0;
-	iovec->offset = 0;
+	hfi1_release_user_pages(pages, npages, 0);
+	kfree(pages);
 }
 
 static int check_header_template(struct user_sdma_request *req,
@@ -1360,7 +1382,6 @@ static void user_sdma_txreq_cb(struct sdma_txreq *txreq, int status)
 	struct hfi1_user_sdma_pkt_q *pq;
 	struct hfi1_user_sdma_comp_q *cq;
 	u16 idx;
-	int i, j;
 
 	if (!tx->req)
 		return;
@@ -1369,24 +1390,6 @@ static void user_sdma_txreq_cb(struct sdma_txreq *txreq, int status)
 	pq = req->pq;
 	cq = req->cq;
 
-	/*
-	 * If we have any io vectors associated with this txreq,
-	 * check whether they need to be 'freed'.
-	 */
-	for (i = tx->idx; i >= 0; i--) {
-		if (tx->iovecs[i].flags & TXREQ_FLAGS_IOVEC_LAST_PKT) {
-			struct user_sdma_iovec *vec =
-				tx->iovecs[i].vec;
-
-			for (j = 0; j < vec->npages; j++)
-				put_page(vec->pages[j]);
-			xadd(&pq->unpinned, vec->npages);
-			kfree(vec->pages);
-			vec->pages = NULL;
-			vec->npages = 0;
-		}
-	}
-
 	if (status != SDMA_TXREQ_S_OK) {
 		SDMA_DBG(req, "SDMA completion with error %d",
 			 status);
@@ -1439,12 +1442,26 @@ static void user_sdma_free_request(struct user_sdma_request *req, bool unpin)
 			kmem_cache_free(req->pq->txreq_cache, tx);
 		}
 	}
-	if (req->data_iovs && unpin) {
+	if (req->data_iovs) {
+		struct sdma_mmu_node *node;
+		struct mmu_rb_node *mnode;
 		int i;
 
-		for (i = 0; i < req->data_iovs; i++)
-			if (req->iovs[i].npages && req->iovs[i].pages)
-				unpin_vector_pages(&req->iovs[i]);
+		for (i = 0; i < req->data_iovs; i++) {
+			mnode = hfi1_mmu_rb_search(
+				&req->pq->sdma_rb_root,
+				(unsigned long)req->iovs[i].iov.iov_base,
+				req->iovs[i].iov.iov_len);
+			if (!mnode)
+				continue;
+
+			node = container_of(mnode, struct sdma_mmu_node, rb);
+			if (unpin)
+				hfi1_mmu_rb_remove(&req->pq->sdma_rb_root,
+						   &node->rb);
+			else
+				atomic_dec(&node->refcount);
+		}
 	}
 	kfree(req->tids);
 	clear_bit(SDMA_REQ_IN_USE, &req->flags);
@@ -1463,3 +1480,39 @@ static inline void set_comp_state(struct hfi1_user_sdma_pkt_q *pq,
 	trace_hfi1_sdma_user_completion(pq->dd, pq->ctxt, pq->subctxt,
 					idx, state, ret);
 }
+
+static bool sdma_rb_filter(struct mmu_rb_node *node, unsigned long addr,
+			   unsigned long len)
+{
+	return (bool)(node->addr == addr);
+}
+
+static int sdma_rb_insert(struct rb_root *root, struct mmu_rb_node *mnode)
+{
+	struct sdma_mmu_node *node =
+		container_of(mnode, struct sdma_mmu_node, rb);
+
+	atomic_inc(&node->refcount);
+	return 0;
+}
+
+static void sdma_rb_remove(struct rb_root *root, struct mmu_rb_node *mnode,
+			   bool notifier)
+{
+	struct sdma_mmu_node *node =
+		container_of(mnode, struct sdma_mmu_node, rb);
+
+	if (!notifier)
+		unpin_vector_pages(node->pages, node->npages);
+	kfree(node);
+}
+
+static int sdma_rb_invalidate(struct rb_root *root, struct mmu_rb_node *mnode)
+{
+	struct sdma_mmu_node *node =
+		container_of(mnode, struct sdma_mmu_node, rb);
+
+	if (!atomic_read(&node->refcount))
+		return 1;
+	return 0;
+}
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
index e0d0fe0..39866b5 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.h
+++ b/drivers/infiniband/hw/hfi1/user_sdma.h
@@ -67,6 +67,7 @@ struct hfi1_user_sdma_pkt_q {
 	unsigned state;
 	wait_queue_head_t wait;
 	unsigned long unpinned;
+	struct rb_root sdma_rb_root;
 };
 
 struct hfi1_user_sdma_comp_q {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/16] IB/hfi1: Add pin query function
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (11 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 12/16] IB/hfi1: Implement SDMA-side buffer caching Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 14/16] IB/hfi1: Specify mm when releasing pages Dennis Dalessandro
                     ` (3 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

System administrators can use the locked memory
ulimit setting to set the maximum amount of memory
a user can lock/pin. However, this setting alone is not
enough to guarantee good operation of the hfi1 driver
due to the fact that the setting does not have fine
enough granularity to account for the limit being used
by multiple user processes and caches.

Therefore, a better limiting algorithm is needed. This
is where the new hfi1_can_pin_pages() function and the
cache_size module parameter come in.

The function works by looking at the ulimit and cache_size
value to compute a cache size. The algorithm examines the
ulimit value and, if it is not "unlimited", computes a
per-cache limit based on the number of configured user
contexts.

After that, the lower of the two - cache_size and computed
per-cache limit - is used.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/hfi.h        |    1 +
 drivers/infiniband/hw/hfi1/user_pages.c |   52 +++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 2107cdc..ff3b37a 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1664,6 +1664,7 @@ void shutdown_led_override(struct hfi1_pportdata *ppd);
  */
 #define DEFAULT_RCVHDR_ENTSIZE 32
 
+bool hfi1_can_pin_pages(struct hfi1_devdata *, u32, u32);
 int hfi1_acquire_user_pages(unsigned long, size_t, bool, struct page **);
 void hfi1_release_user_pages(struct page **, size_t, bool);
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index 3bf8108..bd7a8ab 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -48,22 +48,62 @@
 #include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/device.h>
+#include <linux/module.h>
 
 #include "hfi.h"
 
-int hfi1_acquire_user_pages(unsigned long vaddr, size_t npages, bool writable,
-			    struct page **pages)
+static unsigned long cache_size = 256;
+module_param(cache_size, ulong, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(cache_size, "Send and receive side cache size limit (in MB)");
+
+/*
+ * Determine whether the caller can pin pages.
+ *
+ * This function should be used in the implementation of buffer caches.
+ * The cache implementation should call this function prior to attempting
+ * to pin buffer pages in order to determine whether they should do so.
+ * The function computes cache limits based on the configured ulimit and
+ * cache size. Use of this function is especially important for caches
+ * which are not limited in any other way (e.g. by HW resources) and, thus,
+ * could keeping caching buffers.
+ *
+ */
+bool hfi1_can_pin_pages(struct hfi1_devdata *dd, u32 nlocked, u32 npages)
 {
-	unsigned long pinned, lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	unsigned long ulimit = rlimit(RLIMIT_MEMLOCK), pinned, cache_limit,
+		size = (cache_size * (1UL << 20)); /* convert to bytes */
+	unsigned usr_ctxts = dd->num_rcv_contexts - dd->first_user_ctxt;
 	bool can_lock = capable(CAP_IPC_LOCK);
-	int ret;
+
+	/*
+	 * Calculate per-cache size. The calculation below uses only a quarter
+	 * of the available per-context limit. This leaves space for other
+	 * pinning. Should we worry about shared ctxts?
+	 */
+	cache_limit = (ulimit / usr_ctxts) / 4;
+
+	/* If ulimit isn't set to "unlimited" and is smaller than cache_size. */
+	if (ulimit != (-1UL) && size > cache_limit)
+		size = cache_limit;
+
+	/* Convert to number of pages */
+	size = DIV_ROUND_UP(size, PAGE_SIZE);
 
 	down_read(&current->mm->mmap_sem);
 	pinned = current->mm->pinned_vm;
 	up_read(&current->mm->mmap_sem);
 
-	if (pinned + npages > lock_limit && !can_lock)
-		return -ENOMEM;
+	/* First, check the absolute limit against all pinned pages. */
+	if (pinned + npages >= ulimit && !can_lock)
+		return false;
+
+	return ((nlocked + npages) <= size) || can_lock;
+}
+
+int hfi1_acquire_user_pages(unsigned long vaddr, size_t npages, bool writable,
+			    struct page **pages)
+{
+	int ret;
 
 	ret = get_user_pages_fast(vaddr, npages, writable, pages);
 	if (ret < 0)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/16] IB/hfi1: Specify mm when releasing pages
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (12 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 13/16] IB/hfi1: Add pin query function Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 15/16] IB/hfi1: Switch to using the pin query function Dennis Dalessandro
                     ` (2 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This change adds a pointer to the process mm_struct when
calling hfi1_release_user_pages().

Previously, the function used the mm_struct of the current
process to adjust the number of pinned pages. However, is some
cases, namely when unpinning pages due to a MMU notifier call,
we want to drop into that code block as it will cause a deadlock
(the MMU notifiers take the process' mmap_sem prior to calling
the callbacks).

By allowing to caller to specify the pointer to the mm_struct,
the caller has finer control over that part of hfi1_release_user_pages().

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/hfi.h          |    2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |    4 ++--
 drivers/infiniband/hw/hfi1/user_pages.c   |   11 ++++++-----
 drivers/infiniband/hw/hfi1/user_sdma.c    |   19 +++++++++++++------
 4 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index ff3b37a..3dc644d 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1666,7 +1666,7 @@ void shutdown_led_override(struct hfi1_pportdata *ppd);
 
 bool hfi1_can_pin_pages(struct hfi1_devdata *, u32, u32);
 int hfi1_acquire_user_pages(unsigned long, size_t, bool, struct page **);
-void hfi1_release_user_pages(struct page **, size_t, bool);
+void hfi1_release_user_pages(struct mm_struct *, struct page **, size_t, bool);
 
 static inline void clear_rcvhdrtail(const struct hfi1_ctxtdata *rcd)
 {
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index bf670cb..591605a 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -550,7 +550,7 @@ nomem:
 	 * for example), unpin all unmapped pages so we can pin them nex time.
 	 */
 	if (mapped_pages != pinned)
-		hfi1_release_user_pages(&pages[mapped_pages],
+		hfi1_release_user_pages(current->mm, &pages[mapped_pages],
 					pinned - mapped_pages,
 					false);
 bail:
@@ -923,7 +923,7 @@ static void clear_tid_node(struct hfi1_filedata *fd, u16 subctxt,
 
 	pci_unmap_single(dd->pcidev, node->dma_addr, node->mmu.len,
 			 PCI_DMA_FROMDEVICE);
-	hfi1_release_user_pages(node->pages, node->npages, true);
+	hfi1_release_user_pages(current->mm, node->pages, node->npages, true);
 
 	node->grp->used--;
 	node->grp->map &= ~(1 << (node->rcventry - node->grp->base));
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index bd7a8ab..88e10b5 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -116,7 +116,8 @@ int hfi1_acquire_user_pages(unsigned long vaddr, size_t npages, bool writable,
 	return ret;
 }
 
-void hfi1_release_user_pages(struct page **p, size_t npages, bool dirty)
+void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
+			     size_t npages, bool dirty)
 {
 	size_t i;
 
@@ -126,9 +127,9 @@ void hfi1_release_user_pages(struct page **p, size_t npages, bool dirty)
 		put_page(p[i]);
 	}
 
-	if (current->mm) { /* during close after signal, mm can be NULL */
-		down_write(&current->mm->mmap_sem);
-		current->mm->pinned_vm -= npages;
-		up_write(&current->mm->mmap_sem);
+	if (mm) { /* during close after signal, mm can be NULL */
+		down_write(&mm->mmap_sem);
+		mm->pinned_vm -= npages;
+		up_write(&mm->mmap_sem);
 	}
 }
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
index a53edb9..bf55a41 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -277,7 +277,7 @@ static inline void pq_update(struct hfi1_user_sdma_pkt_q *);
 static void user_sdma_free_request(struct user_sdma_request *, bool);
 static int pin_vector_pages(struct user_sdma_request *,
 			    struct user_sdma_iovec *);
-static void unpin_vector_pages(struct page **, unsigned);
+static void unpin_vector_pages(struct mm_struct *, struct page **, unsigned);
 static int check_header_template(struct user_sdma_request *,
 				 struct hfi1_pkt_header *, u32, u32);
 static int set_txreq_header(struct user_sdma_request *,
@@ -1072,7 +1072,7 @@ static int pin_vector_pages(struct user_sdma_request *req,
 			goto bail;
 		}
 		if (pinned != npages) {
-			unpin_vector_pages(pages, pinned);
+			unpin_vector_pages(current->mm, pages, pinned);
 			ret = -EFAULT;
 			goto bail;
 		}
@@ -1097,9 +1097,10 @@ bail:
 	return ret;
 }
 
-static void unpin_vector_pages(struct page **pages, unsigned npages)
+static void unpin_vector_pages(struct mm_struct *mm, struct page **pages,
+			       unsigned npages)
 {
-	hfi1_release_user_pages(pages, npages, 0);
+	hfi1_release_user_pages(mm, pages, npages, 0);
 	kfree(pages);
 }
 
@@ -1502,8 +1503,14 @@ static void sdma_rb_remove(struct rb_root *root, struct mmu_rb_node *mnode,
 	struct sdma_mmu_node *node =
 		container_of(mnode, struct sdma_mmu_node, rb);
 
-	if (!notifier)
-		unpin_vector_pages(node->pages, node->npages);
+	unpin_vector_pages(notifier ? NULL : current->mm, node->pages,
+			   node->npages);
+	/*
+	 * If called by the MMU notifier, we have to adjust the pinned
+	 * page count ourselves.
+	 */
+	if (notifier)
+		current->mm->pinned_vm -= node->npages;
 	kfree(node);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 15/16] IB/hfi1: Switch to using the pin query function
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (13 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 14/16] IB/hfi1: Specify mm when releasing pages Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 19:15   ` [PATCH 16/16] IB/hfi1: Add SDMA cache eviction algorithm Dennis Dalessandro
  2016-03-08 20:56   ` [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma Or Gerlitz
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Use the new function to query whether the expected receive
user buffer can be pinned successfully. This requires that
a new variable be added to the hfi1_filedata structure used
to hold the number of pages pinned by the expected receive
code.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/hfi.h          |    1 +
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |    8 +++++++-
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 3dc644d..16cbdc4 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1190,6 +1190,7 @@ struct hfi1_filedata {
 	struct hfi1_user_sdma_pkt_q *pq;
 	/* for cpu affinity; -1 if none */
 	int rec_cpu_num;
+	u32 tid_n_pinned;
 	struct rb_root tid_rb_root;
 	struct tid_rb_node **entry_to_rb;
 	spinlock_t tid_lock; /* protect tid_[limit,used] counters */
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index 591605a..0861e09 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -396,11 +396,14 @@ int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo)
 	 * pages, accept the amount pinned so far and program only that.
 	 * User space knows how to deal with partially programmed buffers.
 	 */
+	if (!hfi1_can_pin_pages(dd, fd->tid_n_pinned, npages))
+		return -ENOMEM;
 	pinned = hfi1_acquire_user_pages(vaddr, npages, true, pages);
 	if (pinned <= 0) {
 		ret = pinned;
 		goto bail;
 	}
+	fd->tid_n_pinned += npages;
 
 	/* Find sets of physically contiguous pages */
 	npagesets = find_phys_blocks(pages, pinned, pagesets);
@@ -549,10 +552,12 @@ nomem:
 	 * If not everything was mapped (due to insufficient RcvArray entries,
 	 * for example), unpin all unmapped pages so we can pin them nex time.
 	 */
-	if (mapped_pages != pinned)
+	if (mapped_pages != pinned) {
 		hfi1_release_user_pages(current->mm, &pages[mapped_pages],
 					pinned - mapped_pages,
 					false);
+		fd->tid_n_pinned -= pinned - mapped_pages;
+	}
 bail:
 	kfree(pagesets);
 	kfree(pages);
@@ -924,6 +929,7 @@ static void clear_tid_node(struct hfi1_filedata *fd, u16 subctxt,
 	pci_unmap_single(dd->pcidev, node->dma_addr, node->mmu.len,
 			 PCI_DMA_FROMDEVICE);
 	hfi1_release_user_pages(current->mm, node->pages, node->npages, true);
+	fd->tid_n_pinned -= node->npages;
 
 	node->grp->used--;
 	node->grp->map &= ~(1 << (node->rcventry - node->grp->base));

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 16/16] IB/hfi1: Add SDMA cache eviction algorithm
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (14 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 15/16] IB/hfi1: Switch to using the pin query function Dennis Dalessandro
@ 2016-03-08 19:15   ` Dennis Dalessandro
  2016-03-08 20:56   ` [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma Or Gerlitz
  16 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-08 19:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov, Dean Luick,
	Jubin John

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This commit adds a cache eviction algorithm for the SDMA
user buffer cache.

Besides the interval RB tree used for node lookup, the cache
nodes are also arranged in a doubly-linked list. When a node is
used, it is put at the beginning of the list. Less frequently
used nodes naturally move to the tail of the list.

When the cache limit is reached, the eviction code starts
traversing the linked list in reverse, freeing buffers until
enough space has been freed to fit the new user buffer. This
guarantees that only the least used cache nodes will be removed
from the cache.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/user_sdma.c |   61 +++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/hfi1/user_sdma.h |    3 ++
 2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
index bf55a41..46e254d 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -183,6 +183,8 @@ struct user_sdma_iovec {
 
 struct sdma_mmu_node {
 	struct mmu_rb_node rb;
+	struct list_head list;
+	struct hfi1_user_sdma_pkt_q *pq;
 	atomic_t refcount;
 	struct page **pages;
 	unsigned npages;
@@ -397,6 +399,8 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt, struct file *fp)
 	atomic_set(&pq->n_reqs, 0);
 	init_waitqueue_head(&pq->wait);
 	pq->sdma_rb_root = RB_ROOT;
+	INIT_LIST_HEAD(&pq->evict);
+	spin_lock_init(&pq->evict_lock);
 
 	iowait_init(&pq->busy, 0, NULL, defer_packet_queue,
 		    activate_packet_queue, NULL);
@@ -1027,9 +1031,33 @@ static inline int num_user_pages(const struct iovec *iov)
 	return 1 + ((epage - spage) >> PAGE_SHIFT);
 }
 
+/* Caller must hold pq->evict_lock */
+static u32 sdma_cache_evict(struct hfi1_user_sdma_pkt_q *pq, u32 npages)
+{
+	u32 cleared = 0;
+	struct sdma_mmu_node *node, *ptr;
+
+	list_for_each_entry_safe_reverse(node, ptr, &pq->evict, list) {
+		/* Make sure that no one is still using the node. */
+		if (!atomic_read(&node->refcount)) {
+			/*
+			 * Need to use the page count now as the remove callback
+			 * will free the node.
+			 */
+			cleared += node->npages;
+			spin_unlock(&pq->evict_lock);
+			hfi1_mmu_rb_remove(&pq->sdma_rb_root, &node->rb);
+			spin_lock(&pq->evict_lock);
+			if (cleared >= npages)
+				break;
+		}
+	}
+	return cleared;
+}
+
 static int pin_vector_pages(struct user_sdma_request *req,
 			    struct user_sdma_iovec *iovec) {
-	int ret = 0, pinned, npages;
+	int ret = 0, pinned, npages, cleared;
 	struct page **pages;
 	struct hfi1_user_sdma_pkt_q *pq = req->pq;
 	struct sdma_mmu_node *node = NULL;
@@ -1048,7 +1076,9 @@ static int pin_vector_pages(struct user_sdma_request *req,
 
 		node->rb.addr = (unsigned long)iovec->iov.iov_base;
 		node->rb.len = iovec->iov.iov_len;
+		node->pq = pq;
 		atomic_set(&node->refcount, 0);
+		INIT_LIST_HEAD(&node->list);
 	}
 
 	npages = num_user_pages(&iovec->iov);
@@ -1062,6 +1092,14 @@ static int pin_vector_pages(struct user_sdma_request *req,
 		memcpy(pages, node->pages, node->npages * sizeof(*pages));
 
 		npages -= node->npages;
+retry:
+		if (!hfi1_can_pin_pages(pq->dd, pq->n_locked, npages)) {
+			spin_lock(&pq->evict_lock);
+			cleared = sdma_cache_evict(pq, npages);
+			spin_unlock(&pq->evict_lock);
+			if (cleared >= npages)
+				goto retry;
+		}
 		pinned = hfi1_acquire_user_pages(
 			((unsigned long)iovec->iov.iov_base +
 			 (node->npages * PAGE_SIZE)), npages, 0,
@@ -1080,13 +1118,27 @@ static int pin_vector_pages(struct user_sdma_request *req,
 		node->pages = pages;
 		node->npages += pinned;
 		npages = node->npages;
+		spin_lock(&pq->evict_lock);
+		if (!rb_node)
+			list_add(&node->list, &pq->evict);
+		else
+			list_move(&node->list, &pq->evict);
+		pq->n_locked += pinned;
+		spin_unlock(&pq->evict_lock);
 	}
 	iovec->pages = node->pages;
 	iovec->npages = npages;
 
 	if (!rb_node) {
-		if (hfi1_mmu_rb_insert(&req->pq->sdma_rb_root, &node->rb))
+		ret = hfi1_mmu_rb_insert(&req->pq->sdma_rb_root, &node->rb);
+		if (ret) {
+			spin_lock(&pq->evict_lock);
+			list_del(&node->list);
+			pq->n_locked -= node->npages;
+			spin_unlock(&pq->evict_lock);
+			ret = 0;
 			goto bail;
+		}
 	} else {
 		atomic_inc(&node->refcount);
 	}
@@ -1503,6 +1555,11 @@ static void sdma_rb_remove(struct rb_root *root, struct mmu_rb_node *mnode,
 	struct sdma_mmu_node *node =
 		container_of(mnode, struct sdma_mmu_node, rb);
 
+	spin_lock(&node->pq->evict_lock);
+	list_del(&node->list);
+	node->pq->n_locked -= node->npages;
+	spin_unlock(&node->pq->evict_lock);
+
 	unpin_vector_pages(notifier ? NULL : current->mm, node->pages,
 			   node->npages);
 	/*
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
index 39866b5..b9240e3 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.h
+++ b/drivers/infiniband/hw/hfi1/user_sdma.h
@@ -68,6 +68,9 @@ struct hfi1_user_sdma_pkt_q {
 	wait_queue_head_t wait;
 	unsigned long unpinned;
 	struct rb_root sdma_rb_root;
+	u32 n_locked;
+	struct list_head evict;
+	spinlock_t evict_lock; /* protect evict and n_locked */
 };
 
 struct hfi1_user_sdma_comp_q {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (15 preceding siblings ...)
  2016-03-08 19:15   ` [PATCH 16/16] IB/hfi1: Add SDMA cache eviction algorithm Dennis Dalessandro
@ 2016-03-08 20:56   ` Or Gerlitz
       [not found]     ` <CAJ3xEMj8gz6Xw1bqA677fC8=U2ktLQ6b4W20SeNrztN7u6UKZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  16 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2016-03-08 20:56 UTC (permalink / raw)
  To: Dennis Dalessandro, Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Mar 8, 2016 at 9:14 PM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> This patch series adds a performance improvement to user SDMA transfers from
> PSM applications by caching user buffer pages after pinning them. Subsequent
> uses of the same user buffer will not incur the cost of pinning the same pages
> again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
> pages are unpinned when the context is torn down or when the driver determines
> that the buffer should be evicted from the cache. Cache evictions happen when
> there is a request for a new, uncached buffer and the current size of the cache
> has reached a pre-defined limit.

If indeed there's a need for such a pin down cache, it should have
been implemented in the IB core, b/c both the problem and the solution
are generic and have nothing to do with certain device driver.

The problem is that you are bypassing the IB core altogether with a
proprietary character device in the hfi1 driver which managed to snick
in even in the presence of reviewer comments during the submission.

Doug, what do you suggest here?


Or.

> Mitko Haralanov (16):
>       IB/hfi1: Re-factor MMU notification code
>       IB/hfi1: Allow MMU function execution in IRQ context
>       IB/hfi1: Prevent NULL pointer dereference
>       IB/hfi1: Allow remove MMU callbacks to free nodes
>       IB/hfi1: Remove the use of add/remove RB function pointers
>       IB/hfi1: Notify remove MMU/RB callback of calling context
>       IB/hfi1: Use interval RB trees
>       IB/hfi1: Add MMU tracing
>       IB/hfi1: Remove compare callback
>       IB/hfi1: Add filter callback
>       IB/hfi1: Adjust last address values for intervals
>       IB/hfi1: Implement SDMA-side buffer caching
>       IB/hfi1: Add pin query function
>       IB/hfi1: Specify mm when releasing pages
>       IB/hfi1: Switch to using the pin query function
>       IB/hfi1: Add SDMA cache eviction algorithm
>
>
>  drivers/infiniband/hw/hfi1/Makefile       |    2
>  drivers/infiniband/hw/hfi1/file_ops.c     |    1
>  drivers/infiniband/hw/hfi1/hfi.h          |   16 +
>  drivers/infiniband/hw/hfi1/mmu_rb.c       |  292 +++++++++++++++++++++++
>  drivers/infiniband/hw/hfi1/mmu_rb.h       |   73 ++++++
>  drivers/infiniband/hw/hfi1/trace.c        |    1
>  drivers/infiniband/hw/hfi1/trace.h        |    1
>  drivers/infiniband/hw/hfi1/user_exp_rcv.c |  362 ++++++++---------------------
>  drivers/infiniband/hw/hfi1/user_pages.c   |   63 ++++-
>  drivers/infiniband/hw/hfi1/user_sdma.c    |  319 +++++++++++++++++---------
>  drivers/infiniband/hw/hfi1/user_sdma.h    |    4
>  11 files changed, 747 insertions(+), 387 deletions(-)
>  create mode 100644 drivers/infiniband/hw/hfi1/mmu_rb.c
>  create mode 100644 drivers/infiniband/hw/hfi1/mmu_rb.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]     ` <CAJ3xEMj8gz6Xw1bqA677fC8=U2ktLQ6b4W20SeNrztN7u6UKZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-09  0:21       ` Dennis Dalessandro
       [not found]         ` <20160309002157.GA20105-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-09  0:21 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Mar 08, 2016 at 10:56:48PM +0200, Or Gerlitz wrote:
>On Tue, Mar 8, 2016 at 9:14 PM, Dennis Dalessandro
><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
>> This patch series adds a performance improvement to user SDMA transfers from
>> PSM applications by caching user buffer pages after pinning them. Subsequent
>> uses of the same user buffer will not incur the cost of pinning the same pages
>> again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
>> pages are unpinned when the context is torn down or when the driver determines
>> that the buffer should be evicted from the cache. Cache evictions happen when
>> there is a request for a new, uncached buffer and the current size of the cache
>> has reached a pre-defined limit.
>
>If indeed there's a need for such a pin down cache, it should have
>been implemented in the IB core, b/c both the problem and the solution
>are generic and have nothing to do with certain device driver.

Since this is for psm it has nothing to do with the IB core. I fail to see 
what that would achieve.

>The problem is that you are bypassing the IB core altogether with a
>proprietary character device in the hfi1 driver which managed to snick
>in even in the presence of reviewer comments during the submission.

Yes there were reviewer comments, and the maintainer made a decision.
I would also not call this proprietary. It is open source after all:

See:  https://github.com/01org/opa-psm2

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]         ` <20160309002157.GA20105-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2016-03-09  5:07           ` Leon Romanovsky
       [not found]             ` <20160309050731.GM13396-2ukJVAZIZ/Y@public.gmane.org>
  2016-03-09 14:47           ` Or Gerlitz
  1 sibling, 1 reply; 36+ messages in thread
From: Leon Romanovsky @ 2016-03-09  5:07 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Mar 08, 2016 at 07:21:58PM -0500, Dennis Dalessandro wrote:
> On Tue, Mar 08, 2016 at 10:56:48PM +0200, Or Gerlitz wrote:
> >On Tue, Mar 8, 2016 at 9:14 PM, Dennis Dalessandro
> ><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >
> >>This patch series adds a performance improvement to user SDMA transfers from
> >>PSM applications by caching user buffer pages after pinning them. Subsequent
> >>uses of the same user buffer will not incur the cost of pinning the same pages
> >>again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
> >>pages are unpinned when the context is torn down or when the driver determines
> >>that the buffer should be evicted from the cache. Cache evictions happen when
> >>there is a request for a new, uncached buffer and the current size of the cache
> >>has reached a pre-defined limit.
> >
> >If indeed there's a need for such a pin down cache, it should have
> >been implemented in the IB core, b/c both the problem and the solution
> >are generic and have nothing to do with certain device driver.
> 
> Since this is for psm it has nothing to do with the IB core. I fail to see
> what that would achieve.

I tend to agree with Or, your proposal goes far beyond the driver code.
Do you expect that all drivers which will be needed to support PSM will
implement (copy/paste) the same generic logic?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]         ` <20160309002157.GA20105-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  2016-03-09  5:07           ` Leon Romanovsky
@ 2016-03-09 14:47           ` Or Gerlitz
       [not found]             ` <CAJ3xEMhcs-6vDzsNk9nJKoNxTWnfaUtDmNHhLp1EPyerjnw2Xg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2016-03-09 14:47 UTC (permalink / raw)
  To: Dennis Dalessandro; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 9, 2016 at 2:21 AM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Mar 08, 2016 at 10:56:48PM +0200, Or Gerlitz wrote:
>>

>> If indeed there's a need for such a pin down cache, it should have
>> been implemented in the IB core, b/c both the problem and the solution
>> are generic and have nothing to do with certain device driver.

> Since this is for psm it has nothing to do with the IB core. I fail to see
> what that would achieve.

So your logic goes like

1. hfi is a driver which plugs into the IB core

2. OTOH hfi implements proprietary inter-connect, but we can use the
IB core for fabric management
so we change the IB core MAD code to support the interconnect

3. OTOH the maintainer allowed hfi to expose to user-space proprietary
char-device

4. so from now on, we can patch hfi with 300 patches per release that
deal with extensions
to functionality exposed by the char device  and we don't care to
solve generic problems
that has nothing to do with the HW used for the device.

Really?

>> The problem is that you are bypassing the IB core altogether with a
>> proprietary character device in the hfi1 driver which managed to snick
>> in even in the presence of reviewer comments during the submission.

> Yes there were reviewer comments, and the maintainer made a decision.
> I would also not call this proprietary. It is open source after all:
>
> See:  https://github.com/01org/opa-psm2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]             ` <CAJ3xEMhcs-6vDzsNk9nJKoNxTWnfaUtDmNHhLp1EPyerjnw2Xg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-09 15:19               ` Dennis Dalessandro
  0 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-09 15:19 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 09, 2016 at 04:47:04PM +0200, Or Gerlitz wrote:
>4. so from now on, we can patch hfi with 300 patches per release that
>deal with extensions
>to functionality exposed by the char device  and we don't care to
>solve generic problems
>that has nothing to do with the HW used for the device.
>
>Really?

There is certainly a large number of patches pending for 4.6 I'll give you 
that, and yes it is greater than 300. However, these cover 3 drivers, qib, 
hfi1, and rdmavt.  The bulk of that work is the refactoring of the verbs 
code from qib and hfi1 to rdmavt which was done in a very incremental 
process. 

The vast majority of these pending patches are not related to "extensions to 
functionality exposed by the char device".

In fact, I'd say that the verbs refactoring is precisely solving a generic 
problem.  So I can't agree with you here.

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]             ` <20160309050731.GM13396-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-03-09 19:21               ` Dennis Dalessandro
       [not found]                 ` <20160309192109.GB15031-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-09 19:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 09, 2016 at 07:07:31AM +0200, Leon Romanovsky wrote:
>On Tue, Mar 08, 2016 at 07:21:58PM -0500, Dennis Dalessandro wrote:
>> On Tue, Mar 08, 2016 at 10:56:48PM +0200, Or Gerlitz wrote:
>> >On Tue, Mar 8, 2016 at 9:14 PM, Dennis Dalessandro
>> ><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> >
>> >>This patch series adds a performance improvement to user SDMA transfers from
>> >>PSM applications by caching user buffer pages after pinning them. Subsequent
>> >>uses of the same user buffer will not incur the cost of pinning the same pages
>> >>again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
>> >>pages are unpinned when the context is torn down or when the driver determines
>> >>that the buffer should be evicted from the cache. Cache evictions happen when
>> >>there is a request for a new, uncached buffer and the current size of the cache
>> >>has reached a pre-defined limit.
>> >
>> >If indeed there's a need for such a pin down cache, it should have
>> >been implemented in the IB core, b/c both the problem and the solution
>> >are generic and have nothing to do with certain device driver.
>> 
>> Since this is for psm it has nothing to do with the IB core. I fail to see
>> what that would achieve.
>
>I tend to agree with Or, your proposal goes far beyond the driver code.
>Do you expect that all drivers which will be needed to support PSM will
>implement (copy/paste) the same generic logic?

We should not be putting something which is specific to PSM in the IB core.  
We feel the driver is the most appropriate place for this.

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                 ` <20160309192109.GB15031-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2016-03-10  4:56                   ` Leon Romanovsky
       [not found]                     ` <20160310045609.GC1977-2ukJVAZIZ/Y@public.gmane.org>
  2016-03-10  7:11                   ` Or Gerlitz
  1 sibling, 1 reply; 36+ messages in thread
From: Leon Romanovsky @ 2016-03-10  4:56 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 09, 2016 at 02:21:10PM -0500, Dennis Dalessandro wrote:
> On Wed, Mar 09, 2016 at 07:07:31AM +0200, Leon Romanovsky wrote:
> >On Tue, Mar 08, 2016 at 07:21:58PM -0500, Dennis Dalessandro wrote:
> >>On Tue, Mar 08, 2016 at 10:56:48PM +0200, Or Gerlitz wrote:
> >>>On Tue, Mar 8, 2016 at 9:14 PM, Dennis Dalessandro
> >>><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >>>
> >>>>This patch series adds a performance improvement to user SDMA transfers from
> >>>>PSM applications by caching user buffer pages after pinning them. Subsequent
> >>>>uses of the same user buffer will not incur the cost of pinning the same pages
> >>>>again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
> >>>>pages are unpinned when the context is torn down or when the driver determines
> >>>>that the buffer should be evicted from the cache. Cache evictions happen when
> >>>>there is a request for a new, uncached buffer and the current size of the cache
> >>>>has reached a pre-defined limit.
> >>>
> >>>If indeed there's a need for such a pin down cache, it should have
> >>>been implemented in the IB core, b/c both the problem and the solution
> >>>are generic and have nothing to do with certain device driver.
> >>
> >>Since this is for psm it has nothing to do with the IB core. I fail to see
> >>what that would achieve.
> >
> >I tend to agree with Or, your proposal goes far beyond the driver code.
> >Do you expect that all drivers which will be needed to support PSM will
> >implement (copy/paste) the same generic logic?
> 
> We should not be putting something which is specific to PSM in the IB core.
> We feel the driver is the most appropriate place for this.

To use cache logic - yes
To implement cache logic - no

Can you share with us the technical justification of implementing general
memory management cache mechanism in driver and not in appropriate layer
(IB/core and/or MM)?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                 ` <20160309192109.GB15031-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  2016-03-10  4:56                   ` Leon Romanovsky
@ 2016-03-10  7:11                   ` Or Gerlitz
  1 sibling, 0 replies; 36+ messages in thread
From: Or Gerlitz @ 2016-03-10  7:11 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Leon Romanovsky, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 9, 2016 at 9:21 PM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> We should not be putting something which is specific to PSM in the IB core.
> We feel the driver is the most appropriate place for this.


Actually, the cache itself belongs to user space.

What is needed in the kernel is code that uses MMU notifications to
invalidate the cache when munmap and friends
were done on a cache entry. This is too generic functionality that you
can add to the IB core which does have a char device
or (better) to a generic char device you will write and PSM to use.
There have been some works on that in the 2000s, look
for code from Roland Dreier and/or Pete Wyckoff

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                     ` <20160310045609.GC1977-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-03-14  7:09                       ` Leon Romanovsky
       [not found]                         ` <20160314070951.GB26456-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Leon Romanovsky @ 2016-03-14  7:09 UTC (permalink / raw)
  To: Doug Ledford, Or Gerlitz, Dennis Dalessandro
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Mar 10, 2016 at 06:56:09AM +0200, Leon Romanovsky wrote:
> On Wed, Mar 09, 2016 at 02:21:10PM -0500, Dennis Dalessandro wrote:
> > On Wed, Mar 09, 2016 at 07:07:31AM +0200, Leon Romanovsky wrote:
> > >On Tue, Mar 08, 2016 at 07:21:58PM -0500, Dennis Dalessandro wrote:
> > >>On Tue, Mar 08, 2016 at 10:56:48PM +0200, Or Gerlitz wrote:
> > >>>On Tue, Mar 8, 2016 at 9:14 PM, Dennis Dalessandro
> > >>><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > >>>
> > >>>>This patch series adds a performance improvement to user SDMA transfers from
> > >>>>PSM applications by caching user buffer pages after pinning them. Subsequent
> > >>>>uses of the same user buffer will not incur the cost of pinning the same pages
> > >>>>again. Additionally, the cost of unpinning the same pages is avoided.  Buffer
> > >>>>pages are unpinned when the context is torn down or when the driver determines
> > >>>>that the buffer should be evicted from the cache. Cache evictions happen when
> > >>>>there is a request for a new, uncached buffer and the current size of the cache
> > >>>>has reached a pre-defined limit.
> > >>>
> > >>>If indeed there's a need for such a pin down cache, it should have
> > >>>been implemented in the IB core, b/c both the problem and the solution
> > >>>are generic and have nothing to do with certain device driver.
> > >>
> > >>Since this is for psm it has nothing to do with the IB core. I fail to see
> > >>what that would achieve.
> > >
> > >I tend to agree with Or, your proposal goes far beyond the driver code.
> > >Do you expect that all drivers which will be needed to support PSM will
> > >implement (copy/paste) the same generic logic?
> > 
> > We should not be putting something which is specific to PSM in the IB core.
> > We feel the driver is the most appropriate place for this.
> 
> To use cache logic - yes
> To implement cache logic - no
> 
> Can you share with us the technical justification of implementing general
> memory management cache mechanism in driver and not in appropriate layer
> (IB/core and/or MM)?

Hi Doug,
I saw that you picked these patches and it is now in your github
repository. Since the original author probably missed my and Or's questions, It
will be great if you can share with us your technical view on the topic.

In addition to my question regarding appropriate layer - IB/core. MM, I see that
it can be natively implemented in user-space layer too.

Thanks.

> 
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                         ` <20160314070951.GB26456-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-03-14 12:01                           ` Dennis Dalessandro
       [not found]                             ` <20160314120152.GA30838-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-14 12:01 UTC (permalink / raw)
  To: Doug Ledford, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Mar 14, 2016 at 09:09:51AM +0200, Leon Romanovsky wrote:
>Hi Doug,
>I saw that you picked these patches and it is now in your github
>repository. Since the original author probably missed my and Or's questions, It
>will be great if you can share with us your technical view on the topic.
>
>In addition to my question regarding appropriate layer - IB/core. MM, I see that
>it can be natively implemented in user-space layer too.

Leon,

The original author did not miss your questions. When I say "we" or "our" 
you can be assured that includes the original author.

Our position continues to be that this is best located in the hfi1 driver.
If something else comes along that can make use of the same/similar 
functionality we can then move it to the core or elsewhere, once we have a 
better understanding of the use cases and how best to generalize the code. 

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                             ` <20160314120152.GA30838-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2016-03-14 12:27                               ` Leon Romanovsky
  2016-03-14 21:19                               ` Or Gerlitz
  1 sibling, 0 replies; 36+ messages in thread
From: Leon Romanovsky @ 2016-03-14 12:27 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Doug Ledford, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Mar 14, 2016 at 08:01:53AM -0400, Dennis Dalessandro wrote:
> On Mon, Mar 14, 2016 at 09:09:51AM +0200, Leon Romanovsky wrote:
> >Hi Doug,
> >I saw that you picked these patches and it is now in your github
> >repository. Since the original author probably missed my and Or's questions, It
> >will be great if you can share with us your technical view on the topic.
> >
> >In addition to my question regarding appropriate layer - IB/core. MM, I see that
> >it can be natively implemented in user-space layer too.
> 
> Leon,
> 
> The original author did not miss your questions. When I say "we" or "our"
> you can be assured that includes the original author.
> 
> Our position continues to be that this is best located in the hfi1 driver.
> If something else comes along that can make use of the same/similar
> functionality we can then move it to the core or elsewhere, once we have a
> better understanding of the use cases and how best to generalize the code.

Denny,

There are no doubts that your understanding of this part of the code is
superior to mine. It causes me to ask you and/or Doug to share
your knowledge and give to the community the TECHNICAL reasons
why you placed the cache mechanism logic into the driver and not in
other more appropriate places.

Thanks

> 
> -Denny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                             ` <20160314120152.GA30838-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  2016-03-14 12:27                               ` Leon Romanovsky
@ 2016-03-14 21:19                               ` Or Gerlitz
       [not found]                                 ` <CAJ3xEMiSgXFT46r0Y7DZGr3mpJFSLrtYrMdz+DbCXDozbMYZRA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2016-03-14 21:19 UTC (permalink / raw)
  To: Dennis Dalessandro, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky

On Mon, Mar 14, 2016 at 2:01 PM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Mar 14, 2016 at 09:09:51AM +0200, Leon Romanovsky wrote:
>>
>> Hi Doug,
>> I saw that you picked these patches and it is now in your github
>> repository. Since the original author probably missed my and Or's
>> questions, It
>> will be great if you can share with us your technical view on the topic.
>>
>> In addition to my question regarding appropriate layer - IB/core. MM, I
>> see that
>> it can be natively implemented in user-space layer too.

> The original author did not miss your questions. When I say "we" or "our"
> you can be assured that includes the original author.

Just to make sure we're on the same page(...) submitting code to the
upstream kernel is not send-and-forget, authors should respond to
questions posed on their patches and this is very much not the case
how you act here.

> Our position continues to be that this is best located in the hfi1 driver.
> If something else comes along that can make use of the same/similar
> functionality we can then move it to the core or elsewhere, once we have a
> better understanding of the use cases and how best to generalize the code.

As I wrote your on this thread, you have all you need in user-space,
expect for mmu notifications for cache entries which were invalidated
by memunmaps, which indeed should originate from the kernel -- as such
the cache doesn't belong to the kernel at all. You didn't respond on
that, just ignored, and now again talking on the cache being in the
core, why?

Doug, As Dennis noted, you already picked ~300 patches for 4.6 / hfi1
driver from him. I don't see why picking this patch series there too
and not letting us to further discuss it out, thoughts?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                 ` <CAJ3xEMiSgXFT46r0Y7DZGr3mpJFSLrtYrMdz+DbCXDozbMYZRA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-15  1:20                                   ` Dennis Dalessandro
       [not found]                                     ` <20160315012049.GA30055-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-15  1:20 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky

On Mon, Mar 14, 2016 at 11:19:53PM +0200, Or Gerlitz wrote:
>On Mon, Mar 14, 2016 at 2:01 PM, Dennis Dalessandro
><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> On Mon, Mar 14, 2016 at 09:09:51AM +0200, Leon Romanovsky wrote:
>>>
>>> Hi Doug,
>>> I saw that you picked these patches and it is now in your github
>>> repository. Since the original author probably missed my and Or's
>>> questions, It
>>> will be great if you can share with us your technical view on the topic.
>>>
>>> In addition to my question regarding appropriate layer - IB/core. MM, I
>>> see that
>>> it can be natively implemented in user-space layer too.
>
>> The original author did not miss your questions. When I say "we" or "our"
>> you can be assured that includes the original author.
>
>Just to make sure we're on the same page(...) submitting code to the
>upstream kernel is not send-and-forget, authors should respond to
>questions posed on their patches and this is very much not the case
>how you act here.
>
>> Our position continues to be that this is best located in the hfi1 driver.
>> If something else comes along that can make use of the same/similar
>> functionality we can then move it to the core or elsewhere, once we have a
>> better understanding of the use cases and how best to generalize the code.
>
>As I wrote your on this thread, you have all you need in user-space,
>expect for mmu notifications for cache entries which were invalidated
>by memunmaps, which indeed should originate from the kernel -- as such
>the cache doesn't belong to the kernel at all. You didn't respond on
>that, just ignored, and now again talking on the cache being in the
>core, why?

I said "core or elsewhere". Elsewhere may well include user space. That is 
just a different solution than what we have proposed. I agree that it is 
certainly one alternative. I'd like to see what Doug's opinion on the matter 
is.

>Doug, As Dennis noted, you already picked ~300 patches for 4.6 / hfi1
>driver from him. I don't see why picking this patch series there too
>and not letting us to further discuss it out, thoughts?
>
>Or.

I believe in taking an iterative approach to kernel development, if the code 
doesn't cause problems I don't see any harm adding the series. If Doug 
agrees with you and doesn't like this approach but is willing to let it 
stand for now,  I think the changes can be done in a subsequent release.

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                     ` <20160315012049.GA30055-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2016-03-15  7:03                                       ` Or Gerlitz
       [not found]                                         ` <CAJ3xEMhToYOBMo_nNi2fFpTexUmUqb3zBX7zWWYjCDfPWpWn-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2016-03-15  7:03 UTC (permalink / raw)
  To: Dennis Dalessandro, Mitko Haralanov
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky,
	Christoph Lameter

On Tue, Mar 15, 2016 at 3:20 AM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Mar 14, 2016 at 11:19:53PM +0200, Or Gerlitz wrote:
>> On Mon, Mar 14, 2016 at 2:01 PM, Dennis Dalessandro
>> <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>> On Mon, Mar 14, 2016 at 09:09:51AM +0200, Leon Romanovsky wrote:
>>>>
>>>>
>>>> Hi Doug,
>>>> I saw that you picked these patches and it is now in your github
>>>> repository. Since the original author probably missed my and Or's
>>>> questions, It
>>>> will be great if you can share with us your technical view on the topic.
>>>>
>>>> In addition to my question regarding appropriate layer - IB/core. MM, I
>>>> see that
>>>> it can be natively implemented in user-space layer too.

>>> The original author did not miss your questions. When I say "we" or "our"
>>> you can be assured that includes the original author.

>> Just to make sure we're on the same page(...) submitting code to the
>> upstream kernel is not send-and-forget, authors should respond to
>> questions posed on their patches and this is very much not the case
>> how you act here.

>>> Our position continues to be that this is best located in the hfi1 driver.

>>> If something else comes along that can make use of the same/similar
>>> functionality we can then move it to the core or elsewhere, once we have a
>>> better understanding of the use cases and how best to generalize the code

>> As I wrote your on this thread, you have all you need in user-space,
>> expect for mmu notifications for cache entries which were invalidated
>> by memunmaps, which indeed should originate from the kernel -- as such
>> the cache doesn't belong to the kernel at all. You didn't respond on
>> that, just ignored, and now again talking on the cache being in the
>> core, why?

> I said "core or elsewhere". Elsewhere may well include user space. That is
> just a different solution than what we have proposed. I agree that it is
> certainly one alternative. I'd like to see what Doug's opinion on the matter is.

A reviewer provided you a comment that the code you're proposing does
not belong to the kernel and should reside in user-space. You didn't provide
your reasoning if/why you think otherwise, you just say

"I disagree and lets get the maintainer opinion"

You are again very much wrong w.r.t to how you think proper open-source
development goes. The maintainer should put his opinion AFTER he hears
your counter arguments. He doesn't have to make them for you. By all means,
you should be able to justify the design/code you're proposing, and if you
can't -- ask to put this submission aside and go to HW.

This is terribly improper approach and I would ask you to stop it, now.

>> Doug, As Dennis noted, you already picked ~300 patches for 4.6 / hfi1
>> driver from him. I don't see why picking this patch series there too
>> and not letting us to further discuss it out, thoughts?

> I believe in taking an iterative approach to kernel development, if the code
> doesn't cause problems I don't see any harm adding the series. If Doug
> agrees with you and doesn't like this approach but is willing to let it
> stand for now,  I think the changes can be done in a subsequent release.

Iterative approach doesn't say we should pick code into the kernel and throw
it away later. There are rare times it happens, when people were not able
to agree on something for years and the maintainer say enough is enough,
we'll take it and see later.

But, this is not the case here, your patches were proposed only on the last
month, and if you really believe in them, convince the reviewer and get it
for 4.7, thoughts?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                         ` <CAJ3xEMhToYOBMo_nNi2fFpTexUmUqb3zBX7zWWYjCDfPWpWn-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-15 19:37                                           ` Dennis Dalessandro
       [not found]                                             ` <20160315193731.GA20662-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-15 19:37 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Mitko Haralanov, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Leon Romanovsky, Christoph Lameter

On Tue, Mar 15, 2016 at 09:03:49AM +0200, Or Gerlitz wrote:
>>> Doug, As Dennis noted, you already picked ~300 patches for 4.6 / hfi1
>>> driver from him. I don't see why picking this patch series there too
>>> and not letting us to further discuss it out, thoughts?

Forgot in my last reply but I did want to mention this again, we have not 
submitted 300 patches for hfi1.  We have submitted 300+ patches across 3 
different drivers.  Doug's GitHub branch says hfi1, but it is changes for 
hfi1, rdmavt, and qib.

>> I believe in taking an iterative approach to kernel development, if the code
>> doesn't cause problems I don't see any harm adding the series. If Doug
>> agrees with you and doesn't like this approach but is willing to let it
>> stand for now,  I think the changes can be done in a subsequent release.
>
>Iterative approach doesn't say we should pick code into the kernel and throw
>it away later. There are rare times it happens, when people were not able
>to agree on something for years and the maintainer say enough is enough,
>we'll take it and see later.

Never said anything about throwing it all away. My point is, we are talking 
about a single driver and exactly one user of it.  It's not like we are 
adding core kernel functionality here that impacts multiple drivers, or have 
to worry about changing ABIs across a bunch of different users. We have the 
freedom to fix and change things as needed.

Regardless that stuff aside, I wanted to talk to some folks who are
more familiar with the inner workings of psm than I. So I appreciate the 
patience while I gathered the feedback from interested parties. 

So basically what happens in this series is that psm calls into the kernel 
to do an sdma transfer, the caching is a side effect that happens under the 
covers.  We can see this from the code in this series. What may not be so 
apparent is that the user library does not have to do anything to take 
advantage of the cache. Further it doesn't even have to know about the 
cache.  Which is one of the appeals of this patch series. The other, and 
what trumps all, is performance.

Taking a performance standpoint we don't want this in user space. If the 
cache were at the user level there would end up being a system call to pin 
the buffer, another to do the SDMA transfer from said buffer, and another to 
unpin the buffer when the kernel notifier has invalidated that buffer.  This 
gets even more complex when you consider cache evictions due to reaching the 
limit on pinned pages. 

These extra system calls add overhead, which may be acceptable for a normal 
server or desktop environment, but are not acceptable when dealing with a 
high performance interconnect where we want the absolute best performance.

I should also point out that these extra calls would be done on a per-buffer 
basis, not per-context, or per-application.

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                             ` <20160315193731.GA20662-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2016-03-16 15:53                                               ` Or Gerlitz
       [not found]                                                 ` <CAJ3xEMjWxdwM0wQR85_-wHpPhYSWz1W0T8H_nDjnp2JS7G29Qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2016-03-16 15:53 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Mitko Haralanov, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Leon Romanovsky, Christoph Lameter

On Tue, Mar 15, 2016 at 9:37 PM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Mar 15, 2016 at 09:03:49AM +0200, Or Gerlitz wrote:

>> Iterative approach doesn't say we should pick code into the kernel and throw
>> it away later. There are rare times it happens, when people were not able
>> to agree on something for years and the maintainer say enough is enough,
>> we'll take it and see later.

> Never said anything about throwing it all away. My point is, we are talking
> about a single driver and exactly one user of it.  It's not like we are
> adding core kernel functionality here that impacts multiple drivers, or have
> to worry about changing ABIs across a bunch of different users. We have the
> freedom to fix and change things as needed.

The fact there is one user of driver X doesn't justify functionality Y
to reside in the
kernel and not in user-space.


> Regardless that stuff aside, I wanted to talk to some folks who are
> more familiar with the inner workings of psm than I. So I appreciate the
> patience while I gathered the feedback from interested parties.
> So basically what happens in this series is that psm calls into the kernel
> to do an sdma transfer, the caching is a side effect that happens under the
> covers.  We can see this from the code in this series. What may not be so
> apparent is that the user library does not have to do anything to take
> advantage of the cache. Further it doesn't even have to know about the
> cache.  Which is one of the appeals of this patch series. The other, and
> what trumps all, is performance.

The fact that under your current design the user-space library is not involved
with the caching doesn't mean the correct place for the cache is the kernel.

> Taking a performance standpoint we don't want this in user space. If the
> cache were at the user level there would end up being a system call to pin
> the buffer, another to do the SDMA transfer from said buffer, and another to
> unpin the buffer when the kernel notifier has invalidated that buffer.  This
> gets even more complex when you consider cache evictions due to reaching the
> limit on pinned pages.
> These extra system calls add overhead, which may be acceptable for a normal
> server or desktop environment, but are not acceptable when dealing with a
> high performance interconnect where we want the absolute best performance.

> I should also point out that these extra calls would be done on a per-buffer
> basis, not per-context, or per-application.

The cache comes to amortize the cost of pin/unpin and such, correct? this means
that over longs runs, if there's nice locality of the same process
using the same pages,
you pay the register/pin cost once, later use lazy
de-registration/pinning, and only do that
when MMU notifier tells you this is a must, or under some eviction policy.

Since the cost is amortized, the system calls over-head should be negligible
(also, the same system call can be used to evict X and register Y), do you
have performance data that shows otherwise?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                                 ` <CAJ3xEMjWxdwM0wQR85_-wHpPhYSWz1W0T8H_nDjnp2JS7G29Qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-16 17:23                                                   ` Dennis Dalessandro
       [not found]                                                     ` <20160316172309.GB26530-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-16 17:23 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Mitko Haralanov, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Leon Romanovsky, Christoph Lameter

On Wed, Mar 16, 2016 at 05:53:37PM +0200, Or Gerlitz wrote:
>> Taking a performance standpoint we don't want this in user space. If the
>> cache were at the user level there would end up being a system call to pin
>> the buffer, another to do the SDMA transfer from said buffer, and another to
>> unpin the buffer when the kernel notifier has invalidated that buffer.  This
>> gets even more complex when you consider cache evictions due to reaching the
>> limit on pinned pages.
>> These extra system calls add overhead, which may be acceptable for a normal
>> server or desktop environment, but are not acceptable when dealing with a
>> high performance interconnect where we want the absolute best performance.
>
>> I should also point out that these extra calls would be done on a per-buffer
>> basis, not per-context, or per-application.
>
>The cache comes to amortize the cost of pin/unpin and such, correct? this means
>that over longs runs, if there's nice locality of the same process
>using the same pages,
>you pay the register/pin cost once, later use lazy
>de-registration/pinning, and only do that
>when MMU notifier tells you this is a must, or under some eviction policy.
>
>Since the cost is amortized, the system calls over-head should be negligible
>(also, the same system call can be used to evict X and register Y), do you
>have performance data that shows otherwise?

I don't personally have the data but will check with some folks here.

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                                     ` <20160316172309.GB26530-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2016-03-22  7:33                                                       ` Or Gerlitz
       [not found]                                                         ` <CAJ3xEMg5OPBT63=VV4EKnZUWSHGatDRL0n7O4iBVrN9bMUw5dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Or Gerlitz @ 2016-03-22  7:33 UTC (permalink / raw)
  To: Dennis Dalessandro, Doug Ledford
  Cc: Mitko Haralanov, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Leon Romanovsky, Christoph Lameter

On Wed, Mar 16, 2016 at 7:23 PM, Dennis Dalessandro
<dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Mar 16, 2016 at 05:53:37PM +0200, Or Gerlitz wrote:

>> The cache comes to amortize the cost of pin/unpin and such, correct? this means
>> that over longs runs, if there's nice locality of the same process
>> using the same pages, you pay the register/pin cost once, later use lazy
>> de-registration/pinning, and only do that
>> when MMU notifier tells you this is a must, or under some eviction policy.

>> Since the cost is amortized, the system calls over-head should be negligible
>> (also, the same system call can be used to evict X and register Y), do you
>> have performance data that shows otherwise?

> I don't personally have the data but will check with some folks here.

Dennis, didn't see a reply from you here...

Doug, as this is still review discussion in progress, I guess you'll wait with
finalizing the talks before adding this into your 2nd 4.6 pull request?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma
       [not found]                                                         ` <CAJ3xEMg5OPBT63=VV4EKnZUWSHGatDRL0n7O4iBVrN9bMUw5dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-22 11:46                                                           ` Dennis Dalessandro
  0 siblings, 0 replies; 36+ messages in thread
From: Dennis Dalessandro @ 2016-03-22 11:46 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Mitko Haralanov, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Leon Romanovsky, Christoph Lameter

On Tue, Mar 22, 2016 at 09:33:55AM +0200, Or Gerlitz wrote:
>On Wed, Mar 16, 2016 at 7:23 PM, Dennis Dalessandro
><dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> On Wed, Mar 16, 2016 at 05:53:37PM +0200, Or Gerlitz wrote:
>
>>> The cache comes to amortize the cost of pin/unpin and such, correct? this means
>>> that over longs runs, if there's nice locality of the same process
>>> using the same pages, you pay the register/pin cost once, later use lazy
>>> de-registration/pinning, and only do that
>>> when MMU notifier tells you this is a must, or under some eviction policy.
>
>>> Since the cost is amortized, the system calls over-head should be negligible
>>> (also, the same system call can be used to evict X and register Y), do you
>>> have performance data that shows otherwise?
>
>> I don't personally have the data but will check with some folks here.
>
>Dennis, didn't see a reply from you here...

I have been on out of the office for a few days. I did check into the 
performance aspect here and have been assured there is a performance problem 
with the increased number of system calls. Though I do not have any 
published data to point you to. 

We are particularly worried about the eviction case which is where the 
amortization argument doesn't hold up. If the eviction policy is set to 
evict buffers which have not been used recently in favor of a new buffer 
(which is a very common policy), this happens more often than you realize.  
We do not want any additional cost in this case.

Another reason we do not want this in user space is that it changes the 
semantic of the SDMA transfer call. Normally user space applications don't 
care what needs to be done under the covers to transfer a buffer. The user 
application/library calls into the kernel which takes care of the details, 
including pinning the pages while the underlying hardware pulls the data 
from the buffer. The user application does not (and should not) care about 
that. This is how SDMA transfers work as well.

However, if we move the cache into user space all of that changes. Now user 
space has to start caring about the state of the buffer. The normal SDMA 
transfer request cannot be used as it will, or might, pin/unpin pages it 
shouldn't. The kernel will have to provide effectively two semantically 
different versions of the calls - one which pins/unpins and one which does 
not. This is yet another complication that is being unnecessarily pushed 
into user space.

> Doug, as this is still review discussion in progress, I guess you'll wait 
> with
> finalizing the talks before adding this into your 2nd 4.6 pull request?

I think our position is clear why we do not want to put this in user space.  
Can we? Yes. But just because we could doesn't mean we should. So what is 
your technical reason that makes you feel so strongly this doesn't belong in 
our particular driver? Is there some way this impacts any other RDMA 
drivers, the IB core, or anything else?

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-03-22 11:46 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-08 19:14 [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma Dennis Dalessandro
     [not found] ` <20160308191210.30542.91885.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-03-08 19:14   ` [PATCH 01/16] IB/hfi1: Re-factor MMU notification code Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 02/16] IB/hfi1: Allow MMU function execution in IRQ context Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 03/16] IB/hfi1: Prevent NULL pointer dereference Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 04/16] IB/hfi1: Allow remove MMU callbacks to free nodes Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 05/16] IB/hfi1: Remove the use of add/remove RB function pointers Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 06/16] IB/hfi1: Notify remove MMU/RB callback of calling context Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 07/16] IB/hfi1: Use interval RB trees Dennis Dalessandro
2016-03-08 19:14   ` [PATCH 08/16] IB/hfi1: Add MMU tracing Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 09/16] IB/hfi1: Remove compare callback Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 10/16] IB/hfi1: Add filter callback Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 11/16] IB/hfi1: Adjust last address values for intervals Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 12/16] IB/hfi1: Implement SDMA-side buffer caching Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 13/16] IB/hfi1: Add pin query function Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 14/16] IB/hfi1: Specify mm when releasing pages Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 15/16] IB/hfi1: Switch to using the pin query function Dennis Dalessandro
2016-03-08 19:15   ` [PATCH 16/16] IB/hfi1: Add SDMA cache eviction algorithm Dennis Dalessandro
2016-03-08 20:56   ` [PATCH 00/16] IB/hfi1: Add a page pinning cache for PSM sdma Or Gerlitz
     [not found]     ` <CAJ3xEMj8gz6Xw1bqA677fC8=U2ktLQ6b4W20SeNrztN7u6UKZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-09  0:21       ` Dennis Dalessandro
     [not found]         ` <20160309002157.GA20105-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-03-09  5:07           ` Leon Romanovsky
     [not found]             ` <20160309050731.GM13396-2ukJVAZIZ/Y@public.gmane.org>
2016-03-09 19:21               ` Dennis Dalessandro
     [not found]                 ` <20160309192109.GB15031-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-03-10  4:56                   ` Leon Romanovsky
     [not found]                     ` <20160310045609.GC1977-2ukJVAZIZ/Y@public.gmane.org>
2016-03-14  7:09                       ` Leon Romanovsky
     [not found]                         ` <20160314070951.GB26456-2ukJVAZIZ/Y@public.gmane.org>
2016-03-14 12:01                           ` Dennis Dalessandro
     [not found]                             ` <20160314120152.GA30838-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-03-14 12:27                               ` Leon Romanovsky
2016-03-14 21:19                               ` Or Gerlitz
     [not found]                                 ` <CAJ3xEMiSgXFT46r0Y7DZGr3mpJFSLrtYrMdz+DbCXDozbMYZRA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-15  1:20                                   ` Dennis Dalessandro
     [not found]                                     ` <20160315012049.GA30055-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-03-15  7:03                                       ` Or Gerlitz
     [not found]                                         ` <CAJ3xEMhToYOBMo_nNi2fFpTexUmUqb3zBX7zWWYjCDfPWpWn-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-15 19:37                                           ` Dennis Dalessandro
     [not found]                                             ` <20160315193731.GA20662-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-03-16 15:53                                               ` Or Gerlitz
     [not found]                                                 ` <CAJ3xEMjWxdwM0wQR85_-wHpPhYSWz1W0T8H_nDjnp2JS7G29Qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-16 17:23                                                   ` Dennis Dalessandro
     [not found]                                                     ` <20160316172309.GB26530-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-03-22  7:33                                                       ` Or Gerlitz
     [not found]                                                         ` <CAJ3xEMg5OPBT63=VV4EKnZUWSHGatDRL0n7O4iBVrN9bMUw5dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-22 11:46                                                           ` Dennis Dalessandro
2016-03-10  7:11                   ` Or Gerlitz
2016-03-09 14:47           ` Or Gerlitz
     [not found]             ` <CAJ3xEMhcs-6vDzsNk9nJKoNxTWnfaUtDmNHhLp1EPyerjnw2Xg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-09 15:19               ` Dennis Dalessandro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.