linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: j.glisse@gmail.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	joro@8bytes.org, "Mel Gorman" <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Johannes Weiner" <jweiner@redhat.com>,
	"Larry Woodman" <lwoodman@redhat.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Dave Airlie" <airlied@redhat.com>,
	"Brendan Conoboy" <blc@redhat.com>,
	"Joe Donohue" <jdonohue@redhat.com>,
	"Duncan Poole" <dpoole@nvidia.com>,
	"Sherry Cheung" <SCheung@nvidia.com>,
	"Subhash Gutti" <sgutti@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Mark Hairgrove" <mhairgrove@nvidia.com>,
	"Lucien Dunning" <ldunning@nvidia.com>,
	"Cameron Buschardt" <cabuschardt@nvidia.com>,
	"Arvind Gopalakrishnan" <arvindg@nvidia.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Shachar Raindel" <raindel@mellanox.com>,
	"Liran Liss" <liranl@mellanox.com>,
	"Roland Dreier" <roland@purestorage.com>,
	"Ben Sander" <ben.sander@amd.com>,
	"Greg Stoner" <Greg.Stoner@amd.com>,
	"John Bridgman" <John.Bridgman@amd.com>,
	"Michael Mantor" <Michael.Mantor@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Laurent Morichetti" <Laurent.Morichetti@amd.com>,
	"Alexander Deucher" <Alexander.Deucher@amd.com>,
	"Oded Gabbay" <Oded.Gabbay@amd.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Jatin Kumar" <jakumar@nvidia.com>
Subject: [PATCH 06/36] HMM: add HMM page table v2.
Date: Thu, 21 May 2015 15:31:15 -0400	[thread overview]
Message-ID: <1432236705-4209-7-git-send-email-j.glisse@gmail.com> (raw)
In-Reply-To: <1432236705-4209-1-git-send-email-j.glisse@gmail.com>

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Heterogeneous memory management main purpose is to mirror a process address.
To do so it must maintain a secondary page table that is use by the device
driver to program the device or build a device specific page table.

Radix tree can not be use to create this secondary page table because HMM
needs more flags than RADIX_TREE_MAX_TAGS (while this can be increase we
believe HMM will require so much flags that cost will becomes prohibitive
to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to store dma
address and on some platform sizeof(dma_addr_t) > sizeof(long). Thus radix
tree is unsuitable to fulfill HMM requirement hence why we introduce this
code which allows to create page table that can grow and shrink dynamicly.

The design is very clause to CPU page table as it reuse some of the feature
such as spinlock embedded in struct page.

Changed since v1:
  - Use PAGE_SHIFT as shift value to reserve low bit for private device
    specific flags. This is to allow device driver to use and some of the
    lower bits for their own device specific purpose.
  - Add set of helper for atomically clear, setting and testing bit on
    dma_addr_t pointer. Atomicity being usefull only for dirty bit.
  - Differentiate btw DMA mapped entry and non mapped entry (pfn).
  - Split page directory entry and page table entry helpers.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 380 +++++++++++++++++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 425 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 808 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2f2a2be..8cd0aa7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4736,6 +4736,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..330edb2
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,380 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ *
+ * Device driver should only rely on the helpers and should not traverse the
+ * page table themself.
+ */
+#define HMM_PT_MAX_LEVEL	6
+
+#define HMM_PDE_VALID_BIT	0
+#define HMM_PDE_VALID		(1 << HMM_PDE_VALID_BIT)
+#define HMM_PDE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
+}
+
+static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
+{
+	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
+}
+
+
+/*
+ * The HMM_PTE_VALID_DMA_BIT is set for valid DMA mapped entry, while for pfn
+ * entry the HMM_PTE_VALID_PFN_BIT is set. If the hmm_device is associated with
+ * a valid struct device than device driver will be supplied with DMA mapped
+ * entry otherwise it will be supplied with pfn entry.
+ *
+ * In the first case the device driver must ignore any pfn entry as they might
+ * show as transient state while HMM is mapping the page.
+ */
+#define HMM_PTE_VALID_DMA_BIT	0
+#define HMM_PTE_VALID_PFN_BIT	1
+#define HMM_PTE_WRITE_BIT	2
+#define HMM_PTE_DIRTY_BIT	3
+/*
+ * Reserve some bits for device driver private flags. Note that thus can only
+ * be manipulated using the hmm_pte_*_bit() sets of helpers.
+ *
+ * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
+ * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
+ */
+#define HMM_PTE_HW_SHIFT	4
+
+#define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+#define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+
+#ifdef __BIG_ENDIAN
+/*
+ * The dma_addr_t casting we do on little endian do not work on big endian. It
+ * would require some macro trickery to adjust the bit value depending on the
+ * number of bit unsigned long have in comparison to dma_addr_t. This is just
+ * low on the todo list for now.
+ */
+#error "HMM not supported on BIG_ENDIAN architecture.\n"
+#else /* __BIG_ENDIAN */
+static inline void hmm_pte_clear_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline void hmm_pte_set_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	set_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	return !!test_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_clear_bit(dma_addr_t *ptep,
+					      unsigned char bit)
+{
+	return !!test_and_clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
+					    unsigned char bit)
+{
+	return !!test_and_set_bit(bit, (unsigned long *)ptep);
+}
+#endif /* __BIG_ENDIAN */
+
+
+#define HMM_PTE_CLEAR_BIT(name, bit)\
+	static inline void hmm_pte_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_SET_BIT(name, bit)\
+	static inline void hmm_pte_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_BIT(name, bit)\
+	static inline bool hmm_pte_test_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_SET_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_BIT_HELPER(name, bit)\
+	HMM_PTE_CLEAR_BIT(name, bit)\
+	HMM_PTE_SET_BIT(name, bit)\
+	HMM_PTE_TEST_BIT(name, bit)\
+	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	HMM_PTE_TEST_AND_SET_BIT(name, bit)
+
+HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
+HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
+HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
+}
+
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return addr & pt->mask[level];
+}
+
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return (addr | (~pt->mask[level])) + 1UL;
+}
+
+static inline unsigned long hmm_pt_level_next(struct hmm_pt *pt,
+					      unsigned long addr,
+					      unsigned long end,
+					      unsigned level)
+{
+	addr = (addr | (~pt->mask[level])) + 1UL;
+	return (addr - 1 < end - 1) ? addr : end;
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end);
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr);
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr);
+
+/* hmm_pt_protect_directory_unref() - reference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to reference.
+ *
+ * This function will reference a directory but it is illegal for refcount to
+ * be 0 as this helper should only be call when iterator is protecting the
+ * directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
+					     char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+/* hmm_pt_protect_directory_unref() - unreference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to unreference.
+ *
+ * This function will unreference a directory but it is illegal for refcount to
+ * reach 0 here as this helper should only be call when iterator is protecting
+ * the directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter,
+					       char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	/* Illegal this should not happen. */
+	BUG();
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   struct hmm_pt *pt,
+					   unsigned long addr)
+{
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr >= hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter,
+					      struct hmm_pt *pt)
+{
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter,
+						struct hmm_pt *pt)
+{
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index 90ca9c4..04d7d45 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,4 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..49b200e
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	pt->last &= PAGE_MASK;
+	spin_lock_init(&pt->lock);
+	/* Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/* Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PDE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pde_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PDE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pde_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+
+/* hmm_pt_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter)
+{
+	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     struct hmm_pt *pt,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is not safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	hmm_pt_iter_directory_unref(iter, level - 1);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    struct hmm_pt *pt,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, pt, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+				  struct page *ptd,
+				  unsigned long addr,
+				  unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end)
+{
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			return hmm_pt_level_next(pt, iter->cur, end, i);
+	}
+
+	/*
+	 * No need for rcu protection worst case is we return a now dead
+	 * address.
+	 */
+	if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PDE_VALID)
+		return hmm_pt_level_next(pt, addr, end, pt->llevel);
+	for (; addr < end; addr = hmm_pt_level_next(pt, addr, end, 0))
+		if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PDE_VALID)
+			return addr;
+	return end;
+}
+EXPORT_SYMBOL(hmm_pt_iter_next);
+
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr)
+{
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr < hmm_pt_level_end(pt, iter->cur, pt->llevel))
+		return hmm_pt_iter_ptdp(iter, pt, addr);
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1)
+			return NULL;
+	}
+
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_update);
+
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_update(iter, pt, addr);
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PDE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pde_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				iter->cur = hmm_pt_level_start(pt, addr, i);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pde_from_pfn(page_to_pfn(new));
+		hmm_pt_iter_directory_ref(iter, i - 1);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		iter->cur = hmm_pt_level_start(pt, addr, i);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2015-05-21 19:33 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
2015-05-30  3:43   ` John Hubbard
2015-06-01 19:03     ` Jerome Glisse
2015-06-01 23:10       ` John Hubbard
2015-06-03 16:07         ` Jerome Glisse
2015-06-03 23:02           ` John Hubbard
2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
2015-05-27  5:09   ` Aneesh Kumar K.V
2015-05-27 14:32     ` Jerome Glisse
2015-06-02  9:32   ` John Hubbard
2015-06-03 17:15     ` Jerome Glisse
2015-06-05  3:29       ` John Hubbard
2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
2015-05-27  5:17   ` Aneesh Kumar K.V
2015-05-27 14:33     ` Jerome Glisse
2015-06-03  4:25   ` John Hubbard
2015-05-21 19:31 ` [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier j.glisse
2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
2015-05-27  5:50   ` Aneesh Kumar K.V
2015-05-27 14:38     ` Jerome Glisse
2015-06-08 19:40   ` Mark Hairgrove
2015-06-08 21:17     ` Jerome Glisse
2015-06-09  1:54       ` Mark Hairgrove
2015-06-09 15:56         ` Jerome Glisse
2015-06-10  3:33           ` Mark Hairgrove
2015-06-10 15:42             ` Jerome Glisse
2015-06-11  1:15               ` Mark Hairgrove
2015-06-11 14:23                 ` Jerome Glisse
2015-06-11 22:26                   ` Mark Hairgrove
2015-06-15 14:32                     ` Jerome Glisse
2015-05-21 19:31 ` j.glisse [this message]
2015-06-19  2:06   ` [PATCH 06/36] HMM: add HMM page table v2 Mark Hairgrove
2015-06-19 18:07     ` Jerome Glisse
2015-06-20  2:34       ` Mark Hairgrove
2015-06-25 22:57   ` Mark Hairgrove
2015-06-26 16:30     ` Jerome Glisse
2015-06-27  1:34       ` Mark Hairgrove
2015-06-29 14:43         ` Jerome Glisse
2015-07-01  2:51           ` Mark Hairgrove
2015-07-01 15:07             ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
2015-06-25 23:05   ` Mark Hairgrove
2015-06-26 16:43     ` Jerome Glisse
2015-06-27  3:02       ` Mark Hairgrove
2015-06-29 14:50         ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 08/36] HMM: add device page fault support v3 j.glisse
2015-05-21 19:31 ` [PATCH 09/36] HMM: add mm page table iterator helpers j.glisse
2015-05-21 19:31 ` [PATCH 10/36] HMM: use CPU page table during invalidation j.glisse
2015-05-21 19:31 ` [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range) j.glisse
2015-05-21 19:31 ` [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table) j.glisse
2015-05-21 19:31 ` [PATCH 13/36] HMM: DMA map memory on behalf of device driver j.glisse
2015-05-21 19:31 ` [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions j.glisse
2015-05-21 19:31 ` [PATCH 15/36] memcg: export get_mem_cgroup_from_mm() j.glisse
2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
2015-06-24  7:49   ` Haggai Eran
2015-05-21 19:31 ` [PATCH 17/36] HMM: add new HMM page table flag (valid device memory) j.glisse
2015-05-21 19:31 ` [PATCH 18/36] HMM: add new HMM page table flag (select flag) j.glisse
2015-05-21 19:31 ` [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update j.glisse
2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
2015-05-21 20:22   ` [PATCH 22/36] HMM: add new callback for copying memory from and to device memory jglisse
2015-05-21 20:22   ` [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory jglisse
2015-05-21 20:23   ` [PATCH 24/36] HMM: split DMA mapping function in two jglisse
2015-05-21 20:23   ` [PATCH 25/36] HMM: add helpers for migration back to system memory jglisse
2015-05-21 20:23   ` [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process jglisse
2015-05-21 20:23   ` [PATCH 27/36] HMM: CPU page fault on migrated memory jglisse
2015-05-21 20:23   ` [PATCH 28/36] HMM: add mirror fault support for system to device memory migration jglisse
2015-05-21 20:23   ` [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM jglisse
2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() " jglisse
2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
2015-06-24 13:59     ` Haggai Eran
2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
2015-05-30  3:01 ` HMM (Heterogeneous Memory Management) v8 John Hubbard
2015-05-31  6:56 ` Haggai Eran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1432236705-4209-7-git-send-email-j.glisse@gmail.com \
    --to=j.glisse@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Greg.Stoner@amd.com \
    --cc=John.Bridgman@amd.com \
    --cc=Laurent.Morichetti@amd.com \
    --cc=Michael.Mantor@amd.com \
    --cc=Oded.Gabbay@amd.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=SCheung@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arvindg@nvidia.com \
    --cc=ben.sander@amd.com \
    --cc=blc@redhat.com \
    --cc=cabuschardt@nvidia.com \
    --cc=dpoole@nvidia.com \
    --cc=haggaie@mellanox.com \
    --cc=hpa@zytor.com \
    --cc=jakumar@nvidia.com \
    --cc=jdonohue@redhat.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jweiner@redhat.com \
    --cc=ldunning@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liranl@mellanox.com \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhairgrove@nvidia.com \
    --cc=peterz@infradead.org \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=roland@purestorage.com \
    --cc=sgutti@nvidia.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).