Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/9] Emulated coherent graphics memory take 2
@ 2019-10-08  9:14 Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 1/9] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellström, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov

From: Thomas Hellström <thellstrom@vmware.com>

Graphics APIs like OpenGL 4.4 and Vulkan require the graphics driver
to provide coherent graphics memory, meaning that the GPU sees any
content written to the coherent memory on the next GPU operation that
touches that memory, and the CPU sees any content written by the GPU
to that memory immediately after any fence object trailing the GPU
operation is signaled.

Paravirtual drivers that otherwise require explicit synchronization
needs to do this by hooking up dirty tracking to pagefault handlers
and buffer object validation.

Provide mm helpers needed for this and that also allow for huge pmd-
and pud entries (patch 1-3), and the associated vmwgfx code (patch 4-7).

The code has been tested and exercised by a tailored version of mesa
where we disable all explicit synchronization and assume graphics memory
is coherent. The performance loss varies of course; a typical number is
around 5%.

I would like to merge this code through the DRM tree, so an ack to include
the new mm helpers in that merge would be greatly appreciated.

Changes since RFC:
- Merge conflict changes moved to the correct patch. Fixes intra-patchset
  compile errors.
- Be more aggressive when turning ttm vm code into helpers. This makes sure
  we can use a const qualifier on the vmwgfx vm_ops.
- Reinstate a lost comment an fix an error path that was broken when turning
  the ttm vm code into helpers.
- Remove explicit type-casts of struct vm_area_struct::vm_private_data
- Clarify the locking inversion that makes us not being able to use the mm
  pagewalk code.

Changes since v1:
- Removed the vmwgfx maintainer entry for as_dirty_helpers.c, updated
  commit message accordingly
- Removed the TTM patches from the series as they are merged separately
  through DRM.
Changes since v2:
- Split out the pagewalk code from as_dirty_helpers.c and document locking.
- Add pre_vma and post_vma callbacks to the pagewalk code.
- Remove huge pmd and -pud asserts that would trip when we protect vmas with
  struct address_space::i_mmap_rwsem rather than with
  struct vm_area_struct::mmap_sem.
- Do some naming cleanup in as_dirty_helpers.c
Changes since v3:
- Extensive renaming of the dirty helpers including the filename.
- Update walk_page_mapping() doc.
- Update the pagewalk code to not unconditionally split pmds if a pte_entry()
  callback is present. Update the dirty helper pmd_entry accordingly.
- Use separate walk ops for the dirty helpers.
- Update the pagewalk code to take the pagetable lock in walk_pte_range.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 1/9] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock()
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range() Thomas Hellström (VMware)
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov

From: Thomas Hellstrom <thellstrom@vmware.com>

The caller needs to make sure that the vma is not torn down during the
lock operation and can also use the i_mmap_rwsem for file-backed vmas.
Remove the BUG_ON. We could, as an alternative, add a test that either
vma->vm_mm->mmap_sem or vma->vm_file->f_mapping->i_mmap_rwsem are held.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
---
 include/linux/huge_mm.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93d5cf0bc716..0b84e13e88e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -216,7 +216,6 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
 	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
@@ -225,7 +224,6 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
 	if (pud_trans_huge(*pud) || pud_devmap(*pud))
 		return __pud_trans_huge_lock(pud, vma);
 	else
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range()
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 1/9] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-09 15:14   ` Kirill A. Shutemov
  2019-10-08  9:15 ` [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present Thomas Hellström (VMware)
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Matthew Wilcox, Will Deacon,
	Peter Zijlstra, Rik van Riel, Minchan Kim, Michal Hocko,
	Huang Ying, Jérôme Glisse, Kirill A . Shutemov

From: Thomas Hellstrom <thellstrom@vmware.com>

Without the lock, anybody modifying a pte from within this function might
have it concurrently modified by someone else.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 mm/pagewalk.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..83c0b78363b4 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,8 +10,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	int err = 0;
 	const struct mm_walk_ops *ops = walk->ops;
+	spinlock_t *ptl;
 
-	pte = pte_offset_map(pmd, addr);
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (;;) {
 		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
@@ -22,7 +23,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		pte++;
 	}
 
-	pte_unmap(pte);
+	pte_unmap_unlock(pte - 1, ptl);
 	return err;
 }
 
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 1/9] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range() Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-09 15:27   ` Kirill A. Shutemov
  2019-10-08  9:15 ` [PATCH v4 4/9] mm: Add a walk_page_mapping() function to the pagewalk code Thomas Hellström (VMware)
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Matthew Wilcox, Will Deacon,
	Peter Zijlstra, Rik van Riel, Minchan Kim, Michal Hocko,
	Huang Ying, Jérôme Glisse, Kirill A . Shutemov

From: Thomas Hellstrom <thellstrom@vmware.com>

The pagewalk code was unconditionally splitting transhuge pmds when a
pte_entry was present. However ideally we'd want to handle transhuge pmds
in the pmd_entry function and ptes in pte_entry function. So don't split
huge pmds when there is a pmd_entry function present, but let the callback
take care of it if necessary.

In order to make sure a virtual address range is handled by one and only
one callback, and since pmd entries may be unstable, we introduce a
pmd_entry return code that tells the walk code to continue processing this
pmd entry rather than to move on. Since caller-defined positive return
codes (up to 2) are used by current callers, use a high value that allows a
large range of positive caller-defined return codes for future users.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 include/linux/pagewalk.h |  8 ++++++++
 mm/pagewalk.c            | 28 +++++++++++++++++++++-------
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index bddd9759bab9..c4a013eb445d 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -4,6 +4,11 @@
 
 #include <linux/mm.h>
 
+/* Highest positive pmd_entry caller-specific return value */
+#define PAGE_WALK_CALLER_MAX     (INT_MAX / 2)
+/* The handler did not handle the entry. Fall back to the next level */
+#define PAGE_WALK_FALLBACK       (PAGE_WALK_CALLER_MAX + 1)
+
 struct mm_walk;
 
 /**
@@ -16,6 +21,9 @@ struct mm_walk;
  *			this handler is required to be able to handle
  *			pmd_trans_huge() pmds.  They may simply choose to
  *			split_huge_page() instead of handling it explicitly.
+ *                      If the handler did not handle the PMD, or split the
+ *                      PMD and wants it handled by the PTE handler, it
+ *                      should return PAGE_WALK_FALLBACK.
  * @pte_entry:		if set, called for each non-empty PTE (4th-level) entry
  * @pte_hole:		if set, called for each hole at all levels
  * @hugetlb_entry:	if set, called for each hugetlb entry
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 83c0b78363b4..f844c2a2aa60 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -50,10 +50,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		 * This implies that each ->pmd_entry() handler
 		 * needs to know about pmd_trans_huge() pmds
 		 */
-		if (ops->pmd_entry)
+		if (ops->pmd_entry) {
 			err = ops->pmd_entry(pmd, addr, next, walk);
-		if (err)
-			break;
+			if (!err)
+				continue;
+			else if (err <= PAGE_WALK_CALLER_MAX)
+				break;
+			WARN_ON(err != PAGE_WALK_FALLBACK);
+			err = 0;
+			if (pmd_trans_unstable(pmd))
+				goto again;
+			/* Fall through */
+		}
 
 		/*
 		 * Check this here so we only break down trans_huge
@@ -61,8 +69,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		 */
 		if (!ops->pte_entry)
 			continue;
-
-		split_huge_pmd(walk->vma, pmd, addr);
+		if (!ops->pmd_entry)
+			split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
@@ -281,11 +289,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
  *
  *  - 0  : succeeded to handle the current entry, and if you don't reach the
  *         end address yet, continue to walk.
- *  - >0 : succeeded to handle the current entry, and return to the caller
- *         with caller specific value.
+ *  - >0, and <= PAGE_WALK_CALLER_MAX : succeeded to handle the current entry,
+ *         and return to the caller with caller specific value.
  *  - <0 : failed to handle the current entry, and return to the caller
  *         with error code.
  *
+ * For pmd_entry(), a value <= PAGE_WALK_CALLER_MAX indicates that the entry
+ * was handled by the callback. PAGE_WALK_FALLBACK indicates that the entry
+ * could not be handled by the callback and should be re-checked. If the
+ * callback needs the entry to be handled by the next level, it should
+ * split the entry and then return PAGE_WALK_FALLBACK.
+ *
  * Before starting to walk page table, some callers want to check whether
  * they really want to walk over the current vma, typically by checking
  * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 4/9] mm: Add a walk_page_mapping() function to the pagewalk code
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (2 preceding siblings ...)
  2019-10-08  9:15 ` [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov

From: Thomas Hellstrom <thellstrom@vmware.com>

For users that want to travers all page table entries pointing into a
region of a struct address_space mapping, introduce a walk_page_mapping()
function.

The walk_page_mapping() function will be initially be used for dirty-
tracking in virtual graphics drivers.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 include/linux/pagewalk.h |  9 ++++
 mm/pagewalk.c            | 94 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index c4a013eb445d..9b2742911eff 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -32,6 +32,9 @@ struct mm_walk;
  *			"do page table walk over the current vma", returning
  *			a negative value means "abort current page table walk
  *			right now" and returning 1 means "skip the current vma"
+ * @pre_vma:            if set, called before starting walk on a non-null vma.
+ * @post_vma:           if set, called after a walk on a non-null vma, provided
+ *                      that @pre_vma and the vma walk succeeded.
  */
 struct mm_walk_ops {
 	int (*pud_entry)(pud_t *pud, unsigned long addr,
@@ -47,6 +50,9 @@ struct mm_walk_ops {
 			     struct mm_walk *walk);
 	int (*test_walk)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
+	int (*pre_vma)(unsigned long start, unsigned long end,
+		       struct mm_walk *walk);
+	void (*post_vma)(struct mm_walk *walk);
 };
 
 /**
@@ -70,5 +76,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 		void *private);
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		void *private);
+int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
+		      pgoff_t nr, const struct mm_walk_ops *ops,
+		      void *private);
 
 #endif /* _LINUX_PAGEWALK_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index f844c2a2aa60..8de4574e7753 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -262,13 +262,23 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 {
 	int err = 0;
 	struct vm_area_struct *vma = walk->vma;
+	const struct mm_walk_ops *ops = walk->ops;
+
+	if (vma && ops->pre_vma) {
+		err = ops->pre_vma(start, end, walk);
+		if (err)
+			return err;
+	}
 
 	if (vma && is_vm_hugetlb_page(vma)) {
-		if (walk->ops->hugetlb_entry)
+		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
 		err = walk_pgd_range(start, end, walk);
 
+	if (vma && ops->post_vma)
+		ops->post_vma(walk);
+
 	return err;
 }
 
@@ -305,6 +315,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
  * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
  * purpose.
  *
+ * If operations need to be staged before and committed after a vma is walked,
+ * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
+ * since it is intended to handle commit-type operations, can't return any
+ * errors.
+ *
  * struct mm_walk keeps current values of some common data like vma and pmd,
  * which are useful for the access from callbacks. If you want to pass some
  * caller-specific data to callbacks, @private should be helpful.
@@ -391,3 +406,80 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		return err;
 	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
 }
+
+/**
+ * walk_page_mapping - walk all memory areas mapped into a struct address_space.
+ * @mapping: Pointer to the struct address_space
+ * @first_index: First page offset in the address_space
+ * @nr: Number of incremental page offsets to cover
+ * @ops:	operation to call during the walk
+ * @private:	private data for callbacks' usage
+ *
+ * This function walks all memory areas mapped into a struct address_space.
+ * The walk is limited to only the given page-size index range, but if
+ * the index boundaries cross a huge page-table entry, that entry will be
+ * included.
+ *
+ * Also see walk_page_range() for additional information.
+ *
+ * Locking:
+ *   This function can't require that the struct mm_struct::mmap_sem is held,
+ *   since @mapping may be mapped by multiple processes. Instead
+ *   @mapping->i_mmap_rwsem must be held. This might have implications in the
+ *   callbacks, and it's up tho the caller to ensure that the
+ *   struct mm_struct::mmap_sem is not needed.
+ *
+ *   Also this means that a caller can't rely on the struct
+ *   vm_area_struct::vm_flags to be constant across a call,
+ *   except for immutable flags. Callers requiring this shouldn't use
+ *   this function.
+ *
+ * Return: 0 on success, negative error code on failure, positive number on
+ * caller defined premature termination.
+ */
+int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
+		      pgoff_t nr, const struct mm_walk_ops *ops,
+		      void *private)
+{
+	struct mm_walk walk = {
+		.ops		= ops,
+		.private	= private,
+	};
+	struct vm_area_struct *vma;
+	pgoff_t vba, vea, cba, cea;
+	unsigned long start_addr, end_addr;
+	int err = 0;
+
+	lockdep_assert_held(&mapping->i_mmap_rwsem);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
+				  first_index + nr - 1) {
+		/* Clip to the vma */
+		vba = vma->vm_pgoff;
+		vea = vba + vma_pages(vma);
+		cba = first_index;
+		cba = max(cba, vba);
+		cea = first_index + nr;
+		cea = min(cea, vea);
+
+		start_addr = ((cba - vba) << PAGE_SHIFT) + vma->vm_start;
+		end_addr = ((cea - vba) << PAGE_SHIFT) + vma->vm_start;
+		if (start_addr >= end_addr)
+			continue;
+
+		walk.vma = vma;
+		walk.mm = vma->vm_mm;
+
+		err = walk_page_test(vma->vm_start, vma->vm_end, &walk);
+		if (err > 0) {
+			err = 0;
+			break;
+		} else if (err < 0)
+			break;
+
+		err = __walk_page_range(start_addr, end_addr, &walk);
+		if (err)
+			break;
+	}
+
+	return err;
+}
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (3 preceding siblings ...)
  2019-10-08  9:15 ` [PATCH v4 4/9] mm: Add a walk_page_mapping() function to the pagewalk code Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-08 17:06   ` Linus Torvalds
  2019-10-08  9:15 ` [PATCH v4 6/9] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov

From: Thomas Hellstrom <thellstrom@vmware.com>

Add two utilities to 1) write-protect and 2) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 include/linux/mm.h         |  13 +-
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/mapping_dirty_helpers.c | 308 +++++++++++++++++++++++++++++++++++++
 4 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 mm/mapping_dirty_helpers.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..4bc93477375e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2637,7 +2637,6 @@ typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
 
-
 #ifdef CONFIG_PAGE_POISONING
 extern bool page_poisoning_enabled(void);
 extern void kernel_poison_pages(struct page *page, int numpages, int enable);
@@ -2878,5 +2877,17 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 	return !memcmp_pages(page1, page2);
 }
 
+#ifdef CONFIG_MAPPING_DIRTY_HELPERS
+unsigned long clean_record_shared_mapping_range(struct address_space *mapping,
+						pgoff_t first_index, pgoff_t nr,
+						pgoff_t bitmap_pgoff,
+						unsigned long *bitmap,
+						pgoff_t *start,
+						pgoff_t *end);
+
+unsigned long wp_shared_mapping_range(struct address_space *mapping,
+				      pgoff_t first_index, pgoff_t nr);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..550f7aceb679 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config MAPPING_DIRTY_HELPERS
+        bool
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..1937cc251883 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
new file mode 100644
index 000000000000..10572e558154
--- /dev/null
+++ b/mm/mapping_dirty_helpers.c
@@ -0,0 +1,308 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pagewalk.h>
+#include <linux/hugetlb.h>
+#include <linux/bitops.h>
+#include <linux/mmu_notifier.h>
+#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
+
+/**
+ * struct wp_walk - Private struct for pagetable walk callbacks
+ * @range: Range for mmu notifiers
+ * @tlbflush_start: Address of first modified pte
+ * @tlbflush_end: Address of last modified pte + 1
+ * @total: Total number of modified ptes
+ */
+struct wp_walk {
+	struct mmu_notifier_range range;
+	unsigned long tlbflush_start;
+	unsigned long tlbflush_end;
+	unsigned long total;
+};
+
+/**
+ * wp_pte - Write-protect a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function write-protects a pte and records the range in
+ * virtual address space of touched ptes for efficient range TLB flushes.
+ */
+static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
+		  struct mm_walk *walk)
+{
+	struct wp_walk *wpwalk = walk->private;
+	pte_t ptent = *pte;
+
+	if (pte_write(ptent)) {
+		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+		ptent = pte_wrprotect(old_pte);
+		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+		wpwalk->total++;
+		wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
+		wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
+					   addr + PAGE_SIZE);
+	}
+
+	return 0;
+}
+
+/**
+ * struct clean_walk - Private struct for the clean_record_pte function.
+ * @base: struct wp_walk we derive from
+ * @bitmap_pgoff: Address_space Page offset of the first bit in @bitmap
+ * @bitmap: Bitmap with one bit for each page offset in the address_space range
+ * covered.
+ * @start: Address_space page offset of first modified pte relative
+ * to @bitmap_pgoff
+ * @end: Address_space page offset of last modified pte relative
+ * to @bitmap_pgoff
+ */
+struct clean_walk {
+	struct wp_walk base;
+	pgoff_t bitmap_pgoff;
+	unsigned long *bitmap;
+	pgoff_t start;
+	pgoff_t end;
+};
+
+#define to_clean_walk(_wpwalk) container_of(_wpwalk, struct clean_walk, base)
+
+/**
+ * clean_record_pte - Clean a pte and record its address space offset in a
+ * bitmap
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function cleans a pte and records the range in
+ * virtual address space of touched ptes for efficient TLB flushes.
+ * It also records dirty ptes in a bitmap representing page offsets
+ * in the address_space, as well as the first and last of the bits
+ * touched.
+ */
+static int clean_record_pte(pte_t *pte, unsigned long addr,
+			    unsigned long end, struct mm_walk *walk)
+{
+	struct wp_walk *wpwalk = walk->private;
+	struct clean_walk *cwalk = to_clean_walk(wpwalk);
+	pte_t ptent = *pte;
+
+	if (pte_dirty(ptent)) {
+		pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
+			walk->vma->vm_pgoff - cwalk->bitmap_pgoff;
+		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+		ptent = pte_mkclean(old_pte);
+		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+
+		wpwalk->total++;
+		wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
+		wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
+					   addr + PAGE_SIZE);
+
+		__set_bit(pgoff, cwalk->bitmap);
+		cwalk->start = min(cwalk->start, pgoff);
+		cwalk->end = max(cwalk->end, pgoff + 1);
+	}
+
+	return 0;
+}
+
+/* wp_clean_pmd_entry - The pagewalk pmd callback. */
+static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
+			      struct mm_walk *walk)
+{
+	/* Dirty-tracking should be handled on the pte level */
+	pmd_t pmdval = pmd_read_atomic(pmd);
+
+	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+		WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
+		return 0;
+	}
+
+	return PAGE_WALK_FALLBACK;
+}
+
+/* wp_clean_pud_entry - The pagewalk pud callback. */
+static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
+			      struct mm_walk *walk)
+{
+	/* Dirty-tracking should be handled on the pte level */
+	pud_t pudval = READ_ONCE(*pud);
+
+	if (pud_trans_huge(pudval) || pud_devmap(pudval))
+		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
+
+	return 0;
+}
+
+/*
+ * wp_clean_pre_vma - The pagewalk pre_vma callback.
+ *
+ * The pre_vma callback performs the cache flush, stages the tlb flush
+ * and calls the necessary mmu notifiers.
+ */
+static int wp_clean_pre_vma(unsigned long start, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct wp_walk *wpwalk = walk->private;
+
+	wpwalk->tlbflush_start = end;
+	wpwalk->tlbflush_end = start;
+
+	mmu_notifier_range_init(&wpwalk->range, MMU_NOTIFY_PROTECTION_PAGE, 0,
+				walk->vma, walk->mm, start, end);
+	mmu_notifier_invalidate_range_start(&wpwalk->range);
+	flush_cache_range(walk->vma, start, end);
+
+	/*
+	 * We're not using tlb_gather_mmu() since typically
+	 * only a small subrange of PTEs are affected, whereas
+	 * tlb_gather_mmu() records the full range.
+	 */
+	inc_tlb_flush_pending(walk->mm);
+
+	return 0;
+}
+
+/*
+ * wp_clean_post_vma - The pagewalk post_vma callback.
+ *
+ * The post_vma callback performs the tlb flush and calls necessary mmu
+ * notifiers.
+ */
+static void wp_clean_post_vma(struct mm_walk *walk)
+{
+	struct wp_walk *wpwalk = walk->private;
+
+	if (wpwalk->tlbflush_end > wpwalk->tlbflush_start)
+		flush_tlb_range(walk->vma, wpwalk->tlbflush_start,
+				wpwalk->tlbflush_end);
+
+	mmu_notifier_invalidate_range_end(&wpwalk->range);
+	dec_tlb_flush_pending(walk->mm);
+}
+
+/*
+ * wp_clean_test_walk - The pagewalk test_walk callback.
+ *
+ * Won't perform dirty-tracking on COW, read-only or HUGETLB vmas.
+ */
+static int wp_clean_test_walk(unsigned long start, unsigned long end,
+			      struct mm_walk *walk)
+{
+	/* Skip non-applicable VMAs */
+	if ((walk->vma->vm_flags & (VM_SHARED | VM_MAYWRITE | VM_HUGETLB)) !=
+	    (VM_SHARED | VM_MAYWRITE))
+		return 1;
+
+	return 0;
+}
+
+static const struct mm_walk_ops clean_walk_ops = {
+	.pte_entry = clean_record_pte,
+	.pmd_entry = wp_clean_pmd_entry,
+	.pud_entry = wp_clean_pud_entry,
+	.test_walk = wp_clean_test_walk,
+	.pre_vma = wp_clean_pre_vma,
+	.post_vma = wp_clean_post_vma
+};
+
+static const struct mm_walk_ops wp_walk_ops = {
+	.pte_entry = wp_pte,
+	.pmd_entry = wp_clean_pmd_entry,
+	.pud_entry = wp_clean_pud_entry,
+	.test_walk = wp_clean_test_walk,
+	.pre_vma = wp_clean_pre_vma,
+	.post_vma = wp_clean_post_vma
+};
+
+/**
+ * wp_shared_mapping_range - Write-protect all ptes in an address space range
+ * @mapping: The address_space we want to write protect
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ *
+ * Note: This function currently skips transhuge page-table entries, since
+ * it's intended for dirty-tracking on the PTE level. It will warn on
+ * encountering transhuge write-enabled entries, though, and can easily be
+ * extended to handle them as well.
+ *
+ * Return: The number of ptes actually write-protected. Note that
+ * already write-protected ptes are not counted.
+ */
+unsigned long wp_shared_mapping_range(struct address_space *mapping,
+				      pgoff_t first_index, pgoff_t nr)
+{
+	struct wp_walk wpwalk = { .total = 0 };
+
+	i_mmap_lock_read(mapping);
+	WARN_ON(walk_page_mapping(mapping, first_index, nr, &wp_walk_ops,
+				  &wpwalk));
+	i_mmap_unlock_read(mapping);
+
+	return wpwalk.total;
+}
+EXPORT_SYMBOL_GPL(wp_shared_mapping_range);
+
+/**
+ * clean_record_shared_mapping_range - Clean and record all ptes in an
+ * address space range
+ * @mapping: The address_space we want to clean
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ * @bitmap_pgoff: The page offset of the first bit in @bitmap
+ * @bitmap: Pointer to a bitmap of at least @nr bits. The bitmap needs to
+ * cover the whole range @first_index..@first_index + @nr.
+ * @start: Pointer to number of the first set bit in @bitmap.
+ * is modified as new bits are set by the function.
+ * @end: Pointer to the number of the last set bit in @bitmap.
+ * none set. The value is modified as new bits are set by the function.
+ *
+ * Note: When this function returns there is no guarantee that a CPU has
+ * not already dirtied new ptes. However it will not clean any ptes not
+ * reported in the bitmap.
+ *
+ * If a caller needs to make sure all dirty ptes are picked up and none
+ * additional are added, it first needs to write-protect the address-space
+ * range and make sure new writers are blocked in page_mkwrite() or
+ * pfn_mkwrite(). And then after a TLB flush following the write-protection
+ * pick up all dirty bits.
+ *
+ * Note: This function currently skips transhuge page-table entries, since
+ * it's intended for dirty-tracking on the PTE level. It will warn on
+ * encountering transhuge dirty entries, though, and can easily be extended
+ * to handle them as well.
+ *
+ * Return: The number of dirty ptes actually cleaned.
+ */
+unsigned long clean_record_shared_mapping_range(struct address_space *mapping,
+						pgoff_t first_index, pgoff_t nr,
+						pgoff_t bitmap_pgoff,
+						unsigned long *bitmap,
+						pgoff_t *start,
+						pgoff_t *end)
+{
+	bool none_set = (*start >= *end);
+	struct clean_walk cwalk = {
+		.base = { .total = 0 },
+		.bitmap_pgoff = bitmap_pgoff,
+		.bitmap = bitmap,
+		.start = none_set ? nr : *start,
+		.end = none_set ? 0 : *end,
+	};
+
+	i_mmap_lock_read(mapping);
+	WARN_ON(walk_page_mapping(mapping, first_index, nr, &clean_walk_ops,
+				  &cwalk.base));
+	i_mmap_unlock_read(mapping);
+
+	*start = cwalk.start;
+	*end = cwalk.end;
+
+	return cwalk.base.total;
+}
+EXPORT_SYMBOL_GPL(clean_record_shared_mapping_range);
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 6/9] drm/vmwgfx: Implement an infrastructure for write-coherent resources
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (4 preceding siblings ...)
  2019-10-08  9:15 ` [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 7/9] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

This infrastructure will, for coherent resources, make sure that
from the user-space point of view, data written by the CPU is immediately
automatically available to the GPU at resource validation time.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/Kconfig                |   1 +
 drivers/gpu/drm/vmwgfx/Makefile               |   2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c            |   5 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |  23 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c       |   1 -
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c    | 421 ++++++++++++++++++
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c      |  57 +++
 drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h |  11 +
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  15 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c    |  71 +++
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.h    |  16 +-
 11 files changed, 602 insertions(+), 21 deletions(-)
 create mode 100644 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c

diff --git a/drivers/gpu/drm/vmwgfx/Kconfig b/drivers/gpu/drm/vmwgfx/Kconfig
index 6b28a326f8bb..15acdf2a7c0f 100644
--- a/drivers/gpu/drm/vmwgfx/Kconfig
+++ b/drivers/gpu/drm/vmwgfx/Kconfig
@@ -8,6 +8,7 @@ config DRM_VMWGFX
 	select FB_CFB_IMAGEBLIT
 	select DRM_TTM
 	select FB
+	select MAPPING_DIRTY_HELPERS
 	# Only needed for the transitional use of drm_crtc_init - can be removed
 	# again once vmwgfx sets up the primary plane itself.
 	select DRM_KMS_HELPER
diff --git a/drivers/gpu/drm/vmwgfx/Makefile b/drivers/gpu/drm/vmwgfx/Makefile
index 8841bd30e1e5..c877a21a0739 100644
--- a/drivers/gpu/drm/vmwgfx/Makefile
+++ b/drivers/gpu/drm/vmwgfx/Makefile
@@ -8,7 +8,7 @@ vmwgfx-y := vmwgfx_execbuf.o vmwgfx_gmr.o vmwgfx_kms.o vmwgfx_drv.o \
 	    vmwgfx_cmdbuf_res.o vmwgfx_cmdbuf.o vmwgfx_stdu.o \
 	    vmwgfx_cotable.o vmwgfx_so.o vmwgfx_binding.o vmwgfx_msg.o \
 	    vmwgfx_simple_resource.o vmwgfx_va.o vmwgfx_blit.o \
-	    vmwgfx_validation.o \
+	    vmwgfx_validation.o vmwgfx_page_dirty.o \
 	    ttm_object.o ttm_lock.o
 
 obj-$(CONFIG_DRM_VMWGFX) := vmwgfx.o
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index aad8d8140259..869aeaec2f86 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -462,6 +462,7 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
 {
 	struct vmw_buffer_object *vmw_bo = vmw_buffer_object(bo);
 
+	WARN_ON(vmw_bo->dirty);
 	vmw_bo_unmap(vmw_bo);
 	kfree(vmw_bo);
 }
@@ -475,8 +476,10 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
 static void vmw_user_bo_destroy(struct ttm_buffer_object *bo)
 {
 	struct vmw_user_buffer_object *vmw_user_bo = vmw_user_buffer_object(bo);
+	struct vmw_buffer_object *vbo = &vmw_user_bo->vbo;
 
-	vmw_bo_unmap(&vmw_user_bo->vbo);
+	WARN_ON(vbo->dirty);
+	vmw_bo_unmap(vbo);
 	ttm_prime_object_kfree(vmw_user_bo, prime);
 }
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 5eb73ded8e07..7944dbbbdd72 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -105,6 +105,7 @@ struct vmw_fpriv {
  * @dx_query_ctx: DX context if this buffer object is used as a DX query MOB
  * @map: Kmap object for semi-persistent mappings
  * @res_prios: Eviction priority counts for attached resources
+ * @dirty: structure for user-space dirty-tracking
  */
 struct vmw_buffer_object {
 	struct ttm_buffer_object base;
@@ -115,6 +116,7 @@ struct vmw_buffer_object {
 	/* Protected by reservation */
 	struct ttm_bo_kmap_obj map;
 	u32 res_prios[TTM_MAX_BO_PRIORITY];
+	struct vmw_bo_dirty *dirty;
 };
 
 /**
@@ -145,7 +147,8 @@ struct vmw_res_func;
  * @res_dirty: Resource contains data not yet in the backup buffer. Protected
  * by resource reserved.
  * @backup_dirty: Backup buffer contains data not yet in the HW resource.
- * Protecte by resource reserved.
+ * Protected by resource reserved.
+ * @coherent: Emulate coherency by tracking vm accesses.
  * @backup: The backup buffer if any. Protected by resource reserved.
  * @backup_offset: Offset into the backup buffer if any. Protected by resource
  * reserved. Note that only a few resource types can have a @backup_offset
@@ -162,14 +165,16 @@ struct vmw_res_func;
  * @hw_destroy: Callback to destroy the resource on the device, as part of
  * resource destruction.
  */
+struct vmw_resource_dirty;
 struct vmw_resource {
 	struct kref kref;
 	struct vmw_private *dev_priv;
 	int id;
 	u32 used_prio;
 	unsigned long backup_size;
-	bool res_dirty;
-	bool backup_dirty;
+	u32 res_dirty : 1;
+	u32 backup_dirty : 1;
+	u32 coherent : 1;
 	struct vmw_buffer_object *backup;
 	unsigned long backup_offset;
 	unsigned long pin_count;
@@ -177,6 +182,7 @@ struct vmw_resource {
 	struct list_head lru_head;
 	struct list_head mob_head;
 	struct list_head binding_head;
+	struct vmw_resource_dirty *dirty;
 	void (*res_free) (struct vmw_resource *res);
 	void (*hw_destroy) (struct vmw_resource *res);
 };
@@ -716,6 +722,8 @@ extern void vmw_resource_evict_all(struct vmw_private *dev_priv);
 extern void vmw_resource_unbind_list(struct vmw_buffer_object *vbo);
 void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
+void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
+			       pgoff_t end);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1403,6 +1411,15 @@ int vmw_host_log(const char *log);
 #define VMW_DEBUG_USER(fmt, ...)                                              \
 	DRM_DEBUG_DRIVER(fmt, ##__VA_ARGS__)
 
+/* Resource dirtying - vmwgfx_page_dirty.c */
+void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo);
+int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
+void vmw_bo_dirty_clear_res(struct vmw_resource *res);
+void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
+vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
+
 /**
  * VMW_DEBUG_KMS - Debug output for kernel mode-setting
  *
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
index ff86d49dc5e8..934ad7c0c342 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
@@ -2560,7 +2560,6 @@ static int vmw_cmd_dx_check_subresource(struct vmw_private *dev_priv,
 		     offsetof(typeof(*cmd), sid));
 
 	cmd = container_of(header, typeof(*cmd), header);
-
 	return vmw_cmd_res_check(dev_priv, sw_context, vmw_res_surface,
 				 VMW_RES_DIRTY_NONE, user_surface_converter,
 				 &cmd->sid, NULL);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
new file mode 100644
index 000000000000..060c1e492f25
--- /dev/null
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/**************************************************************************
+ *
+ * Copyright 2019 VMware, Inc., Palo Alto, CA., USA
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDERS, AUTHORS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
+ * USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ **************************************************************************/
+#include "vmwgfx_drv.h"
+
+/*
+ * Different methods for tracking dirty:
+ * VMW_BO_DIRTY_PAGETABLE - Scan the pagetable for hardware dirty bits
+ * VMW_BO_DIRTY_MKWRITE - Write-protect page table entries and record write-
+ * accesses in the VM mkwrite() callback
+ */
+enum vmw_bo_dirty_method {
+	VMW_BO_DIRTY_PAGETABLE,
+	VMW_BO_DIRTY_MKWRITE,
+};
+
+/*
+ * No dirtied pages at scan trigger a transition to the _MKWRITE method,
+ * similarly a certain percentage of dirty pages trigger a transition to
+ * the _PAGETABLE method. How many triggers should we wait for before
+ * changing method?
+ */
+#define VMW_DIRTY_NUM_CHANGE_TRIGGERS 2
+
+/* Percentage to trigger a transition to the _PAGETABLE method */
+#define VMW_DIRTY_PERCENTAGE 10
+
+/**
+ * struct vmw_bo_dirty - Dirty information for buffer objects
+ * @start: First currently dirty bit
+ * @end: Last currently dirty bit + 1
+ * @method: The currently used dirty method
+ * @change_count: Number of consecutive method change triggers
+ * @ref_count: Reference count for this structure
+ * @bitmap_size: The size of the bitmap in bits. Typically equal to the
+ * nuber of pages in the bo.
+ * @size: The accounting size for this struct.
+ * @bitmap: A bitmap where each bit represents a page. A set bit means a
+ * dirty page.
+ */
+struct vmw_bo_dirty {
+	unsigned long start;
+	unsigned long end;
+	enum vmw_bo_dirty_method method;
+	unsigned int change_count;
+	unsigned int ref_count;
+	unsigned long bitmap_size;
+	size_t size;
+	unsigned long bitmap[0];
+};
+
+/**
+ * vmw_bo_dirty_scan_pagetable - Perform a pagetable scan for dirty bits
+ * @vbo: The buffer object to scan
+ *
+ * Scans the pagetable for dirty bits. Clear those bits and modify the
+ * dirty structure with the results. This function may change the
+ * dirty-tracking method.
+ */
+static void vmw_bo_dirty_scan_pagetable(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	pgoff_t offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+	pgoff_t num_marked;
+
+	num_marked = clean_record_shared_mapping_range
+		(mapping,
+		 offset, dirty->bitmap_size,
+		 offset, &dirty->bitmap[0],
+		 &dirty->start, &dirty->end);
+	if (num_marked == 0)
+		dirty->change_count++;
+	else
+		dirty->change_count = 0;
+
+	if (dirty->change_count > VMW_DIRTY_NUM_CHANGE_TRIGGERS) {
+		dirty->change_count = 0;
+		dirty->method = VMW_BO_DIRTY_MKWRITE;
+		wp_shared_mapping_range(mapping,
+					offset, dirty->bitmap_size);
+		clean_record_shared_mapping_range(mapping,
+						  offset, dirty->bitmap_size,
+						  offset, &dirty->bitmap[0],
+						  &dirty->start, &dirty->end);
+	}
+}
+
+/**
+ * vmw_bo_dirty_scan_mkwrite - Reset the mkwrite dirty-tracking method
+ * @vbo: The buffer object to scan
+ *
+ * Write-protect pages written to so that consecutive write accesses will
+ * trigger a call to mkwrite.
+ *
+ * This function may change the dirty-tracking method.
+ */
+static void vmw_bo_dirty_scan_mkwrite(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	unsigned long offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+	pgoff_t num_marked;
+
+	if (dirty->end <= dirty->start)
+		return;
+
+	num_marked = wp_shared_mapping_range(vbo->base.bdev->dev_mapping,
+					dirty->start + offset,
+					dirty->end - dirty->start);
+
+	if (100UL * num_marked / dirty->bitmap_size >
+	    VMW_DIRTY_PERCENTAGE) {
+		dirty->change_count++;
+	} else {
+		dirty->change_count = 0;
+	}
+
+	if (dirty->change_count > VMW_DIRTY_NUM_CHANGE_TRIGGERS) {
+		pgoff_t start = 0;
+		pgoff_t end = dirty->bitmap_size;
+
+		dirty->method = VMW_BO_DIRTY_PAGETABLE;
+		clean_record_shared_mapping_range(mapping, offset, end, offset,
+						  &dirty->bitmap[0],
+						  &start, &end);
+		bitmap_clear(&dirty->bitmap[0], 0, dirty->bitmap_size);
+		if (dirty->start < dirty->end)
+			bitmap_set(&dirty->bitmap[0], dirty->start,
+				   dirty->end - dirty->start);
+		dirty->change_count = 0;
+	}
+}
+
+
+/**
+ * vmw_bo_dirty_scan - Scan for dirty pages and add them to the dirty
+ * tracking structure
+ * @vbo: The buffer object to scan
+ *
+ * This function may change the dirty tracking method.
+ */
+void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+
+	if (dirty->method == VMW_BO_DIRTY_PAGETABLE)
+		vmw_bo_dirty_scan_pagetable(vbo);
+	else
+		vmw_bo_dirty_scan_mkwrite(vbo);
+}
+
+/**
+ * vmw_bo_dirty_add - Add a dirty-tracking user to a buffer object
+ * @vbo: The buffer object
+ *
+ * This function registers a dirty-tracking user to a buffer object.
+ * A user can be for example a resource or a vma in a special user-space
+ * mapping.
+ *
+ * Return: Zero on success, -ENOMEM on memory allocation failure.
+ */
+int vmw_bo_dirty_add(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	pgoff_t num_pages = vbo->base.num_pages;
+	size_t size, acc_size;
+	int ret;
+	static struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = false
+	};
+
+	if (dirty) {
+		dirty->ref_count++;
+		return 0;
+	}
+
+	size = sizeof(*dirty) + BITS_TO_LONGS(num_pages) * sizeof(long);
+	acc_size = ttm_round_pot(size);
+	ret = ttm_mem_global_alloc(&ttm_mem_glob, acc_size, &ctx);
+	if (ret) {
+		VMW_DEBUG_USER("Out of graphics memory for buffer object "
+			       "dirty tracker.\n");
+		return ret;
+	}
+	dirty = kvzalloc(size, GFP_KERNEL);
+	if (!dirty) {
+		ret = -ENOMEM;
+		goto out_no_dirty;
+	}
+
+	dirty->size = acc_size;
+	dirty->bitmap_size = num_pages;
+	dirty->start = dirty->bitmap_size;
+	dirty->end = 0;
+	dirty->ref_count = 1;
+	if (num_pages < PAGE_SIZE / sizeof(pte_t)) {
+		dirty->method = VMW_BO_DIRTY_PAGETABLE;
+	} else {
+		struct address_space *mapping = vbo->base.bdev->dev_mapping;
+		pgoff_t offset = drm_vma_node_start(&vbo->base.base.vma_node);
+
+		dirty->method = VMW_BO_DIRTY_MKWRITE;
+
+		/* Write-protect and then pick up already dirty bits */
+		wp_shared_mapping_range(mapping, offset, num_pages);
+		clean_record_shared_mapping_range(mapping, offset, num_pages,
+						  offset,
+						  &dirty->bitmap[0],
+						  &dirty->start, &dirty->end);
+	}
+
+	vbo->dirty = dirty;
+
+	return 0;
+
+out_no_dirty:
+	ttm_mem_global_free(&ttm_mem_glob, acc_size);
+	return ret;
+}
+
+/**
+ * vmw_bo_dirty_release - Release a dirty-tracking user from a buffer object
+ * @vbo: The buffer object
+ *
+ * This function releases a dirty-tracking user from a buffer object.
+ * If the reference count reaches zero, then the dirty-tracking object is
+ * freed and the pointer to it cleared.
+ *
+ * Return: Zero on success, -ENOMEM on memory allocation failure.
+ */
+void vmw_bo_dirty_release(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+
+	if (dirty && --dirty->ref_count == 0) {
+		size_t acc_size = dirty->size;
+
+		kvfree(dirty);
+		ttm_mem_global_free(&ttm_mem_glob, acc_size);
+		vbo->dirty = NULL;
+	}
+}
+
+/**
+ * vmw_bo_dirty_transfer_to_res - Pick up a resource's dirty region from
+ * its backing mob.
+ * @res: The resource
+ *
+ * This function will pick up all dirty ranges affecting the resource from
+ * it's backup mob, and call vmw_resource_dirty_update() once for each
+ * range. The transferred ranges will be cleared from the backing mob's
+ * dirty tracking.
+ */
+void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res)
+{
+	struct vmw_buffer_object *vbo = res->backup;
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	pgoff_t start, cur, end;
+	unsigned long res_start = res->backup_offset;
+	unsigned long res_end = res->backup_offset + res->backup_size;
+
+	WARN_ON_ONCE(res_start & ~PAGE_MASK);
+	res_start >>= PAGE_SHIFT;
+	res_end = DIV_ROUND_UP(res_end, PAGE_SIZE);
+
+	if (res_start >= dirty->end || res_end <= dirty->start)
+		return;
+
+	cur = max(res_start, dirty->start);
+	res_end = max(res_end, dirty->end);
+	while (cur < res_end) {
+		unsigned long num;
+
+		start = find_next_bit(&dirty->bitmap[0], res_end, cur);
+		if (start >= res_end)
+			break;
+
+		end = find_next_zero_bit(&dirty->bitmap[0], res_end, start + 1);
+		cur = end + 1;
+		num = end - start;
+		bitmap_clear(&dirty->bitmap[0], start, num);
+		vmw_resource_dirty_update(res, start, end);
+	}
+
+	if (res_start <= dirty->start && res_end > dirty->start)
+		dirty->start = res_end;
+	if (res_start < dirty->end && res_end >= dirty->end)
+		dirty->end = res_start;
+}
+
+/**
+ * vmw_bo_dirty_clear_res - Clear a resource's dirty region from
+ * its backing mob.
+ * @res: The resource
+ *
+ * This function will clear all dirty ranges affecting the resource from
+ * it's backup mob's dirty tracking.
+ */
+void vmw_bo_dirty_clear_res(struct vmw_resource *res)
+{
+	unsigned long res_start = res->backup_offset;
+	unsigned long res_end = res->backup_offset + res->backup_size;
+	struct vmw_buffer_object *vbo = res->backup;
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+
+	res_start >>= PAGE_SHIFT;
+	res_end = DIV_ROUND_UP(res_end, PAGE_SIZE);
+
+	if (res_start >= dirty->end || res_end <= dirty->start)
+		return;
+
+	res_start = max(res_start, dirty->start);
+	res_end = min(res_end, dirty->end);
+	bitmap_clear(&dirty->bitmap[0], res_start, res_end - res_start);
+
+	if (res_start <= dirty->start && res_end > dirty->start)
+		dirty->start = res_end;
+	if (res_start < dirty->end && res_end >= dirty->end)
+		dirty->end = res_start;
+}
+
+vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct ttm_buffer_object *bo = (struct ttm_buffer_object *)
+	    vma->vm_private_data;
+	vm_fault_t ret;
+	unsigned long page_offset;
+	unsigned int save_flags;
+	struct vmw_buffer_object *vbo =
+		container_of(bo, typeof(*vbo), base);
+
+	/*
+	 * mkwrite() doesn't handle the VM_FAULT_RETRY return value correctly.
+	 * So make sure the TTM helpers are aware.
+	 */
+	save_flags = vmf->flags;
+	vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
+	ret = ttm_bo_vm_reserve(bo, vmf);
+	vmf->flags = save_flags;
+	if (ret)
+		return ret;
+
+	page_offset = vmf->pgoff - drm_vma_node_start(&bo->base.vma_node);
+	if (unlikely(page_offset >= bo->num_pages)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
+	if (vbo->dirty && vbo->dirty->method == VMW_BO_DIRTY_MKWRITE &&
+	    !test_bit(page_offset, &vbo->dirty->bitmap[0])) {
+		struct vmw_bo_dirty *dirty = vbo->dirty;
+
+		__set_bit(page_offset, &dirty->bitmap[0]);
+		dirty->start = min(dirty->start, page_offset);
+		dirty->end = max(dirty->end, page_offset + 1);
+	}
+
+out_unlock:
+	dma_resv_unlock(bo->base.resv);
+	return ret;
+}
+
+vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct ttm_buffer_object *bo = (struct ttm_buffer_object *)
+	    vma->vm_private_data;
+	struct vmw_buffer_object *vbo =
+		container_of(bo, struct vmw_buffer_object, base);
+	pgoff_t num_prefault;
+	pgprot_t prot;
+	vm_fault_t ret;
+
+	ret = ttm_bo_vm_reserve(bo, vmf);
+	if (ret)
+		return ret;
+
+	/*
+	 * This will cause mkwrite() to be called for each pte on
+	 * write-enable vmas.
+	 */
+	if (vbo->dirty && vbo->dirty->method == VMW_BO_DIRTY_MKWRITE)
+		prot = vma->vm_page_prot;
+	else
+		prot = vm_get_page_prot(vma->vm_flags);
+
+	num_prefault = (vma->vm_flags & VM_RAND_READ) ? 0 :
+		TTM_BO_VM_NUM_PREFAULT;
+	ret = ttm_bo_vm_fault_reserved(vmf, prot, num_prefault);
+	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+		return ret;
+
+	dma_resv_unlock(bo->base.resv);
+	return ret;
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index 5581a7826b4c..e4c97a4cf2ff 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -119,6 +119,10 @@ static void vmw_resource_release(struct kref *kref)
 		}
 		res->backup_dirty = false;
 		vmw_resource_mob_detach(res);
+		if (res->dirty)
+			res->func->dirty_free(res);
+		if (res->coherent)
+			vmw_bo_dirty_release(res->backup);
 		ttm_bo_unreserve(bo);
 		vmw_bo_unreference(&res->backup);
 	}
@@ -208,7 +212,9 @@ int vmw_resource_init(struct vmw_private *dev_priv, struct vmw_resource *res,
 	res->backup_offset = 0;
 	res->backup_dirty = false;
 	res->res_dirty = false;
+	res->coherent = false;
 	res->used_prio = 3;
+	res->dirty = NULL;
 	if (delay_id)
 		return 0;
 	else
@@ -395,6 +401,30 @@ static int vmw_resource_do_validate(struct vmw_resource *res,
 			vmw_resource_mob_attach(res);
 	}
 
+	/*
+	 * Handle the case where the backup mob is marked coherent but
+	 * the resource isn't.
+	 */
+	if (func->dirty_alloc && vmw_resource_mob_attached(res) &&
+	    !res->coherent) {
+		if (res->backup->dirty && !res->dirty) {
+			ret = func->dirty_alloc(res);
+			if (ret)
+				return ret;
+		} else if (!res->backup->dirty && res->dirty) {
+			func->dirty_free(res);
+		}
+	}
+
+	/*
+	 * Transfer the dirty regions to the resource and update
+	 * the resource.
+	 */
+	if (res->dirty) {
+		vmw_bo_dirty_transfer_to_res(res);
+		return func->dirty_sync(res);
+	}
+
 	return 0;
 
 out_bind_failed:
@@ -433,16 +463,28 @@ void vmw_resource_unreserve(struct vmw_resource *res,
 	if (switch_backup && new_backup != res->backup) {
 		if (res->backup) {
 			vmw_resource_mob_detach(res);
+			if (res->coherent)
+				vmw_bo_dirty_release(res->backup);
 			vmw_bo_unreference(&res->backup);
 		}
 
 		if (new_backup) {
 			res->backup = vmw_bo_reference(new_backup);
+
+			/*
+			 * The validation code should already have added a
+			 * dirty tracker here.
+			 */
+			WARN_ON(res->coherent && !new_backup->dirty);
+
 			vmw_resource_mob_attach(res);
 		} else {
 			res->backup = NULL;
 		}
+	} else if (switch_backup && res->coherent) {
+		vmw_bo_dirty_release(res->backup);
 	}
+
 	if (switch_backup)
 		res->backup_offset = new_backup_offset;
 
@@ -1008,3 +1050,18 @@ enum vmw_res_type vmw_res_type(const struct vmw_resource *res)
 {
 	return res->func->res_type;
 }
+
+/**
+ * vmw_resource_update_dirty - Update a resource's dirty tracker with a
+ * sequential range of touched backing store memory.
+ * @res: The resource.
+ * @start: The first page touched.
+ * @end: The last page touched + 1.
+ */
+void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
+			       pgoff_t end)
+{
+	if (res->dirty)
+		res->func->dirty_range_add(res, start << PAGE_SHIFT,
+					   end << PAGE_SHIFT);
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
index 984e588c62ca..c85144286cfe 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
@@ -71,6 +71,12 @@ struct vmw_user_resource_conv {
  * @commit_notify:     If the resource is a command buffer managed resource,
  *                     callback to notify that a define or remove command
  *                     has been committed to the device.
+ * @dirty_alloc:       Allocate a dirty tracker. NULL if dirty-tracking is not
+ *                     supported.
+ * @dirty_free:        Free the dirty tracker.
+ * @dirty_sync:        Upload the dirty mob contents to the resource.
+ * @dirty_add_range:   Add a sequential dirty range to the resource
+ *                     dirty tracker.
  */
 struct vmw_res_func {
 	enum vmw_res_type res_type;
@@ -90,6 +96,11 @@ struct vmw_res_func {
 		       struct ttm_validate_buffer *val_buf);
 	void (*commit_notify)(struct vmw_resource *res,
 			      enum vmw_cmdbuf_res_state state);
+	int (*dirty_alloc)(struct vmw_resource *res);
+	void (*dirty_free)(struct vmw_resource *res);
+	int (*dirty_sync)(struct vmw_resource *res);
+	void (*dirty_range_add)(struct vmw_resource *res, size_t start,
+				 size_t end);
 };
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
index 5a7b8bb420de..ce288756531b 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
@@ -29,10 +29,23 @@
 
 int vmw_mmap(struct file *filp, struct vm_area_struct *vma)
 {
+	static const struct vm_operations_struct vmw_vm_ops = {
+		.pfn_mkwrite = vmw_bo_vm_mkwrite,
+		.page_mkwrite = vmw_bo_vm_mkwrite,
+		.fault = vmw_bo_vm_fault,
+		.open = ttm_bo_vm_open,
+		.close = ttm_bo_vm_close
+	};
 	struct drm_file *file_priv = filp->private_data;
 	struct vmw_private *dev_priv = vmw_priv(file_priv->minor->dev);
+	int ret = ttm_bo_mmap(filp, vma, &dev_priv->bdev);
 
-	return ttm_bo_mmap(filp, vma, &dev_priv->bdev);
+	if (ret)
+		return ret;
+
+	vma->vm_ops = &vmw_vm_ops;
+
+	return 0;
 }
 
 /* struct vmw_validation_mem callback */
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
index f611b2290a1b..71349a7bae90 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
@@ -33,6 +33,8 @@
  * struct vmw_validation_bo_node - Buffer object validation metadata.
  * @base: Metadata used for TTM reservation- and validation.
  * @hash: A hash entry used for the duplicate detection hash table.
+ * @coherent_count: If switching backup buffers, number of new coherent
+ * resources that will have this buffer as a backup buffer.
  * @as_mob: Validate as mob.
  * @cpu_blit: Validate for cpu blit access.
  *
@@ -42,6 +44,7 @@
 struct vmw_validation_bo_node {
 	struct ttm_validate_buffer base;
 	struct drm_hash_item hash;
+	unsigned int coherent_count;
 	u32 as_mob : 1;
 	u32 cpu_blit : 1;
 };
@@ -459,6 +462,19 @@ int vmw_validation_res_reserve(struct vmw_validation_context *ctx,
 			if (ret)
 				goto out_unreserve;
 		}
+
+		if (val->switching_backup && val->new_backup &&
+		    res->coherent) {
+			struct vmw_validation_bo_node *bo_node =
+				vmw_validation_find_bo_dup(ctx,
+							   val->new_backup);
+
+			if (WARN_ON(!bo_node)) {
+				ret = -EINVAL;
+				goto out_unreserve;
+			}
+			bo_node->coherent_count++;
+		}
 	}
 
 	return 0;
@@ -562,6 +578,9 @@ int vmw_validation_bo_validate(struct vmw_validation_context *ctx, bool intr)
 	int ret;
 
 	list_for_each_entry(entry, &ctx->bo_list, base.head) {
+		struct vmw_buffer_object *vbo =
+			container_of(entry->base.bo, typeof(*vbo), base);
+
 		if (entry->cpu_blit) {
 			struct ttm_operation_ctx ctx = {
 				.interruptible = intr,
@@ -576,6 +595,27 @@ int vmw_validation_bo_validate(struct vmw_validation_context *ctx, bool intr)
 		}
 		if (ret)
 			return ret;
+
+		/*
+		 * Rather than having the resource code allocating the bo
+		 * dirty tracker in resource_unreserve() where we can't fail,
+		 * Do it here when validating the buffer object.
+		 */
+		if (entry->coherent_count) {
+			unsigned int coherent_count = entry->coherent_count;
+
+			while (coherent_count) {
+				ret = vmw_bo_dirty_add(vbo);
+				if (ret)
+					return ret;
+
+				coherent_count--;
+			}
+			entry->coherent_count -= coherent_count;
+		}
+
+		if (vbo->dirty)
+			vmw_bo_dirty_scan(vbo);
 	}
 	return 0;
 }
@@ -828,3 +868,34 @@ int vmw_validation_preload_res(struct vmw_validation_context *ctx,
 	ctx->mem_size_left += size;
 	return 0;
 }
+
+/**
+ * vmw_validation_bo_backoff - Unreserve buffer objects registered with a
+ * validation context
+ * @ctx: The validation context
+ *
+ * This function unreserves the buffer objects previously reserved using
+ * vmw_validation_bo_reserve. It's typically used as part of an error path
+ */
+void vmw_validation_bo_backoff(struct vmw_validation_context *ctx)
+{
+	struct vmw_validation_bo_node *entry;
+
+	/*
+	 * Switching coherent resource backup buffers failed.
+	 * Release corresponding buffer object dirty trackers.
+	 */
+	list_for_each_entry(entry, &ctx->bo_list, base.head) {
+		if (entry->coherent_count) {
+			unsigned int coherent_count = entry->coherent_count;
+			struct vmw_buffer_object *vbo =
+				container_of(entry->base.bo, typeof(*vbo),
+					     base);
+
+			while (coherent_count--)
+				vmw_bo_dirty_release(vbo);
+		}
+	}
+
+	ttm_eu_backoff_reservation(&ctx->ticket, &ctx->bo_list);
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h
index 0e063743dd86..4cee3f732588 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h
@@ -173,20 +173,6 @@ vmw_validation_bo_reserve(struct vmw_validation_context *ctx,
 				      NULL, true);
 }
 
-/**
- * vmw_validation_bo_backoff - Unreserve buffer objects registered with a
- * validation context
- * @ctx: The validation context
- *
- * This function unreserves the buffer objects previously reserved using
- * vmw_validation_bo_reserve. It's typically used as part of an error path
- */
-static inline void
-vmw_validation_bo_backoff(struct vmw_validation_context *ctx)
-{
-	ttm_eu_backoff_reservation(&ctx->ticket, &ctx->bo_list);
-}
-
 /**
  * vmw_validation_bo_fence - Unreserve and fence buffer objects registered
  * with a validation context
@@ -269,4 +255,6 @@ int vmw_validation_preload_res(struct vmw_validation_context *ctx,
 			       unsigned int size);
 void vmw_validation_res_set_dirty(struct vmw_validation_context *ctx,
 				  void *val_private, u32 dirty);
+void vmw_validation_bo_backoff(struct vmw_validation_context *ctx);
+
 #endif
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 7/9] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (5 preceding siblings ...)
  2019-10-08  9:15 ` [PATCH v4 6/9] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 8/9] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 9/9] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)
  8 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

With emulated coherent memory we need to be able to quickly look up
a resource from the MOB offset. Instead of traversing a linked list with
O(n) worst case, use an RBtree with O(log n) worst case complexity.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c       |  5 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h      | 10 +++----
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 33 +++++++++++++++++-------
 3 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index 869aeaec2f86..18e4b329e563 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -463,6 +463,7 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
 	struct vmw_buffer_object *vmw_bo = vmw_buffer_object(bo);
 
 	WARN_ON(vmw_bo->dirty);
+	WARN_ON(!RB_EMPTY_ROOT(&vmw_bo->res_tree));
 	vmw_bo_unmap(vmw_bo);
 	kfree(vmw_bo);
 }
@@ -479,6 +480,7 @@ static void vmw_user_bo_destroy(struct ttm_buffer_object *bo)
 	struct vmw_buffer_object *vbo = &vmw_user_bo->vbo;
 
 	WARN_ON(vbo->dirty);
+	WARN_ON(!RB_EMPTY_ROOT(&vbo->res_tree));
 	vmw_bo_unmap(vbo);
 	ttm_prime_object_kfree(vmw_user_bo, prime);
 }
@@ -514,8 +516,7 @@ int vmw_bo_init(struct vmw_private *dev_priv,
 	memset(vmw_bo, 0, sizeof(*vmw_bo));
 	BUILD_BUG_ON(TTM_MAX_BO_PRIORITY <= 3);
 	vmw_bo->base.priority = 3;
-
-	INIT_LIST_HEAD(&vmw_bo->res_list);
+	vmw_bo->res_tree = RB_ROOT;
 
 	ret = ttm_bo_init(bdev, &vmw_bo->base, size,
 			  ttm_bo_type_device, placement,
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 7944dbbbdd72..53f8522ae032 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -100,7 +100,7 @@ struct vmw_fpriv {
 /**
  * struct vmw_buffer_object - TTM buffer object with vmwgfx additions
  * @base: The TTM buffer object
- * @res_list: List of resources using this buffer object as a backing MOB
+ * @res_tree: RB tree of resources using this buffer object as a backing MOB
  * @pin_count: pin depth
  * @dx_query_ctx: DX context if this buffer object is used as a DX query MOB
  * @map: Kmap object for semi-persistent mappings
@@ -109,7 +109,7 @@ struct vmw_fpriv {
  */
 struct vmw_buffer_object {
 	struct ttm_buffer_object base;
-	struct list_head res_list;
+	struct rb_root res_tree;
 	s32 pin_count;
 	/* Not ref-counted.  Protected by binding_mutex */
 	struct vmw_resource *dx_query_ctx;
@@ -157,8 +157,8 @@ struct vmw_res_func;
  * pin-count greater than zero. It is not on the resource LRU lists and its
  * backup buffer is pinned. Hence it can't be evicted.
  * @func: Method vtable for this resource. Immutable.
+ * @mob_node; Node for the MOB backup rbtree. Protected by @backup reserved.
  * @lru_head: List head for the LRU list. Protected by @dev_priv::resource_lock.
- * @mob_head: List head for the MOB backup list. Protected by @backup reserved.
  * @binding_head: List head for the context binding list. Protected by
  * the @dev_priv::binding_mutex
  * @res_free: The resource destructor.
@@ -179,8 +179,8 @@ struct vmw_resource {
 	unsigned long backup_offset;
 	unsigned long pin_count;
 	const struct vmw_res_func *func;
+	struct rb_node mob_node;
 	struct list_head lru_head;
-	struct list_head mob_head;
 	struct list_head binding_head;
 	struct vmw_resource_dirty *dirty;
 	void (*res_free) (struct vmw_resource *res);
@@ -733,7 +733,7 @@ void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
  */
 static inline bool vmw_resource_mob_attached(const struct vmw_resource *res)
 {
-	return !list_empty(&res->mob_head);
+	return !RB_EMPTY_NODE(&res->mob_node);
 }
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index e4c97a4cf2ff..328ad46076ff 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -40,11 +40,24 @@
 void vmw_resource_mob_attach(struct vmw_resource *res)
 {
 	struct vmw_buffer_object *backup = res->backup;
+	struct rb_node **new = &backup->res_tree.rb_node, *parent = NULL;
 
 	dma_resv_assert_held(res->backup->base.base.resv);
 	res->used_prio = (res->res_dirty) ? res->func->dirty_prio :
 		res->func->prio;
-	list_add_tail(&res->mob_head, &backup->res_list);
+
+	while (*new) {
+		struct vmw_resource *this =
+			container_of(*new, struct vmw_resource, mob_node);
+
+		parent = *new;
+		new = (res->backup_offset < this->backup_offset) ?
+			&((*new)->rb_left) : &((*new)->rb_right);
+	}
+
+	rb_link_node(&res->mob_node, parent, new);
+	rb_insert_color(&res->mob_node, &backup->res_tree);
+
 	vmw_bo_prio_add(backup, res->used_prio);
 }
 
@@ -58,7 +71,8 @@ void vmw_resource_mob_detach(struct vmw_resource *res)
 
 	dma_resv_assert_held(backup->base.base.resv);
 	if (vmw_resource_mob_attached(res)) {
-		list_del_init(&res->mob_head);
+		rb_erase(&res->mob_node, &backup->res_tree);
+		RB_CLEAR_NODE(&res->mob_node);
 		vmw_bo_prio_del(backup, res->used_prio);
 	}
 }
@@ -204,8 +218,8 @@ int vmw_resource_init(struct vmw_private *dev_priv, struct vmw_resource *res,
 	res->res_free = res_free;
 	res->dev_priv = dev_priv;
 	res->func = func;
+	RB_CLEAR_NODE(&res->mob_node);
 	INIT_LIST_HEAD(&res->lru_head);
-	INIT_LIST_HEAD(&res->mob_head);
 	INIT_LIST_HEAD(&res->binding_head);
 	res->id = -1;
 	res->backup = NULL;
@@ -754,19 +768,20 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr)
  */
 void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
 {
-
-	struct vmw_resource *res, *next;
 	struct ttm_validate_buffer val_buf = {
 		.bo = &vbo->base,
 		.num_shared = 0
 	};
 
 	dma_resv_assert_held(vbo->base.base.resv);
-	list_for_each_entry_safe(res, next, &vbo->res_list, mob_head) {
-		if (!res->func->unbind)
-			continue;
+	while (!RB_EMPTY_ROOT(&vbo->res_tree)) {
+		struct rb_node *node = vbo->res_tree.rb_node;
+		struct vmw_resource *res =
+			container_of(node, struct vmw_resource, mob_node);
+
+		if (!WARN_ON_ONCE(!res->func->unbind))
+			(void) res->func->unbind(res, res->res_dirty, &val_buf);
 
-		(void) res->func->unbind(res, res->res_dirty, &val_buf);
 		res->backup_dirty = true;
 		res->res_dirty = false;
 		vmw_resource_mob_detach(res);
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 8/9] drm/vmwgfx: Implement an infrastructure for read-coherent resources
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (6 preceding siblings ...)
  2019-10-08  9:15 ` [PATCH v4 7/9] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  2019-10-08  9:15 ` [PATCH v4 9/9] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)
  8 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

Similar to write-coherent resources, make sure that from the user-space
point of view, GPU rendered contents is automatically available for
reading by the CPU.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   7 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c    |  77 ++++++++++++-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c      | 103 +++++++++++++++++-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h |   2 +
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c    |   3 +-
 5 files changed, 181 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 53f8522ae032..729a2e93acf1 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -680,7 +680,8 @@ extern void vmw_resource_unreference(struct vmw_resource **p_res);
 extern struct vmw_resource *vmw_resource_reference(struct vmw_resource *res);
 extern struct vmw_resource *
 vmw_resource_reference_unless_doomed(struct vmw_resource *res);
-extern int vmw_resource_validate(struct vmw_resource *res, bool intr);
+extern int vmw_resource_validate(struct vmw_resource *res, bool intr,
+				 bool dirtying);
 extern int vmw_resource_reserve(struct vmw_resource *res, bool interruptible,
 				bool no_backup);
 extern bool vmw_resource_needs_backup(const struct vmw_resource *res);
@@ -724,6 +725,8 @@ void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
 void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
 			       pgoff_t end);
+int vmw_resources_clean(struct vmw_buffer_object *vbo, pgoff_t start,
+			pgoff_t end, pgoff_t *num_prefault);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1417,6 +1420,8 @@ int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
 void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
 void vmw_bo_dirty_clear_res(struct vmw_resource *res);
 void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+			pgoff_t start, pgoff_t end);
 vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
 vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
index 060c1e492f25..f07aa857587c 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -155,7 +155,6 @@ static void vmw_bo_dirty_scan_mkwrite(struct vmw_buffer_object *vbo)
 	}
 }
 
-
 /**
  * vmw_bo_dirty_scan - Scan for dirty pages and add them to the dirty
  * tracking structure
@@ -173,6 +172,53 @@ void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo)
 		vmw_bo_dirty_scan_mkwrite(vbo);
 }
 
+/**
+ * vmw_bo_dirty_pre_unmap - write-protect and pick up dirty pages before
+ * an unmap_mapping_range operation.
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * If we're using the _PAGETABLE scan method, we may leak dirty pages
+ * when calling unmap_mapping_range(). This function makes sure we pick
+ * up all dirty pages.
+ */
+static void vmw_bo_dirty_pre_unmap(struct vmw_buffer_object *vbo,
+				   pgoff_t start, pgoff_t end)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	unsigned long offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+	if (dirty->method != VMW_BO_DIRTY_PAGETABLE || start >= end)
+		return;
+
+	wp_shared_mapping_range(mapping, start + offset, end - start);
+	clean_record_shared_mapping_range(mapping, start + offset,
+					  end - start, offset,
+					  &dirty->bitmap[0], &dirty->start,
+					  &dirty->end);
+}
+
+/**
+ * vmw_bo_dirty_unmap - Clear all ptes pointing to a range within a bo
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * This is similar to ttm_bo_unmap_virtual_locked() except it takes a subrange.
+ */
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+			pgoff_t start, pgoff_t end)
+{
+	unsigned long offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+	vmw_bo_dirty_pre_unmap(vbo, start, end);
+	unmap_shared_mapping_range(mapping, (offset + start) << PAGE_SHIFT,
+				   (loff_t) (end - start) << PAGE_SHIFT);
+}
+
 /**
  * vmw_bo_dirty_add - Add a dirty-tracking user to a buffer object
  * @vbo: The buffer object
@@ -401,21 +447,42 @@ vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf)
 	if (ret)
 		return ret;
 
+	num_prefault = (vma->vm_flags & VM_RAND_READ) ? 1 :
+		TTM_BO_VM_NUM_PREFAULT;
+
+	if (vbo->dirty) {
+		pgoff_t allowed_prefault;
+		unsigned long page_offset;
+
+		page_offset = vmf->pgoff -
+			drm_vma_node_start(&bo->base.vma_node);
+		if (page_offset >= bo->num_pages ||
+		    vmw_resources_clean(vbo, page_offset,
+					page_offset + PAGE_SIZE,
+					&allowed_prefault)) {
+			ret = VM_FAULT_SIGBUS;
+			goto out_unlock;
+		}
+
+		num_prefault = min(num_prefault, allowed_prefault);
+	}
+
 	/*
-	 * This will cause mkwrite() to be called for each pte on
-	 * write-enable vmas.
+	 * If we don't track dirty using the MKWRITE method, make sure
+	 * sure the page protection is write-enabled so we don't get
+	 * a lot of unnecessary write faults.
 	 */
 	if (vbo->dirty && vbo->dirty->method == VMW_BO_DIRTY_MKWRITE)
 		prot = vma->vm_page_prot;
 	else
 		prot = vm_get_page_prot(vma->vm_flags);
 
-	num_prefault = (vma->vm_flags & VM_RAND_READ) ? 0 :
-		TTM_BO_VM_NUM_PREFAULT;
 	ret = ttm_bo_vm_fault_reserved(vmf, prot, num_prefault);
 	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
 		return ret;
 
+out_unlock:
 	dma_resv_unlock(bo->base.resv);
+
 	return ret;
 }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index 328ad46076ff..c76faf33972e 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -393,7 +393,8 @@ static int vmw_resource_buf_alloc(struct vmw_resource *res,
  * should be retried once resources have been freed up.
  */
 static int vmw_resource_do_validate(struct vmw_resource *res,
-				    struct ttm_validate_buffer *val_buf)
+				    struct ttm_validate_buffer *val_buf,
+				    bool dirtying)
 {
 	int ret = 0;
 	const struct vmw_res_func *func = res->func;
@@ -435,6 +436,15 @@ static int vmw_resource_do_validate(struct vmw_resource *res,
 	 * the resource.
 	 */
 	if (res->dirty) {
+		if (dirtying && !res->res_dirty) {
+			pgoff_t start = res->backup_offset >> PAGE_SHIFT;
+			pgoff_t end = __KERNEL_DIV_ROUND_UP
+				(res->backup_offset + res->backup_size,
+				 PAGE_SIZE);
+
+			vmw_bo_dirty_unmap(res->backup, start, end);
+		}
+
 		vmw_bo_dirty_transfer_to_res(res);
 		return func->dirty_sync(res);
 	}
@@ -679,6 +689,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
  *                         to the device.
  * @res: The resource to make visible to the device.
  * @intr: Perform waits interruptible if possible.
+ * @dirtying: Pending GPU operation will dirty the resource
  *
  * On succesful return, any backup DMA buffer pointed to by @res->backup will
  * be reserved and validated.
@@ -688,7 +699,8 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
  * Return: Zero on success, -ERESTARTSYS if interrupted, negative error code
  * on failure.
  */
-int vmw_resource_validate(struct vmw_resource *res, bool intr)
+int vmw_resource_validate(struct vmw_resource *res, bool intr,
+			  bool dirtying)
 {
 	int ret;
 	struct vmw_resource *evict_res;
@@ -705,7 +717,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr)
 	if (res->backup)
 		val_buf.bo = &res->backup->base;
 	do {
-		ret = vmw_resource_do_validate(res, &val_buf);
+		ret = vmw_resource_do_validate(res, &val_buf, dirtying);
 		if (likely(ret != -EBUSY))
 			break;
 
@@ -1005,7 +1017,7 @@ int vmw_resource_pin(struct vmw_resource *res, bool interruptible)
 			/* Do we really need to pin the MOB as well? */
 			vmw_bo_pin_reserved(vbo, true);
 		}
-		ret = vmw_resource_validate(res, interruptible);
+		ret = vmw_resource_validate(res, interruptible, true);
 		if (vbo)
 			ttm_bo_unreserve(&vbo->base);
 		if (ret)
@@ -1080,3 +1092,86 @@ void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
 		res->func->dirty_range_add(res, start << PAGE_SHIFT,
 					   end << PAGE_SHIFT);
 }
+
+/**
+ * vmw_resources_clean - Clean resources intersecting a mob range
+ * @vbo: The mob buffer object
+ * @start: The mob page offset starting the range
+ * @end: The mob page offset ending the range
+ * @num_prefault: Returns how many pages including the first have been
+ * cleaned and are ok to prefault
+ */
+int vmw_resources_clean(struct vmw_buffer_object *vbo, pgoff_t start,
+			pgoff_t end, pgoff_t *num_prefault)
+{
+	struct rb_node *cur = vbo->res_tree.rb_node;
+	struct vmw_resource *found = NULL;
+	unsigned long res_start = start << PAGE_SHIFT;
+	unsigned long res_end = end << PAGE_SHIFT;
+	unsigned long last_cleaned = 0;
+
+	/*
+	 * Find the resource with lowest backup_offset that intersects the
+	 * range.
+	 */
+	while (cur) {
+		struct vmw_resource *cur_res =
+			container_of(cur, struct vmw_resource, mob_node);
+
+		if (cur_res->backup_offset >= res_end) {
+			cur = cur->rb_left;
+		} else if (cur_res->backup_offset + cur_res->backup_size <=
+			   res_start) {
+			cur = cur->rb_right;
+		} else {
+			found = cur_res;
+			cur = cur->rb_left;
+			/* Continue to look for resources with lower offsets */
+		}
+	}
+
+	/*
+	 * In order of increasing backup_offset, clean dirty resorces
+	 * intersecting the range.
+	 */
+	while (found) {
+		if (found->res_dirty) {
+			int ret;
+
+			if (!found->func->clean)
+				return -EINVAL;
+
+			ret = found->func->clean(found);
+			if (ret)
+				return ret;
+
+			found->res_dirty = false;
+		}
+		last_cleaned = found->backup_offset + found->backup_size;
+		cur = rb_next(&found->mob_node);
+		if (!cur)
+			break;
+
+		found = container_of(cur, struct vmw_resource, mob_node);
+		if (found->backup_offset >= res_end)
+			break;
+	}
+
+	/*
+	 * Set number of pages allowed prefaulting and fence the buffer object
+	 */
+	*num_prefault = 1;
+	if (last_cleaned > res_start) {
+		struct ttm_buffer_object *bo = &vbo->base;
+
+		*num_prefault = __KERNEL_DIV_ROUND_UP(last_cleaned - res_start,
+						      PAGE_SIZE);
+		vmw_bo_fence_single(bo, NULL);
+		if (bo->moving)
+			dma_fence_put(bo->moving);
+		bo->moving = dma_fence_get
+			(dma_resv_get_excl(bo->base.resv));
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
index c85144286cfe..3b7438b2d289 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
@@ -77,6 +77,7 @@ struct vmw_user_resource_conv {
  * @dirty_sync:        Upload the dirty mob contents to the resource.
  * @dirty_add_range:   Add a sequential dirty range to the resource
  *                     dirty tracker.
+ * @clean:             Clean the resource.
  */
 struct vmw_res_func {
 	enum vmw_res_type res_type;
@@ -101,6 +102,7 @@ struct vmw_res_func {
 	int (*dirty_sync)(struct vmw_resource *res);
 	void (*dirty_range_add)(struct vmw_resource *res, size_t start,
 				 size_t end);
+	int (*clean)(struct vmw_resource *res);
 };
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
index 71349a7bae90..9aaf807ed73c 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
@@ -641,7 +641,8 @@ int vmw_validation_res_validate(struct vmw_validation_context *ctx, bool intr)
 		struct vmw_resource *res = val->res;
 		struct vmw_buffer_object *backup = res->backup;
 
-		ret = vmw_resource_validate(res, intr);
+		ret = vmw_resource_validate(res, intr, val->dirty_set &&
+					    val->dirty);
 		if (ret) {
 			if (ret != -ERESTARTSYS)
 				DRM_ERROR("Failed to validate resource.\n");
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 9/9] drm/vmwgfx: Add surface dirty-tracking callbacks
  2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (7 preceding siblings ...)
  2019-10-08  9:15 ` [PATCH v4 8/9] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
@ 2019-10-08  9:15 ` Thomas Hellström (VMware)
  8 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-08  9:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: torvalds, Thomas Hellstrom, Andrew Morton, Matthew Wilcox,
	Will Deacon, Peter Zijlstra, Rik van Riel, Minchan Kim,
	Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

Add the callbacks necessary to implement emulated coherent memory for
surfaces. Add a flag to the gb_surface_create ioctl to indicate that
surface memory should be coherent.
Also bump the drm minor version to signal the availability of coherent
surfaces.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 .../device_include/svga3d_surfacedefs.h       | 233 ++++++++++-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   4 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c       | 395 +++++++++++++++++-
 include/uapi/drm/vmwgfx_drm.h                 |   4 +-
 4 files changed, 629 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
index f2bfd3d80598..61414f105c67 100644
--- a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
+++ b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
@@ -1280,7 +1280,6 @@ svga3dsurface_get_pixel_offset(SVGA3dSurfaceFormat format,
 	return offset;
 }
 
-
 static inline u32
 svga3dsurface_get_image_offset(SVGA3dSurfaceFormat format,
 			       surf_size_struct baseLevelSize,
@@ -1375,4 +1374,236 @@ svga3dsurface_is_screen_target_format(SVGA3dSurfaceFormat format)
 	return svga3dsurface_is_dx_screen_target_format(format);
 }
 
+/**
+ * struct svga3dsurface_mip - Mimpmap level information
+ * @bytes: Bytes required in the backing store of this mipmap level.
+ * @img_stride: Byte stride per image.
+ * @row_stride: Byte stride per block row.
+ * @size: The size of the mipmap.
+ */
+struct svga3dsurface_mip {
+	size_t bytes;
+	size_t img_stride;
+	size_t row_stride;
+	struct drm_vmw_size size;
+
+};
+
+/**
+ * struct svga3dsurface_cache - Cached surface information
+ * @desc: Pointer to the surface descriptor
+ * @mip: Array of mipmap level information. Valid size is @num_mip_levels.
+ * @mip_chain_bytes: Bytes required in the backing store for the whole chain
+ * of mip levels.
+ * @sheet_bytes: Bytes required in the backing store for a sheet
+ * representing a single sample.
+ * @num_mip_levels: Valid size of the @mip array. Number of mipmap levels in
+ * a chain.
+ * @num_layers: Number of slices in an array texture or number of faces in
+ * a cubemap texture.
+ */
+struct svga3dsurface_cache {
+	const struct svga3d_surface_desc *desc;
+	struct svga3dsurface_mip mip[DRM_VMW_MAX_MIP_LEVELS];
+	size_t mip_chain_bytes;
+	size_t sheet_bytes;
+	u32 num_mip_levels;
+	u32 num_layers;
+};
+
+/**
+ * struct svga3dsurface_loc - Surface location
+ * @sub_resource: Surface subresource. Defined as layer * num_mip_levels +
+ * mip_level.
+ * @x: X coordinate.
+ * @y: Y coordinate.
+ * @z: Z coordinate.
+ */
+struct svga3dsurface_loc {
+	u32 sub_resource;
+	u32 x, y, z;
+};
+
+/**
+ * svga3dsurface_subres - Compute the subresource from layer and mipmap.
+ * @cache: Surface layout data.
+ * @mip_level: The mipmap level.
+ * @layer: The surface layer (face or array slice).
+ *
+ * Return: The subresource.
+ */
+static inline u32 svga3dsurface_subres(const struct svga3dsurface_cache *cache,
+				       u32 mip_level, u32 layer)
+{
+	return cache->num_mip_levels * layer + mip_level;
+}
+
+/**
+ * svga3dsurface_setup_cache - Build a surface cache entry
+ * @size: The surface base level dimensions.
+ * @format: The surface format.
+ * @num_mip_levels: Number of mipmap levels.
+ * @num_layers: Number of layers.
+ * @cache: Pointer to a struct svga3dsurface_cach object to be filled in.
+ *
+ * Return: Zero on success, -EINVAL on invalid surface layout.
+ */
+static inline int svga3dsurface_setup_cache(const struct drm_vmw_size *size,
+					    SVGA3dSurfaceFormat format,
+					    u32 num_mip_levels,
+					    u32 num_layers,
+					    u32 num_samples,
+					    struct svga3dsurface_cache *cache)
+{
+	const struct svga3d_surface_desc *desc;
+	u32 i;
+
+	memset(cache, 0, sizeof(*cache));
+	cache->desc = desc = svga3dsurface_get_desc(format);
+	cache->num_mip_levels = num_mip_levels;
+	cache->num_layers = num_layers;
+	for (i = 0; i < cache->num_mip_levels; i++) {
+		struct svga3dsurface_mip *mip = &cache->mip[i];
+
+		mip->size = svga3dsurface_get_mip_size(*size, i);
+		mip->bytes = svga3dsurface_get_image_buffer_size
+			(desc, &mip->size, 0);
+		mip->row_stride =
+			__KERNEL_DIV_ROUND_UP(mip->size.width,
+					      desc->block_size.width) *
+			desc->bytes_per_block * num_samples;
+		if (!mip->row_stride)
+			goto invalid_dim;
+
+		mip->img_stride =
+			__KERNEL_DIV_ROUND_UP(mip->size.height,
+					      desc->block_size.height) *
+			mip->row_stride;
+		if (!mip->img_stride)
+			goto invalid_dim;
+
+		cache->mip_chain_bytes += mip->bytes;
+	}
+	cache->sheet_bytes = cache->mip_chain_bytes * num_layers;
+	if (!cache->sheet_bytes)
+		goto invalid_dim;
+
+	return 0;
+
+invalid_dim:
+	VMW_DEBUG_USER("Invalid surface layout for dirty tracking.\n");
+	return -EINVAL;
+}
+
+/**
+ * svga3dsurface_get_loc - Get a surface location from an offset into the
+ * backing store
+ * @cache: Surface layout data.
+ * @loc: Pointer to a struct svga3dsurface_loc to be filled in.
+ * @offset: Offset into the surface backing store.
+ */
+static inline void
+svga3dsurface_get_loc(const struct svga3dsurface_cache *cache,
+		      struct svga3dsurface_loc *loc,
+		      size_t offset)
+{
+	const struct svga3dsurface_mip *mip = &cache->mip[0];
+	const struct svga3d_surface_desc *desc = cache->desc;
+	u32 layer;
+	int i;
+
+	if (offset >= cache->sheet_bytes)
+		offset %= cache->sheet_bytes;
+
+	layer = offset / cache->mip_chain_bytes;
+	offset -= layer * cache->mip_chain_bytes;
+	for (i = 0; i < cache->num_mip_levels; ++i, ++mip) {
+		if (mip->bytes > offset)
+			break;
+		offset -= mip->bytes;
+	}
+
+	loc->sub_resource = svga3dsurface_subres(cache, i, layer);
+	loc->z = offset / mip->img_stride;
+	offset -= loc->z * mip->img_stride;
+	loc->z *= desc->block_size.depth;
+	loc->y = offset / mip->row_stride;
+	offset -= loc->y * mip->row_stride;
+	loc->y *= desc->block_size.height;
+	loc->x = offset / desc->bytes_per_block;
+	loc->x *= desc->block_size.width;
+}
+
+/**
+ * svga3dsurface_inc_loc - Clamp increment a surface location with one block
+ * size
+ * in each dimension.
+ * @loc: Pointer to a struct svga3dsurface_loc to be incremented.
+ *
+ * When computing the size of a range as size = end - start, the range does not
+ * include the end element. However a location representing the last byte
+ * of a touched region in the backing store *is* included in the range.
+ * This function modifies such a location to match the end definition
+ * given as start + size which is the one used in a SVGA3dBox.
+ */
+static inline void
+svga3dsurface_inc_loc(const struct svga3dsurface_cache *cache,
+		      struct svga3dsurface_loc *loc)
+{
+	const struct svga3d_surface_desc *desc = cache->desc;
+	u32 mip = loc->sub_resource % cache->num_mip_levels;
+	const struct drm_vmw_size *size = &cache->mip[mip].size;
+
+	loc->sub_resource++;
+	loc->x += desc->block_size.width;
+	if (loc->x > size->width)
+		loc->x = size->width;
+	loc->y += desc->block_size.height;
+	if (loc->y > size->height)
+		loc->y = size->height;
+	loc->z += desc->block_size.depth;
+	if (loc->z > size->depth)
+		loc->z = size->depth;
+}
+
+/**
+ * svga3dsurface_min_loc - The start location in a subresource
+ * @cache: Surface layout data.
+ * @sub_resource: The subresource.
+ * @loc: Pointer to a struct svga3dsurface_loc to be filled in.
+ */
+static inline void
+svga3dsurface_min_loc(const struct svga3dsurface_cache *cache,
+		      u32 sub_resource,
+		      struct svga3dsurface_loc *loc)
+{
+	loc->sub_resource = sub_resource;
+	loc->x = loc->y = loc->z = 0;
+}
+
+/**
+ * svga3dsurface_min_loc - The end location in a subresource
+ * @cache: Surface layout data.
+ * @sub_resource: The subresource.
+ * @loc: Pointer to a struct svga3dsurface_loc to be filled in.
+ *
+ * Following the end definition given in svga3dsurface_inc_loc(),
+ * Compute the end location of a surface subresource.
+ */
+static inline void
+svga3dsurface_max_loc(const struct svga3dsurface_cache *cache,
+		      u32 sub_resource,
+		      struct svga3dsurface_loc *loc)
+{
+	const struct drm_vmw_size *size;
+	u32 mip;
+
+	loc->sub_resource = sub_resource + 1;
+	mip = sub_resource % cache->num_mip_levels;
+	size = &cache->mip[mip].size;
+	loc->x = size->width;
+	loc->y = size->height;
+	loc->z = size->depth;
+}
+
 #endif /* _SVGA3D_SURFACEDEFS_H_ */
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 729a2e93acf1..f5261e1c96d7 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -56,9 +56,9 @@
 
 
 #define VMWGFX_DRIVER_NAME "vmwgfx"
-#define VMWGFX_DRIVER_DATE "20180704"
+#define VMWGFX_DRIVER_DATE "20190328"
 #define VMWGFX_DRIVER_MAJOR 2
-#define VMWGFX_DRIVER_MINOR 15
+#define VMWGFX_DRIVER_MINOR 16
 #define VMWGFX_DRIVER_PATCHLEVEL 0
 #define VMWGFX_FIFO_STATIC_SIZE (1024*1024)
 #define VMWGFX_MAX_RELOCATIONS 2048
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c b/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
index 29d8794f0421..876bada5b35e 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
@@ -68,6 +68,20 @@ struct vmw_surface_offset {
 	uint32_t bo_offset;
 };
 
+/**
+ * vmw_surface_dirty - Surface dirty-tracker
+ * @cache: Cached layout information of the surface.
+ * @size: Accounting size for the struct vmw_surface_dirty.
+ * @num_subres: Number of subresources.
+ * @boxes: Array of SVGA3dBoxes indicating dirty regions. One per subresource.
+ */
+struct vmw_surface_dirty {
+	struct svga3dsurface_cache cache;
+	size_t size;
+	u32 num_subres;
+	SVGA3dBox boxes[0];
+};
+
 static void vmw_user_surface_free(struct vmw_resource *res);
 static struct vmw_resource *
 vmw_user_surface_base_to_res(struct ttm_base_object *base);
@@ -96,6 +110,13 @@ vmw_gb_surface_reference_internal(struct drm_device *dev,
 				  struct drm_vmw_gb_surface_ref_ext_rep *rep,
 				  struct drm_file *file_priv);
 
+static void vmw_surface_dirty_free(struct vmw_resource *res);
+static int vmw_surface_dirty_alloc(struct vmw_resource *res);
+static int vmw_surface_dirty_sync(struct vmw_resource *res);
+static void vmw_surface_dirty_range_add(struct vmw_resource *res, size_t start,
+					size_t end);
+static int vmw_surface_clean(struct vmw_resource *res);
+
 static const struct vmw_user_resource_conv user_surface_conv = {
 	.object_type = VMW_RES_SURFACE,
 	.base_obj_to_res = vmw_user_surface_base_to_res,
@@ -133,7 +154,12 @@ static const struct vmw_res_func vmw_gb_surface_func = {
 	.create = vmw_gb_surface_create,
 	.destroy = vmw_gb_surface_destroy,
 	.bind = vmw_gb_surface_bind,
-	.unbind = vmw_gb_surface_unbind
+	.unbind = vmw_gb_surface_unbind,
+	.dirty_alloc = vmw_surface_dirty_alloc,
+	.dirty_free = vmw_surface_dirty_free,
+	.dirty_sync = vmw_surface_dirty_sync,
+	.dirty_range_add = vmw_surface_dirty_range_add,
+	.clean = vmw_surface_clean,
 };
 
 /**
@@ -641,6 +667,7 @@ static void vmw_user_surface_free(struct vmw_resource *res)
 	struct vmw_private *dev_priv = srf->res.dev_priv;
 	uint32_t size = user_srf->size;
 
+	WARN_ON_ONCE(res->dirty);
 	if (user_srf->master)
 		drm_master_put(&user_srf->master);
 	kfree(srf->offsets);
@@ -1168,10 +1195,16 @@ static int vmw_gb_surface_bind(struct vmw_resource *res,
 		cmd2->header.id = SVGA_3D_CMD_UPDATE_GB_SURFACE;
 		cmd2->header.size = sizeof(cmd2->body);
 		cmd2->body.sid = res->id;
-		res->backup_dirty = false;
 	}
 	vmw_fifo_commit(dev_priv, submit_size);
 
+	if (res->backup->dirty && res->backup_dirty) {
+		/* We've just made a full upload. Cear dirty regions. */
+		vmw_bo_dirty_clear_res(res);
+	}
+
+	res->backup_dirty = false;
+
 	return 0;
 }
 
@@ -1636,7 +1669,8 @@ vmw_gb_surface_define_internal(struct drm_device *dev,
 			}
 		}
 	} else if (req->base.drm_surface_flags &
-		   drm_vmw_surface_flag_create_buffer)
+		   (drm_vmw_surface_flag_create_buffer |
+		    drm_vmw_surface_flag_coherent))
 		ret = vmw_user_bo_alloc(dev_priv, tfile,
 					res->backup_size,
 					req->base.drm_surface_flags &
@@ -1650,6 +1684,26 @@ vmw_gb_surface_define_internal(struct drm_device *dev,
 		goto out_unlock;
 	}
 
+	if (req->base.drm_surface_flags & drm_vmw_surface_flag_coherent) {
+		struct vmw_buffer_object *backup = res->backup;
+
+		ttm_bo_reserve(&backup->base, false, false, NULL);
+		if (!res->func->dirty_alloc)
+			ret = -EINVAL;
+		if (!ret)
+			ret = vmw_bo_dirty_add(backup);
+		if (!ret) {
+			res->coherent = true;
+			ret = res->func->dirty_alloc(res);
+		}
+		ttm_bo_unreserve(&backup->base);
+		if (ret) {
+			vmw_resource_unreference(&res);
+			goto out_unlock;
+		}
+
+	}
+
 	tmp = vmw_resource_reference(res);
 	ret = ttm_prime_object_init(tfile, res->backup_size, &user_srf->prime,
 				    req->base.drm_surface_flags &
@@ -1758,3 +1812,338 @@ vmw_gb_surface_reference_internal(struct drm_device *dev,
 
 	return ret;
 }
+
+/**
+ * vmw_subres_dirty_add - Add a dirty region to a subresource
+ * @dirty: The surfaces's dirty tracker.
+ * @loc_start: The location corresponding to the start of the region.
+ * @loc_end: The location corresponding to the end of the region.
+ *
+ * As we are assuming that @loc_start and @loc_end represent a sequential
+ * range of backing store memory, if the region spans multiple lines then
+ * regardless of the x coordinate, the full lines are dirtied.
+ * Correspondingly if the region spans multiple z slices, then full rather
+ * than partial z slices are dirtied.
+ */
+static void vmw_subres_dirty_add(struct vmw_surface_dirty *dirty,
+				 const struct svga3dsurface_loc *loc_start,
+				 const struct svga3dsurface_loc *loc_end)
+{
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	SVGA3dBox *box = &dirty->boxes[loc_start->sub_resource];
+	u32 mip = loc_start->sub_resource % cache->num_mip_levels;
+	const struct drm_vmw_size *size = &cache->mip[mip].size;
+	u32 box_c2 = box->z + box->d;
+
+	if (WARN_ON(loc_start->sub_resource >= dirty->num_subres))
+		return;
+
+	if (box->d == 0 || box->z > loc_start->z)
+		box->z = loc_start->z;
+	if (box_c2 < loc_end->z)
+		box->d = loc_end->z - box->z;
+
+	if (loc_start->z + 1 == loc_end->z) {
+		box_c2 = box->y + box->h;
+		if (box->h == 0 || box->y > loc_start->y)
+			box->y = loc_start->y;
+		if (box_c2 < loc_end->y)
+			box->h = loc_end->y - box->y;
+
+		if (loc_start->y + 1 == loc_end->y) {
+			box_c2 = box->x + box->w;
+			if (box->w == 0 || box->x > loc_start->x)
+				box->x = loc_start->x;
+			if (box_c2 < loc_end->x)
+				box->w = loc_end->x - box->x;
+		} else {
+			box->x = 0;
+			box->w = size->width;
+		}
+	} else {
+		box->y = 0;
+		box->h = size->height;
+		box->x = 0;
+		box->w = size->width;
+	}
+}
+
+/**
+ * vmw_subres_dirty_full - Mark a full subresource as dirty
+ * @dirty: The surface's dirty tracker.
+ * @subres: The subresource
+ */
+static void vmw_subres_dirty_full(struct vmw_surface_dirty *dirty, u32 subres)
+{
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	u32 mip = subres % cache->num_mip_levels;
+	const struct drm_vmw_size *size = &cache->mip[mip].size;
+	SVGA3dBox *box = &dirty->boxes[subres];
+
+	box->x = 0;
+	box->y = 0;
+	box->z = 0;
+	box->w = size->width;
+	box->h = size->height;
+	box->d = size->depth;
+}
+
+/*
+ * vmw_surface_tex_dirty_add_range - The dirty_add_range callback for texture
+ * surfaces.
+ */
+static void vmw_surface_tex_dirty_range_add(struct vmw_resource *res,
+					    size_t start, size_t end)
+{
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	size_t backup_end = res->backup_offset + res->backup_size;
+	struct svga3dsurface_loc loc1, loc2;
+	const struct svga3dsurface_cache *cache;
+
+	start = max_t(size_t, start, res->backup_offset) - res->backup_offset;
+	end = min(end, backup_end) - res->backup_offset;
+	cache = &dirty->cache;
+	svga3dsurface_get_loc(cache, &loc1, start);
+	svga3dsurface_get_loc(cache, &loc2, end - 1);
+	svga3dsurface_inc_loc(cache, &loc2);
+
+	if (loc1.sub_resource + 1 == loc2.sub_resource) {
+		/* Dirty range covers a single sub-resource */
+		vmw_subres_dirty_add(dirty, &loc1, &loc2);
+	} else {
+		/* Dirty range covers multiple sub-resources */
+		struct svga3dsurface_loc loc_min, loc_max;
+		u32 sub_res = loc1.sub_resource;
+
+		svga3dsurface_max_loc(cache, loc1.sub_resource, &loc_max);
+		vmw_subres_dirty_add(dirty, &loc1, &loc_max);
+		svga3dsurface_min_loc(cache, loc2.sub_resource - 1, &loc_min);
+		vmw_subres_dirty_add(dirty, &loc_min, &loc2);
+		for (sub_res = loc1.sub_resource + 1;
+		     sub_res < loc2.sub_resource - 1; ++sub_res)
+			vmw_subres_dirty_full(dirty, sub_res);
+	}
+}
+
+/*
+ * vmw_surface_tex_dirty_add_range - The dirty_add_range callback for buffer
+ * surfaces.
+ */
+static void vmw_surface_buf_dirty_range_add(struct vmw_resource *res,
+					    size_t start, size_t end)
+{
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	size_t backup_end = res->backup_offset + cache->mip_chain_bytes;
+	SVGA3dBox *box = &dirty->boxes[0];
+	u32 box_c2;
+
+	box->h = box->d = 1;
+	start = max_t(size_t, start, res->backup_offset) - res->backup_offset;
+	end = min(end, backup_end) - res->backup_offset;
+	box_c2 = box->x + box->w;
+	if (box->w == 0 || box->x > start)
+		box->x = start;
+	if (box_c2 < end)
+		box->w = end - box->x;
+}
+
+/*
+ * vmw_surface_tex_dirty_add_range - The dirty_add_range callback for surfaces
+ */
+static void vmw_surface_dirty_range_add(struct vmw_resource *res, size_t start,
+					size_t end)
+{
+	struct vmw_surface *srf = vmw_res_to_srf(res);
+
+	if (WARN_ON(end <= res->backup_offset ||
+		    start >= res->backup_offset + res->backup_size))
+		return;
+
+	if (srf->format == SVGA3D_BUFFER)
+		vmw_surface_buf_dirty_range_add(res, start, end);
+	else
+		vmw_surface_tex_dirty_range_add(res, start, end);
+}
+
+/*
+ * vmw_surface_dirty_sync - The surface's dirty_sync callback.
+ */
+static int vmw_surface_dirty_sync(struct vmw_resource *res)
+{
+	struct vmw_private *dev_priv = res->dev_priv;
+	bool has_dx = 0;
+	u32 i, num_dirty;
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	size_t alloc_size;
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	struct {
+		SVGA3dCmdHeader header;
+		SVGA3dCmdDXUpdateSubResource body;
+	} *cmd1;
+	struct {
+		SVGA3dCmdHeader header;
+		SVGA3dCmdUpdateGBImage body;
+	} *cmd2;
+	void *cmd;
+
+	num_dirty = 0;
+	for (i = 0; i < dirty->num_subres; ++i) {
+		const SVGA3dBox *box = &dirty->boxes[i];
+
+		if (box->d)
+			num_dirty++;
+	}
+
+	if (!num_dirty)
+		goto out;
+
+	alloc_size = num_dirty * ((has_dx) ? sizeof(*cmd1) : sizeof(*cmd2));
+	cmd = VMW_FIFO_RESERVE(dev_priv, alloc_size);
+	if (!cmd)
+		return -ENOMEM;
+
+	cmd1 = cmd;
+	cmd2 = cmd;
+
+	for (i = 0; i < dirty->num_subres; ++i) {
+		const SVGA3dBox *box = &dirty->boxes[i];
+
+		if (!box->d)
+			continue;
+
+		/*
+		 * DX_UPDATE_SUBRESOURCE is aware of array surfaces.
+		 * UPDATE_GB_IMAGE is not.
+		 */
+		if (has_dx) {
+			cmd1->header.id = SVGA_3D_CMD_DX_UPDATE_SUBRESOURCE;
+			cmd1->header.size = sizeof(cmd1->body);
+			cmd1->body.sid = res->id;
+			cmd1->body.subResource = i;
+			cmd1->body.box = *box;
+			cmd1++;
+		} else {
+			cmd2->header.id = SVGA_3D_CMD_UPDATE_GB_IMAGE;
+			cmd2->header.size = sizeof(cmd2->body);
+			cmd2->body.image.sid = res->id;
+			cmd2->body.image.face = i / cache->num_mip_levels;
+			cmd2->body.image.mipmap = i -
+				(cache->num_mip_levels * cmd2->body.image.face);
+			cmd2->body.box = *box;
+			cmd2++;
+		}
+
+	}
+	vmw_fifo_commit(dev_priv, alloc_size);
+ out:
+	memset(&dirty->boxes[0], 0, sizeof(dirty->boxes[0]) *
+	       dirty->num_subres);
+
+	return 0;
+}
+
+/*
+ * vmw_surface_dirty_alloc - The surface's dirty_alloc callback.
+ */
+static int vmw_surface_dirty_alloc(struct vmw_resource *res)
+{
+	struct vmw_surface *srf = vmw_res_to_srf(res);
+	struct vmw_surface_dirty *dirty;
+	u32 num_layers = 1;
+	u32 num_mip;
+	u32 num_subres;
+	u32 num_samples;
+	size_t dirty_size, acc_size;
+	static struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = false
+	};
+	int ret;
+
+	if (srf->array_size)
+		num_layers = srf->array_size;
+	else if (srf->flags & SVGA3D_SURFACE_CUBEMAP)
+		num_layers *= SVGA3D_MAX_SURFACE_FACES;
+
+	num_mip = srf->mip_levels[0];
+	if (!num_mip)
+		num_mip = 1;
+
+	num_subres = num_layers * num_mip;
+	dirty_size = sizeof(*dirty) + num_subres * sizeof(dirty->boxes[0]);
+	acc_size = ttm_round_pot(dirty_size);
+	ret = ttm_mem_global_alloc(vmw_mem_glob(res->dev_priv),
+				   acc_size, &ctx);
+	if (ret) {
+		VMW_DEBUG_USER("Out of graphics memory for surface "
+			       "dirty tracker.\n");
+		return ret;
+	}
+
+	dirty = kvzalloc(dirty_size, GFP_KERNEL);
+	if (!dirty) {
+		ret = -ENOMEM;
+		goto out_no_dirty;
+	}
+
+	num_samples = max_t(u32, 1, srf->multisample_count);
+	ret = svga3dsurface_setup_cache(&srf->base_size, srf->format, num_mip,
+					num_layers, num_samples, &dirty->cache);
+	if (ret)
+		goto out_no_cache;
+
+	dirty->num_subres = num_subres;
+	dirty->size = acc_size;
+	res->dirty = (struct vmw_resource_dirty *) dirty;
+
+	return 0;
+
+out_no_cache:
+	kvfree(dirty);
+out_no_dirty:
+	ttm_mem_global_free(vmw_mem_glob(res->dev_priv), acc_size);
+	return ret;
+}
+
+/*
+ * vmw_surface_dirty_free - The surface's dirty_free callback
+ */
+static void vmw_surface_dirty_free(struct vmw_resource *res)
+{
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	size_t acc_size = dirty->size;
+
+	kvfree(dirty);
+	ttm_mem_global_free(vmw_mem_glob(res->dev_priv), acc_size);
+	res->dirty = NULL;
+}
+
+/*
+ * vmw_surface_clean - The surface's clean callback
+ */
+static int vmw_surface_clean(struct vmw_resource *res)
+{
+	struct vmw_private *dev_priv = res->dev_priv;
+	size_t alloc_size;
+	struct {
+		SVGA3dCmdHeader header;
+		SVGA3dCmdReadbackGBSurface body;
+	} *cmd;
+
+	alloc_size = sizeof(*cmd);
+	cmd = VMW_FIFO_RESERVE(dev_priv, alloc_size);
+	if (!cmd)
+		return -ENOMEM;
+
+	cmd->header.id = SVGA_3D_CMD_READBACK_GB_SURFACE;
+	cmd->header.size = sizeof(cmd->body);
+	cmd->body.sid = res->id;
+	vmw_fifo_commit(dev_priv, alloc_size);
+
+	return 0;
+}
diff --git a/include/uapi/drm/vmwgfx_drm.h b/include/uapi/drm/vmwgfx_drm.h
index 399f58317cff..02cab33f2f25 100644
--- a/include/uapi/drm/vmwgfx_drm.h
+++ b/include/uapi/drm/vmwgfx_drm.h
@@ -891,11 +891,13 @@ struct drm_vmw_shader_arg {
  *                                      surface.
  * @drm_vmw_surface_flag_create_buffer: Create a backup buffer if none is
  *                                      given.
+ * @drm_vmw_surface_flag_coherent:      Back surface with coherent memory.
  */
 enum drm_vmw_surface_flags {
 	drm_vmw_surface_flag_shareable = (1 << 0),
 	drm_vmw_surface_flag_scanout = (1 << 1),
-	drm_vmw_surface_flag_create_buffer = (1 << 2)
+	drm_vmw_surface_flag_create_buffer = (1 << 2),
+	drm_vmw_surface_flag_coherent = (1 << 3),
 };
 
 /**
-- 
2.21.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges
  2019-10-08  9:15 ` [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
@ 2019-10-08 17:06   ` Linus Torvalds
  2019-10-08 18:25     ` Thomas Hellstrom
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2019-10-08 17:06 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Linux Kernel Mailing List, Linux-MM, Thomas Hellstrom,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse, Kirill A . Shutemov

On Tue, Oct 8, 2019 at 2:15 AM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> Add two utilities to 1) write-protect and 2) clean all ptes pointing into
> a range of an address space.

The code looks much simpler and easier to understand now, I think, but
that also makes some thing more obvious..

> +static int clean_record_pte(pte_t *pte, unsigned long addr,
> +                           unsigned long end, struct mm_walk *walk)
> +{
> +       struct wp_walk *wpwalk = walk->private;
> +       struct clean_walk *cwalk = to_clean_walk(wpwalk);
> +       pte_t ptent = *pte;
> +
> +       if (pte_dirty(ptent)) {
> +               pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
> +                       walk->vma->vm_pgoff - cwalk->bitmap_pgoff;
> +               pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
> +
> +               ptent = pte_mkclean(old_pte);
> +               ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
> +
> +               wpwalk->total++;
> +               wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
> +               wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
> +                                          addr + PAGE_SIZE);
> +
> +               __set_bit(pgoff, cwalk->bitmap);

The above looks fundamentally racy.

You clean the pte in memory, but it remains dirty and writable in the
TLB, and you only flush it much later.

So now writes can continue to happen to that page, without it
necessarily being marked dirty again in the page tables, because all
the CPU TLB caches say "it's already dirty".

This may be ok - you've moved the diry bit into that bitmap, and you
will flush the TLB before then taking action on the bitmap. So you
haven't really _lost_ any dirty bits, but it still may be worth a
comment.

You do have comments about the _other_ issues (ie the whole "If a
caller needs to make sure all dirty ptes are picked up and none
additional are added..") but you don't actually have comments about
the TLB race basically potentially now causing "missing" dirty stuff.

And this may actually be a big problem on some architectures. Not x86,
and maybe nothing else, but I have this dim memory of some
architectures being unhappy due to virtual caches, and writeback when
the page table entry says it's read-only and clean.

Our normal "clean pages" is very anal about this, and does

                        flush_cache_page(vma, address, pte_pfn(*pte));
                        entry = ptep_clear_flush(vma, address, pte);
                        entry = pte_wrprotect(entry);
                        entry = pte_mkclean(entry);
                        set_pte_at(vma->vm_mm, address, pte, entry);

when it cleans a page, and I note both the cache flush and the
"clear_flush()" (which invalidates the pte and does a synchronous TLB
flush) and we have magic semantics for that at least on s390 because
there are some low-level architecture details wrt flushing TLB entries
and modifying them.

End result: I think the code is probably ok in practice because you
don't mind the slightly fuzzy dirty bit state, and it's _probably_ ok
on all architectures that use drm for the TLB invalidation side. But I
think it bears a bit of thinking about.

              Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges
  2019-10-08 17:06   ` Linus Torvalds
@ 2019-10-08 18:25     ` Thomas Hellstrom
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellstrom @ 2019-10-08 18:25 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Hellström (VMware)
  Cc: Linux Kernel Mailing List, Linux-MM, Andrew Morton,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse,
	Kirill A . Shutemov

On 10/8/19 7:07 PM, Linus Torvalds wrote:
> On Tue, Oct 8, 2019 at 2:15 AM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> Add two utilities to 1) write-protect and 2) clean all ptes pointing into
>> a range of an address space.
> The code looks much simpler and easier to understand now, I think, but
> that also makes some thing more obvious..
>
>> +static int clean_record_pte(pte_t *pte, unsigned long addr,
>> +                           unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct wp_walk *wpwalk = walk->private;
>> +       struct clean_walk *cwalk = to_clean_walk(wpwalk);
>> +       pte_t ptent = *pte;
>> +
>> +       if (pte_dirty(ptent)) {
>> +               pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
>> +                       walk->vma->vm_pgoff - cwalk->bitmap_pgoff;
>> +               pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
>> +
>> +               ptent = pte_mkclean(old_pte);
>> +               ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
>> +
>> +               wpwalk->total++;
>> +               wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
>> +               wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
>> +                                          addr + PAGE_SIZE);
>> +
>> +               __set_bit(pgoff, cwalk->bitmap);
> The above looks fundamentally racy.
>
> You clean the pte in memory, but it remains dirty and writable in the
> TLB, and you only flush it much later.
>
> So now writes can continue to happen to that page, without it
> necessarily being marked dirty again in the page tables, because all
> the CPU TLB caches say "it's already dirty".
>
> This may be ok - you've moved the diry bit into that bitmap, and you
> will flush the TLB before then taking action on the bitmap. So you
> haven't really _lost_ any dirty bits, but it still may be worth a
> comment.
>
> You do have comments about the _other_ issues (ie the whole "If a
> caller needs to make sure all dirty ptes are picked up and none
> additional are added..") but you don't actually have comments about
> the TLB race basically potentially now causing "missing" dirty stuff.
>
> And this may actually be a big problem on some architectures. Not x86,
> and maybe nothing else, but I have this dim memory of some
> architectures being unhappy due to virtual caches, and writeback when
> the page table entry says it's read-only and clean.
>
> Our normal "clean pages" is very anal about this, and does
>
>                         flush_cache_page(vma, address, pte_pfn(*pte));
>                         entry = ptep_clear_flush(vma, address, pte);
>                         entry = pte_wrprotect(entry);
>                         entry = pte_mkclean(entry);
>                         set_pte_at(vma->vm_mm, address, pte, entry);
>
> when it cleans a page, and I note both the cache flush and the
> "clear_flush()" (which invalidates the pte and does a synchronous TLB
> flush) and we have magic semantics for that at least on s390 because
> there are some low-level architecture details wrt flushing TLB entries
> and modifying them.
>
> End result: I think the code is probably ok in practice because you
> don't mind the slightly fuzzy dirty bit state, and it's _probably_ ok
> on all architectures that use drm for the TLB invalidation side. But I
> think it bears a bit of thinking about.

Yes, there's been some considerable thinking behind this. We do have the
cache flush in the pre_vma callback, and as you mention the TLB flush
happens before we use the bitmap: any pte that may be subject to a race
is recorded in the bitmap, and the guarantee we want to provide with the
function is

a) All ptes that are dirtied before the function starts executing will
be recorded in the bitmap.
b) All ptes dirtied after that will either be recorded in the bitmap,
remain dirty or both.

I probably should add that in the docs.

And the actual writeback happens asynchronously a lot later *should* be
OK, as long as caches are synced for the dma operation. Of course if
this somehow fails on a particular architecture I guess we need to
rethink and use ptep_clear_flush() at least on that architecture.

Thanks,
Thomas.





>
>               Linus
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range()
  2019-10-08  9:15 ` [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range() Thomas Hellström (VMware)
@ 2019-10-09 15:14   ` Kirill A. Shutemov
  2019-10-09 16:07     ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2019-10-09 15:14 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, linux-mm, torvalds, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On Tue, Oct 08, 2019 at 11:15:01AM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom <thellstrom@vmware.com>
> 
> Without the lock, anybody modifying a pte from within this function might
> have it concurrently modified by someone else.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
> ---
>  mm/pagewalk.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index d48c2a986ea3..83c0b78363b4 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,8 +10,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte_t *pte;
>  	int err = 0;
>  	const struct mm_walk_ops *ops = walk->ops;
> +	spinlock_t *ptl;
>  
> -	pte = pte_offset_map(pmd, addr);
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>  	for (;;) {
>  		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
> @@ -22,7 +23,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		pte++;
>  	}
>  
> -	pte_unmap(pte);
> +	pte_unmap_unlock(pte - 1, ptl);

NAK.

If ->pte_entry() fails on the first entry of the page table, pte - 1 will
point out side the page table.

And the '- 1' is totally unnecessary as we break the loop before pte++ on
the last iteration.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-08  9:15 ` [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present Thomas Hellström (VMware)
@ 2019-10-09 15:27   ` Kirill A. Shutemov
  2019-10-09 15:27     ` Kirill A. Shutemov
                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2019-10-09 15:27 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, linux-mm, torvalds, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On Tue, Oct 08, 2019 at 11:15:02AM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom <thellstrom@vmware.com>
> 
> The pagewalk code was unconditionally splitting transhuge pmds when a
> pte_entry was present. However ideally we'd want to handle transhuge pmds
> in the pmd_entry function and ptes in pte_entry function. So don't split
> huge pmds when there is a pmd_entry function present, but let the callback
> take care of it if necessary.

Do we have any current user that expect split_huge_pmd() in this scenario.

> 
> In order to make sure a virtual address range is handled by one and only
> one callback, and since pmd entries may be unstable, we introduce a
> pmd_entry return code that tells the walk code to continue processing this
> pmd entry rather than to move on. Since caller-defined positive return
> codes (up to 2) are used by current callers, use a high value that allows a
> large range of positive caller-defined return codes for future users.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
> ---
>  include/linux/pagewalk.h |  8 ++++++++
>  mm/pagewalk.c            | 28 +++++++++++++++++++++-------
>  2 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index bddd9759bab9..c4a013eb445d 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -4,6 +4,11 @@
>  
>  #include <linux/mm.h>
>  
> +/* Highest positive pmd_entry caller-specific return value */
> +#define PAGE_WALK_CALLER_MAX     (INT_MAX / 2)
> +/* The handler did not handle the entry. Fall back to the next level */
> +#define PAGE_WALK_FALLBACK       (PAGE_WALK_CALLER_MAX + 1)
> +

That's hacky.

Maybe just use an error code for this? -EAGAIN?

>  struct mm_walk;
>  
>  /**
> @@ -16,6 +21,9 @@ struct mm_walk;
>   *			this handler is required to be able to handle
>   *			pmd_trans_huge() pmds.  They may simply choose to
>   *			split_huge_page() instead of handling it explicitly.
> + *                      If the handler did not handle the PMD, or split the
> + *                      PMD and wants it handled by the PTE handler, it
> + *                      should return PAGE_WALK_FALLBACK.

Indentation is broken. Use tabs.

>   * @pte_entry:		if set, called for each non-empty PTE (4th-level) entry
>   * @pte_hole:		if set, called for each hole at all levels
>   * @hugetlb_entry:	if set, called for each hugetlb entry
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 83c0b78363b4..f844c2a2aa60 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -50,10 +50,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		 * This implies that each ->pmd_entry() handler
>  		 * needs to know about pmd_trans_huge() pmds
>  		 */
> -		if (ops->pmd_entry)
> +		if (ops->pmd_entry) {
>  			err = ops->pmd_entry(pmd, addr, next, walk);
> -		if (err)
> -			break;
> +			if (!err)
> +				continue;
> +			else if (err <= PAGE_WALK_CALLER_MAX)
> +				break;
> +			WARN_ON(err != PAGE_WALK_FALLBACK);
> +			err = 0;
> +			if (pmd_trans_unstable(pmd))
> +				goto again;
> +			/* Fall through */
> +		}
>  
>  		/*
>  		 * Check this here so we only break down trans_huge
> @@ -61,8 +69,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		 */
>  		if (!ops->pte_entry)
>  			continue;
> -
> -		split_huge_pmd(walk->vma, pmd, addr);
> +		if (!ops->pmd_entry)
> +			split_huge_pmd(walk->vma, pmd, addr);
>  		if (pmd_trans_unstable(pmd))
>  			goto again;
>  		err = walk_pte_range(pmd, addr, next, walk);
> @@ -281,11 +289,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>   *
>   *  - 0  : succeeded to handle the current entry, and if you don't reach the
>   *         end address yet, continue to walk.
> - *  - >0 : succeeded to handle the current entry, and return to the caller
> - *         with caller specific value.
> + *  - >0, and <= PAGE_WALK_CALLER_MAX : succeeded to handle the current entry,
> + *         and return to the caller with caller specific value.
>   *  - <0 : failed to handle the current entry, and return to the caller
>   *         with error code.
>   *
> + * For pmd_entry(), a value <= PAGE_WALK_CALLER_MAX indicates that the entry
> + * was handled by the callback. PAGE_WALK_FALLBACK indicates that the entry
> + * could not be handled by the callback and should be re-checked. If the
> + * callback needs the entry to be handled by the next level, it should
> + * split the entry and then return PAGE_WALK_FALLBACK.
> + *
>   * Before starting to walk page table, some callers want to check whether
>   * they really want to walk over the current vma, typically by checking
>   * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 15:27   ` Kirill A. Shutemov
@ 2019-10-09 15:27     ` Kirill A. Shutemov
  2019-10-09 16:20     ` Thomas Hellström (VMware)
  2019-10-09 16:21     ` Linus Torvalds
  2 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2019-10-09 15:27 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, linux-mm, torvalds, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On Tue, Oct 08, 2019 at 11:15:02AM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom <thellstrom@vmware.com>
> 
> The pagewalk code was unconditionally splitting transhuge pmds when a
> pte_entry was present. However ideally we'd want to handle transhuge pmds
> in the pmd_entry function and ptes in pte_entry function. So don't split
> huge pmds when there is a pmd_entry function present, but let the callback
> take care of it if necessary.

Do we have any current user that expect split_huge_pmd() in this scenario.

> 
> In order to make sure a virtual address range is handled by one and only
> one callback, and since pmd entries may be unstable, we introduce a
> pmd_entry return code that tells the walk code to continue processing this
> pmd entry rather than to move on. Since caller-defined positive return
> codes (up to 2) are used by current callers, use a high value that allows a
> large range of positive caller-defined return codes for future users.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
> ---
>  include/linux/pagewalk.h |  8 ++++++++
>  mm/pagewalk.c            | 28 +++++++++++++++++++++-------
>  2 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index bddd9759bab9..c4a013eb445d 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -4,6 +4,11 @@
>  
>  #include <linux/mm.h>
>  
> +/* Highest positive pmd_entry caller-specific return value */
> +#define PAGE_WALK_CALLER_MAX     (INT_MAX / 2)
> +/* The handler did not handle the entry. Fall back to the next level */
> +#define PAGE_WALK_FALLBACK       (PAGE_WALK_CALLER_MAX + 1)
> +

That's hacky.

Maybe just use an error code for this? -EAGAIN?

>  struct mm_walk;
>  
>  /**
> @@ -16,6 +21,9 @@ struct mm_walk;
>   *			this handler is required to be able to handle
>   *			pmd_trans_huge() pmds.  They may simply choose to
>   *			split_huge_page() instead of handling it explicitly.
> + *                      If the handler did not handle the PMD, or split the
> + *                      PMD and wants it handled by the PTE handler, it
> + *                      should return PAGE_WALK_FALLBACK.

Indentation is broken. Use tabs.

>   * @pte_entry:		if set, called for each non-empty PTE (4th-level) entry
>   * @pte_hole:		if set, called for each hole at all levels
>   * @hugetlb_entry:	if set, called for each hugetlb entry
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 83c0b78363b4..f844c2a2aa60 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -50,10 +50,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		 * This implies that each ->pmd_entry() handler
>  		 * needs to know about pmd_trans_huge() pmds
>  		 */
> -		if (ops->pmd_entry)
> +		if (ops->pmd_entry) {
>  			err = ops->pmd_entry(pmd, addr, next, walk);
> -		if (err)
> -			break;
> +			if (!err)
> +				continue;
> +			else if (err <= PAGE_WALK_CALLER_MAX)
> +				break;
> +			WARN_ON(err != PAGE_WALK_FALLBACK);
> +			err = 0;
> +			if (pmd_trans_unstable(pmd))
> +				goto again;
> +			/* Fall through */
> +		}
>  
>  		/*
>  		 * Check this here so we only break down trans_huge
> @@ -61,8 +69,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		 */
>  		if (!ops->pte_entry)
>  			continue;
> -
> -		split_huge_pmd(walk->vma, pmd, addr);
> +		if (!ops->pmd_entry)
> +			split_huge_pmd(walk->vma, pmd, addr);
>  		if (pmd_trans_unstable(pmd))
>  			goto again;
>  		err = walk_pte_range(pmd, addr, next, walk);
> @@ -281,11 +289,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>   *
>   *  - 0  : succeeded to handle the current entry, and if you don't reach the
>   *         end address yet, continue to walk.
> - *  - >0 : succeeded to handle the current entry, and return to the caller
> - *         with caller specific value.
> + *  - >0, and <= PAGE_WALK_CALLER_MAX : succeeded to handle the current entry,
> + *         and return to the caller with caller specific value.
>   *  - <0 : failed to handle the current entry, and return to the caller
>   *         with error code.
>   *
> + * For pmd_entry(), a value <= PAGE_WALK_CALLER_MAX indicates that the entry
> + * was handled by the callback. PAGE_WALK_FALLBACK indicates that the entry
> + * could not be handled by the callback and should be re-checked. If the
> + * callback needs the entry to be handled by the next level, it should
> + * split the entry and then return PAGE_WALK_FALLBACK.
> + *
>   * Before starting to walk page table, some callers want to check whether
>   * they really want to walk over the current vma, typically by checking
>   * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range()
  2019-10-09 15:14   ` Kirill A. Shutemov
@ 2019-10-09 16:07     ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2019-10-09 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, Linux-MM, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On Wed, Oct 9, 2019 at 8:14 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> If ->pte_entry() fails on the first entry of the page table, pte - 1 will
> point out side the page table.
>
> And the '- 1' is totally unnecessary as we break the loop before pte++ on
> the last iteration.

Good catch. Too much copying the wrong pattern from other sources.

I do wish we didn't have this pattern of "update pte, then do
pte_unmap as long as it's in the same page". Yeah, it avoids a
variable, but still... But it is what it is, and we just need to be
careful.

             Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 15:27   ` Kirill A. Shutemov
  2019-10-09 15:27     ` Kirill A. Shutemov
@ 2019-10-09 16:20     ` Thomas Hellström (VMware)
  2019-10-09 16:20       ` Thomas Hellström (VMware)
  2019-10-09 16:21     ` Linus Torvalds
  2 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-09 16:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, torvalds, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

Hi, Kirill.

Thanks for reviewing.

On 10/9/19 5:27 PM, Kirill A. Shutemov wrote:
> On Tue, Oct 08, 2019 at 11:15:02AM +0200, Thomas Hellström (VMware) wrote:
>> From: Thomas Hellstrom <thellstrom@vmware.com>
>>
>> The pagewalk code was unconditionally splitting transhuge pmds when a
>> pte_entry was present. However ideally we'd want to handle transhuge pmds
>> in the pmd_entry function and ptes in pte_entry function. So don't split
>> huge pmds when there is a pmd_entry function present, but let the callback
>> take care of it if necessary.
> Do we have any current user that expect split_huge_pmd() in this scenario.

No. All current users either have pmd_entry (no splitting) or pte_entry 
(unconditional splitting)

>
>> In order to make sure a virtual address range is handled by one and only
>> one callback, and since pmd entries may be unstable, we introduce a
>> pmd_entry return code that tells the walk code to continue processing this
>> pmd entry rather than to move on. Since caller-defined positive return
>> codes (up to 2) are used by current callers, use a high value that allows a
>> large range of positive caller-defined return codes for future users.
>>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
>> ---
>>   include/linux/pagewalk.h |  8 ++++++++
>>   mm/pagewalk.c            | 28 +++++++++++++++++++++-------
>>   2 files changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index bddd9759bab9..c4a013eb445d 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -4,6 +4,11 @@
>>   
>>   #include <linux/mm.h>
>>   
>> +/* Highest positive pmd_entry caller-specific return value */
>> +#define PAGE_WALK_CALLER_MAX     (INT_MAX / 2)
>> +/* The handler did not handle the entry. Fall back to the next level */
>> +#define PAGE_WALK_FALLBACK       (PAGE_WALK_CALLER_MAX + 1)
>> +
> That's hacky.
>
> Maybe just use an error code for this? -EAGAIN?

I agree this is hacky. But IMO it's a reasonably safe option. My 
thinking was that in the long run we'd move the positive return codes to 
the mm_walk private and introduce a PAGE_WALK_TERMINATE code as well.

Perhaps a completely clean and safe way would be to add an "int 
walk_control" in the struct mm_walk?

I'm pretty sure using an error code will come back and bite us at some 
point, if someone just blindly forwards error messages. But if you 
insist, I'll use -EAGAIN.

Please let me know what you think.

Thanks,

Thomas




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 16:20     ` Thomas Hellström (VMware)
@ 2019-10-09 16:20       ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-09 16:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, torvalds, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

Hi, Kirill.

Thanks for reviewing.

On 10/9/19 5:27 PM, Kirill A. Shutemov wrote:
> On Tue, Oct 08, 2019 at 11:15:02AM +0200, Thomas Hellström (VMware) wrote:
>> From: Thomas Hellstrom <thellstrom@vmware.com>
>>
>> The pagewalk code was unconditionally splitting transhuge pmds when a
>> pte_entry was present. However ideally we'd want to handle transhuge pmds
>> in the pmd_entry function and ptes in pte_entry function. So don't split
>> huge pmds when there is a pmd_entry function present, but let the callback
>> take care of it if necessary.
> Do we have any current user that expect split_huge_pmd() in this scenario.

No. All current users either have pmd_entry (no splitting) or pte_entry
(unconditional splitting)

>
>> In order to make sure a virtual address range is handled by one and only
>> one callback, and since pmd entries may be unstable, we introduce a
>> pmd_entry return code that tells the walk code to continue processing this
>> pmd entry rather than to move on. Since caller-defined positive return
>> codes (up to 2) are used by current callers, use a high value that allows a
>> large range of positive caller-defined return codes for future users.
>>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
>> ---
>>   include/linux/pagewalk.h |  8 ++++++++
>>   mm/pagewalk.c            | 28 +++++++++++++++++++++-------
>>   2 files changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index bddd9759bab9..c4a013eb445d 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -4,6 +4,11 @@
>>   
>>   #include <linux/mm.h>
>>   
>> +/* Highest positive pmd_entry caller-specific return value */
>> +#define PAGE_WALK_CALLER_MAX     (INT_MAX / 2)
>> +/* The handler did not handle the entry. Fall back to the next level */
>> +#define PAGE_WALK_FALLBACK       (PAGE_WALK_CALLER_MAX + 1)
>> +
> That's hacky.
>
> Maybe just use an error code for this? -EAGAIN?

I agree this is hacky. But IMO it's a reasonably safe option. My 
thinking was that in the long run we'd move the positive return codes to 
the mm_walk private and introduce a PAGE_WALK_TERMINATE code as well.

Perhaps a completely clean and safe way would be to add an "int 
walk_control" in the struct mm_walk?

I'm pretty sure using an error code will come back and bite us at some 
point, if someone just blindly forwards error messages. But if you 
insist, I'll use -EAGAIN.

Please let me know what you think.

Thanks,

Thomas





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 15:27   ` Kirill A. Shutemov
  2019-10-09 15:27     ` Kirill A. Shutemov
  2019-10-09 16:20     ` Thomas Hellström (VMware)
@ 2019-10-09 16:21     ` Linus Torvalds
  2019-10-09 17:03       ` Thomas Hellström (VMware)
  2 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2019-10-09 16:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, Linux-MM, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On Wed, Oct 9, 2019 at 8:27 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Do we have any current user that expect split_huge_pmd() in this scenario.

No. There are no current users of the pmd callback and the pte
callback at all, that I could find.

But it looks like the new drm use does want a "I can't handle the
hugepage, please split it and I'll fo the ptes instead".

> That's hacky.
>
> Maybe just use an error code for this? -EAGAIN?

I actually like the PAGE_WALK_FALLBACK thing as more documentation
than "it's an error, but not one you return", although I do detest the
particular value chosen, which is just a nasty bitpattern.

Maybe it could use an error value, just one that makes no sense, and
is hidden by the PAGE_WALK_FALLBACK define, ie something like

  #define PAGE_WALK_FALLBACK (-ECHILD)

or something like that.

And I suspect the conditional would be cleaner if it was written something like

        if (!err)
                continue;
        if (err != PAGE_WALK_FALLBACK)
                break;
        err = 0;
        if (pmd_trans_unstable(pmd))
                goto again;
        .. do the split ..

and skip the WARN_ON() and the odd "non-zero but smaller than MAX test"

            Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 16:21     ` Linus Torvalds
@ 2019-10-09 17:03       ` Thomas Hellström (VMware)
  2019-10-09 17:16         ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-09 17:03 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Linux Kernel Mailing List, Linux-MM, Thomas Hellstrom,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On 10/9/19 6:21 PM, Linus Torvalds wrote:
> On Wed, Oct 9, 2019 at 8:27 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>> Do we have any current user that expect split_huge_pmd() in this scenario.
> No. There are no current users of the pmd callback and the pte
> callback at all, that I could find.
>
> But it looks like the new drm use does want a "I can't handle the
> hugepage, please split it and I'll fo the ptes instead".
>
Nope, it handles the hugepages by ignoring them, since they should be 
read-only, but if pmd_entry() was called with something else than a 
hugepage, then it requests the fallback, but never a split.

/Thomas




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 17:03       ` Thomas Hellström (VMware)
@ 2019-10-09 17:16         ` Linus Torvalds
  2019-10-09 18:52           ` Thomas Hellstrom
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2019-10-09 17:16 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, Linux-MM,
	Thomas Hellstrom, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On Wed, Oct 9, 2019 at 10:03 AM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> Nope, it handles the hugepages by ignoring them, since they should be
> read-only, but if pmd_entry() was called with something else than a
> hugepage, then it requests the fallback, but never a split.

But  PAGE_WALK_FALLBACK _is_ a split.

Oh, except you did this

-               split_huge_pmd(walk->vma, pmd, addr);
+               if (!ops->pmd_entry)
+                       split_huge_pmd(walk->vma, pmd, addr);


so it avoids the split.

No, that's unacceptable. And makes no sense anyway. If it doesn't
split the pmd, then it shouldn't walk the pte's - because they don't
exist. And if it's not a hugepmd, then the split is a no-op, so the
test makes no sense.

I hadn't noticed that part of the patch. That simply can't be right. I
don't think you've tested this, because you never actually have
hugepages, do you?

You didn't notice or realize that split_huge_pmd() just does that

                if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)   \
                                        || pmd_devmap(*____pmd))        \

thing and doesn't do anythign at all if it's not huge.

So no. That code makes no sense at all, and I didn't realize how
senseless it was, becasue I stupidly missed that "make the split
conditional" - which is insane and wrong - and I thought that you
wanted PAGE_WALK_FALLBACK to split a pmd and fall back to per-pte
entries, which is what the name implies.

But that's not what you wanted at all.

Just get rid of PAGE_WALK_FALLBACK entirely then, and make the rule be
that a zero return value just means "split and do ptes". Which is what
you want (see above why "split" simply is wrong, and isn't an issue
for you anyway.

That won't change any existing cases, since even if they do have a
zero return value, they don't have a pte_entry() callback, so they
won't do that "split and do ptes" anyway.

             Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 17:16         ` Linus Torvalds
@ 2019-10-09 18:52           ` Thomas Hellstrom
  2019-10-09 19:20             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellstrom @ 2019-10-09 18:52 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Hellström (VMware)
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, Linux-MM,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

Hi,

On 10/9/19 7:17 PM, Linus Torvalds wrote:
> On Wed, Oct 9, 2019 at 10:03 AM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> Nope, it handles the hugepages by ignoring them, since they should be
>> read-only, but if pmd_entry() was called with something else than a
>> hugepage, then it requests the fallback, but never a split.
> But  PAGE_WALK_FALLBACK _is_ a split.
>
> Oh, except you did this
>
> -               split_huge_pmd(walk->vma, pmd, addr);
> +               if (!ops->pmd_entry)
> +                       split_huge_pmd(walk->vma, pmd, addr);
>
>
> so it avoids the split.
>
> No, that's unacceptable. And makes no sense anyway. If it doesn't
> split the pmd, then it shouldn't walk the pte's - because they don't
> exist. And if it's not a hugepmd, then the split is a no-op, so the
> test makes no sense.
>
> I hadn't noticed that part of the patch. That simply can't be right. I
> don't think you've tested this, because you never actually have
> hugepages, do you?
>
> You didn't notice or realize that split_huge_pmd() just does that
>
>                 if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)   \
>                                         || pmd_devmap(*____pmd))        \
>
> thing and doesn't do anythign at all if it's not huge.
>
> So no. That code makes no sense at all, and I didn't realize how
> senseless it was, becasue I stupidly missed that "make the split
> conditional" - which is insane and wrong - and I thought that you
> wanted PAGE_WALK_FALLBACK to split a pmd and fall back to per-pte
> entries, which is what the name implies.
>
> But that's not what you wanted at all.
>
> Just get rid of PAGE_WALK_FALLBACK entirely then, and make the rule be
> that a zero return value just means "split and do ptes". Which is what
> you want (see above why "split" simply is wrong, and isn't an issue
> for you anyway.
>
> That won't change any existing cases, since even if they do have a
> zero return value, they don't have a pte_entry() callback, so they
> won't do that "split and do ptes" anyway.
>
>              Linus
>
Hmm, so we have the following cases we need to handle when returning
from the pmd_entry() handler.

1) Huge pmd was handled - Returns 0 and continues.
2) A pmd is otherwise unstable, typically someone just zapped a huge
pmd. Returns PAGE_WALK_FALLBACK, gets caught in the pmd_trans_unstable()
test and retries.
3) A pte directory - Returns PAGE_WALK_FALLBACK, falls through, avoids
the split and continues to the next level. Yeah that split avoidance
test is indeed made unnecessary by the preceding pmd_trans_unstable() test.

-               split_huge_pmd(walk->vma, pmd, addr);
+               if (!ops->pmd_entry)
+                       split_huge_pmd(walk->vma, pmd, addr);

But as the commit message says, PAGE_WALK_FALLBACK is necessary to have
a virtual address range being handled once and only once. Therefore we
must distinguish between 1) and 2) since 2) must be retried until it's
handled correctly.

So we need the PAGE_WALK_FALLBACK. And if we instead were to combine 1)
and 3) in a single return value and use, for example PAGE_WALK_RETRY for
2)  the following could happen.

a) we handle the huge pmd and return 0 from pte_entry().
b) another process splits it.
c) we fall through to the pte level and handle the same address range
again...

So to summarize, yes the test in the code you cite is unnecessary. But
if we want to guarantee a virtual address range being handled once and
only once we need the PAGE_WALK_FALLBACK, (perhaps renamed). If not, we
can do without it similar to your original patch.

Thanks,

/Thomas





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 18:52           ` Thomas Hellstrom
@ 2019-10-09 19:20             ` Linus Torvalds
  2019-10-09 20:06               ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2019-10-09 19:20 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: Thomas Hellström (VMware),
	Kirill A. Shutemov, Linux Kernel Mailing List, Linux-MM,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On Wed, Oct 9, 2019 at 11:52 AM Thomas Hellstrom <thellstrom@vmware.com> wrote:
>
> Hmm, so we have the following cases we need to handle when returning
> from the pmd_entry() handler.

No, we really don't.

> 1) Huge pmd was handled - Returns 0 and continues.

No.

That case simply DOES NOT EXIST.

The only case that exists is

"pmd was seen, we return 0 and then look at wherer pte level is relevant".

Note that this has nothing to do with huge or not.

> 2) A pmd is otherwise unstable, typically someone just zapped a huge
> pmd. Returns PAGE_WALK_FALLBACK, gets caught in the pmd_trans_unstable()
> test and retries.

No. PAGE_WALK_FALLBACK doesn't exist, is completely broken in your
patch, and is immaterial.

It falls under the previous heading: a pmd was seen, returns zero, and
we go on with life.

If you don't have a pte callback - like EVERY SINGLE CURRENT USER -
that "goes on with life" is just "go to the next pmd entry".

And if you do have a pte callback - like your new case, that "go on
with life" is to look at the pte cases.

> 3) A pte directory - Returns PAGE_WALK_FALLBACK, falls through, avoids
> the split and continues to the next level. Yeah that split avoidance
> test is indeed made unnecessary by the preceding pmd_trans_unstable() test.

Again, no. This case does not exist. It's the same case as above: it
returns 0 and goes on to the pte level.


> -               split_huge_pmd(walk->vma, pmd, addr);
> +               if (!ops->pmd_entry)
> +                       split_huge_pmd(walk->vma, pmd, addr);
>
> But as the commit message says, PAGE_WALK_FALLBACK is necessary to have
> a virtual address range being handled once and only once.

No. Your logic is garbage. The above code is completely broken.

YOU CAN NOT AVOID TRHE SPLIT AND THEN GO ON AT THE PTE LEVEL.

Don't you get it? There *is* no PTE level if you didn't split.

And your "being handled once and only once" is garbage too. If you ask
for both a pmd callback and a pte callback, you get both. It's that
simple.

There are zero users that actually do it now, and you don't want to do
it either, so all your arguments are just pointless.

> So we need the PAGE_WALK_FALLBACK.

No we don't. You make no sense. Your case doesn't want it, no existing
cases want it, nobody wants it.

When you actually have a case that wants it, let's look at it then.
Right now, you introduced fundamentally buggy code because your
thinking is fuzzy and broken.

So what you should do is to just always return 0 in your pmd_entry().
Boom, done. The only reason for the pmd_entry existing at all is to
get the warning. Then, if you don't want to split it, you make that
warning just return an error (or a positive value) instead and say
"ok, that was bad, we don't handle it at all".

And in some _future_ life, if anybody wants to actually say "yeah,
let's not split it", make it have some "yeah I handled it" case.

In fact, I would suggest that positive return values be exactly that
"I did it" case, and that they just add up instead of breaking out.
Only an actual error would break out, and 0 would then (continue to)
mean "continue with next level".

But right now, no such user even exists.

               Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 19:20             ` Linus Torvalds
@ 2019-10-09 20:06               ` Thomas Hellström (VMware)
  2019-10-09 20:20                 ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-09 20:06 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Hellstrom
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, Linux-MM,
	Matthew Wilcox, Will Deacon, Peter Zijlstra, Rik van Riel,
	Minchan Kim, Michal Hocko, Huang Ying, Jérôme Glisse

On 10/9/19 9:20 PM, Linus Torvalds wrote:
>
> No. Your logic is garbage. The above code is completely broken.
>
> YOU CAN NOT AVOID TRHE SPLIT AND THEN GO ON AT THE PTE LEVEL.
>
> Don't you get it? There *is* no PTE level if you didn't split.

Hmm, This paragraph makes me think we have very different perceptions about what I'm trying to achieve.

I wanted the pte level to *only* get called for *pre-existing* pte entries.
Surely those must be able to exist even if we don't split occasional huge pmds in the pagewalk code?

>
> So what you should do is to just always return 0 in your pmd_entry().
> Boom, done. The only reason for the pmd_entry existing at all is to
> get the warning. Then, if you don't want to split it, you make that
> warning just return an error (or a positive value) instead and say
> "ok, that was bad, we don't handle it at all".
>
> And in some _future_ life, if anybody wants to actually say "yeah,
> let's not split it", make it have some "yeah I handled it" case.

Well yes, this is exactly what I want. Because any huge pmd we encounter 
should be read-only.

/Thomas




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 20:06               ` Thomas Hellström (VMware)
@ 2019-10-09 20:20                 ` Linus Torvalds
  2019-10-09 22:30                   ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2019-10-09 20:20 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On Wed, Oct 9, 2019 at 1:06 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> On 10/9/19 9:20 PM, Linus Torvalds wrote:
> >
> > Don't you get it? There *is* no PTE level if you didn't split.
>
> Hmm, This paragraph makes me think we have very different perceptions about what I'm trying to achieve.

It's not about what you're trying to achieve.

It's about the actual code.

You cannot do that

> -               split_huge_pmd(walk->vma, pmd, addr);
> +               if (!ops->pmd_entry)
> +                       split_huge_pmd(walk->vma, pmd, addr);

it's insane.

You *have* to call split_huge_pmd() if you're doing to call the
pte_entry() function.

I don't understand why you are arguing. This is not about "feelings"
and "intentions" or about "trying to achieve".

This is about cold hard "you can't do that", and this is now the third
time I tell you _why_ you can't do that: you can't walk the last level
if you don't _have_ a last level. You have to split the pmd to do so.

End of story.

> I wanted the pte level to *only* get called for *pre-existing* pte entries.

Again, I told you what the solution was.

But the fact is, it's not what your other code even wants or does.

Seriously. You have two cases you care about in your callbacks

 - an actual hugepmd. This is an ERROR for you and you do a huge
WARN_ON() for it to let people know.

 - regular pages. This is what your other code actually handles.

So for the hugepomd case, you have two choices:

 - handle it by splitting and deal with the regular pages: "return 0;"

 - actually error out: "return -EINVAL".

There simply are no other choices from looking at the users you added.

> Surely those must be able to exist even if we don't split occasional huge pmds in the pagewalk code?

And I told you how to handle that case when you eventually hit it.

But that is immaterial. Because it's not what you actually do right
now. Right now you have a huge WARN_ON().

                Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 20:20                 ` Linus Torvalds
@ 2019-10-09 22:30                   ` Thomas Hellström (VMware)
  2019-10-09 23:50                     ` Thomas Hellström (VMware)
  2019-10-09 23:51                     ` Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-09 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On 10/9/19 10:20 PM, Linus Torvalds wrote:
> On Wed, Oct 9, 2019 at 1:06 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> On 10/9/19 9:20 PM, Linus Torvalds wrote:
>>> Don't you get it? There *is* no PTE level if you didn't split.
>> Hmm, This paragraph makes me think we have very different perceptions about what I'm trying to achieve.
> It's not about what you're trying to achieve.
>
> It's about the actual code.
>
> You cannot do that
>
>> -               split_huge_pmd(walk->vma, pmd, addr);
>> +               if (!ops->pmd_entry)
>> +                       split_huge_pmd(walk->vma, pmd, addr);
> it's insane.
>
> You *have* to call split_huge_pmd() if you're doing to call the
> pte_entry() function.
>
> I don't understand why you are arguing. This is not about "feelings"
> and "intentions" or about "trying to achieve".
>
> This is about cold hard "you can't do that", and this is now the third
> time I tell you _why_ you can't do that: you can't walk the last level
> if you don't _have_ a last level. You have to split the pmd to do so.
It's not so much arguing but rather trying to understand your concerns 
and your perception of what the final code should look like.
>
> End of story.

So is it that you want pte_entry() to be strictly called for *each* 
virtual address, even if we have a pmd_entry()?
In that case I completely follow your arguments, meaning we skip this 
patch completely?

My take on the change was that pmd_entry() returning 0 would mean we 
could actually skip the pte level completely and nothing would otherwise 
pass down to the next level for which split_huge_pmd() wasn't a NOP, 
similar to how pud_entry does things. FWIW, see

https://lore.kernel.org/lkml/20191004123732.xpr3vroee5mhg2zt@box.shutemov.name/

and we could in the long run transform the pte walk in many pmd_entry 
callbacks into pte_entry callbacks.


>
>> I wanted the pte level to *only* get called for *pre-existing* pte entries.
> Again, I told you what the solution was.
>
> But the fact is, it's not what your other code even wants or does.
>
> Seriously. You have two cases you care about in your callbacks
>
>   - an actual hugepmd. This is an ERROR for you and you do a huge
> WARN_ON() for it to let people know.
No, it's typically a NOP, since the hugepmd should be read-only. 
Write-enabled huge pages are split in fault().
>
>   - regular pages. This is what your other code actually handles.
>
> So for the hugepomd case, you have two choices:
>
>   - handle it by splitting and deal with the regular pages: "return 0;"
Well, this is not what we want to do, and the reason we have the patch 
in the first place.
>
>   - actually error out: "return -EINVAL".

/Thomas




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 22:30                   ` Thomas Hellström (VMware)
@ 2019-10-09 23:50                     ` Thomas Hellström (VMware)
  2019-10-09 23:51                     ` Linus Torvalds
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-09 23:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On 10/10/19 12:30 AM, Thomas Hellström (VMware) wrote:
> On 10/9/19 10:20 PM, Linus Torvalds wrote:
>> On Wed, Oct 9, 2019 at 1:06 PM Thomas Hellström (VMware)
>> <thomas_os@shipmail.org> wrote:
>>> On 10/9/19 9:20 PM, Linus Torvalds wrote:
>>>> Don't you get it? There *is* no PTE level if you didn't split.
>>> Hmm, This paragraph makes me think we have very different 
>>> perceptions about what I'm trying to achieve.
>> It's not about what you're trying to achieve.
>>
>> It's about the actual code.
>>
>> You cannot do that
>>
>>> - split_huge_pmd(walk->vma, pmd, addr);
>>> +               if (!ops->pmd_entry)
>>> +                       split_huge_pmd(walk->vma, pmd, addr);
>> it's insane.
>>
>> You *have* to call split_huge_pmd() if you're doing to call the
>> pte_entry() function.
>>
>> I don't understand why you are arguing. This is not about "feelings"
>> and "intentions" or about "trying to achieve".
>>
>> This is about cold hard "you can't do that", and this is now the third
>> time I tell you _why_ you can't do that: you can't walk the last level
>> if you don't _have_ a last level. You have to split the pmd to do so.
> It's not so much arguing but rather trying to understand your concerns 
> and your perception of what the final code should look like.
>>
>> End of story.
>
> So is it that you want pte_entry() to be strictly called for *each* 
> virtual address, even if we have a pmd_entry()?
> In that case I completely follow your arguments, meaning we skip this 
> patch completely?

Or if you're still OK with your original patch

https://lore.kernel.org/lkml/CAHk-=wj5NiFPouYd6zUgY4K7VovOAxQT-xhDRjD6j5hifBWi_g@mail.gmail.com/

I'd happily use that instead.

Thanks,

Thomas




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 22:30                   ` Thomas Hellström (VMware)
  2019-10-09 23:50                     ` Thomas Hellström (VMware)
@ 2019-10-09 23:51                     ` Linus Torvalds
  2019-10-10  0:18                       ` Linus Torvalds
  2019-10-10  1:09                       ` Thomas Hellström (VMware)
  1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2019-10-09 23:51 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On Wed, Oct 9, 2019 at 3:31 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> On 10/9/19 10:20 PM, Linus Torvalds wrote:
> >
> > You *have* to call split_huge_pmd() if you're doing to call the
> > pte_entry() function.
> >
> > End of story.
>
> So is it that you want pte_entry() to be strictly called for *each*
> virtual address, even if we have a pmd_entry()?

Thomas, you're not reading the emails.

You are conflating two issues:

 (a) the conditional split_huge_pmd()

 (b) the what to do about the return value of pmd_entry().

and I'm now FOR THE LAST TIME telling you that (a) is completely
wrong, buggy crap. It will not happen. I missed that part of the patch
in my original read-through, because it was so senseless.

Get it through you head. The "conditional split_huge_pmd()" thing is
wrong, wrong, wrong.

And it is COMPLETELY WRONG regardless of any "each virtual address"
thing that you keep bringing up. The "each virtual address" argument
is irrelevant, pointless, and does not matter.

So stop bringing that thing up. Really. The conditional
split_huge_pmd() is wrong.

It's wrong,  because the whole notion of "don't split pmd and then
call the pte walker" is completely idiotic and utterly senseless,
because without the splitting the pte's DO NOT EXIST.

What part of that is not getting through?

> In that case I completely follow your arguments, meaning we skip this
> patch completely?

Well, yes and no.

The part of the patch that is complete and utter garbage, and that you
really need to *understand* why it's complete and utter garbage is
this part:

                if (!ops->pte_entry)
                        continue;
-
-               split_huge_pmd(walk->vma, pmd, addr);
+               if (!ops->pmd_entry)
+                       split_huge_pmd(walk->vma, pmd, addr);
                if (pmd_trans_unstable(pmd))
                        goto again;
                err = walk_pte_range(pmd, addr, next, walk);

Look, you cannot call "walk_pte_range()" without calling
split_huge_pmd(), because walk_pte_range() cannot deal with a huge
pmd.

walk_pte_range() does that pte_offset_map() on the pmd (well, with
your other patch in place it does the locking version), and then it
walks every pte entry of it. But that cannot possibly work if you
didn't split it.

See what I've been trying tog tell you? YOU CANNOT MAKE THAT SPLITTING
BE CONDITIONAL!

What can be done - and what the line above it does - is to skip
walking entirely. We do it when we don't have a pte_entry walker at
all, obviously.

But we can't do is "oh, we had a pmd walker, so now I'll skip
splitting". That is insane, and cannot possibly make sense.

Now, that gets us back to the (b) part above - what to do about the
return value of "pmd_entry()".

What *would* make sense, for example, is saying "ok, pmd_entry()
already handled this, so we'll just continue, and not bother with the
pte_range() at all".

Note how that is VERY VERY different from "let's not split".

Yes, for that case we also don't split, but we don't split because
we're not even walking it. See the difference?

Not splitting and then walking: wrong, insane, and not going to happen.

Nor splitting because we're not going to walk it: sane, and we already
have one such case.

> My take on the change was that pmd_entry() returning 0 would mean we
> could actually skip the pte level completely and nothing would otherwise
> pass down to the next level for which split_huge_pmd() wasn't a NOP,
> similar to how pud_entry does things.

And that would have been a sensible operation, and yes, you had that

+                       if (!err)
+                               continue;

and that was it. Fine.

BUT.

That was not all that your patch did. Kirill felt that your
PAGE_WALK_FALLBACK thing was hacky, which is why I looked more at the
patch to see what it really wanted to do, and noticed that crazy
conditional splitting that didn't make sense, and has *NOTHING* to do
with your "skip the level completely".

I _agree_ that skipping the level completely makes sense.

But the "don't split and then don't skip the level" makes zero sense
what-so-ever. Ever. Under no circumstances can that be valid as per
above.

There's also the argument that right now, there are no users that
actually want to skip the level.

Even your use case doesn't really want that, because in your use-case,
the only case that would do it is the error case that isn't supposed
to happen.

And if it *is* supposed to happen, in many ways it would be more
sensible to just return a positive value saying "I took care of it, go
on to the next entry", wouldn't you agree?

Now, I actually tried to look through every single existing pmd_entry
case, because I wanted to see who is returning positive values right
now, and which of them might *want* to say "ok, I handled it" or "now
do the pte walking".

Quite a lot of them seem to really want to do that "ok, now do the pte
walking", because they currently do it inside the pmd function because
of how the original interface was badly done. So while we _currently_
do not have a "ok, I did everything for this pmd, skip it" vs a "ok,
continue with pte" case, it clearly does make sense. No question about
it.

I did find one single user of a positive return value:
queue_pages_pte_range() returns

 * 0 - pages are placed on the right node or queued successfully.
 * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
 *     specified.
 * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
 *        on a node that does not follow the policy.

to match queue_pages_range(), but that function has two callers:

 - the first ignores the return value entirely

 - the second would be perfectly fine with -EBUSY as a return value
instead, which would actually make more sense.

Everybody else returns 0 or a negative error, as far as I can tell.

End result:

 (a) right now nobody wants the "skip" behavior. You think you'll
eventually want it

 (b) right now, the "return positive value" is actually a horribly
ugly pointless hack, which could be made to be an error value and
cleaned up in the process

 (c) to me that really argues that we should just make the rule be

     - negative error means break out with error

     - 0 means continue down the next level

     - positive could be trivially be made to just mean "ok, I did it,
you can just continue".

and I think that would make a lot of sense.

So I do think we can get rid of the PAGE_WALK_FALLBACK hack entirely.
It becomes "return 0", which is what everybody really does right now.
If there's a pte_entry(), it will split and call that pte_entry for
each pte, the way it does now. Sensible and simple.

And then - for _future_ uses - we can add that "positive return value
means that I already handled it" and the walker would just skip to the
next entry.

But - again - in no alternate reality, past, future or present, does
it make sense to skip the splitting if you walk pte's.

               Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 23:51                     ` Linus Torvalds
@ 2019-10-10  0:18                       ` Linus Torvalds
  2019-10-10  1:09                       ` Thomas Hellström (VMware)
  1 sibling, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2019-10-10  0:18 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 1696 bytes --]

On Wed, Oct 9, 2019 at 4:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  (a) right now nobody wants the "skip" behavior. You think you'll
> eventually want it
>
>  (b) right now, the "return positive value" is actually a horribly
> ugly pointless hack, which could be made to be an error value and
> cleaned up in the process
>
>  (c) to me that really argues that we should just make the rule be
>
>      - negative error means break out with error
>
>      - 0 means continue down the next level
>
>      - positive could be trivially be made to just mean "ok, I did it,
> you can just continue".
>
> and I think that would make a lot of sense.

So here's an ENTIRELY untested patch, but the return value of
pmd_entry() now makes conceptual sense to me. The whole "I hit an
error, I did nothing, I already did it myself" to me is the intuitive
meaning of {neg,0,pos} handling.

I think we probably should do this same thing for the upper levels too
to be consistent, but I think this at least makes sense, and is
simple, and avoids any hacky PAGE_WALK_CALLER_MAX magic.

I also wonder if some caller might want to get a count of "how many X
handled", and we'd just sum up all the positive return values as we're
traversing things, but that falls under "nobody seems to want it right
now", so I'm not adding extra code for something that might not be
useful.

And it is possible that I missed some other pmd_entry() callback that
returns a positive value. I did check them all, but mistakes happen
and I might have missed some case...

Again: entirely and utterly untested. It compiles. That's all I'm
going to guarantee, and even that might be a fluke.

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2426 bytes --]

 mm/mempolicy.c | 10 +++++-----
 mm/pagewalk.c  | 11 ++++++++---
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..f8c99591592b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -482,7 +482,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
  *
  * queue_pages_pte_range() has three possible return values:
  * 0 - pages are placed on the right node or queued successfully.
- * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * -EBUSY - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
  *     specified.
  * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
  *        on a node that does not follow the policy.
@@ -669,7 +669,7 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
  * passed via @private.
  *
  * queue_pages_range() has three possible return values:
- * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * -EBUSY - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
  *     specified.
  * 0 - queue pages successfully or no misplaced page.
  * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
@@ -1285,8 +1285,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 	ret = queue_pages_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	if (ret < 0) {
-		err = -EIO;
+	if (ret < 0 && ret != -EBUSY) {
+		err = ret;
 		goto up_out;
 	}
 
@@ -1303,7 +1303,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 				putback_movable_pages(&pagelist);
 		}
 
-		if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+		if ((ret < 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
 			err = -EIO;
 	} else
 		putback_movable_pages(&pagelist);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..eb9d292588a2 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -49,10 +49,15 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		 * This implies that each ->pmd_entry() handler
 		 * needs to know about pmd_trans_huge() pmds
 		 */
-		if (ops->pmd_entry)
+		if (ops->pmd_entry) {
 			err = ops->pmd_entry(pmd, addr, next, walk);
-		if (err)
-			break;
+			if (err < 0)
+				break;
+			if (err > 0) {
+				err = 0;
+				continue;
+			}
+		}
 
 		/*
 		 * Check this here so we only break down trans_huge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-09 23:51                     ` Linus Torvalds
  2019-10-10  0:18                       ` Linus Torvalds
@ 2019-10-10  1:09                       ` Thomas Hellström (VMware)
  2019-10-10  2:07                         ` Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10  1:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On 10/10/19 1:51 AM, Linus Torvalds wrote:
> On Wed, Oct 9, 2019 at 3:31 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> On 10/9/19 10:20 PM, Linus Torvalds wrote:
>>> You *have* to call split_huge_pmd() if you're doing to call the
>>> pte_entry() function.
>>>
>>> End of story.
>> So is it that you want pte_entry() to be strictly called for *each*
>> virtual address, even if we have a pmd_entry()?
> Thomas, you're not reading the emails.
>
> You are conflating two issues:
>
>   (a) the conditional split_huge_pmd()
>
>   (b) the what to do about the return value of pmd_entry().
>
> and I'm now FOR THE LAST TIME telling you that (a) is completely
> wrong, buggy crap. It will not happen. I missed that part of the patch
> in my original read-through, because it was so senseless.
>
> Get it through you head. The "conditional split_huge_pmd()" thing is
> wrong, wrong, wrong.
>
> And it is COMPLETELY WRONG regardless of any "each virtual address"
> thing that you keep bringing up. The "each virtual address" argument
> is irrelevant, pointless, and does not matter.
>
> So stop bringing that thing up. Really. The conditional
> split_huge_pmd() is wrong.
>
> It's wrong,  because the whole notion of "don't split pmd and then
> call the pte walker" is completely idiotic and utterly senseless,
> because without the splitting the pte's DO NOT EXIST.
>
> What part of that is not getting through?
>
>> In that case I completely follow your arguments, meaning we skip this
>> patch completely?
> Well, yes and no.
>
> The part of the patch that is complete and utter garbage, and that you
> really need to *understand* why it's complete and utter garbage is
> this part:
>
>                  if (!ops->pte_entry)
>                          continue;
> -
> -               split_huge_pmd(walk->vma, pmd, addr);
> +               if (!ops->pmd_entry)
> +                       split_huge_pmd(walk->vma, pmd, addr);
>                  if (pmd_trans_unstable(pmd))
>                          goto again;
>                  err = walk_pte_range(pmd, addr, next, walk);
>
> Look, you cannot call "walk_pte_range()" without calling
> split_huge_pmd(), because walk_pte_range() cannot deal with a huge
> pmd.
>
> walk_pte_range() does that pte_offset_map() on the pmd (well, with
> your other patch in place it does the locking version), and then it
> walks every pte entry of it. But that cannot possibly work if you
> didn't split it.

Thank you for your patience!

Yes, I very well *do* understand that we need to split a huge pmd to 
walk the pte range, and I've never been against removing that 
conditional. What I've said is that it is pointless anyway, because 
we've already verified that the only path coming from the pmd_entry 
(with the patch applied) has the pmd *already split* and stable.

Your original patch does exactly the same!

So please let's move on from the splitting issue. We don't disagree 
here. The conditional is gone to never be resurrected.

>
> Now, that gets us back to the (b) part above - what to do about the
> return value of "pmd_entry()".
>
> What *would* make sense, for example, is saying "ok, pmd_entry()
> already handled this, so we'll just continue, and not bother with the
> pte_range() at all".
>
> Note how that is VERY VERY different from "let's not split".
>
> Yes, for that case we also don't split, but we don't split because
> we're not even walking it. See the difference?
>
> Not splitting and then walking: wrong, insane, and not going to happen.
>
> Nor splitting because we're not going to walk it: sane, and we already
> have one such case.
>
> But the "don't split and then don't skip the level" makes zero sense
> what-so-ever. Ever. Under no circumstances can that be valid as per
> above.

Agreed.

>
> There's also the argument that right now, there are no users that
> actually want to skip the level.
>
> Even your use case doesn't really want that, because in your use-case,
> the only case that would do it is the error case that isn't supposed
> to happen.
>
> And if it *is* supposed to happen, in many ways it would be more
> sensible to just return a positive value saying "I took care of it, go
> on to the next entry", wouldn't you agree?

Indeed.

>
> Now, I actually tried to look through every single existing pmd_entry
> case, because I wanted to see who is returning positive values right
> now, and which of them might *want* to say "ok, I handled it" or "now
> do the pte walking".
>
> Quite a lot of them seem to really want to do that "ok, now do the pte
> walking", because they currently do it inside the pmd function because
> of how the original interface was badly done. So while we _currently_
> do not have a "ok, I did everything for this pmd, skip it" vs a "ok,
> continue with pte" case, it clearly does make sense. No question about
> it.
>
> I did find one single user of a positive return value:
> queue_pages_pte_range() returns
>
>   * 0 - pages are placed on the right node or queued successfully.
>   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
>   *     specified.
>   * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
>   *        on a node that does not follow the policy.
>
> to match queue_pages_range(), but that function has two callers:
>
>   - the first ignores the return value entirely
>
>   - the second would be perfectly fine with -EBUSY as a return value
> instead, which would actually make more sense.
>
> Everybody else returns 0 or a negative error, as far as I can tell.
>
> End result:
>
>   (a) right now nobody wants the "skip" behavior. You think you'll
> eventually want it
>
>   (b) right now, the "return positive value" is actually a horribly
> ugly pointless hack, which could be made to be an error value and
> cleaned up in the process
>
>   (c) to me that really argues that we should just make the rule be
>
>       - negative error means break out with error
>
>       - 0 means continue down the next level
>
>       - positive could be trivially be made to just mean "ok, I did it,
> you can just continue".
>
> and I think that would make a lot of sense.
>
Sure. I'm fine with this. So for next spin, I'll skip this patch, and do 
the positive value rework as part of the prerequisites for the huge page 
enablement. IIRC there is another user of positive values that should be 
trivial to fix.

Does that sound OK?

Thanks,

Thomas




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-10  1:09                       ` Thomas Hellström (VMware)
@ 2019-10-10  2:07                         ` Linus Torvalds
  2019-10-10  6:15                           ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2019-10-10  2:07 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On Wed, Oct 9, 2019 at 6:10 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> Your original patch does exactly the same!

Oh, no. You misread my original patch.

Look again.

The logic in my original patch was very different. It said that

 - *if* we have a pmd_entry function, then we obviously call that one.

    And if - after calling the pmd_entry function - we are still a
hugepage, then we skip the pte_entry case entirely.

   And part of skipping is obviously "don't split" - but it never had
that "don't split and then call the pte walker" case.

 - and if we *don't* have a pmd_entry function, but we do have a
pte_entry function, then we always split before calling it.

Notice the difference?

So instead of looking at the return value of the pmd_entry() function,
the approach of that suggested patch was to basically say that if the
pmd_entry function wants us to go another level deeper and it was a
hugepmd, it needed to split the pmd to make that happen.

That's actually very different from what your patch did. My original
patch never tried to walk the pte level without having one - either it
*checked* that it had one, or it split.

But I see where you might have misread the patch, particularly if you
only looked at it as a patch, not as the end result of the patch.

The end result of that patch was this:

                if (ops->pmd_entry) {
                        err = ops->pmd_entry(pmd, addr, next, walk);
                        if (err)
                                break;
                        /* No pte level walking? */
                        if (!ops->pte_entry)
                                continue;
                        /* No pte level at all? */
                        if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)
|| pmd_devmap(*pmd))
                                continue;
                } else {
                        if (!ops->pte_entry)
                                continue;

                        split_huge_pmd(walk->vma, pmd, addr);
                        if (pmd_trans_unstable(pmd))
                                goto again;
                }
                err = walk_pte_range(pmd, addr, next, walk);

and look at thew two different sides of the if-statement: if they get
to "walk_pte_range()", both cases wil have verified that there
actually _is_ a pte level. They will just have done it differently. -
the "we didn't have a pmd function" will have split the pmd if it was
a hugepmd, while the "we do have a pmd_entry" case will just check
whether it's still a huge-pmd, and done a "continue" if it was and
never even tried to walk the ptes.

But I think the "change pmd_entry to have a sane return code" is a
simpler and more flexible model, and then the pmd_entry code can just
let the pte walker split the pmd if needed.

So I liked that part of your patch.

           Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
  2019-10-10  2:07                         ` Linus Torvalds
@ 2019-10-10  6:15                           ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10  6:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellstrom, Kirill A. Shutemov, Linux Kernel Mailing List,
	Linux-MM, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Jérôme Glisse

On 10/10/19 4:07 AM, Linus Torvalds wrote:
> On Wed, Oct 9, 2019 at 6:10 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> Your original patch does exactly the same!
> Oh, no. You misread my original patch.
>
> Look again.
>
> The logic in my original patch was very different. It said that
>
>   - *if* we have a pmd_entry function, then we obviously call that one.
>
>      And if - after calling the pmd_entry function - we are still a
> hugepage, then we skip the pte_entry case entirely.
>
>     And part of skipping is obviously "don't split" - but it never had
> that "don't split and then call the pte walker" case.
>
>   - and if we *don't* have a pmd_entry function, but we do have a
> pte_entry function, then we always split before calling it.
>
> Notice the difference?

 From what I can tell, my patch is doing the same. At least that always 
was the intention. To determine whether to skip pte and skip split, your 
patch uses

                         /* No pte level at all? */
                         if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)
|| pmd_devmap(*pmd))
                                 continue;

whereas my patch does

                         if (pmd_trans_unstable(pmd))
                                 goto again;
			/* Fall through */

which is the same (pmd_trans_unstable() is the same test as you do, but 
not racy). Yes, it's missing the test for pmd_devmap() but I think 
that's an mm bug been discussed elsewhere, and we also rerun because a 
huge / none pmd at this (FALLBACK) point is probably a race and unintended.

>
> But I think the "change pmd_entry to have a sane return code" is a
> simpler and more flexible model, and then the pmd_entry code can just
> let the pte walker split the pmd if needed.

OK, let's aim for that then.

Thanks,

Thomas


>
> So I liked that part of your patch.
>
>             Linus




^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, back to index

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 1/9] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range() Thomas Hellström (VMware)
2019-10-09 15:14   ` Kirill A. Shutemov
2019-10-09 16:07     ` Linus Torvalds
2019-10-08  9:15 ` [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present Thomas Hellström (VMware)
2019-10-09 15:27   ` Kirill A. Shutemov
2019-10-09 15:27     ` Kirill A. Shutemov
2019-10-09 16:20     ` Thomas Hellström (VMware)
2019-10-09 16:20       ` Thomas Hellström (VMware)
2019-10-09 16:21     ` Linus Torvalds
2019-10-09 17:03       ` Thomas Hellström (VMware)
2019-10-09 17:16         ` Linus Torvalds
2019-10-09 18:52           ` Thomas Hellstrom
2019-10-09 19:20             ` Linus Torvalds
2019-10-09 20:06               ` Thomas Hellström (VMware)
2019-10-09 20:20                 ` Linus Torvalds
2019-10-09 22:30                   ` Thomas Hellström (VMware)
2019-10-09 23:50                     ` Thomas Hellström (VMware)
2019-10-09 23:51                     ` Linus Torvalds
2019-10-10  0:18                       ` Linus Torvalds
2019-10-10  1:09                       ` Thomas Hellström (VMware)
2019-10-10  2:07                         ` Linus Torvalds
2019-10-10  6:15                           ` Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 4/9] mm: Add a walk_page_mapping() function to the pagewalk code Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
2019-10-08 17:06   ` Linus Torvalds
2019-10-08 18:25     ` Thomas Hellstrom
2019-10-08  9:15 ` [PATCH v4 6/9] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 7/9] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 8/9] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 9/9] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox