linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Emulated coherent graphics memory take 2
@ 2019-09-26 11:55 Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 11:55 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: pv-drivers, linux-graphics-maintainer, Thomas Hellström,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Souptick Joarder, Jérôme Glisse, Christian König,
	Christoph Hellwig

From: Thomas Hellström <thellstrom@vmware.com>

Graphics APIs like OpenGL 4.4 and Vulkan require the graphics driver
to provide coherent graphics memory, meaning that the GPU sees any
content written to the coherent memory on the next GPU operation that
touches that memory, and the CPU sees any content written by the GPU
to that memory immediately after any fence object trailing the GPU
operation has signaled.

Paravirtual drivers that otherwise require explicit synchronization
needs to do this by hooking up dirty tracking to pagefault handlers
and buffer object validation.

The mm patch page walk interface has been reworked to be similar to the
reworked page-walk code (mm/pagewalk.c). There have been two other solutions
to consider:
1) Using the page-walk code. That is currently not possible since it requires
the mmap-sem to be held for the struct vm_area_struct vm_flags and for huge
page splitting. The pagewalk code in this patchset can't hold the mmap sems
since it will lead to locking inversion. We have an established locking order
mmap_sem -> dma_reservation -> i_mmap_lock, whereas holding the mmap_sem in
this case would require dma_reservation -> i_mmap_lock -> mmap_sem.
Instead it uses an operation mode similar to unmap_mapping_range() where the
i_mmap_lock is held.
2) Using apply_to_page_range(). The primary use of this code is to fill
page tables. The operation modes are IMO sufficiently different to motivate
re-implementing the page-walk.

The code has been tested and exercised by a tailored version of mesa
where we disable all explicit synchronization and assume graphics memory
is coherent. The performance loss varies of course; a typical number is
around 5%.

I would like to merge this code through the DRM tree, so an ack to include
the new mm helpers in that merge would be greatly appreciated.

Changes since RFC:
- Merge conflict changes moved to the correct patch. Fixes intra-patchset
  compile errors.
- Be more aggressive when turning ttm vm code into helpers. This makes sure
  we can use a const qualifier on the vmwgfx vm_ops.
- Reinstate a lost comment an fix an error path that was broken when turning
  the ttm vm code into helpers.
- Remove explicit type-casts of struct vm_area_struct::vm_private_data
- Clarify the locking inversion that makes us not being able to use the mm
  pagewalk code.

Changes since v1:
- Removed the vmwgfx maintainer entry for as_dirty_helpers.c, updated
  commit message accordingly
- Removed the TTM patches from the series as they are merged separately
  through DRM.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 11:55 [PATCH v2 0/5] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
@ 2019-09-26 11:55 ` Thomas Hellström (VMware)
  2019-09-26 12:03   ` Ack to merge through DRM? WAS " Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 2/5] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 11:55 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: pv-drivers, linux-graphics-maintainer, Thomas Hellstrom,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Souptick Joarder, Jérôme Glisse, Ralph Campbell

From: Thomas Hellstrom <thellstrom@vmware.com>

Add two utilities to a) write-protect and b) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
---
 include/linux/mm.h    |  13 +-
 mm/Kconfig            |   3 +
 mm/Makefile           |   1 +
 mm/as_dirty_helpers.c | 392 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 408 insertions(+), 1 deletion(-)
 create mode 100644 mm/as_dirty_helpers.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..27ff341ecbdc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2657,7 +2657,6 @@ typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
 
-
 #ifdef CONFIG_PAGE_POISONING
 extern bool page_poisoning_enabled(void);
 extern void kernel_poison_pages(struct page *page, int numpages, int enable);
@@ -2891,5 +2890,17 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_AS_DIRTY_HELPERS
+unsigned long apply_as_clean(struct address_space *mapping,
+			     pgoff_t first_index, pgoff_t nr,
+			     pgoff_t bitmap_pgoff,
+			     unsigned long *bitmap,
+			     pgoff_t *start,
+			     pgoff_t *end);
+
+unsigned long apply_as_wrprotect(struct address_space *mapping,
+				 pgoff_t first_index, pgoff_t nr);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..594350e9d78e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config AS_DIRTY_HELPERS
+        bool
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d0b295c3b764..4086f1eefbc6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -105,3 +105,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_AS_DIRTY_HELPERS) += as_dirty_helpers.o
diff --git a/mm/as_dirty_helpers.c b/mm/as_dirty_helpers.c
new file mode 100644
index 000000000000..d4cc37dcb144
--- /dev/null
+++ b/mm/as_dirty_helpers.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb.h>
+#include <linux/bitops.h>
+#include <linux/mmu_notifier.h>
+#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
+
+/**
+ * struct as_walk - Argument to struct as_walk_ops callbacks.
+ * @vma: Pointer to the struct vmw_area_struct currently being walked.
+ *
+ * Embeddable argument to struct as_walk_ops callbacks.
+ */
+struct as_walk {
+	struct vm_area_struct *vma;
+};
+
+/**
+ * struct as_walk_ops - Callbacks for entries of various page table levels.
+ * extend for additional level support.
+ */
+struct as_walk_ops {
+	/**
+	 * pte-entry: Callback for PTEs
+	 * @pte: Pointer to the PTE.
+	 * @addr: Virtual address.
+	 * @asw: Struct as_walk argument for the walk. Embed for additional
+	 * data.
+	 */
+	void (*const pte_entry) (pte_t *pte, unsigned long addr,
+				 struct as_walk *asw);
+};
+
+/* Page-walking code */
+static void walk_as_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			      const struct as_walk_ops *ops,
+			      struct as_walk *asw)
+{
+	struct mm_struct *mm = asw->vma->vm_mm;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pte = (mm == &init_mm) ?
+		pte_offset_kernel(pmd, addr) :
+		pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		ops->pte_entry(pte++, addr, asw);
+	} while (addr += PAGE_SIZE, addr != end);
+
+	arch_leave_lazy_mmu_mode();
+
+	if (mm != &init_mm)
+		pte_unmap_unlock(pte - 1, ptl);
+}
+
+static void walk_as_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+			      const struct as_walk_ops *ops,
+			      struct as_walk *asw)
+{
+	pmd_t *pmd = pmd_offset(pud, addr);
+	unsigned long next;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		if (WARN_ON(pmd_huge(*pmd)))
+			continue;
+		walk_as_pte_range(pmd, addr, next, ops, asw);
+	} while (pmd++, addr = next, addr != end);
+}
+
+static void walk_as_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+			      const struct as_walk_ops *ops,
+			      struct as_walk *asw)
+{
+	pud_t *pud = pud_offset(p4d, addr);
+	unsigned long next;
+
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		if (WARN_ON(pud_huge(*pud)))
+			continue;
+		walk_as_pmd_range(pud, addr, next, ops, asw);
+	} while (pud++, addr = next, addr != end);
+}
+
+static void walk_as_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			      const struct as_walk_ops *ops,
+			      struct as_walk *asw)
+{
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	unsigned long next;
+
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		walk_as_pud_range(p4d, addr, next, ops, asw);
+	} while (p4d++, addr = next, addr != end);
+}
+
+static void walk_as_pfn_range(unsigned long addr, unsigned long end,
+			      const struct as_walk_ops *ops,
+			      struct as_walk *asw)
+{
+	pgd_t *pgd = pgd_offset(asw->vma->vm_mm, addr);
+	unsigned long next;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		walk_as_p4d_range(pgd, addr, next, ops, asw);
+	} while (pgd++, addr = next, addr != end);
+}
+
+
+/**
+ * struct as_walk_range - Argument for apply_as_range
+ * @asw: The struct as_walk we embed for the page walk
+ * @start: Address of first modified pte
+ * @end: Address of last modified pte + 1
+ * @total: Total number of modified ptes
+ */
+struct as_walk_range {
+	struct as_walk base;
+	unsigned long start;
+	unsigned long end;
+	unsigned long total;
+};
+
+#define to_as_walk_range(_asw) container_of(_asw, struct as_walk_range, base)
+
+/**
+ * apply_pt_wrprotect - Leaf pte callback to write-protect a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @asw: Pointer to a struct as_walk embedded in a struct as_walk_range
+ *
+ * The function write-protects a pte and records the range in
+ * virtual address space of touched ptes for efficient range TLB flushes.
+ */
+static void apply_pt_wrprotect(pte_t *pte, unsigned long addr,
+			       struct as_walk *asw)
+{
+	struct as_walk_range *awr = to_as_walk_range(asw);
+	pte_t ptent = *pte;
+
+	if (pte_write(ptent)) {
+		pte_t old_pte = ptep_modify_prot_start(asw->vma, addr, pte);
+
+		ptent = pte_wrprotect(old_pte);
+		ptep_modify_prot_commit(asw->vma, addr, pte, old_pte, ptent);
+		awr->total++;
+		awr->start = min(awr->start, addr);
+		awr->end = max(awr->end, addr + PAGE_SIZE);
+	}
+}
+
+/**
+ * struct as_walk_clean - Argument structure for apply_pt_clean
+ * @base: struct as_walk we derive from
+ * @bitmap_pgoff: Address_space Page offset of the first bit in @bitmap
+ * @bitmap: Bitmap with one bit for each page offset in the address_space range
+ * covered.
+ * @start: Address_space page offset of first modified pte relative
+ * to @bitmap_pgoff
+ * @end: Address_space page offset of last modified pte relative
+ * to @bitmap_pgoff
+ */
+struct as_walk_clean {
+	struct as_walk_range base;
+	pgoff_t bitmap_pgoff;
+	unsigned long *bitmap;
+	pgoff_t start;
+	pgoff_t end;
+};
+
+#define to_as_walk_clean(_awr) container_of(_awr, struct as_walk_clean, base)
+
+/**
+ * apply_pt_clean - Leaf pte callback to clean a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @asw: Pointer to a struct as_walk embedded in a struct as_walk_clean
+ *
+ * The function cleans a pte and records the range in
+ * virtual address space of touched ptes for efficient TLB flushes.
+ * It also records dirty ptes in a bitmap representing page offsets
+ * in the address_space, as well as the first and last of the bits
+ * touched.
+ */
+static void apply_pt_clean(pte_t *pte, unsigned long addr, struct as_walk *asw)
+{
+	struct as_walk_range *awr = to_as_walk_range(asw);
+	struct as_walk_clean *clean = to_as_walk_clean(awr);
+	pte_t ptent = *pte;
+
+	if (pte_dirty(ptent)) {
+		pgoff_t pgoff = ((addr - asw->vma->vm_start) >> PAGE_SHIFT) +
+			asw->vma->vm_pgoff - clean->bitmap_pgoff;
+		pte_t old_pte = ptep_modify_prot_start(asw->vma, addr, pte);
+
+		ptent = pte_mkclean(old_pte);
+		ptep_modify_prot_commit(asw->vma, addr, pte, old_pte, ptent);
+
+		awr->total++;
+		awr->start = min(awr->start, addr);
+		awr->end = max(awr->end, addr + PAGE_SIZE);
+
+		__set_bit(pgoff, clean->bitmap);
+		clean->start = min(clean->start, pgoff);
+		clean->end = max(clean->end, pgoff + 1);
+	}
+}
+
+/**
+ * apply_as_range - Apply a pte callback to all PTEs pointing into a range
+ * of an address_space.
+ * @mapping: Pointer to the struct address_space
+ * @aas: Closure structure
+ * @first_index: First page offset in the address_space
+ * @nr: Number of incremental page offsets to cover
+ *
+ * Return: Number of ptes touched. Note that this number might be larger
+ * than @nr if there are overlapping vmas
+ */
+static unsigned long apply_as_range(struct address_space *mapping,
+				    pgoff_t first_index, pgoff_t nr,
+				    const struct as_walk_ops *ops,
+				    struct as_walk_range *awr)
+{
+	struct vm_area_struct *vma;
+	pgoff_t vba, vea, cba, cea;
+	unsigned long start_addr, end_addr;
+	struct mmu_notifier_range range;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
+				  first_index + nr - 1) {
+		unsigned long vm_flags = READ_ONCE(vma->vm_flags);
+
+		/*
+		 * We can only do advisory flag tests below, since we can't
+		 * require the mm's mmap_sem to be held to protect the flags.
+		 * Therefore, callers that strictly depend on specific vm_flags
+		 * to remain constant throughout the operation must ensure
+		 * those flags are immutable for all relevant vmas or can't use
+		 * this function. Fixing this properly would require the
+		 * vm_flags to be protected by a separate lock taken after the
+		 * i_mmap_lock
+		 */
+
+		/* Skip non-applicable VMAs */
+		if ((vm_flags & (VM_SHARED | VM_WRITE)) !=
+		    (VM_SHARED | VM_WRITE))
+			continue;
+
+		/* Warn on and skip VMAs whose flags indicate illegal usage */
+		if (WARN_ON((vm_flags & (VM_HUGETLB | VM_IO)) != VM_IO))
+			continue;
+
+		/* Clip to the vma */
+		vba = vma->vm_pgoff;
+		vea = vba + vma_pages(vma);
+		cba = first_index;
+		cba = max(cba, vba);
+		cea = first_index + nr;
+		cea = min(cea, vea);
+
+		/* Translate to virtual address */
+		start_addr = ((cba - vba) << PAGE_SHIFT) + vma->vm_start;
+		end_addr = ((cea - vba) << PAGE_SHIFT) + vma->vm_start;
+		if (start_addr >= end_addr)
+			continue;
+
+		awr->base.vma = vma;
+		awr->start = end_addr;
+		awr->end = start_addr;
+
+		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0,
+					vma, vma->vm_mm, start_addr, end_addr);
+		mmu_notifier_invalidate_range_start(&range);
+
+		/* Is this needed when we only change protection? */
+		flush_cache_range(vma, start_addr, end_addr);
+
+		/*
+		 * We're not using tlb_gather_mmu() since typically
+		 * only a small subrange of PTEs are affected, whereas
+		 * tlb_gather_mmu() records the full range.
+		 */
+		inc_tlb_flush_pending(vma->vm_mm);
+		walk_as_pfn_range(start_addr, end_addr, ops, &awr->base);
+		if (awr->end > awr->start)
+			flush_tlb_range(vma, awr->start, awr->end);
+
+		mmu_notifier_invalidate_range_end(&range);
+		dec_tlb_flush_pending(vma->vm_mm);
+	}
+	i_mmap_unlock_read(mapping);
+
+	return awr->total;
+}
+
+/**
+ * apply_as_wrprotect - Write-protect all ptes in an address_space range
+ * @mapping: The address_space we want to write protect
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ *
+ * WARNING: This function should only be used for address spaces whose
+ * vmas are marked VM_IO and that do not contain huge pages.
+ * To avoid interference with COW'd pages, vmas not marked VM_SHARED are
+ * simply skipped.
+ *
+ * Return: The number of ptes actually write-protected. Note that
+ * already write-protected ptes are not counted.
+ */
+unsigned long apply_as_wrprotect(struct address_space *mapping,
+				 pgoff_t first_index, pgoff_t nr)
+{
+	static const struct as_walk_ops ops = {
+		.pte_entry = apply_pt_wrprotect
+	};
+	struct as_walk_range awr = { .total = 0 };
+
+	return apply_as_range(mapping, first_index, nr, &ops, &awr);
+}
+EXPORT_SYMBOL_GPL(apply_as_wrprotect);
+
+/**
+ * apply_as_clean - Clean all ptes in an address_space range
+ * @mapping: The address_space we want to clean
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ * @bitmap_pgoff: The page offset of the first bit in @bitmap
+ * @bitmap: Pointer to a bitmap of at least @nr bits. The bitmap needs to
+ * cover the whole range @first_index..@first_index + @nr.
+ * @start: Pointer to number of the first set bit in @bitmap.
+ * is modified as new bits are set by the function.
+ * @end: Pointer to the number of the last set bit in @bitmap.
+ * none set. The value is modified as new bits are set by the function.
+ *
+ * Note: When this function returns there is no guarantee that a CPU has
+ * not already dirtied new ptes. However it will not clean any ptes not
+ * reported in the bitmap.
+ *
+ * If a caller needs to make sure all dirty ptes are picked up and none
+ * additional are added, it first needs to write-protect the address-space
+ * range and make sure new writers are blocked in page_mkwrite() or
+ * pfn_mkwrite(). And then after a TLB flush following the write-protection
+ * pick up all dirty bits.
+ *
+ * WARNING: This function should only be used for address spaces whose
+ * vmas are marked VM_IO and that do not contain huge pages.
+ * To avoid interference with COW'd pages, vmas not marked VM_SHARED are
+ * simply skipped.
+ *
+ * Return: The number of dirty ptes actually cleaned.
+ */
+unsigned long apply_as_clean(struct address_space *mapping,
+			     pgoff_t first_index, pgoff_t nr,
+			     pgoff_t bitmap_pgoff,
+			     unsigned long *bitmap,
+			     pgoff_t *start,
+			     pgoff_t *end)
+{
+	bool none_set = (*start >= *end);
+	static const struct as_walk_ops ops = { .pte_entry = apply_pt_clean };
+	struct as_walk_clean clean = {
+		.base = { .total = 0, },
+		.bitmap_pgoff = bitmap_pgoff,
+		.bitmap = bitmap,
+		.start = none_set ? nr : *start,
+		.end = none_set ? 0 : *end,
+	};
+	unsigned long ret = apply_as_range(mapping, first_index, nr, &ops,
+					   &clean.base);
+	*start = clean.start;
+	*end = clean.end;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(apply_as_clean);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/5] drm/vmwgfx: Implement an infrastructure for write-coherent resources
  2019-09-26 11:55 [PATCH v2 0/5] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
@ 2019-09-26 11:55 ` Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 3/5] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 11:55 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: pv-drivers, linux-graphics-maintainer, Thomas Hellstrom,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Souptick Joarder, Jérôme Glisse, Christian König,
	Christoph Hellwig, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

This infrastructure will, for coherent resources, make sure that
from the user-space point of view, data written by the CPU is immediately
automatically available to the GPU at resource validation time.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/Kconfig                |   1 +
 drivers/gpu/drm/vmwgfx/Makefile               |   2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c            |   5 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |  23 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c       |   1 -
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c    | 417 ++++++++++++++++++
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c      |  57 +++
 drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h |  11 +
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  15 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c    |  71 +++
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.h    |  16 +-
 11 files changed, 598 insertions(+), 21 deletions(-)
 create mode 100644 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c

diff --git a/drivers/gpu/drm/vmwgfx/Kconfig b/drivers/gpu/drm/vmwgfx/Kconfig
index 6b28a326f8bb..d5fd81a521f6 100644
--- a/drivers/gpu/drm/vmwgfx/Kconfig
+++ b/drivers/gpu/drm/vmwgfx/Kconfig
@@ -8,6 +8,7 @@ config DRM_VMWGFX
 	select FB_CFB_IMAGEBLIT
 	select DRM_TTM
 	select FB
+	select AS_DIRTY_HELPERS
 	# Only needed for the transitional use of drm_crtc_init - can be removed
 	# again once vmwgfx sets up the primary plane itself.
 	select DRM_KMS_HELPER
diff --git a/drivers/gpu/drm/vmwgfx/Makefile b/drivers/gpu/drm/vmwgfx/Makefile
index 8841bd30e1e5..c877a21a0739 100644
--- a/drivers/gpu/drm/vmwgfx/Makefile
+++ b/drivers/gpu/drm/vmwgfx/Makefile
@@ -8,7 +8,7 @@ vmwgfx-y := vmwgfx_execbuf.o vmwgfx_gmr.o vmwgfx_kms.o vmwgfx_drv.o \
 	    vmwgfx_cmdbuf_res.o vmwgfx_cmdbuf.o vmwgfx_stdu.o \
 	    vmwgfx_cotable.o vmwgfx_so.o vmwgfx_binding.o vmwgfx_msg.o \
 	    vmwgfx_simple_resource.o vmwgfx_va.o vmwgfx_blit.o \
-	    vmwgfx_validation.o \
+	    vmwgfx_validation.o vmwgfx_page_dirty.o \
 	    ttm_object.o ttm_lock.o
 
 obj-$(CONFIG_DRM_VMWGFX) := vmwgfx.o
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index aad8d8140259..869aeaec2f86 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -462,6 +462,7 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
 {
 	struct vmw_buffer_object *vmw_bo = vmw_buffer_object(bo);
 
+	WARN_ON(vmw_bo->dirty);
 	vmw_bo_unmap(vmw_bo);
 	kfree(vmw_bo);
 }
@@ -475,8 +476,10 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
 static void vmw_user_bo_destroy(struct ttm_buffer_object *bo)
 {
 	struct vmw_user_buffer_object *vmw_user_bo = vmw_user_buffer_object(bo);
+	struct vmw_buffer_object *vbo = &vmw_user_bo->vbo;
 
-	vmw_bo_unmap(&vmw_user_bo->vbo);
+	WARN_ON(vbo->dirty);
+	vmw_bo_unmap(vbo);
 	ttm_prime_object_kfree(vmw_user_bo, prime);
 }
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 5eb73ded8e07..7944dbbbdd72 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -105,6 +105,7 @@ struct vmw_fpriv {
  * @dx_query_ctx: DX context if this buffer object is used as a DX query MOB
  * @map: Kmap object for semi-persistent mappings
  * @res_prios: Eviction priority counts for attached resources
+ * @dirty: structure for user-space dirty-tracking
  */
 struct vmw_buffer_object {
 	struct ttm_buffer_object base;
@@ -115,6 +116,7 @@ struct vmw_buffer_object {
 	/* Protected by reservation */
 	struct ttm_bo_kmap_obj map;
 	u32 res_prios[TTM_MAX_BO_PRIORITY];
+	struct vmw_bo_dirty *dirty;
 };
 
 /**
@@ -145,7 +147,8 @@ struct vmw_res_func;
  * @res_dirty: Resource contains data not yet in the backup buffer. Protected
  * by resource reserved.
  * @backup_dirty: Backup buffer contains data not yet in the HW resource.
- * Protecte by resource reserved.
+ * Protected by resource reserved.
+ * @coherent: Emulate coherency by tracking vm accesses.
  * @backup: The backup buffer if any. Protected by resource reserved.
  * @backup_offset: Offset into the backup buffer if any. Protected by resource
  * reserved. Note that only a few resource types can have a @backup_offset
@@ -162,14 +165,16 @@ struct vmw_res_func;
  * @hw_destroy: Callback to destroy the resource on the device, as part of
  * resource destruction.
  */
+struct vmw_resource_dirty;
 struct vmw_resource {
 	struct kref kref;
 	struct vmw_private *dev_priv;
 	int id;
 	u32 used_prio;
 	unsigned long backup_size;
-	bool res_dirty;
-	bool backup_dirty;
+	u32 res_dirty : 1;
+	u32 backup_dirty : 1;
+	u32 coherent : 1;
 	struct vmw_buffer_object *backup;
 	unsigned long backup_offset;
 	unsigned long pin_count;
@@ -177,6 +182,7 @@ struct vmw_resource {
 	struct list_head lru_head;
 	struct list_head mob_head;
 	struct list_head binding_head;
+	struct vmw_resource_dirty *dirty;
 	void (*res_free) (struct vmw_resource *res);
 	void (*hw_destroy) (struct vmw_resource *res);
 };
@@ -716,6 +722,8 @@ extern void vmw_resource_evict_all(struct vmw_private *dev_priv);
 extern void vmw_resource_unbind_list(struct vmw_buffer_object *vbo);
 void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
+void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
+			       pgoff_t end);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1403,6 +1411,15 @@ int vmw_host_log(const char *log);
 #define VMW_DEBUG_USER(fmt, ...)                                              \
 	DRM_DEBUG_DRIVER(fmt, ##__VA_ARGS__)
 
+/* Resource dirtying - vmwgfx_page_dirty.c */
+void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo);
+int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
+void vmw_bo_dirty_clear_res(struct vmw_resource *res);
+void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
+vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
+
 /**
  * VMW_DEBUG_KMS - Debug output for kernel mode-setting
  *
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
index ff86d49dc5e8..934ad7c0c342 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
@@ -2560,7 +2560,6 @@ static int vmw_cmd_dx_check_subresource(struct vmw_private *dev_priv,
 		     offsetof(typeof(*cmd), sid));
 
 	cmd = container_of(header, typeof(*cmd), header);
-
 	return vmw_cmd_res_check(dev_priv, sw_context, vmw_res_surface,
 				 VMW_RES_DIRTY_NONE, user_surface_converter,
 				 &cmd->sid, NULL);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
new file mode 100644
index 000000000000..be3302a8e309
--- /dev/null
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -0,0 +1,417 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/**************************************************************************
+ *
+ * Copyright 2019 VMware, Inc., Palo Alto, CA., USA
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDERS, AUTHORS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
+ * USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ **************************************************************************/
+#include "vmwgfx_drv.h"
+
+/*
+ * Different methods for tracking dirty:
+ * VMW_BO_DIRTY_PAGETABLE - Scan the pagetable for hardware dirty bits
+ * VMW_BO_DIRTY_MKWRITE - Write-protect page table entries and record write-
+ * accesses in the VM mkwrite() callback
+ */
+enum vmw_bo_dirty_method {
+	VMW_BO_DIRTY_PAGETABLE,
+	VMW_BO_DIRTY_MKWRITE,
+};
+
+/*
+ * No dirtied pages at scan trigger a transition to the _MKWRITE method,
+ * similarly a certain percentage of dirty pages trigger a transition to
+ * the _PAGETABLE method. How many triggers should we wait for before
+ * changing method?
+ */
+#define VMW_DIRTY_NUM_CHANGE_TRIGGERS 2
+
+/* Percentage to trigger a transition to the _PAGETABLE method */
+#define VMW_DIRTY_PERCENTAGE 10
+
+/**
+ * struct vmw_bo_dirty - Dirty information for buffer objects
+ * @start: First currently dirty bit
+ * @end: Last currently dirty bit + 1
+ * @method: The currently used dirty method
+ * @change_count: Number of consecutive method change triggers
+ * @ref_count: Reference count for this structure
+ * @bitmap_size: The size of the bitmap in bits. Typically equal to the
+ * nuber of pages in the bo.
+ * @size: The accounting size for this struct.
+ * @bitmap: A bitmap where each bit represents a page. A set bit means a
+ * dirty page.
+ */
+struct vmw_bo_dirty {
+	unsigned long start;
+	unsigned long end;
+	enum vmw_bo_dirty_method method;
+	unsigned int change_count;
+	unsigned int ref_count;
+	unsigned long bitmap_size;
+	size_t size;
+	unsigned long bitmap[0];
+};
+
+/**
+ * vmw_bo_dirty_scan_pagetable - Perform a pagetable scan for dirty bits
+ * @vbo: The buffer object to scan
+ *
+ * Scans the pagetable for dirty bits. Clear those bits and modify the
+ * dirty structure with the results. This function may change the
+ * dirty-tracking method.
+ */
+static void vmw_bo_dirty_scan_pagetable(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	pgoff_t offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+	pgoff_t num_marked;
+
+	num_marked = apply_as_clean(mapping,
+				    offset, dirty->bitmap_size,
+				    offset, &dirty->bitmap[0],
+				    &dirty->start, &dirty->end);
+	if (num_marked == 0)
+		dirty->change_count++;
+	else
+		dirty->change_count = 0;
+
+	if (dirty->change_count > VMW_DIRTY_NUM_CHANGE_TRIGGERS) {
+		dirty->change_count = 0;
+		dirty->method = VMW_BO_DIRTY_MKWRITE;
+		apply_as_wrprotect(mapping,
+				   offset, dirty->bitmap_size);
+		apply_as_clean(mapping,
+			       offset, dirty->bitmap_size,
+			       offset, &dirty->bitmap[0],
+			       &dirty->start, &dirty->end);
+	}
+}
+
+/**
+ * vmw_bo_dirty_scan_mkwrite - Reset the mkwrite dirty-tracking method
+ * @vbo: The buffer object to scan
+ *
+ * Write-protect pages written to so that consecutive write accesses will
+ * trigger a call to mkwrite.
+ *
+ * This function may change the dirty-tracking method.
+ */
+static void vmw_bo_dirty_scan_mkwrite(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	unsigned long offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+	pgoff_t num_marked;
+
+	if (dirty->end <= dirty->start)
+		return;
+
+	num_marked = apply_as_wrprotect(vbo->base.bdev->dev_mapping,
+					dirty->start + offset,
+					dirty->end - dirty->start);
+
+	if (100UL * num_marked / dirty->bitmap_size >
+	    VMW_DIRTY_PERCENTAGE) {
+		dirty->change_count++;
+	} else {
+		dirty->change_count = 0;
+	}
+
+	if (dirty->change_count > VMW_DIRTY_NUM_CHANGE_TRIGGERS) {
+		pgoff_t start = 0;
+		pgoff_t end = dirty->bitmap_size;
+
+		dirty->method = VMW_BO_DIRTY_PAGETABLE;
+		apply_as_clean(mapping, offset, end, offset, &dirty->bitmap[0],
+			       &start, &end);
+		bitmap_clear(&dirty->bitmap[0], 0, dirty->bitmap_size);
+		if (dirty->start < dirty->end)
+			bitmap_set(&dirty->bitmap[0], dirty->start,
+				   dirty->end - dirty->start);
+		dirty->change_count = 0;
+	}
+}
+
+
+/**
+ * vmw_bo_dirty_scan - Scan for dirty pages and add them to the dirty
+ * tracking structure
+ * @vbo: The buffer object to scan
+ *
+ * This function may change the dirty tracking method.
+ */
+void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+
+	if (dirty->method == VMW_BO_DIRTY_PAGETABLE)
+		vmw_bo_dirty_scan_pagetable(vbo);
+	else
+		vmw_bo_dirty_scan_mkwrite(vbo);
+}
+
+/**
+ * vmw_bo_dirty_add - Add a dirty-tracking user to a buffer object
+ * @vbo: The buffer object
+ *
+ * This function registers a dirty-tracking user to a buffer object.
+ * A user can be for example a resource or a vma in a special user-space
+ * mapping.
+ *
+ * Return: Zero on success, -ENOMEM on memory allocation failure.
+ */
+int vmw_bo_dirty_add(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	pgoff_t num_pages = vbo->base.num_pages;
+	size_t size, acc_size;
+	int ret;
+	static struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = false
+	};
+
+	if (dirty) {
+		dirty->ref_count++;
+		return 0;
+	}
+
+	size = sizeof(*dirty) + BITS_TO_LONGS(num_pages) * sizeof(long);
+	acc_size = ttm_round_pot(size);
+	ret = ttm_mem_global_alloc(&ttm_mem_glob, acc_size, &ctx);
+	if (ret) {
+		VMW_DEBUG_USER("Out of graphics memory for buffer object "
+			       "dirty tracker.\n");
+		return ret;
+	}
+	dirty = kvzalloc(size, GFP_KERNEL);
+	if (!dirty) {
+		ret = -ENOMEM;
+		goto out_no_dirty;
+	}
+
+	dirty->size = acc_size;
+	dirty->bitmap_size = num_pages;
+	dirty->start = dirty->bitmap_size;
+	dirty->end = 0;
+	dirty->ref_count = 1;
+	if (num_pages < PAGE_SIZE / sizeof(pte_t)) {
+		dirty->method = VMW_BO_DIRTY_PAGETABLE;
+	} else {
+		struct address_space *mapping = vbo->base.bdev->dev_mapping;
+		pgoff_t offset = drm_vma_node_start(&vbo->base.base.vma_node);
+
+		dirty->method = VMW_BO_DIRTY_MKWRITE;
+
+		/* Write-protect and then pick up already dirty bits */
+		apply_as_wrprotect(mapping, offset, num_pages);
+		apply_as_clean(mapping, offset, num_pages, offset,
+			       &dirty->bitmap[0], &dirty->start, &dirty->end);
+	}
+
+	vbo->dirty = dirty;
+
+	return 0;
+
+out_no_dirty:
+	ttm_mem_global_free(&ttm_mem_glob, acc_size);
+	return ret;
+}
+
+/**
+ * vmw_bo_dirty_release - Release a dirty-tracking user from a buffer object
+ * @vbo: The buffer object
+ *
+ * This function releases a dirty-tracking user from a buffer object.
+ * If the reference count reaches zero, then the dirty-tracking object is
+ * freed and the pointer to it cleared.
+ *
+ * Return: Zero on success, -ENOMEM on memory allocation failure.
+ */
+void vmw_bo_dirty_release(struct vmw_buffer_object *vbo)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+
+	if (dirty && --dirty->ref_count == 0) {
+		size_t acc_size = dirty->size;
+
+		kvfree(dirty);
+		ttm_mem_global_free(&ttm_mem_glob, acc_size);
+		vbo->dirty = NULL;
+	}
+}
+
+/**
+ * vmw_bo_dirty_transfer_to_res - Pick up a resource's dirty region from
+ * its backing mob.
+ * @res: The resource
+ *
+ * This function will pick up all dirty ranges affecting the resource from
+ * it's backup mob, and call vmw_resource_dirty_update() once for each
+ * range. The transferred ranges will be cleared from the backing mob's
+ * dirty tracking.
+ */
+void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res)
+{
+	struct vmw_buffer_object *vbo = res->backup;
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	pgoff_t start, cur, end;
+	unsigned long res_start = res->backup_offset;
+	unsigned long res_end = res->backup_offset + res->backup_size;
+
+	WARN_ON_ONCE(res_start & ~PAGE_MASK);
+	res_start >>= PAGE_SHIFT;
+	res_end = DIV_ROUND_UP(res_end, PAGE_SIZE);
+
+	if (res_start >= dirty->end || res_end <= dirty->start)
+		return;
+
+	cur = max(res_start, dirty->start);
+	res_end = max(res_end, dirty->end);
+	while (cur < res_end) {
+		unsigned long num;
+
+		start = find_next_bit(&dirty->bitmap[0], res_end, cur);
+		if (start >= res_end)
+			break;
+
+		end = find_next_zero_bit(&dirty->bitmap[0], res_end, start + 1);
+		cur = end + 1;
+		num = end - start;
+		bitmap_clear(&dirty->bitmap[0], start, num);
+		vmw_resource_dirty_update(res, start, end);
+	}
+
+	if (res_start <= dirty->start && res_end > dirty->start)
+		dirty->start = res_end;
+	if (res_start < dirty->end && res_end >= dirty->end)
+		dirty->end = res_start;
+}
+
+/**
+ * vmw_bo_dirty_clear_res - Clear a resource's dirty region from
+ * its backing mob.
+ * @res: The resource
+ *
+ * This function will clear all dirty ranges affecting the resource from
+ * it's backup mob's dirty tracking.
+ */
+void vmw_bo_dirty_clear_res(struct vmw_resource *res)
+{
+	unsigned long res_start = res->backup_offset;
+	unsigned long res_end = res->backup_offset + res->backup_size;
+	struct vmw_buffer_object *vbo = res->backup;
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+
+	res_start >>= PAGE_SHIFT;
+	res_end = DIV_ROUND_UP(res_end, PAGE_SIZE);
+
+	if (res_start >= dirty->end || res_end <= dirty->start)
+		return;
+
+	res_start = max(res_start, dirty->start);
+	res_end = min(res_end, dirty->end);
+	bitmap_clear(&dirty->bitmap[0], res_start, res_end - res_start);
+
+	if (res_start <= dirty->start && res_end > dirty->start)
+		dirty->start = res_end;
+	if (res_start < dirty->end && res_end >= dirty->end)
+		dirty->end = res_start;
+}
+
+vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct ttm_buffer_object *bo = (struct ttm_buffer_object *)
+	    vma->vm_private_data;
+	vm_fault_t ret;
+	unsigned long page_offset;
+	unsigned int save_flags;
+	struct vmw_buffer_object *vbo =
+		container_of(bo, typeof(*vbo), base);
+
+	/*
+	 * mkwrite() doesn't handle the VM_FAULT_RETRY return value correctly.
+	 * So make sure the TTM helpers are aware.
+	 */
+	save_flags = vmf->flags;
+	vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
+	ret = ttm_bo_vm_reserve(bo, vmf);
+	vmf->flags = save_flags;
+	if (ret)
+		return ret;
+
+	page_offset = vmf->pgoff - drm_vma_node_start(&bo->base.vma_node);
+	if (unlikely(page_offset >= bo->num_pages)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
+	if (vbo->dirty && vbo->dirty->method == VMW_BO_DIRTY_MKWRITE &&
+	    !test_bit(page_offset, &vbo->dirty->bitmap[0])) {
+		struct vmw_bo_dirty *dirty = vbo->dirty;
+
+		__set_bit(page_offset, &dirty->bitmap[0]);
+		dirty->start = min(dirty->start, page_offset);
+		dirty->end = max(dirty->end, page_offset + 1);
+	}
+
+out_unlock:
+	dma_resv_unlock(bo->base.resv);
+	return ret;
+}
+
+vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct ttm_buffer_object *bo = (struct ttm_buffer_object *)
+	    vma->vm_private_data;
+	struct vmw_buffer_object *vbo =
+		container_of(bo, struct vmw_buffer_object, base);
+	pgoff_t num_prefault;
+	pgprot_t prot;
+	vm_fault_t ret;
+
+	ret = ttm_bo_vm_reserve(bo, vmf);
+	if (ret)
+		return ret;
+
+	/*
+	 * This will cause mkwrite() to be called for each pte on
+	 * write-enable vmas.
+	 */
+	if (vbo->dirty && vbo->dirty->method == VMW_BO_DIRTY_MKWRITE)
+		prot = vma->vm_page_prot;
+	else
+		prot = vm_get_page_prot(vma->vm_flags);
+
+	num_prefault = (vma->vm_flags & VM_RAND_READ) ? 0 :
+		TTM_BO_VM_NUM_PREFAULT;
+	ret = ttm_bo_vm_fault_reserved(vmf, prot, num_prefault);
+	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+		return ret;
+
+	dma_resv_unlock(bo->base.resv);
+	return ret;
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index 5581a7826b4c..e4c97a4cf2ff 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -119,6 +119,10 @@ static void vmw_resource_release(struct kref *kref)
 		}
 		res->backup_dirty = false;
 		vmw_resource_mob_detach(res);
+		if (res->dirty)
+			res->func->dirty_free(res);
+		if (res->coherent)
+			vmw_bo_dirty_release(res->backup);
 		ttm_bo_unreserve(bo);
 		vmw_bo_unreference(&res->backup);
 	}
@@ -208,7 +212,9 @@ int vmw_resource_init(struct vmw_private *dev_priv, struct vmw_resource *res,
 	res->backup_offset = 0;
 	res->backup_dirty = false;
 	res->res_dirty = false;
+	res->coherent = false;
 	res->used_prio = 3;
+	res->dirty = NULL;
 	if (delay_id)
 		return 0;
 	else
@@ -395,6 +401,30 @@ static int vmw_resource_do_validate(struct vmw_resource *res,
 			vmw_resource_mob_attach(res);
 	}
 
+	/*
+	 * Handle the case where the backup mob is marked coherent but
+	 * the resource isn't.
+	 */
+	if (func->dirty_alloc && vmw_resource_mob_attached(res) &&
+	    !res->coherent) {
+		if (res->backup->dirty && !res->dirty) {
+			ret = func->dirty_alloc(res);
+			if (ret)
+				return ret;
+		} else if (!res->backup->dirty && res->dirty) {
+			func->dirty_free(res);
+		}
+	}
+
+	/*
+	 * Transfer the dirty regions to the resource and update
+	 * the resource.
+	 */
+	if (res->dirty) {
+		vmw_bo_dirty_transfer_to_res(res);
+		return func->dirty_sync(res);
+	}
+
 	return 0;
 
 out_bind_failed:
@@ -433,16 +463,28 @@ void vmw_resource_unreserve(struct vmw_resource *res,
 	if (switch_backup && new_backup != res->backup) {
 		if (res->backup) {
 			vmw_resource_mob_detach(res);
+			if (res->coherent)
+				vmw_bo_dirty_release(res->backup);
 			vmw_bo_unreference(&res->backup);
 		}
 
 		if (new_backup) {
 			res->backup = vmw_bo_reference(new_backup);
+
+			/*
+			 * The validation code should already have added a
+			 * dirty tracker here.
+			 */
+			WARN_ON(res->coherent && !new_backup->dirty);
+
 			vmw_resource_mob_attach(res);
 		} else {
 			res->backup = NULL;
 		}
+	} else if (switch_backup && res->coherent) {
+		vmw_bo_dirty_release(res->backup);
 	}
+
 	if (switch_backup)
 		res->backup_offset = new_backup_offset;
 
@@ -1008,3 +1050,18 @@ enum vmw_res_type vmw_res_type(const struct vmw_resource *res)
 {
 	return res->func->res_type;
 }
+
+/**
+ * vmw_resource_update_dirty - Update a resource's dirty tracker with a
+ * sequential range of touched backing store memory.
+ * @res: The resource.
+ * @start: The first page touched.
+ * @end: The last page touched + 1.
+ */
+void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
+			       pgoff_t end)
+{
+	if (res->dirty)
+		res->func->dirty_range_add(res, start << PAGE_SHIFT,
+					   end << PAGE_SHIFT);
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
index 984e588c62ca..c85144286cfe 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
@@ -71,6 +71,12 @@ struct vmw_user_resource_conv {
  * @commit_notify:     If the resource is a command buffer managed resource,
  *                     callback to notify that a define or remove command
  *                     has been committed to the device.
+ * @dirty_alloc:       Allocate a dirty tracker. NULL if dirty-tracking is not
+ *                     supported.
+ * @dirty_free:        Free the dirty tracker.
+ * @dirty_sync:        Upload the dirty mob contents to the resource.
+ * @dirty_add_range:   Add a sequential dirty range to the resource
+ *                     dirty tracker.
  */
 struct vmw_res_func {
 	enum vmw_res_type res_type;
@@ -90,6 +96,11 @@ struct vmw_res_func {
 		       struct ttm_validate_buffer *val_buf);
 	void (*commit_notify)(struct vmw_resource *res,
 			      enum vmw_cmdbuf_res_state state);
+	int (*dirty_alloc)(struct vmw_resource *res);
+	void (*dirty_free)(struct vmw_resource *res);
+	int (*dirty_sync)(struct vmw_resource *res);
+	void (*dirty_range_add)(struct vmw_resource *res, size_t start,
+				 size_t end);
 };
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
index 5a7b8bb420de..ce288756531b 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
@@ -29,10 +29,23 @@
 
 int vmw_mmap(struct file *filp, struct vm_area_struct *vma)
 {
+	static const struct vm_operations_struct vmw_vm_ops = {
+		.pfn_mkwrite = vmw_bo_vm_mkwrite,
+		.page_mkwrite = vmw_bo_vm_mkwrite,
+		.fault = vmw_bo_vm_fault,
+		.open = ttm_bo_vm_open,
+		.close = ttm_bo_vm_close
+	};
 	struct drm_file *file_priv = filp->private_data;
 	struct vmw_private *dev_priv = vmw_priv(file_priv->minor->dev);
+	int ret = ttm_bo_mmap(filp, vma, &dev_priv->bdev);
 
-	return ttm_bo_mmap(filp, vma, &dev_priv->bdev);
+	if (ret)
+		return ret;
+
+	vma->vm_ops = &vmw_vm_ops;
+
+	return 0;
 }
 
 /* struct vmw_validation_mem callback */
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
index f611b2290a1b..71349a7bae90 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
@@ -33,6 +33,8 @@
  * struct vmw_validation_bo_node - Buffer object validation metadata.
  * @base: Metadata used for TTM reservation- and validation.
  * @hash: A hash entry used for the duplicate detection hash table.
+ * @coherent_count: If switching backup buffers, number of new coherent
+ * resources that will have this buffer as a backup buffer.
  * @as_mob: Validate as mob.
  * @cpu_blit: Validate for cpu blit access.
  *
@@ -42,6 +44,7 @@
 struct vmw_validation_bo_node {
 	struct ttm_validate_buffer base;
 	struct drm_hash_item hash;
+	unsigned int coherent_count;
 	u32 as_mob : 1;
 	u32 cpu_blit : 1;
 };
@@ -459,6 +462,19 @@ int vmw_validation_res_reserve(struct vmw_validation_context *ctx,
 			if (ret)
 				goto out_unreserve;
 		}
+
+		if (val->switching_backup && val->new_backup &&
+		    res->coherent) {
+			struct vmw_validation_bo_node *bo_node =
+				vmw_validation_find_bo_dup(ctx,
+							   val->new_backup);
+
+			if (WARN_ON(!bo_node)) {
+				ret = -EINVAL;
+				goto out_unreserve;
+			}
+			bo_node->coherent_count++;
+		}
 	}
 
 	return 0;
@@ -562,6 +578,9 @@ int vmw_validation_bo_validate(struct vmw_validation_context *ctx, bool intr)
 	int ret;
 
 	list_for_each_entry(entry, &ctx->bo_list, base.head) {
+		struct vmw_buffer_object *vbo =
+			container_of(entry->base.bo, typeof(*vbo), base);
+
 		if (entry->cpu_blit) {
 			struct ttm_operation_ctx ctx = {
 				.interruptible = intr,
@@ -576,6 +595,27 @@ int vmw_validation_bo_validate(struct vmw_validation_context *ctx, bool intr)
 		}
 		if (ret)
 			return ret;
+
+		/*
+		 * Rather than having the resource code allocating the bo
+		 * dirty tracker in resource_unreserve() where we can't fail,
+		 * Do it here when validating the buffer object.
+		 */
+		if (entry->coherent_count) {
+			unsigned int coherent_count = entry->coherent_count;
+
+			while (coherent_count) {
+				ret = vmw_bo_dirty_add(vbo);
+				if (ret)
+					return ret;
+
+				coherent_count--;
+			}
+			entry->coherent_count -= coherent_count;
+		}
+
+		if (vbo->dirty)
+			vmw_bo_dirty_scan(vbo);
 	}
 	return 0;
 }
@@ -828,3 +868,34 @@ int vmw_validation_preload_res(struct vmw_validation_context *ctx,
 	ctx->mem_size_left += size;
 	return 0;
 }
+
+/**
+ * vmw_validation_bo_backoff - Unreserve buffer objects registered with a
+ * validation context
+ * @ctx: The validation context
+ *
+ * This function unreserves the buffer objects previously reserved using
+ * vmw_validation_bo_reserve. It's typically used as part of an error path
+ */
+void vmw_validation_bo_backoff(struct vmw_validation_context *ctx)
+{
+	struct vmw_validation_bo_node *entry;
+
+	/*
+	 * Switching coherent resource backup buffers failed.
+	 * Release corresponding buffer object dirty trackers.
+	 */
+	list_for_each_entry(entry, &ctx->bo_list, base.head) {
+		if (entry->coherent_count) {
+			unsigned int coherent_count = entry->coherent_count;
+			struct vmw_buffer_object *vbo =
+				container_of(entry->base.bo, typeof(*vbo),
+					     base);
+
+			while (coherent_count--)
+				vmw_bo_dirty_release(vbo);
+		}
+	}
+
+	ttm_eu_backoff_reservation(&ctx->ticket, &ctx->bo_list);
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h
index 0e063743dd86..4cee3f732588 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.h
@@ -173,20 +173,6 @@ vmw_validation_bo_reserve(struct vmw_validation_context *ctx,
 				      NULL, true);
 }
 
-/**
- * vmw_validation_bo_backoff - Unreserve buffer objects registered with a
- * validation context
- * @ctx: The validation context
- *
- * This function unreserves the buffer objects previously reserved using
- * vmw_validation_bo_reserve. It's typically used as part of an error path
- */
-static inline void
-vmw_validation_bo_backoff(struct vmw_validation_context *ctx)
-{
-	ttm_eu_backoff_reservation(&ctx->ticket, &ctx->bo_list);
-}
-
 /**
  * vmw_validation_bo_fence - Unreserve and fence buffer objects registered
  * with a validation context
@@ -269,4 +255,6 @@ int vmw_validation_preload_res(struct vmw_validation_context *ctx,
 			       unsigned int size);
 void vmw_validation_res_set_dirty(struct vmw_validation_context *ctx,
 				  void *val_private, u32 dirty);
+void vmw_validation_bo_backoff(struct vmw_validation_context *ctx);
+
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/5] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources
  2019-09-26 11:55 [PATCH v2 0/5] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 2/5] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
@ 2019-09-26 11:55 ` Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 4/5] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 5/5] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)
  4 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 11:55 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: pv-drivers, linux-graphics-maintainer, Thomas Hellstrom,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Souptick Joarder, Jérôme Glisse, Christian König,
	Christoph Hellwig, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

With emulated coherent memory we need to be able to quickly look up
a resource from the MOB offset. Instead of traversing a linked list with
O(n) worst case, use an RBtree with O(log n) worst case complexity.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c       |  5 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h      | 10 +++----
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 33 +++++++++++++++++-------
 3 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index 869aeaec2f86..18e4b329e563 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -463,6 +463,7 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
 	struct vmw_buffer_object *vmw_bo = vmw_buffer_object(bo);
 
 	WARN_ON(vmw_bo->dirty);
+	WARN_ON(!RB_EMPTY_ROOT(&vmw_bo->res_tree));
 	vmw_bo_unmap(vmw_bo);
 	kfree(vmw_bo);
 }
@@ -479,6 +480,7 @@ static void vmw_user_bo_destroy(struct ttm_buffer_object *bo)
 	struct vmw_buffer_object *vbo = &vmw_user_bo->vbo;
 
 	WARN_ON(vbo->dirty);
+	WARN_ON(!RB_EMPTY_ROOT(&vbo->res_tree));
 	vmw_bo_unmap(vbo);
 	ttm_prime_object_kfree(vmw_user_bo, prime);
 }
@@ -514,8 +516,7 @@ int vmw_bo_init(struct vmw_private *dev_priv,
 	memset(vmw_bo, 0, sizeof(*vmw_bo));
 	BUILD_BUG_ON(TTM_MAX_BO_PRIORITY <= 3);
 	vmw_bo->base.priority = 3;
-
-	INIT_LIST_HEAD(&vmw_bo->res_list);
+	vmw_bo->res_tree = RB_ROOT;
 
 	ret = ttm_bo_init(bdev, &vmw_bo->base, size,
 			  ttm_bo_type_device, placement,
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 7944dbbbdd72..53f8522ae032 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -100,7 +100,7 @@ struct vmw_fpriv {
 /**
  * struct vmw_buffer_object - TTM buffer object with vmwgfx additions
  * @base: The TTM buffer object
- * @res_list: List of resources using this buffer object as a backing MOB
+ * @res_tree: RB tree of resources using this buffer object as a backing MOB
  * @pin_count: pin depth
  * @dx_query_ctx: DX context if this buffer object is used as a DX query MOB
  * @map: Kmap object for semi-persistent mappings
@@ -109,7 +109,7 @@ struct vmw_fpriv {
  */
 struct vmw_buffer_object {
 	struct ttm_buffer_object base;
-	struct list_head res_list;
+	struct rb_root res_tree;
 	s32 pin_count;
 	/* Not ref-counted.  Protected by binding_mutex */
 	struct vmw_resource *dx_query_ctx;
@@ -157,8 +157,8 @@ struct vmw_res_func;
  * pin-count greater than zero. It is not on the resource LRU lists and its
  * backup buffer is pinned. Hence it can't be evicted.
  * @func: Method vtable for this resource. Immutable.
+ * @mob_node; Node for the MOB backup rbtree. Protected by @backup reserved.
  * @lru_head: List head for the LRU list. Protected by @dev_priv::resource_lock.
- * @mob_head: List head for the MOB backup list. Protected by @backup reserved.
  * @binding_head: List head for the context binding list. Protected by
  * the @dev_priv::binding_mutex
  * @res_free: The resource destructor.
@@ -179,8 +179,8 @@ struct vmw_resource {
 	unsigned long backup_offset;
 	unsigned long pin_count;
 	const struct vmw_res_func *func;
+	struct rb_node mob_node;
 	struct list_head lru_head;
-	struct list_head mob_head;
 	struct list_head binding_head;
 	struct vmw_resource_dirty *dirty;
 	void (*res_free) (struct vmw_resource *res);
@@ -733,7 +733,7 @@ void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
  */
 static inline bool vmw_resource_mob_attached(const struct vmw_resource *res)
 {
-	return !list_empty(&res->mob_head);
+	return !RB_EMPTY_NODE(&res->mob_node);
 }
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index e4c97a4cf2ff..328ad46076ff 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -40,11 +40,24 @@
 void vmw_resource_mob_attach(struct vmw_resource *res)
 {
 	struct vmw_buffer_object *backup = res->backup;
+	struct rb_node **new = &backup->res_tree.rb_node, *parent = NULL;
 
 	dma_resv_assert_held(res->backup->base.base.resv);
 	res->used_prio = (res->res_dirty) ? res->func->dirty_prio :
 		res->func->prio;
-	list_add_tail(&res->mob_head, &backup->res_list);
+
+	while (*new) {
+		struct vmw_resource *this =
+			container_of(*new, struct vmw_resource, mob_node);
+
+		parent = *new;
+		new = (res->backup_offset < this->backup_offset) ?
+			&((*new)->rb_left) : &((*new)->rb_right);
+	}
+
+	rb_link_node(&res->mob_node, parent, new);
+	rb_insert_color(&res->mob_node, &backup->res_tree);
+
 	vmw_bo_prio_add(backup, res->used_prio);
 }
 
@@ -58,7 +71,8 @@ void vmw_resource_mob_detach(struct vmw_resource *res)
 
 	dma_resv_assert_held(backup->base.base.resv);
 	if (vmw_resource_mob_attached(res)) {
-		list_del_init(&res->mob_head);
+		rb_erase(&res->mob_node, &backup->res_tree);
+		RB_CLEAR_NODE(&res->mob_node);
 		vmw_bo_prio_del(backup, res->used_prio);
 	}
 }
@@ -204,8 +218,8 @@ int vmw_resource_init(struct vmw_private *dev_priv, struct vmw_resource *res,
 	res->res_free = res_free;
 	res->dev_priv = dev_priv;
 	res->func = func;
+	RB_CLEAR_NODE(&res->mob_node);
 	INIT_LIST_HEAD(&res->lru_head);
-	INIT_LIST_HEAD(&res->mob_head);
 	INIT_LIST_HEAD(&res->binding_head);
 	res->id = -1;
 	res->backup = NULL;
@@ -754,19 +768,20 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr)
  */
 void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
 {
-
-	struct vmw_resource *res, *next;
 	struct ttm_validate_buffer val_buf = {
 		.bo = &vbo->base,
 		.num_shared = 0
 	};
 
 	dma_resv_assert_held(vbo->base.base.resv);
-	list_for_each_entry_safe(res, next, &vbo->res_list, mob_head) {
-		if (!res->func->unbind)
-			continue;
+	while (!RB_EMPTY_ROOT(&vbo->res_tree)) {
+		struct rb_node *node = vbo->res_tree.rb_node;
+		struct vmw_resource *res =
+			container_of(node, struct vmw_resource, mob_node);
+
+		if (!WARN_ON_ONCE(!res->func->unbind))
+			(void) res->func->unbind(res, res->res_dirty, &val_buf);
 
-		(void) res->func->unbind(res, res->res_dirty, &val_buf);
 		res->backup_dirty = true;
 		res->res_dirty = false;
 		vmw_resource_mob_detach(res);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/5] drm/vmwgfx: Implement an infrastructure for read-coherent resources
  2019-09-26 11:55 [PATCH v2 0/5] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (2 preceding siblings ...)
  2019-09-26 11:55 ` [PATCH v2 3/5] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
@ 2019-09-26 11:55 ` Thomas Hellström (VMware)
  2019-09-26 11:55 ` [PATCH v2 5/5] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)
  4 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 11:55 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: pv-drivers, linux-graphics-maintainer, Thomas Hellstrom,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Souptick Joarder, Jérôme Glisse, Christian König,
	Christoph Hellwig, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

Similar to write-coherent resources, make sure that from the user-space
point of view, GPU rendered contents is automatically available for
reading by the CPU.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   7 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c    |  75 ++++++++++++-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c      | 103 +++++++++++++++++-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h |   2 +
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c    |   3 +-
 5 files changed, 179 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 53f8522ae032..729a2e93acf1 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -680,7 +680,8 @@ extern void vmw_resource_unreference(struct vmw_resource **p_res);
 extern struct vmw_resource *vmw_resource_reference(struct vmw_resource *res);
 extern struct vmw_resource *
 vmw_resource_reference_unless_doomed(struct vmw_resource *res);
-extern int vmw_resource_validate(struct vmw_resource *res, bool intr);
+extern int vmw_resource_validate(struct vmw_resource *res, bool intr,
+				 bool dirtying);
 extern int vmw_resource_reserve(struct vmw_resource *res, bool interruptible,
 				bool no_backup);
 extern bool vmw_resource_needs_backup(const struct vmw_resource *res);
@@ -724,6 +725,8 @@ void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
 void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
 			       pgoff_t end);
+int vmw_resources_clean(struct vmw_buffer_object *vbo, pgoff_t start,
+			pgoff_t end, pgoff_t *num_prefault);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1417,6 +1420,8 @@ int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
 void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
 void vmw_bo_dirty_clear_res(struct vmw_resource *res);
 void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+			pgoff_t start, pgoff_t end);
 vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
 vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
index be3302a8e309..1914d34c183a 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -153,7 +153,6 @@ static void vmw_bo_dirty_scan_mkwrite(struct vmw_buffer_object *vbo)
 	}
 }
 
-
 /**
  * vmw_bo_dirty_scan - Scan for dirty pages and add them to the dirty
  * tracking structure
@@ -171,6 +170,51 @@ void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo)
 		vmw_bo_dirty_scan_mkwrite(vbo);
 }
 
+/**
+ * vmw_bo_dirty_pre_unmap - write-protect and pick up dirty pages before
+ * an unmap_mapping_range operation.
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * If we're using the _PAGETABLE scan method, we may leak dirty pages
+ * when calling unmap_mapping_range(). This function makes sure we pick
+ * up all dirty pages.
+ */
+static void vmw_bo_dirty_pre_unmap(struct vmw_buffer_object *vbo,
+				   pgoff_t start, pgoff_t end)
+{
+	struct vmw_bo_dirty *dirty = vbo->dirty;
+	unsigned long offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+	if (dirty->method != VMW_BO_DIRTY_PAGETABLE || start >= end)
+		return;
+
+	apply_as_wrprotect(mapping, start + offset, end - start);
+	apply_as_clean(mapping, start + offset, end - start, offset,
+		       &dirty->bitmap[0], &dirty->start, &dirty->end);
+}
+
+/**
+ * vmw_bo_dirty_unmap - Clear all ptes pointing to a range within a bo
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * This is similar to ttm_bo_unmap_virtual_locked() except it takes a subrange.
+ */
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+			pgoff_t start, pgoff_t end)
+{
+	unsigned long offset = drm_vma_node_start(&vbo->base.base.vma_node);
+	struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+	vmw_bo_dirty_pre_unmap(vbo, start, end);
+	unmap_shared_mapping_range(mapping, (offset + start) << PAGE_SHIFT,
+				   (loff_t) (end - start) << PAGE_SHIFT);
+}
+
 /**
  * vmw_bo_dirty_add - Add a dirty-tracking user to a buffer object
  * @vbo: The buffer object
@@ -397,21 +441,42 @@ vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf)
 	if (ret)
 		return ret;
 
+	num_prefault = (vma->vm_flags & VM_RAND_READ) ? 1 :
+		TTM_BO_VM_NUM_PREFAULT;
+
+	if (vbo->dirty) {
+		pgoff_t allowed_prefault;
+		unsigned long page_offset;
+
+		page_offset = vmf->pgoff -
+			drm_vma_node_start(&bo->base.vma_node);
+		if (page_offset >= bo->num_pages ||
+		    vmw_resources_clean(vbo, page_offset,
+					page_offset + PAGE_SIZE,
+					&allowed_prefault)) {
+			ret = VM_FAULT_SIGBUS;
+			goto out_unlock;
+		}
+
+		num_prefault = min(num_prefault, allowed_prefault);
+	}
+
 	/*
-	 * This will cause mkwrite() to be called for each pte on
-	 * write-enable vmas.
+	 * If we don't track dirty using the MKWRITE method, make sure
+	 * sure the page protection is write-enabled so we don't get
+	 * a lot of unnecessary write faults.
 	 */
 	if (vbo->dirty && vbo->dirty->method == VMW_BO_DIRTY_MKWRITE)
 		prot = vma->vm_page_prot;
 	else
 		prot = vm_get_page_prot(vma->vm_flags);
 
-	num_prefault = (vma->vm_flags & VM_RAND_READ) ? 0 :
-		TTM_BO_VM_NUM_PREFAULT;
 	ret = ttm_bo_vm_fault_reserved(vmf, prot, num_prefault);
 	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
 		return ret;
 
+out_unlock:
 	dma_resv_unlock(bo->base.resv);
+
 	return ret;
 }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index 328ad46076ff..c76faf33972e 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -393,7 +393,8 @@ static int vmw_resource_buf_alloc(struct vmw_resource *res,
  * should be retried once resources have been freed up.
  */
 static int vmw_resource_do_validate(struct vmw_resource *res,
-				    struct ttm_validate_buffer *val_buf)
+				    struct ttm_validate_buffer *val_buf,
+				    bool dirtying)
 {
 	int ret = 0;
 	const struct vmw_res_func *func = res->func;
@@ -435,6 +436,15 @@ static int vmw_resource_do_validate(struct vmw_resource *res,
 	 * the resource.
 	 */
 	if (res->dirty) {
+		if (dirtying && !res->res_dirty) {
+			pgoff_t start = res->backup_offset >> PAGE_SHIFT;
+			pgoff_t end = __KERNEL_DIV_ROUND_UP
+				(res->backup_offset + res->backup_size,
+				 PAGE_SIZE);
+
+			vmw_bo_dirty_unmap(res->backup, start, end);
+		}
+
 		vmw_bo_dirty_transfer_to_res(res);
 		return func->dirty_sync(res);
 	}
@@ -679,6 +689,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
  *                         to the device.
  * @res: The resource to make visible to the device.
  * @intr: Perform waits interruptible if possible.
+ * @dirtying: Pending GPU operation will dirty the resource
  *
  * On succesful return, any backup DMA buffer pointed to by @res->backup will
  * be reserved and validated.
@@ -688,7 +699,8 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
  * Return: Zero on success, -ERESTARTSYS if interrupted, negative error code
  * on failure.
  */
-int vmw_resource_validate(struct vmw_resource *res, bool intr)
+int vmw_resource_validate(struct vmw_resource *res, bool intr,
+			  bool dirtying)
 {
 	int ret;
 	struct vmw_resource *evict_res;
@@ -705,7 +717,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr)
 	if (res->backup)
 		val_buf.bo = &res->backup->base;
 	do {
-		ret = vmw_resource_do_validate(res, &val_buf);
+		ret = vmw_resource_do_validate(res, &val_buf, dirtying);
 		if (likely(ret != -EBUSY))
 			break;
 
@@ -1005,7 +1017,7 @@ int vmw_resource_pin(struct vmw_resource *res, bool interruptible)
 			/* Do we really need to pin the MOB as well? */
 			vmw_bo_pin_reserved(vbo, true);
 		}
-		ret = vmw_resource_validate(res, interruptible);
+		ret = vmw_resource_validate(res, interruptible, true);
 		if (vbo)
 			ttm_bo_unreserve(&vbo->base);
 		if (ret)
@@ -1080,3 +1092,86 @@ void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
 		res->func->dirty_range_add(res, start << PAGE_SHIFT,
 					   end << PAGE_SHIFT);
 }
+
+/**
+ * vmw_resources_clean - Clean resources intersecting a mob range
+ * @vbo: The mob buffer object
+ * @start: The mob page offset starting the range
+ * @end: The mob page offset ending the range
+ * @num_prefault: Returns how many pages including the first have been
+ * cleaned and are ok to prefault
+ */
+int vmw_resources_clean(struct vmw_buffer_object *vbo, pgoff_t start,
+			pgoff_t end, pgoff_t *num_prefault)
+{
+	struct rb_node *cur = vbo->res_tree.rb_node;
+	struct vmw_resource *found = NULL;
+	unsigned long res_start = start << PAGE_SHIFT;
+	unsigned long res_end = end << PAGE_SHIFT;
+	unsigned long last_cleaned = 0;
+
+	/*
+	 * Find the resource with lowest backup_offset that intersects the
+	 * range.
+	 */
+	while (cur) {
+		struct vmw_resource *cur_res =
+			container_of(cur, struct vmw_resource, mob_node);
+
+		if (cur_res->backup_offset >= res_end) {
+			cur = cur->rb_left;
+		} else if (cur_res->backup_offset + cur_res->backup_size <=
+			   res_start) {
+			cur = cur->rb_right;
+		} else {
+			found = cur_res;
+			cur = cur->rb_left;
+			/* Continue to look for resources with lower offsets */
+		}
+	}
+
+	/*
+	 * In order of increasing backup_offset, clean dirty resorces
+	 * intersecting the range.
+	 */
+	while (found) {
+		if (found->res_dirty) {
+			int ret;
+
+			if (!found->func->clean)
+				return -EINVAL;
+
+			ret = found->func->clean(found);
+			if (ret)
+				return ret;
+
+			found->res_dirty = false;
+		}
+		last_cleaned = found->backup_offset + found->backup_size;
+		cur = rb_next(&found->mob_node);
+		if (!cur)
+			break;
+
+		found = container_of(cur, struct vmw_resource, mob_node);
+		if (found->backup_offset >= res_end)
+			break;
+	}
+
+	/*
+	 * Set number of pages allowed prefaulting and fence the buffer object
+	 */
+	*num_prefault = 1;
+	if (last_cleaned > res_start) {
+		struct ttm_buffer_object *bo = &vbo->base;
+
+		*num_prefault = __KERNEL_DIV_ROUND_UP(last_cleaned - res_start,
+						      PAGE_SIZE);
+		vmw_bo_fence_single(bo, NULL);
+		if (bo->moving)
+			dma_fence_put(bo->moving);
+		bo->moving = dma_fence_get
+			(dma_resv_get_excl(bo->base.resv));
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
index c85144286cfe..3b7438b2d289 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h
@@ -77,6 +77,7 @@ struct vmw_user_resource_conv {
  * @dirty_sync:        Upload the dirty mob contents to the resource.
  * @dirty_add_range:   Add a sequential dirty range to the resource
  *                     dirty tracker.
+ * @clean:             Clean the resource.
  */
 struct vmw_res_func {
 	enum vmw_res_type res_type;
@@ -101,6 +102,7 @@ struct vmw_res_func {
 	int (*dirty_sync)(struct vmw_resource *res);
 	void (*dirty_range_add)(struct vmw_resource *res, size_t start,
 				 size_t end);
+	int (*clean)(struct vmw_resource *res);
 };
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
index 71349a7bae90..9aaf807ed73c 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
@@ -641,7 +641,8 @@ int vmw_validation_res_validate(struct vmw_validation_context *ctx, bool intr)
 		struct vmw_resource *res = val->res;
 		struct vmw_buffer_object *backup = res->backup;
 
-		ret = vmw_resource_validate(res, intr);
+		ret = vmw_resource_validate(res, intr, val->dirty_set &&
+					    val->dirty);
 		if (ret) {
 			if (ret != -ERESTARTSYS)
 				DRM_ERROR("Failed to validate resource.\n");
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/5] drm/vmwgfx: Add surface dirty-tracking callbacks
  2019-09-26 11:55 [PATCH v2 0/5] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
                   ` (3 preceding siblings ...)
  2019-09-26 11:55 ` [PATCH v2 4/5] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
@ 2019-09-26 11:55 ` Thomas Hellström (VMware)
  4 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 11:55 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: pv-drivers, linux-graphics-maintainer, Thomas Hellstrom,
	Andrew Morton, Matthew Wilcox, Will Deacon, Peter Zijlstra,
	Rik van Riel, Minchan Kim, Michal Hocko, Huang Ying,
	Souptick Joarder, Jérôme Glisse, Christian König,
	Christoph Hellwig, Deepak Rawat

From: Thomas Hellstrom <thellstrom@vmware.com>

Add the callbacks necessary to implement emulated coherent memory for
surfaces. Add a flag to the gb_surface_create ioctl to indicate that
surface memory should be coherent.
Also bump the drm minor version to signal the availability of coherent
surfaces.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 .../device_include/svga3d_surfacedefs.h       | 233 ++++++++++-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   4 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c       | 395 +++++++++++++++++-
 include/uapi/drm/vmwgfx_drm.h                 |   4 +-
 4 files changed, 629 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
index f2bfd3d80598..61414f105c67 100644
--- a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
+++ b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
@@ -1280,7 +1280,6 @@ svga3dsurface_get_pixel_offset(SVGA3dSurfaceFormat format,
 	return offset;
 }
 
-
 static inline u32
 svga3dsurface_get_image_offset(SVGA3dSurfaceFormat format,
 			       surf_size_struct baseLevelSize,
@@ -1375,4 +1374,236 @@ svga3dsurface_is_screen_target_format(SVGA3dSurfaceFormat format)
 	return svga3dsurface_is_dx_screen_target_format(format);
 }
 
+/**
+ * struct svga3dsurface_mip - Mimpmap level information
+ * @bytes: Bytes required in the backing store of this mipmap level.
+ * @img_stride: Byte stride per image.
+ * @row_stride: Byte stride per block row.
+ * @size: The size of the mipmap.
+ */
+struct svga3dsurface_mip {
+	size_t bytes;
+	size_t img_stride;
+	size_t row_stride;
+	struct drm_vmw_size size;
+
+};
+
+/**
+ * struct svga3dsurface_cache - Cached surface information
+ * @desc: Pointer to the surface descriptor
+ * @mip: Array of mipmap level information. Valid size is @num_mip_levels.
+ * @mip_chain_bytes: Bytes required in the backing store for the whole chain
+ * of mip levels.
+ * @sheet_bytes: Bytes required in the backing store for a sheet
+ * representing a single sample.
+ * @num_mip_levels: Valid size of the @mip array. Number of mipmap levels in
+ * a chain.
+ * @num_layers: Number of slices in an array texture or number of faces in
+ * a cubemap texture.
+ */
+struct svga3dsurface_cache {
+	const struct svga3d_surface_desc *desc;
+	struct svga3dsurface_mip mip[DRM_VMW_MAX_MIP_LEVELS];
+	size_t mip_chain_bytes;
+	size_t sheet_bytes;
+	u32 num_mip_levels;
+	u32 num_layers;
+};
+
+/**
+ * struct svga3dsurface_loc - Surface location
+ * @sub_resource: Surface subresource. Defined as layer * num_mip_levels +
+ * mip_level.
+ * @x: X coordinate.
+ * @y: Y coordinate.
+ * @z: Z coordinate.
+ */
+struct svga3dsurface_loc {
+	u32 sub_resource;
+	u32 x, y, z;
+};
+
+/**
+ * svga3dsurface_subres - Compute the subresource from layer and mipmap.
+ * @cache: Surface layout data.
+ * @mip_level: The mipmap level.
+ * @layer: The surface layer (face or array slice).
+ *
+ * Return: The subresource.
+ */
+static inline u32 svga3dsurface_subres(const struct svga3dsurface_cache *cache,
+				       u32 mip_level, u32 layer)
+{
+	return cache->num_mip_levels * layer + mip_level;
+}
+
+/**
+ * svga3dsurface_setup_cache - Build a surface cache entry
+ * @size: The surface base level dimensions.
+ * @format: The surface format.
+ * @num_mip_levels: Number of mipmap levels.
+ * @num_layers: Number of layers.
+ * @cache: Pointer to a struct svga3dsurface_cach object to be filled in.
+ *
+ * Return: Zero on success, -EINVAL on invalid surface layout.
+ */
+static inline int svga3dsurface_setup_cache(const struct drm_vmw_size *size,
+					    SVGA3dSurfaceFormat format,
+					    u32 num_mip_levels,
+					    u32 num_layers,
+					    u32 num_samples,
+					    struct svga3dsurface_cache *cache)
+{
+	const struct svga3d_surface_desc *desc;
+	u32 i;
+
+	memset(cache, 0, sizeof(*cache));
+	cache->desc = desc = svga3dsurface_get_desc(format);
+	cache->num_mip_levels = num_mip_levels;
+	cache->num_layers = num_layers;
+	for (i = 0; i < cache->num_mip_levels; i++) {
+		struct svga3dsurface_mip *mip = &cache->mip[i];
+
+		mip->size = svga3dsurface_get_mip_size(*size, i);
+		mip->bytes = svga3dsurface_get_image_buffer_size
+			(desc, &mip->size, 0);
+		mip->row_stride =
+			__KERNEL_DIV_ROUND_UP(mip->size.width,
+					      desc->block_size.width) *
+			desc->bytes_per_block * num_samples;
+		if (!mip->row_stride)
+			goto invalid_dim;
+
+		mip->img_stride =
+			__KERNEL_DIV_ROUND_UP(mip->size.height,
+					      desc->block_size.height) *
+			mip->row_stride;
+		if (!mip->img_stride)
+			goto invalid_dim;
+
+		cache->mip_chain_bytes += mip->bytes;
+	}
+	cache->sheet_bytes = cache->mip_chain_bytes * num_layers;
+	if (!cache->sheet_bytes)
+		goto invalid_dim;
+
+	return 0;
+
+invalid_dim:
+	VMW_DEBUG_USER("Invalid surface layout for dirty tracking.\n");
+	return -EINVAL;
+}
+
+/**
+ * svga3dsurface_get_loc - Get a surface location from an offset into the
+ * backing store
+ * @cache: Surface layout data.
+ * @loc: Pointer to a struct svga3dsurface_loc to be filled in.
+ * @offset: Offset into the surface backing store.
+ */
+static inline void
+svga3dsurface_get_loc(const struct svga3dsurface_cache *cache,
+		      struct svga3dsurface_loc *loc,
+		      size_t offset)
+{
+	const struct svga3dsurface_mip *mip = &cache->mip[0];
+	const struct svga3d_surface_desc *desc = cache->desc;
+	u32 layer;
+	int i;
+
+	if (offset >= cache->sheet_bytes)
+		offset %= cache->sheet_bytes;
+
+	layer = offset / cache->mip_chain_bytes;
+	offset -= layer * cache->mip_chain_bytes;
+	for (i = 0; i < cache->num_mip_levels; ++i, ++mip) {
+		if (mip->bytes > offset)
+			break;
+		offset -= mip->bytes;
+	}
+
+	loc->sub_resource = svga3dsurface_subres(cache, i, layer);
+	loc->z = offset / mip->img_stride;
+	offset -= loc->z * mip->img_stride;
+	loc->z *= desc->block_size.depth;
+	loc->y = offset / mip->row_stride;
+	offset -= loc->y * mip->row_stride;
+	loc->y *= desc->block_size.height;
+	loc->x = offset / desc->bytes_per_block;
+	loc->x *= desc->block_size.width;
+}
+
+/**
+ * svga3dsurface_inc_loc - Clamp increment a surface location with one block
+ * size
+ * in each dimension.
+ * @loc: Pointer to a struct svga3dsurface_loc to be incremented.
+ *
+ * When computing the size of a range as size = end - start, the range does not
+ * include the end element. However a location representing the last byte
+ * of a touched region in the backing store *is* included in the range.
+ * This function modifies such a location to match the end definition
+ * given as start + size which is the one used in a SVGA3dBox.
+ */
+static inline void
+svga3dsurface_inc_loc(const struct svga3dsurface_cache *cache,
+		      struct svga3dsurface_loc *loc)
+{
+	const struct svga3d_surface_desc *desc = cache->desc;
+	u32 mip = loc->sub_resource % cache->num_mip_levels;
+	const struct drm_vmw_size *size = &cache->mip[mip].size;
+
+	loc->sub_resource++;
+	loc->x += desc->block_size.width;
+	if (loc->x > size->width)
+		loc->x = size->width;
+	loc->y += desc->block_size.height;
+	if (loc->y > size->height)
+		loc->y = size->height;
+	loc->z += desc->block_size.depth;
+	if (loc->z > size->depth)
+		loc->z = size->depth;
+}
+
+/**
+ * svga3dsurface_min_loc - The start location in a subresource
+ * @cache: Surface layout data.
+ * @sub_resource: The subresource.
+ * @loc: Pointer to a struct svga3dsurface_loc to be filled in.
+ */
+static inline void
+svga3dsurface_min_loc(const struct svga3dsurface_cache *cache,
+		      u32 sub_resource,
+		      struct svga3dsurface_loc *loc)
+{
+	loc->sub_resource = sub_resource;
+	loc->x = loc->y = loc->z = 0;
+}
+
+/**
+ * svga3dsurface_min_loc - The end location in a subresource
+ * @cache: Surface layout data.
+ * @sub_resource: The subresource.
+ * @loc: Pointer to a struct svga3dsurface_loc to be filled in.
+ *
+ * Following the end definition given in svga3dsurface_inc_loc(),
+ * Compute the end location of a surface subresource.
+ */
+static inline void
+svga3dsurface_max_loc(const struct svga3dsurface_cache *cache,
+		      u32 sub_resource,
+		      struct svga3dsurface_loc *loc)
+{
+	const struct drm_vmw_size *size;
+	u32 mip;
+
+	loc->sub_resource = sub_resource + 1;
+	mip = sub_resource % cache->num_mip_levels;
+	size = &cache->mip[mip].size;
+	loc->x = size->width;
+	loc->y = size->height;
+	loc->z = size->depth;
+}
+
 #endif /* _SVGA3D_SURFACEDEFS_H_ */
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 729a2e93acf1..f5261e1c96d7 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -56,9 +56,9 @@
 
 
 #define VMWGFX_DRIVER_NAME "vmwgfx"
-#define VMWGFX_DRIVER_DATE "20180704"
+#define VMWGFX_DRIVER_DATE "20190328"
 #define VMWGFX_DRIVER_MAJOR 2
-#define VMWGFX_DRIVER_MINOR 15
+#define VMWGFX_DRIVER_MINOR 16
 #define VMWGFX_DRIVER_PATCHLEVEL 0
 #define VMWGFX_FIFO_STATIC_SIZE (1024*1024)
 #define VMWGFX_MAX_RELOCATIONS 2048
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c b/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
index 29d8794f0421..876bada5b35e 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_surface.c
@@ -68,6 +68,20 @@ struct vmw_surface_offset {
 	uint32_t bo_offset;
 };
 
+/**
+ * vmw_surface_dirty - Surface dirty-tracker
+ * @cache: Cached layout information of the surface.
+ * @size: Accounting size for the struct vmw_surface_dirty.
+ * @num_subres: Number of subresources.
+ * @boxes: Array of SVGA3dBoxes indicating dirty regions. One per subresource.
+ */
+struct vmw_surface_dirty {
+	struct svga3dsurface_cache cache;
+	size_t size;
+	u32 num_subres;
+	SVGA3dBox boxes[0];
+};
+
 static void vmw_user_surface_free(struct vmw_resource *res);
 static struct vmw_resource *
 vmw_user_surface_base_to_res(struct ttm_base_object *base);
@@ -96,6 +110,13 @@ vmw_gb_surface_reference_internal(struct drm_device *dev,
 				  struct drm_vmw_gb_surface_ref_ext_rep *rep,
 				  struct drm_file *file_priv);
 
+static void vmw_surface_dirty_free(struct vmw_resource *res);
+static int vmw_surface_dirty_alloc(struct vmw_resource *res);
+static int vmw_surface_dirty_sync(struct vmw_resource *res);
+static void vmw_surface_dirty_range_add(struct vmw_resource *res, size_t start,
+					size_t end);
+static int vmw_surface_clean(struct vmw_resource *res);
+
 static const struct vmw_user_resource_conv user_surface_conv = {
 	.object_type = VMW_RES_SURFACE,
 	.base_obj_to_res = vmw_user_surface_base_to_res,
@@ -133,7 +154,12 @@ static const struct vmw_res_func vmw_gb_surface_func = {
 	.create = vmw_gb_surface_create,
 	.destroy = vmw_gb_surface_destroy,
 	.bind = vmw_gb_surface_bind,
-	.unbind = vmw_gb_surface_unbind
+	.unbind = vmw_gb_surface_unbind,
+	.dirty_alloc = vmw_surface_dirty_alloc,
+	.dirty_free = vmw_surface_dirty_free,
+	.dirty_sync = vmw_surface_dirty_sync,
+	.dirty_range_add = vmw_surface_dirty_range_add,
+	.clean = vmw_surface_clean,
 };
 
 /**
@@ -641,6 +667,7 @@ static void vmw_user_surface_free(struct vmw_resource *res)
 	struct vmw_private *dev_priv = srf->res.dev_priv;
 	uint32_t size = user_srf->size;
 
+	WARN_ON_ONCE(res->dirty);
 	if (user_srf->master)
 		drm_master_put(&user_srf->master);
 	kfree(srf->offsets);
@@ -1168,10 +1195,16 @@ static int vmw_gb_surface_bind(struct vmw_resource *res,
 		cmd2->header.id = SVGA_3D_CMD_UPDATE_GB_SURFACE;
 		cmd2->header.size = sizeof(cmd2->body);
 		cmd2->body.sid = res->id;
-		res->backup_dirty = false;
 	}
 	vmw_fifo_commit(dev_priv, submit_size);
 
+	if (res->backup->dirty && res->backup_dirty) {
+		/* We've just made a full upload. Cear dirty regions. */
+		vmw_bo_dirty_clear_res(res);
+	}
+
+	res->backup_dirty = false;
+
 	return 0;
 }
 
@@ -1636,7 +1669,8 @@ vmw_gb_surface_define_internal(struct drm_device *dev,
 			}
 		}
 	} else if (req->base.drm_surface_flags &
-		   drm_vmw_surface_flag_create_buffer)
+		   (drm_vmw_surface_flag_create_buffer |
+		    drm_vmw_surface_flag_coherent))
 		ret = vmw_user_bo_alloc(dev_priv, tfile,
 					res->backup_size,
 					req->base.drm_surface_flags &
@@ -1650,6 +1684,26 @@ vmw_gb_surface_define_internal(struct drm_device *dev,
 		goto out_unlock;
 	}
 
+	if (req->base.drm_surface_flags & drm_vmw_surface_flag_coherent) {
+		struct vmw_buffer_object *backup = res->backup;
+
+		ttm_bo_reserve(&backup->base, false, false, NULL);
+		if (!res->func->dirty_alloc)
+			ret = -EINVAL;
+		if (!ret)
+			ret = vmw_bo_dirty_add(backup);
+		if (!ret) {
+			res->coherent = true;
+			ret = res->func->dirty_alloc(res);
+		}
+		ttm_bo_unreserve(&backup->base);
+		if (ret) {
+			vmw_resource_unreference(&res);
+			goto out_unlock;
+		}
+
+	}
+
 	tmp = vmw_resource_reference(res);
 	ret = ttm_prime_object_init(tfile, res->backup_size, &user_srf->prime,
 				    req->base.drm_surface_flags &
@@ -1758,3 +1812,338 @@ vmw_gb_surface_reference_internal(struct drm_device *dev,
 
 	return ret;
 }
+
+/**
+ * vmw_subres_dirty_add - Add a dirty region to a subresource
+ * @dirty: The surfaces's dirty tracker.
+ * @loc_start: The location corresponding to the start of the region.
+ * @loc_end: The location corresponding to the end of the region.
+ *
+ * As we are assuming that @loc_start and @loc_end represent a sequential
+ * range of backing store memory, if the region spans multiple lines then
+ * regardless of the x coordinate, the full lines are dirtied.
+ * Correspondingly if the region spans multiple z slices, then full rather
+ * than partial z slices are dirtied.
+ */
+static void vmw_subres_dirty_add(struct vmw_surface_dirty *dirty,
+				 const struct svga3dsurface_loc *loc_start,
+				 const struct svga3dsurface_loc *loc_end)
+{
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	SVGA3dBox *box = &dirty->boxes[loc_start->sub_resource];
+	u32 mip = loc_start->sub_resource % cache->num_mip_levels;
+	const struct drm_vmw_size *size = &cache->mip[mip].size;
+	u32 box_c2 = box->z + box->d;
+
+	if (WARN_ON(loc_start->sub_resource >= dirty->num_subres))
+		return;
+
+	if (box->d == 0 || box->z > loc_start->z)
+		box->z = loc_start->z;
+	if (box_c2 < loc_end->z)
+		box->d = loc_end->z - box->z;
+
+	if (loc_start->z + 1 == loc_end->z) {
+		box_c2 = box->y + box->h;
+		if (box->h == 0 || box->y > loc_start->y)
+			box->y = loc_start->y;
+		if (box_c2 < loc_end->y)
+			box->h = loc_end->y - box->y;
+
+		if (loc_start->y + 1 == loc_end->y) {
+			box_c2 = box->x + box->w;
+			if (box->w == 0 || box->x > loc_start->x)
+				box->x = loc_start->x;
+			if (box_c2 < loc_end->x)
+				box->w = loc_end->x - box->x;
+		} else {
+			box->x = 0;
+			box->w = size->width;
+		}
+	} else {
+		box->y = 0;
+		box->h = size->height;
+		box->x = 0;
+		box->w = size->width;
+	}
+}
+
+/**
+ * vmw_subres_dirty_full - Mark a full subresource as dirty
+ * @dirty: The surface's dirty tracker.
+ * @subres: The subresource
+ */
+static void vmw_subres_dirty_full(struct vmw_surface_dirty *dirty, u32 subres)
+{
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	u32 mip = subres % cache->num_mip_levels;
+	const struct drm_vmw_size *size = &cache->mip[mip].size;
+	SVGA3dBox *box = &dirty->boxes[subres];
+
+	box->x = 0;
+	box->y = 0;
+	box->z = 0;
+	box->w = size->width;
+	box->h = size->height;
+	box->d = size->depth;
+}
+
+/*
+ * vmw_surface_tex_dirty_add_range - The dirty_add_range callback for texture
+ * surfaces.
+ */
+static void vmw_surface_tex_dirty_range_add(struct vmw_resource *res,
+					    size_t start, size_t end)
+{
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	size_t backup_end = res->backup_offset + res->backup_size;
+	struct svga3dsurface_loc loc1, loc2;
+	const struct svga3dsurface_cache *cache;
+
+	start = max_t(size_t, start, res->backup_offset) - res->backup_offset;
+	end = min(end, backup_end) - res->backup_offset;
+	cache = &dirty->cache;
+	svga3dsurface_get_loc(cache, &loc1, start);
+	svga3dsurface_get_loc(cache, &loc2, end - 1);
+	svga3dsurface_inc_loc(cache, &loc2);
+
+	if (loc1.sub_resource + 1 == loc2.sub_resource) {
+		/* Dirty range covers a single sub-resource */
+		vmw_subres_dirty_add(dirty, &loc1, &loc2);
+	} else {
+		/* Dirty range covers multiple sub-resources */
+		struct svga3dsurface_loc loc_min, loc_max;
+		u32 sub_res = loc1.sub_resource;
+
+		svga3dsurface_max_loc(cache, loc1.sub_resource, &loc_max);
+		vmw_subres_dirty_add(dirty, &loc1, &loc_max);
+		svga3dsurface_min_loc(cache, loc2.sub_resource - 1, &loc_min);
+		vmw_subres_dirty_add(dirty, &loc_min, &loc2);
+		for (sub_res = loc1.sub_resource + 1;
+		     sub_res < loc2.sub_resource - 1; ++sub_res)
+			vmw_subres_dirty_full(dirty, sub_res);
+	}
+}
+
+/*
+ * vmw_surface_tex_dirty_add_range - The dirty_add_range callback for buffer
+ * surfaces.
+ */
+static void vmw_surface_buf_dirty_range_add(struct vmw_resource *res,
+					    size_t start, size_t end)
+{
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	size_t backup_end = res->backup_offset + cache->mip_chain_bytes;
+	SVGA3dBox *box = &dirty->boxes[0];
+	u32 box_c2;
+
+	box->h = box->d = 1;
+	start = max_t(size_t, start, res->backup_offset) - res->backup_offset;
+	end = min(end, backup_end) - res->backup_offset;
+	box_c2 = box->x + box->w;
+	if (box->w == 0 || box->x > start)
+		box->x = start;
+	if (box_c2 < end)
+		box->w = end - box->x;
+}
+
+/*
+ * vmw_surface_tex_dirty_add_range - The dirty_add_range callback for surfaces
+ */
+static void vmw_surface_dirty_range_add(struct vmw_resource *res, size_t start,
+					size_t end)
+{
+	struct vmw_surface *srf = vmw_res_to_srf(res);
+
+	if (WARN_ON(end <= res->backup_offset ||
+		    start >= res->backup_offset + res->backup_size))
+		return;
+
+	if (srf->format == SVGA3D_BUFFER)
+		vmw_surface_buf_dirty_range_add(res, start, end);
+	else
+		vmw_surface_tex_dirty_range_add(res, start, end);
+}
+
+/*
+ * vmw_surface_dirty_sync - The surface's dirty_sync callback.
+ */
+static int vmw_surface_dirty_sync(struct vmw_resource *res)
+{
+	struct vmw_private *dev_priv = res->dev_priv;
+	bool has_dx = 0;
+	u32 i, num_dirty;
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	size_t alloc_size;
+	const struct svga3dsurface_cache *cache = &dirty->cache;
+	struct {
+		SVGA3dCmdHeader header;
+		SVGA3dCmdDXUpdateSubResource body;
+	} *cmd1;
+	struct {
+		SVGA3dCmdHeader header;
+		SVGA3dCmdUpdateGBImage body;
+	} *cmd2;
+	void *cmd;
+
+	num_dirty = 0;
+	for (i = 0; i < dirty->num_subres; ++i) {
+		const SVGA3dBox *box = &dirty->boxes[i];
+
+		if (box->d)
+			num_dirty++;
+	}
+
+	if (!num_dirty)
+		goto out;
+
+	alloc_size = num_dirty * ((has_dx) ? sizeof(*cmd1) : sizeof(*cmd2));
+	cmd = VMW_FIFO_RESERVE(dev_priv, alloc_size);
+	if (!cmd)
+		return -ENOMEM;
+
+	cmd1 = cmd;
+	cmd2 = cmd;
+
+	for (i = 0; i < dirty->num_subres; ++i) {
+		const SVGA3dBox *box = &dirty->boxes[i];
+
+		if (!box->d)
+			continue;
+
+		/*
+		 * DX_UPDATE_SUBRESOURCE is aware of array surfaces.
+		 * UPDATE_GB_IMAGE is not.
+		 */
+		if (has_dx) {
+			cmd1->header.id = SVGA_3D_CMD_DX_UPDATE_SUBRESOURCE;
+			cmd1->header.size = sizeof(cmd1->body);
+			cmd1->body.sid = res->id;
+			cmd1->body.subResource = i;
+			cmd1->body.box = *box;
+			cmd1++;
+		} else {
+			cmd2->header.id = SVGA_3D_CMD_UPDATE_GB_IMAGE;
+			cmd2->header.size = sizeof(cmd2->body);
+			cmd2->body.image.sid = res->id;
+			cmd2->body.image.face = i / cache->num_mip_levels;
+			cmd2->body.image.mipmap = i -
+				(cache->num_mip_levels * cmd2->body.image.face);
+			cmd2->body.box = *box;
+			cmd2++;
+		}
+
+	}
+	vmw_fifo_commit(dev_priv, alloc_size);
+ out:
+	memset(&dirty->boxes[0], 0, sizeof(dirty->boxes[0]) *
+	       dirty->num_subres);
+
+	return 0;
+}
+
+/*
+ * vmw_surface_dirty_alloc - The surface's dirty_alloc callback.
+ */
+static int vmw_surface_dirty_alloc(struct vmw_resource *res)
+{
+	struct vmw_surface *srf = vmw_res_to_srf(res);
+	struct vmw_surface_dirty *dirty;
+	u32 num_layers = 1;
+	u32 num_mip;
+	u32 num_subres;
+	u32 num_samples;
+	size_t dirty_size, acc_size;
+	static struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = false
+	};
+	int ret;
+
+	if (srf->array_size)
+		num_layers = srf->array_size;
+	else if (srf->flags & SVGA3D_SURFACE_CUBEMAP)
+		num_layers *= SVGA3D_MAX_SURFACE_FACES;
+
+	num_mip = srf->mip_levels[0];
+	if (!num_mip)
+		num_mip = 1;
+
+	num_subres = num_layers * num_mip;
+	dirty_size = sizeof(*dirty) + num_subres * sizeof(dirty->boxes[0]);
+	acc_size = ttm_round_pot(dirty_size);
+	ret = ttm_mem_global_alloc(vmw_mem_glob(res->dev_priv),
+				   acc_size, &ctx);
+	if (ret) {
+		VMW_DEBUG_USER("Out of graphics memory for surface "
+			       "dirty tracker.\n");
+		return ret;
+	}
+
+	dirty = kvzalloc(dirty_size, GFP_KERNEL);
+	if (!dirty) {
+		ret = -ENOMEM;
+		goto out_no_dirty;
+	}
+
+	num_samples = max_t(u32, 1, srf->multisample_count);
+	ret = svga3dsurface_setup_cache(&srf->base_size, srf->format, num_mip,
+					num_layers, num_samples, &dirty->cache);
+	if (ret)
+		goto out_no_cache;
+
+	dirty->num_subres = num_subres;
+	dirty->size = acc_size;
+	res->dirty = (struct vmw_resource_dirty *) dirty;
+
+	return 0;
+
+out_no_cache:
+	kvfree(dirty);
+out_no_dirty:
+	ttm_mem_global_free(vmw_mem_glob(res->dev_priv), acc_size);
+	return ret;
+}
+
+/*
+ * vmw_surface_dirty_free - The surface's dirty_free callback
+ */
+static void vmw_surface_dirty_free(struct vmw_resource *res)
+{
+	struct vmw_surface_dirty *dirty =
+		(struct vmw_surface_dirty *) res->dirty;
+	size_t acc_size = dirty->size;
+
+	kvfree(dirty);
+	ttm_mem_global_free(vmw_mem_glob(res->dev_priv), acc_size);
+	res->dirty = NULL;
+}
+
+/*
+ * vmw_surface_clean - The surface's clean callback
+ */
+static int vmw_surface_clean(struct vmw_resource *res)
+{
+	struct vmw_private *dev_priv = res->dev_priv;
+	size_t alloc_size;
+	struct {
+		SVGA3dCmdHeader header;
+		SVGA3dCmdReadbackGBSurface body;
+	} *cmd;
+
+	alloc_size = sizeof(*cmd);
+	cmd = VMW_FIFO_RESERVE(dev_priv, alloc_size);
+	if (!cmd)
+		return -ENOMEM;
+
+	cmd->header.id = SVGA_3D_CMD_READBACK_GB_SURFACE;
+	cmd->header.size = sizeof(cmd->body);
+	cmd->body.sid = res->id;
+	vmw_fifo_commit(dev_priv, alloc_size);
+
+	return 0;
+}
diff --git a/include/uapi/drm/vmwgfx_drm.h b/include/uapi/drm/vmwgfx_drm.h
index 399f58317cff..02cab33f2f25 100644
--- a/include/uapi/drm/vmwgfx_drm.h
+++ b/include/uapi/drm/vmwgfx_drm.h
@@ -891,11 +891,13 @@ struct drm_vmw_shader_arg {
  *                                      surface.
  * @drm_vmw_surface_flag_create_buffer: Create a backup buffer if none is
  *                                      given.
+ * @drm_vmw_surface_flag_coherent:      Back surface with coherent memory.
  */
 enum drm_vmw_surface_flags {
 	drm_vmw_surface_flag_shareable = (1 << 0),
 	drm_vmw_surface_flag_scanout = (1 << 1),
-	drm_vmw_surface_flag_create_buffer = (1 << 2)
+	drm_vmw_surface_flag_create_buffer = (1 << 2),
+	drm_vmw_surface_flag_coherent = (1 << 3),
 };
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 11:55 ` [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
@ 2019-09-26 12:03   ` Thomas Hellström (VMware)
  2019-09-26 19:19     ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 12:03 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm, Andrew Morton, Matthew Wilcox
  Cc: Linus Torvalds

On 9/26/19 1:55 PM, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom <thellstrom@vmware.com>
>
> Add two utilities to a) write-protect and b) clean all ptes pointing into
> a range of an address space.
> The utilities are intended to aid in tracking dirty pages (either
> driver-allocated system memory or pci device memory).
> The write-protect utility should be used in conjunction with
> page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
> accesses. Typically one would want to use this on sparse accesses into
> large memory regions. The clean utility should be used to utilize
> hardware dirtying functionality and avoid the overhead of page-faults,
> typically on large accesses into small memory regions.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Souptick Joarder <jrdr.linux@gmail.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> ---

Hi!

I wonder if I can get an ack from an mm maintainer to merge this through 
DRM along with the vmwgfx patches? Andrew? Matthew?

Thanks,
Thomas




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 12:03   ` Ack to merge through DRM? WAS " Thomas Hellström (VMware)
@ 2019-09-26 19:19     ` Linus Torvalds
  2019-09-26 20:09       ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-09-26 19:19 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Thu, Sep 26, 2019 at 5:03 AM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> I wonder if I can get an ack from an mm maintainer to merge this through
> DRM along with the vmwgfx patches? Andrew? Matthew?

It would have helped to actually point to the patch itself, instead of
just quoting the commit message.

Looks like this:

     https://lore.kernel.org/lkml/20190926115548.44000-2-thomas_os@shipmail.org/

but why is the code in question not just using the regular page
walkers. The commit log shows no explanation of what's so special
about this?

Is the only reason the locking magic? Because if that's the reason,
then afaik we already have a function for that: it's
__walk_page_range().

Yes, it's static right now, but that's imho not a reason to duplicate
all the walking (badly).

Is there some other magic reason that isn't documented?

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 19:19     ` Linus Torvalds
@ 2019-09-26 20:09       ` Thomas Hellström (VMware)
  2019-09-26 20:16         ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 20:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

Hi,

On 9/26/19 9:19 PM, Linus Torvalds wrote:
> On Thu, Sep 26, 2019 at 5:03 AM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> I wonder if I can get an ack from an mm maintainer to merge this through
>> DRM along with the vmwgfx patches? Andrew? Matthew?
> It would have helped to actually point to the patch itself, instead of
> just quoting the commit message.
>
> Looks like this:
>
>       https://lore.kernel.org/lkml/20190926115548.44000-2-thomas_os@shipmail.org/
>
> but why is the code in question not just using the regular page
> walkers. The commit log shows no explanation of what's so special
> about this?
>
> Is the only reason the locking magic? Because if that's the reason,
> then afaik we already have a function for that: it's
> __walk_page_range().
>
> Yes, it's static right now, but that's imho not a reason to duplicate
> all the walking (badly).
>
> Is there some other magic reason that isn't documented?
>
>                Linus

There is a discussion around this subject in

https://lore.kernel.org/lkml/20190926115548.44000-1-thomas_os@shipmail.org/

The main point is that there is an assert in pud_trans_huge_lock() that 
the mmap_sem is held, and we don't have it. Presumably we should be able 
to get away with the i_mmap_lock, but in addition I would have had to 
include the walk_as_pte_range() as the walk::pmd_entry anyway, so the 
amount of duplicated page walk code isn't that big.

That said, if people are OK with me modifying the assert in 
pud_trans_huge_lock() and make __walk_page_range non-static, it should 
probably be possible to make it work, yes.

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 20:09       ` Thomas Hellström (VMware)
@ 2019-09-26 20:16         ` Linus Torvalds
  2019-09-26 20:55           ` Thomas Hellström (VMware)
                             ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Linus Torvalds @ 2019-09-26 20:16 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> That said, if people are OK with me modifying the assert in
> pud_trans_huge_lock() and make __walk_page_range non-static, it should
> probably be possible to make it work, yes.

I don't think you need to modify that assert at all.

That thing only exists when there's a "pud_entry" op in the walker,
and then you absolutely need to have that mmap_lock.

As far as I can tell, you fundamentally only ever work on a pte level
in your address space walker already and actually have a WARN_ON() on
the pud_huge thing, so no pud entry can possibly apply.

So no, the assert in pud_trans_huge_lock() does not seem to be a
reason not to just use the existing page table walkers.

And once you get rid of the walking, what is left? Just the "iterate
over the inode mappings" part. Which could just be done in
mm/pagewalk.c, and then you don't even need to remove the static.

So making it be just another walking in pagewalk.c would seem to be
the simplest model.

Call it "walk_page_mapping()". And talk extensively about how the
locking differs a lot from the usual "walk_page_vma()" things.

The then actual "apply" functions (what a horrid name) could be in the
users. They shouldn't be mixed in with the walking functions anyway.
They are callbacks, not walkers.

             Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 20:16         ` Linus Torvalds
@ 2019-09-26 20:55           ` Thomas Hellström (VMware)
  2019-09-26 22:20             ` Linus Torvalds
  2019-09-27 12:17           ` Kirill A. Shutemov
  2019-10-02  9:21           ` Thomas Hellström (VMware)
  2 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-26 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On 9/26/19 10:16 PM, Linus Torvalds wrote:
> On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> That said, if people are OK with me modifying the assert in
>> pud_trans_huge_lock() and make __walk_page_range non-static, it should
>> probably be possible to make it work, yes.
> I don't think you need to modify that assert at all.
>
> That thing only exists when there's a "pud_entry" op in the walker,
> and then you absolutely need to have that mmap_lock.
>
> As far as I can tell, you fundamentally only ever work on a pte level
> in your address space walker already and actually have a WARN_ON() on
> the pud_huge thing, so no pud entry can possibly apply.

Well, we're working on supporting huge puds and pmds in the graphics 
VMAs, although in the write-notify cases we're looking at here, we would 
probably want to split them down to PTE level. But if we would want to 
extend that or if we would want to make this interface general, we'd 
probably want to support also a pud_entry callback.

Looking at zap_pud_range() which when called from unmap_mapping_pages() 
uses identical locking (no mmap_sem), it seems we should be able to get 
away with i_mmap_lock(), making sure the whole page table doesn't 
disappear under us. So it's not clear to me why the mmap_sem is strictly 
needed here. Better to sort those restrictions out now rather than when 
huge entries start appearing.

>
> So no, the assert in pud_trans_huge_lock() does not seem to be a
> reason not to just use the existing page table walkers.
>
> And once you get rid of the walking, what is left? Just the "iterate
> over the inode mappings" part. Which could just be done in
> mm/pagewalk.c, and then you don't even need to remove the static.
>
> So making it be just another walking in pagewalk.c would seem to be
> the simplest model.
>
> Call it "walk_page_mapping()". And talk extensively about how the
> locking differs a lot from the usual "walk_page_vma()" things.

Sure. Can do that.

Thanks,

Thomas



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 20:55           ` Thomas Hellström (VMware)
@ 2019-09-26 22:20             ` Linus Torvalds
  2019-09-27  5:55               ` Thomas Hellström (VMware)
  2019-09-27 12:26               ` Kirill A. Shutemov
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2019-09-26 22:20 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Thu, Sep 26, 2019 at 1:55 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> Well, we're working on supporting huge puds and pmds in the graphics
> VMAs, although in the write-notify cases we're looking at here, we would
> probably want to split them down to PTE level.

Well, that's what the existing walker code does if you don't have that
"pud_entry()" callback.

That said, I assume you would *not* want to do that if the huge
pud/pmd is already clean and read-only, but just continue.

So you may want to have a special pud_entry() that handles that case.
Eventually. Maybe. Although honestly, if you're doing dirty tracking,
I doubt it makes much sense to use largepages.

> Looking at zap_pud_range() which when called from unmap_mapping_pages()
> uses identical locking (no mmap_sem), it seems we should be able to get
> away with i_mmap_lock(), making sure the whole page table doesn't
> disappear under us. So it's not clear to me why the mmap_sem is strictly
> needed here. Better to sort those restrictions out now rather than when
> huge entries start appearing.

zap_pud_range()actually does have that

               VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);

exactly for the case where it might have to split the pud entry.

Zapping the whole thing it does do without the assert.

I'm not going to swear the mmap_sem is absolutely required, since a
shared vma should be stable due to the i_mmap_lock, but splitting the
hugepage really is a fairly big deal.

It can't happen if you zap the *whole* mapping, but it can happen if
you have a start/end range. Like you do.

Also, in general it's probably not a great idea to look at
zap_page_range() (and copy_page_range()) for ideas.

They are kind of special, since they tend to be used for fundamental
whole-address-space operations (ie fork/exit) and so as a result they
get to do special things that a normal page walker generally shouldn't
do.

It's why they've never gotten translated to use the generic walker code.

           Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 22:20             ` Linus Torvalds
@ 2019-09-27  5:55               ` Thomas Hellström (VMware)
  2019-09-27  9:27                 ` Thomas Hellström (VMware)
  2019-09-27 12:26               ` Kirill A. Shutemov
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-27  5:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On 9/27/19 12:20 AM, Linus Torvalds wrote:
> On Thu, Sep 26, 2019 at 1:55 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> Well, we're working on supporting huge puds and pmds in the graphics
>> VMAs, although in the write-notify cases we're looking at here, we would
>> probably want to split them down to PTE level.
> Well, that's what the existing walker code does if you don't have that
> "pud_entry()" callback.
>
> That said, I assume you would *not* want to do that if the huge
> pud/pmd is already clean and read-only, but just continue.
>
> So you may want to have a special pud_entry() that handles that case.
> Eventually. Maybe. Although honestly, if you're doing dirty tracking,
> I doubt it makes much sense to use largepages.

The approach we're looking at in this case is to keep huge entries 
write-protected and split them in the wp_huge_xxx() code's fallback path 
with the mmap_sem held. This means that there will actually be huge 
entries in the page-walking code soon, but as you say, only entries that 
we want to ignore and not split. So we'd also need a way to avoid the 
pagewalk splitting for the situation when someone faults a huge entry in 
just before the call to split_huge_xxx.

>
>> Looking at zap_pud_range() which when called from unmap_mapping_pages()
>> uses identical locking (no mmap_sem), it seems we should be able to get
>> away with i_mmap_lock(), making sure the whole page table doesn't
>> disappear under us. So it's not clear to me why the mmap_sem is strictly
>> needed here. Better to sort those restrictions out now rather than when
>> huge entries start appearing.
> zap_pud_range()actually does have that
>
>                 VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>
> exactly for the case where it might have to split the pud entry.

Yes. My take on this is that locking the PUD ptl can be done either with 
the mmap_sem or the i_mmap_lock if present and that we should update the 
asserts in xxx_trans_huge_lock to reflect that. But when actually 
splitting transhuge pages you don't want to race with khugepaged, so you 
need the mmap_sem. For the graphics VMAs (MIXEDMAP), khugepaged never 
touches them. Yet.

>
> It's why they've never gotten translated to use the generic walker code.

OK. Yes there are a number of various specialized pagewalks all over the 
mm code.

But another thing that worries me is that the page-table modifications 
that happen in the callback use functionality that is not guaranteed to 
be exported, and that mm people don't want them to be exported because 
you don't want the drivers to go hacking around in page tables, which 
means that the two callbacks used here would need to be a set of core 
helpers anyway.

So I figure what I would end up with would actually be an extern 
__walk_page_range anyway, and slightly modified asserts. Do you think 
that could be acceptible?

Thanks,

Thomas


>
>             Linus



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-27  5:55               ` Thomas Hellström (VMware)
@ 2019-09-27  9:27                 ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-27  9:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On 9/27/19 7:55 AM, Thomas Hellström (VMware) wrote:
> On 9/27/19 12:20 AM, Linus Torvalds wrote:
>> On Thu, Sep 26, 2019 at 1:55 PM Thomas Hellström (VMware)
>> <thomas_os@shipmail.org> wrote:
>>> Well, we're working on supporting huge puds and pmds in the graphics
>>> VMAs, although in the write-notify cases we're looking at here, we 
>>> would
>>> probably want to split them down to PTE level.
>> Well, that's what the existing walker code does if you don't have that
>> "pud_entry()" callback.
>>
>> That said, I assume you would *not* want to do that if the huge
>> pud/pmd is already clean and read-only, but just continue.
>>
>> So you may want to have a special pud_entry() that handles that case.
>> Eventually. Maybe. Although honestly, if you're doing dirty tracking,
>> I doubt it makes much sense to use largepages.
>
> The approach we're looking at in this case is to keep huge entries 
> write-protected and split them in the wp_huge_xxx() code's fallback 
> path with the mmap_sem held. This means that there will actually be 
> huge entries in the page-walking code soon, but as you say, only 
> entries that we want to ignore and not split. So we'd also need a way 
> to avoid the pagewalk splitting for the situation when someone faults 
> a huge entry in just before the call to split_huge_xxx.
>
>>
>>> Looking at zap_pud_range() which when called from unmap_mapping_pages()
>>> uses identical locking (no mmap_sem), it seems we should be able to get
>>> away with i_mmap_lock(), making sure the whole page table doesn't
>>> disappear under us. So it's not clear to me why the mmap_sem is 
>>> strictly
>>> needed here. Better to sort those restrictions out now rather than when
>>> huge entries start appearing.
>> zap_pud_range()actually does have that
>>
>> VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>>
>> exactly for the case where it might have to split the pud entry.
>
> Yes. My take on this is that locking the PUD ptl can be done either 
> with the mmap_sem or the i_mmap_lock if present and that we should 
> update the asserts in xxx_trans_huge_lock to reflect that. But when 
> actually splitting transhuge pages you don't want to race with 
> khugepaged, so you need the mmap_sem. For the graphics VMAs 
> (MIXEDMAP), khugepaged never touches them. Yet.
>
>>
>> It's why they've never gotten translated to use the generic walker code.
>
> OK. Yes there are a number of various specialized pagewalks all over 
> the mm code.
>
> But another thing that worries me is that the page-table modifications 
> that happen in the callback use functionality that is not guaranteed 
> to be exported, and that mm people don't want them to be exported 
> because you don't want the drivers to go hacking around in page 
> tables, which means that the two callbacks used here would need to be 
> a set of core helpers anyway.
>
> So I figure what I would end up with would actually be an extern 
> __walk_page_range anyway, and slightly modified asserts. Do you think 
> that could be acceptible?


Actually, I'll give your original suggestion a try and see what I come 
up with.

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 20:16         ` Linus Torvalds
  2019-09-26 20:55           ` Thomas Hellström (VMware)
@ 2019-09-27 12:17           ` Kirill A. Shutemov
  2019-09-27 16:39             ` Linus Torvalds
  2019-10-02  9:21           ` Thomas Hellström (VMware)
  2 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-09-27 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Thu, Sep 26, 2019 at 01:16:55PM -0700, Linus Torvalds wrote:
> On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
> >
> > That said, if people are OK with me modifying the assert in
> > pud_trans_huge_lock() and make __walk_page_range non-static, it should
> > probably be possible to make it work, yes.
> 
> I don't think you need to modify that assert at all.
> 
> That thing only exists when there's a "pud_entry" op in the walker,
> and then you absolutely need to have that mmap_lock.
> 
> As far as I can tell, you fundamentally only ever work on a pte level
> in your address space walker already and actually have a WARN_ON() on
> the pud_huge thing, so no pud entry can possibly apply.
> 
> So no, the assert in pud_trans_huge_lock() does not seem to be a
> reason not to just use the existing page table walkers.
> 
> And once you get rid of the walking, what is left? Just the "iterate
> over the inode mappings" part. Which could just be done in
> mm/pagewalk.c, and then you don't even need to remove the static.
> 
> So making it be just another walking in pagewalk.c would seem to be
> the simplest model.
> 
> Call it "walk_page_mapping()". And talk extensively about how the
> locking differs a lot from the usual "walk_page_vma()" things.

Walking mappings of a page is what rmap does. This code thas to be
integrated there.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 22:20             ` Linus Torvalds
  2019-09-27  5:55               ` Thomas Hellström (VMware)
@ 2019-09-27 12:26               ` Kirill A. Shutemov
  1 sibling, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-09-27 12:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Thu, Sep 26, 2019 at 03:20:42PM -0700, Linus Torvalds wrote:
> On Thu, Sep 26, 2019 at 1:55 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
> >
> > Well, we're working on supporting huge puds and pmds in the graphics
> > VMAs, although in the write-notify cases we're looking at here, we would
> > probably want to split them down to PTE level.
> 
> Well, that's what the existing walker code does if you don't have that
> "pud_entry()" callback.
> 
> That said, I assume you would *not* want to do that if the huge
> pud/pmd is already clean and read-only, but just continue.
> 
> So you may want to have a special pud_entry() that handles that case.
> Eventually. Maybe. Although honestly, if you're doing dirty tracking,
> I doubt it makes much sense to use largepages.
> 
> > Looking at zap_pud_range() which when called from unmap_mapping_pages()
> > uses identical locking (no mmap_sem), it seems we should be able to get
> > away with i_mmap_lock(), making sure the whole page table doesn't
> > disappear under us. So it's not clear to me why the mmap_sem is strictly
> > needed here. Better to sort those restrictions out now rather than when
> > huge entries start appearing.
> 
> zap_pud_range()actually does have that
> 
>                VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);

The VM_BUG is a blind copy from PMD layer and it's bogus. i_mmap_lock()
works fine for file mappings.

The PMD was intended for THP case at the time when there were only
anon-THP. The check was relaxed and later dropped for file-THP on PMD
level. It has to be dropped on PUD too. We don't have anon-THP on PUD
level at all, only DAX played with them.
 
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-27 12:17           ` Kirill A. Shutemov
@ 2019-09-27 16:39             ` Linus Torvalds
  2019-09-30 13:03               ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-09-27 16:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Fri, Sep 27, 2019 at 5:17 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> > Call it "walk_page_mapping()". And talk extensively about how the
> > locking differs a lot from the usual "walk_page_vma()" things.
>
> Walking mappings of a page is what rmap does. This code thas to be
> integrated there.

Well, that's very questionable.

The rmap code mainly does the "page -> virtual" mapping.  One page at a time.

The page walker code does the "virtual -> pte" mapping. Always a whole
range at a time.

The new code wants a combination of both.

It very much is about walking ranges - as in mm/pagewalk.c. It's just
that it walks potentially multiple ranges, based on where the address
space is mapped.

I think it has way more commonalities with the page walking code than
it has with the rmap code. But yes, there is some of that "look up
mappings based on address space" in there too, but it's the least part
of it

And as Thomas pointed out, it also has commonalities with
unmap_mapping_pages() in mm/memory.c. In many ways that part is the
closest.

I'd say that from a code sharing standpoint, mm/rmap.c is absolutely
the wrong place. It's the furthest away from what Thomas wants to do.

The mm/pagewalk.c code has the most actual code that could be shared,
and the addition would be smallest there.

And conceptually the closest analogue in terms of what it _does_ is
unmap_mapping_range() in mm/memory.c, but I see no room for sharing
actual code there unless we completely change how we do
zap_page_range() and add a lot of configurability there (which we
don't want, because page table teardown at exit is really a pretty
critical operation - I commonly see copy_page_range() and
zap_page_range() on profiles if you have things like script-heavyu
traditional UNIX loads).

So I think conceptually, mm/memory.c and unmap_mapping_range() is
closest but I don't think it's practical to share code.

And between mm/pagewalk.c and mm/rmap.c, I think the page walking has
way more of actual practical code sharing, and is also conceptually
closer because most of the code is about walking a range, not looking
up the mapping of one page.

               Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-27 16:39             ` Linus Torvalds
@ 2019-09-30 13:03               ` Kirill A. Shutemov
  2019-09-30 17:12                 ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-09-30 13:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Fri, Sep 27, 2019 at 09:39:48AM -0700, Linus Torvalds wrote:
> On Fri, Sep 27, 2019 at 5:17 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > > Call it "walk_page_mapping()". And talk extensively about how the
> > > locking differs a lot from the usual "walk_page_vma()" things.
> >
> > Walking mappings of a page is what rmap does. This code thas to be
> > integrated there.
> 
> Well, that's very questionable.
> 
> The rmap code mainly does the "page -> virtual" mapping.  One page at a time.
> 
> The page walker code does the "virtual -> pte" mapping. Always a whole
> range at a time.

Have you seen page_vma_mapped_walk()? I made it specifically for rmap code
to cover cases when a THP is mapped with PTEs. To me it's not a big
stretch to make it cover multiple pages too.

> So I think conceptually, mm/memory.c and unmap_mapping_range() is
> closest but I don't think it's practical to share code.
> 
> And between mm/pagewalk.c and mm/rmap.c, I think the page walking has
> way more of actual practical code sharing, and is also conceptually
> closer because most of the code is about walking a range, not looking
> up the mapping of one page.

I guess it's matter of personal preferences, but page table walkers based
on callback always felt wrong to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-30 13:03               ` Kirill A. Shutemov
@ 2019-09-30 17:12                 ` Linus Torvalds
  2019-09-30 17:38                   ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-09-30 17:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Hellström (VMware),
	Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On Mon, Sep 30, 2019 at 6:04 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Have you seen page_vma_mapped_walk()? I made it specifically for rmap code
> to cover cases when a THP is mapped with PTEs. To me it's not a big
> stretch to make it cover multiple pages too.

I agree that is closer, but it does make for calling that big complex
function for every iteration step.

Of course, you are right that the callback approach is problematic
too, now that we have retpoline issues, making those very expensive.
But at least that hopefully gets fixed some day and gets to be a rare
problem.

Matter ot taste, I guess.

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-30 17:12                 ` Linus Torvalds
@ 2019-09-30 17:38                   ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-30 17:38 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On 9/30/19 7:12 PM, Linus Torvalds wrote:
> On Mon, Sep 30, 2019 at 6:04 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>> Have you seen page_vma_mapped_walk()? I made it specifically for rmap code
>> to cover cases when a THP is mapped with PTEs. To me it's not a big
>> stretch to make it cover multiple pages too.
> I agree that is closer, but it does make for calling that big complex
> function for every iteration step.
>
> Of course, you are right that the callback approach is problematic
> too, now that we have retpoline issues, making those very expensive.
> But at least that hopefully gets fixed some day and gets to be a rare
> problem.
>
> Matter ot taste, I guess.
>
>                Linus

Matthew Wilcox suggested something similar before the pagewalk.c rewrite:

https://lore.kernel.org/lkml/20190806190937.GD30179@bombadil.infradead.org/

Still, In this case I'd opt for using the pagewalk code: In the dirty 
helpers we don't ever use a struct page, but only deal with PTE entries, 
same as the pagewalk code does, but not page_vma_mapped_walk(). The 
underlying memory may well be IO memory.

/Thomas



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-09-26 20:16         ` Linus Torvalds
  2019-09-26 20:55           ` Thomas Hellström (VMware)
  2019-09-27 12:17           ` Kirill A. Shutemov
@ 2019-10-02  9:21           ` Thomas Hellström (VMware)
  2019-10-02 13:18             ` Kirill A. Shutemov
  2 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-02  9:21 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Linux Kernel Mailing List, dri-devel, Linux-MM, Andrew Morton,
	Matthew Wilcox

On 9/26/19 10:16 PM, Linus Torvalds wrote:
> On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> That said, if people are OK with me modifying the assert in
>> pud_trans_huge_lock() and make __walk_page_range non-static, it should
>> probably be possible to make it work, yes.
> I don't think you need to modify that assert at all.
>
> That thing only exists when there's a "pud_entry" op in the walker,
> and then you absolutely need to have that mmap_lock.
>
> As far as I can tell, you fundamentally only ever work on a pte level
> in your address space walker already and actually have a WARN_ON() on
> the pud_huge thing, so no pud entry can possibly apply.
>
> So no, the assert in pud_trans_huge_lock() does not seem to be a
> reason not to just use the existing page table walkers.
>
> And once you get rid of the walking, what is left? Just the "iterate
> over the inode mappings" part. Which could just be done in
> mm/pagewalk.c, and then you don't even need to remove the static.
>
> So making it be just another walking in pagewalk.c would seem to be
> the simplest model.
>
> Call it "walk_page_mapping()". And talk extensively about how the
> locking differs a lot from the usual "walk_page_vma()" things.
>
> The then actual "apply" functions (what a horrid name) could be in the
> users. They shouldn't be mixed in with the walking functions anyway.
> They are callbacks, not walkers.
>
>               Linus

Linus, Kirill

I've pushed a reworked version based on the pagewalk code here:

https://cgit.freedesktop.org/~thomash/linux/log/?h=pagewalk

(top three patched)

with users included here:

https://cgit.freedesktop.org/~thomash/linux/log/?h=coherent-rebased

Do you think this could work? The reason that the "mm: Add write-protect 
and clean.." code is still in mm as a set of helpers, is of course that 
much of the needed functionality is not exported, presumably since we 
want to keep page table manipulation in mm.

Thanks,

Thomas



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-10-02  9:21           ` Thomas Hellström (VMware)
@ 2019-10-02 13:18             ` Kirill A. Shutemov
  2019-10-02 13:28               ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-10-02 13:18 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Linus Torvalds, Linux Kernel Mailing List, dri-devel, Linux-MM,
	Andrew Morton, Matthew Wilcox

On Wed, Oct 02, 2019 at 11:21:01AM +0200, Thomas Hellström (VMware) wrote:
> On 9/26/19 10:16 PM, Linus Torvalds wrote:
> > On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
> > <thomas_os@shipmail.org> wrote:
> > > That said, if people are OK with me modifying the assert in
> > > pud_trans_huge_lock() and make __walk_page_range non-static, it should
> > > probably be possible to make it work, yes.
> > I don't think you need to modify that assert at all.
> > 
> > That thing only exists when there's a "pud_entry" op in the walker,
> > and then you absolutely need to have that mmap_lock.
> > 
> > As far as I can tell, you fundamentally only ever work on a pte level
> > in your address space walker already and actually have a WARN_ON() on
> > the pud_huge thing, so no pud entry can possibly apply.
> > 
> > So no, the assert in pud_trans_huge_lock() does not seem to be a
> > reason not to just use the existing page table walkers.
> > 
> > And once you get rid of the walking, what is left? Just the "iterate
> > over the inode mappings" part. Which could just be done in
> > mm/pagewalk.c, and then you don't even need to remove the static.
> > 
> > So making it be just another walking in pagewalk.c would seem to be
> > the simplest model.
> > 
> > Call it "walk_page_mapping()". And talk extensively about how the
> > locking differs a lot from the usual "walk_page_vma()" things.
> > 
> > The then actual "apply" functions (what a horrid name) could be in the
> > users. They shouldn't be mixed in with the walking functions anyway.
> > They are callbacks, not walkers.
> > 
> >               Linus
> 
> Linus, Kirill
> 
> I've pushed a reworked version based on the pagewalk code here:
> 
> https://cgit.freedesktop.org/~thomash/linux/log/?h=pagewalk
> 
> (top three patched)
> 
> with users included here:
> 
> https://cgit.freedesktop.org/~thomash/linux/log/?h=coherent-rebased
> 
> Do you think this could work? The reason that the "mm: Add write-protect and
> clean.." code is still in mm as a set of helpers, is of course that much of
> the needed functionality is not exported, presumably since we want to keep
> page table manipulation in mm.

Could you post it to the mailing list? It's easier to review this way.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges
  2019-10-02 13:18             ` Kirill A. Shutemov
@ 2019-10-02 13:28               ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-02 13:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Linux Kernel Mailing List, dri-devel, Linux-MM,
	Andrew Morton, Matthew Wilcox

On 10/2/19 3:18 PM, Kirill A. Shutemov wrote:
> On Wed, Oct 02, 2019 at 11:21:01AM +0200, Thomas Hellström (VMware) wrote:
>> On 9/26/19 10:16 PM, Linus Torvalds wrote:
>>> On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
>>> <thomas_os@shipmail.org> wrote:
>>>> That said, if people are OK with me modifying the assert in
>>>> pud_trans_huge_lock() and make __walk_page_range non-static, it should
>>>> probably be possible to make it work, yes.
>>> I don't think you need to modify that assert at all.
>>>
>>> That thing only exists when there's a "pud_entry" op in the walker,
>>> and then you absolutely need to have that mmap_lock.
>>>
>>> As far as I can tell, you fundamentally only ever work on a pte level
>>> in your address space walker already and actually have a WARN_ON() on
>>> the pud_huge thing, so no pud entry can possibly apply.
>>>
>>> So no, the assert in pud_trans_huge_lock() does not seem to be a
>>> reason not to just use the existing page table walkers.
>>>
>>> And once you get rid of the walking, what is left? Just the "iterate
>>> over the inode mappings" part. Which could just be done in
>>> mm/pagewalk.c, and then you don't even need to remove the static.
>>>
>>> So making it be just another walking in pagewalk.c would seem to be
>>> the simplest model.
>>>
>>> Call it "walk_page_mapping()". And talk extensively about how the
>>> locking differs a lot from the usual "walk_page_vma()" things.
>>>
>>> The then actual "apply" functions (what a horrid name) could be in the
>>> users. They shouldn't be mixed in with the walking functions anyway.
>>> They are callbacks, not walkers.
>>>
>>>                Linus
>> Linus, Kirill
>>
>> I've pushed a reworked version based on the pagewalk code here:
>>
>> https://cgit.freedesktop.org/~thomash/linux/log/?h=pagewalk
>>
>> (top three patched)
>>
>> with users included here:
>>
>> https://cgit.freedesktop.org/~thomash/linux/log/?h=coherent-rebased
>>
>> Do you think this could work? The reason that the "mm: Add write-protect and
>> clean.." code is still in mm as a set of helpers, is of course that much of
>> the needed functionality is not exported, presumably since we want to keep
>> page table manipulation in mm.
> Could you post it to the mailing list? It's easier to review this way.
>
Sure.

/Thomas



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-10-02 13:28 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-26 11:55 [PATCH v2 0/5] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
2019-09-26 11:55 ` [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
2019-09-26 12:03   ` Ack to merge through DRM? WAS " Thomas Hellström (VMware)
2019-09-26 19:19     ` Linus Torvalds
2019-09-26 20:09       ` Thomas Hellström (VMware)
2019-09-26 20:16         ` Linus Torvalds
2019-09-26 20:55           ` Thomas Hellström (VMware)
2019-09-26 22:20             ` Linus Torvalds
2019-09-27  5:55               ` Thomas Hellström (VMware)
2019-09-27  9:27                 ` Thomas Hellström (VMware)
2019-09-27 12:26               ` Kirill A. Shutemov
2019-09-27 12:17           ` Kirill A. Shutemov
2019-09-27 16:39             ` Linus Torvalds
2019-09-30 13:03               ` Kirill A. Shutemov
2019-09-30 17:12                 ` Linus Torvalds
2019-09-30 17:38                   ` Thomas Hellström (VMware)
2019-10-02  9:21           ` Thomas Hellström (VMware)
2019-10-02 13:18             ` Kirill A. Shutemov
2019-10-02 13:28               ` Thomas Hellström (VMware)
2019-09-26 11:55 ` [PATCH v2 2/5] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
2019-09-26 11:55 ` [PATCH v2 3/5] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
2019-09-26 11:55 ` [PATCH v2 4/5] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
2019-09-26 11:55 ` [PATCH v2 5/5] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).