[PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range()
@ 2014-06-13 16:26 Chris Wilson
  2014-06-13 16:26 ` [PATCH 2/2] drm/i915: Use remap_pfn_range() to prefault all PTE in a single pass Chris Wilson
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Chris Wilson @ 2014-06-13 16:26 UTC (permalink / raw)
  To: intel-gfx
  Cc: Rik van Riel, Peter Zijlstra, Cyrill Gorcunov, linux-mm,
	Mel Gorman, Johannes Weiner, Andrew Morton, Kirill A. Shutemov

When using remap_pfn_range() from a fault handler, we are exposed to
races between concurrent faults. Rather than hitting a BUG, report the
error back to the caller, like vm_insert_pfn().

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
---
 mm/memory.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 037b812a9531..6603a9e6a731 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2306,19 +2306,23 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	int ret = 0;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
 	arch_enter_lazy_mmu_mode();
 	do {
-		BUG_ON(!pte_none(*pte));
+		if (!pte_none(*pte)) {
+			ret = -EBUSY;
+			break;
+		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
-	return 0;
+	return ret;
 }
 
 static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] drm/i915: Use remap_pfn_range() to prefault all PTE in a single pass
  2014-06-13 16:26 [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
@ 2014-06-13 16:26 ` Chris Wilson
  2014-06-13 16:34 ` [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
  2014-06-16 13:41   ` Kirill A. Shutemov
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Wilson @ 2014-06-13 16:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: linux-mm

On an Ivybridge i7-3720qm with 1600MHz DDR3, with 32 fences,
Upload rate for 2 linear surfaces:  8134MiB/s -> 8154MiB/s
Upload rate for 2 tiled surfaces:   8625MiB/s -> 8632MiB/s
Upload rate for 4 linear surfaces:  8127MiB/s -> 8134MiB/s
Upload rate for 4 tiled surfaces:   8602MiB/s -> 8629MiB/s
Upload rate for 8 linear surfaces:  8124MiB/s -> 8137MiB/s
Upload rate for 8 tiled surfaces:   8603MiB/s -> 8624MiB/s
Upload rate for 16 linear surfaces: 8123MiB/s -> 8128MiB/s
Upload rate for 16 tiled surfaces:  8606MiB/s -> 8618MiB/s
Upload rate for 32 linear surfaces: 8121MiB/s -> 8128MiB/s
Upload rate for 32 tiled surfaces:  8605MiB/s -> 8614MiB/s
Upload rate for 64 linear surfaces: 8121MiB/s -> 8127MiB/s
Upload rate for 64 tiled surfaces:  3017MiB/s -> 5127MiB/s

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Testcase: igt/gem_fence_upload/performance
Testcase: igt/gem_mmap_gtt
Reviewed-by: Brad Volkin <bradley.d.volkin@intel.com>
Cc: linux-mm@kvack.org
---
 drivers/gpu/drm/i915/i915_gem.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c313cb2b641b..e6246634b419 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1565,22 +1565,23 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pfn = dev_priv->gtt.mappable_base + i915_gem_obj_ggtt_offset(obj);
 	pfn >>= PAGE_SHIFT;
 
-	if (!obj->fault_mappable) {
-		int i;
+	ret = remap_pfn_range(vma, vma->vm_start,
+			      pfn, vma->vm_end - vma->vm_start,
+			      vma->vm_page_prot);
+	if (ret) {
+		/* After passing the sanity checks on remap_pfn_range(), we may
+		 * abort whilst updating the pagetables due to ENOMEM and leave
+		 * the tables in an inconsistent state. Reset them all now.
+		 * However, we do not want to undo the work of another thread
+		 * that beat us to prefaulting the PTEs.
+		 */
+		if (ret != -EBUSY)
+			zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+		goto unpin;
+	}
 
-		for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
-			ret = vm_insert_pfn(vma,
-					    (unsigned long)vma->vm_start + i * PAGE_SIZE,
-					    pfn + i);
-			if (ret)
-				break;
-		}
+	obj->fault_mappable = true;
 
-		obj->fault_mappable = true;
-	} else
-		ret = vm_insert_pfn(vma,
-				    (unsigned long)vmf->virtual_address,
-				    pfn + page_offset);
 unpin:
 	i915_gem_object_ggtt_unpin(obj);
 unlock:
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-13 16:26 [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
  2014-06-13 16:26 ` [PATCH 2/2] drm/i915: Use remap_pfn_range() to prefault all PTE in a single pass Chris Wilson
@ 2014-06-13 16:34 ` Chris Wilson
  2014-06-16 13:41   ` Kirill A. Shutemov
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Wilson @ 2014-06-13 16:34 UTC (permalink / raw)
  To: intel-gfx
  Cc: Andrew Morton, Kirill A. Shutemov, Peter Zijlstra, Rik van Riel,
	Mel Gorman, Cyrill Gorcunov, Johannes Weiner, linux-mm

On Fri, Jun 13, 2014 at 05:26:17PM +0100, Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/memory.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..6603a9e6a731 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2306,19 +2306,23 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  {
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int ret = 0;
>  
>  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> +		if (!pte_none(*pte)) {
> +			ret = -EBUSY;
> +			break;
> +		}
>  		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
>  		pfn++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);

Oh. That will want the EBUSY path to increment pte or we will try to
unmap the wrong page.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-13 16:26 [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
@ 2014-06-16 13:41   ` Kirill A. Shutemov
  2014-06-13 16:34 ` [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
  2014-06-16 13:41   ` Kirill A. Shutemov
  2 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-16 13:41 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, Andrew Morton, Kirill A. Shutemov, Peter Zijlstra,
	Rik van Riel, Mel Gorman, Cyrill Gorcunov, Johannes Weiner,
	linux-mm

Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/memory.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..6603a9e6a731 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2306,19 +2306,23 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  {
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int ret = 0;
>  
>  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> +		if (!pte_none(*pte)) {
> +			ret = -EBUSY;
> +			break;

I think you need at least remove entries you've setup if the check failed not
at first iteration.

And nobody propagate your -EBUSY back to remap_pfn_range(): caller will
see -ENOMEM, which is not what you want, I believe.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range()
@ 2014-06-16 13:41   ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-16 13:41 UTC (permalink / raw)
  Cc: intel-gfx, Chris Wilson, Andrew Morton, Kirill A. Shutemov,
	Peter Zijlstra, Rik van Riel, Mel Gorman, Cyrill Gorcunov,
	Johannes Weiner, linux-mm

Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/memory.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..6603a9e6a731 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2306,19 +2306,23 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  {
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int ret = 0;
>  
>  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> +		if (!pte_none(*pte)) {
> +			ret = -EBUSY;
> +			break;

I think you need at least remove entries you've setup if the check failed not
at first iteration.

And nobody propagate your -EBUSY back to remap_pfn_range(): caller will
see -ENOMEM, which is not what you want, I believe.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-16 13:41   ` Kirill A. Shutemov
  (?)
@ 2014-06-19  7:19   ` Chris Wilson
  2014-06-19 11:50       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 22+ messages in thread
From: Chris Wilson @ 2014-06-19  7:19 UTC (permalink / raw)
  To: intel-gfx
  Cc: Chris Wilson, Andrew Morton, Kirill A. Shutemov, Peter Zijlstra,
	Rik van Riel, Mel Gorman, Cyrill Gorcunov, Johannes Weiner,
	linux-mm

When using remap_pfn_range() from a fault handler, we are exposed to
races between concurrent faults. Rather than hitting a BUG, report the
error back to the caller, like vm_insert_pfn().

v2: Fix the pte address for unmapping along the error path.
v3: Report the error back and cleanup partial remaps.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
---

Whilst this has the semantics I want to allow two concurrent, but
serialised, pagefaults that try to prefault the same object to succeed,
it looks fragile and fraught with subtlety.
-Chris

---
 mm/memory.c | 54 ++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d67fd9f..be51fcc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,32 +1657,41 @@ EXPORT_SYMBOL(vm_insert_mixed);
  * in null mappings (currently treated as "copy-on-access")
  */
 static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, unsigned long end,
-			unsigned long pfn, pgprot_t prot)
+			   unsigned long addr, unsigned long end,
+			   unsigned long pfn, pgprot_t prot,
+			   bool first)
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	int err = 0;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
 	arch_enter_lazy_mmu_mode();
 	do {
-		BUG_ON(!pte_none(*pte));
+		if (!pte_none(*pte)) {
+			err = first ? -EBUSY : -EINVAL;
+			pte++;
+			break;
+		}
+		first = false;
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
-	return 0;
+	return err;
 }
 
 static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
-			unsigned long addr, unsigned long end,
-			unsigned long pfn, pgprot_t prot)
+				  unsigned long addr, unsigned long end,
+				  unsigned long pfn, pgprot_t prot,
+				  bool first)
 {
 	pmd_t *pmd;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pmd = pmd_alloc(mm, pud, addr);
@@ -1691,19 +1700,23 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
-		if (remap_pte_range(mm, pmd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pte_range(mm, pmd, addr, next,
+				      pfn + (addr >> PAGE_SHIFT), prot, first);
+		if (err)
+			return err;
+
+		first = false;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
-			unsigned long addr, unsigned long end,
-			unsigned long pfn, pgprot_t prot)
+				  unsigned long addr, unsigned long end,
+				  unsigned long pfn, pgprot_t prot, bool first)
 {
 	pud_t *pud;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pud = pud_alloc(mm, pgd, addr);
@@ -1711,9 +1724,12 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (remap_pmd_range(mm, pud, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pmd_range(mm, pud, addr, next,
+				      pfn + (addr >> PAGE_SHIFT), prot, first);
+		if (err)
+			return err;
+
+		first = false;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
@@ -1735,6 +1751,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	unsigned long next;
 	unsigned long end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
+	bool first = true;
 	int err;
 
 	/*
@@ -1774,13 +1791,18 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot);
+				      pfn + (addr >> PAGE_SHIFT), prot, first);
 		if (err)
 			break;
+
+		first = false;
 	} while (pgd++, addr = next, addr != end);
 
-	if (err)
+	if (err) {
 		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
+		if (err != -EBUSY)
+			zap_page_range_single(vma, addr, size, NULL);
+	}
 
 	return err;
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-19  7:19   ` [PATCH] " Chris Wilson
@ 2014-06-19 11:50       ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-19 11:50 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, Andrew Morton, Kirill A. Shutemov, Peter Zijlstra,
	Rik van Riel, Mel Gorman, Cyrill Gorcunov, Johannes Weiner,
	linux-mm

Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> v2: Fix the pte address for unmapping along the error path.
> v3: Report the error back and cleanup partial remaps.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
> 
> Whilst this has the semantics I want to allow two concurrent, but
> serialised, pagefaults that try to prefault the same object to succeed,
> it looks fragile and fraught with subtlety.
> -Chris
> 
> ---
>  mm/memory.c | 54 ++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 38 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d67fd9f..be51fcc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1657,32 +1657,41 @@ EXPORT_SYMBOL(vm_insert_mixed);
>   * in null mappings (currently treated as "copy-on-access")
>   */
>  static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +			   unsigned long addr, unsigned long end,
> +			   unsigned long pfn, pgprot_t prot,
> +			   bool first)
>  {

With this long parameter list, wouldn't it cleaner to pass down a point to
structure instead? This could simplify the code, I believe.

>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int err = 0;
>  
>  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> +		if (!pte_none(*pte)) {
> +			err = first ? -EBUSY : -EINVAL;
> +			pte++;
> +			break;
> +		}
> +		first = false;
>  		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
>  		pfn++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
> -	return 0;
> +	return err;
>  }
>  
>  static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +				  unsigned long addr, unsigned long end,
> +				  unsigned long pfn, pgprot_t prot,
> +				  bool first)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> +	int err;
>  
>  	pfn -= addr >> PAGE_SHIFT;
>  	pmd = pmd_alloc(mm, pud, addr);
> @@ -1691,19 +1700,23 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (remap_pte_range(mm, pmd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> +		err = remap_pte_range(mm, pmd, addr, next,
> +				      pfn + (addr >> PAGE_SHIFT), prot, first);
> +		if (err)
> +			return err;
> +
> +		first = false;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
>  }
>  
>  static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +				  unsigned long addr, unsigned long end,
> +				  unsigned long pfn, pgprot_t prot, bool first)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> +	int err;
>  
>  	pfn -= addr >> PAGE_SHIFT;
>  	pud = pud_alloc(mm, pgd, addr);
> @@ -1711,9 +1724,12 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
>  		return -ENOMEM;
>  	do {
>  		next = pud_addr_end(addr, end);
> -		if (remap_pmd_range(mm, pud, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> +		err = remap_pmd_range(mm, pud, addr, next,
> +				      pfn + (addr >> PAGE_SHIFT), prot, first);
> +		if (err)
> +			return err;
> +
> +		first = false;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
>  }
> @@ -1735,6 +1751,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  	unsigned long next;
>  	unsigned long end = addr + PAGE_ALIGN(size);
>  	struct mm_struct *mm = vma->vm_mm;
> +	bool first = true;
>  	int err;
>  
>  	/*
> @@ -1774,13 +1791,18 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		err = remap_pud_range(mm, pgd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot);
> +				      pfn + (addr >> PAGE_SHIFT), prot, first);
>  		if (err)
>  			break;
> +
> +		first = false;
>  	} while (pgd++, addr = next, addr != end);
>  
> -	if (err)
> +	if (err) {
>  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> +		if (err != -EBUSY)
> +			zap_page_range_single(vma, addr, size, NULL);

Hm. If I read it correctly, you zap whole range, not only what you've
set up. Looks wrong.

And for after zap, you probably whant to return -EBUSY to caller of
remap_pfn_range(), not -EINVAL.

> +	}
>  
>  	return err;
>  }
> -- 
> 1.9.1
> 

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
@ 2014-06-19 11:50       ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-19 11:50 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Rik van Riel, Peter Zijlstra, intel-gfx, Cyrill Gorcunov,
	linux-mm, Mel Gorman, Johannes Weiner, Andrew Morton,
	Kirill A. Shutemov

Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> v2: Fix the pte address for unmapping along the error path.
> v3: Report the error back and cleanup partial remaps.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
> 
> Whilst this has the semantics I want to allow two concurrent, but
> serialised, pagefaults that try to prefault the same object to succeed,
> it looks fragile and fraught with subtlety.
> -Chris
> 
> ---
>  mm/memory.c | 54 ++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 38 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d67fd9f..be51fcc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1657,32 +1657,41 @@ EXPORT_SYMBOL(vm_insert_mixed);
>   * in null mappings (currently treated as "copy-on-access")
>   */
>  static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +			   unsigned long addr, unsigned long end,
> +			   unsigned long pfn, pgprot_t prot,
> +			   bool first)
>  {

With this long parameter list, wouldn't it cleaner to pass down a point to
structure instead? This could simplify the code, I believe.

>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int err = 0;
>  
>  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> +		if (!pte_none(*pte)) {
> +			err = first ? -EBUSY : -EINVAL;
> +			pte++;
> +			break;
> +		}
> +		first = false;
>  		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
>  		pfn++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
> -	return 0;
> +	return err;
>  }
>  
>  static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +				  unsigned long addr, unsigned long end,
> +				  unsigned long pfn, pgprot_t prot,
> +				  bool first)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> +	int err;
>  
>  	pfn -= addr >> PAGE_SHIFT;
>  	pmd = pmd_alloc(mm, pud, addr);
> @@ -1691,19 +1700,23 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (remap_pte_range(mm, pmd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> +		err = remap_pte_range(mm, pmd, addr, next,
> +				      pfn + (addr >> PAGE_SHIFT), prot, first);
> +		if (err)
> +			return err;
> +
> +		first = false;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
>  }
>  
>  static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +				  unsigned long addr, unsigned long end,
> +				  unsigned long pfn, pgprot_t prot, bool first)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> +	int err;
>  
>  	pfn -= addr >> PAGE_SHIFT;
>  	pud = pud_alloc(mm, pgd, addr);
> @@ -1711,9 +1724,12 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
>  		return -ENOMEM;
>  	do {
>  		next = pud_addr_end(addr, end);
> -		if (remap_pmd_range(mm, pud, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> +		err = remap_pmd_range(mm, pud, addr, next,
> +				      pfn + (addr >> PAGE_SHIFT), prot, first);
> +		if (err)
> +			return err;
> +
> +		first = false;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
>  }
> @@ -1735,6 +1751,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  	unsigned long next;
>  	unsigned long end = addr + PAGE_ALIGN(size);
>  	struct mm_struct *mm = vma->vm_mm;
> +	bool first = true;
>  	int err;
>  
>  	/*
> @@ -1774,13 +1791,18 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		err = remap_pud_range(mm, pgd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot);
> +				      pfn + (addr >> PAGE_SHIFT), prot, first);
>  		if (err)
>  			break;
> +
> +		first = false;
>  	} while (pgd++, addr = next, addr != end);
>  
> -	if (err)
> +	if (err) {
>  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> +		if (err != -EBUSY)
> +			zap_page_range_single(vma, addr, size, NULL);

Hm. If I read it correctly, you zap whole range, not only what you've
set up. Looks wrong.

And for after zap, you probably whant to return -EBUSY to caller of
remap_pfn_range(), not -EINVAL.

> +	}
>  
>  	return err;
>  }
> -- 
> 1.9.1
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-19 11:50       ` Kirill A. Shutemov
  (?)
@ 2014-06-19 12:00       ` Chris Wilson
  2014-06-19 12:57           ` Kirill A. Shutemov
  -1 siblings, 1 reply; 22+ messages in thread
From: Chris Wilson @ 2014-06-19 12:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: intel-gfx, Andrew Morton, Peter Zijlstra, Rik van Riel,
	Mel Gorman, Cyrill Gorcunov, Johannes Weiner, linux-mm

On Thu, Jun 19, 2014 at 02:50:18PM +0300, Kirill A. Shutemov wrote:
> > +	if (err) {
> >  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> > +		if (err != -EBUSY)
> > +			zap_page_range_single(vma, addr, size, NULL);
> 
> Hm. If I read it correctly, you zap whole range, not only what you've
> set up. Looks wrong.

Yes. I didn't fancy threading the last touched pte back, but that should
be easier if moving to a struct.
 
> And for after zap, you probably whant to return -EBUSY to caller of
> remap_pfn_range(), not -EINVAL.

No, it has to be EINVAL for my purpose. If we return EBUSY, the caller
will just report VM_NOPAGE back to the fault handler, and the fault will
be retriggered - but the overlapping object will still be present. So the
EINVAL is there to report that the range conflicts with another and that
the caller should abort. It's a nasty semantic that works only when the
concurrent pagefaults are serialised around the call to remap_pfn_range().

The alternative would be to always report EINVAL and clean up, and
export pte_exists() so that the caller can detect when the PTEs have
already been populated by the concurrent fault.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-19 12:00       ` Chris Wilson
@ 2014-06-19 12:57           ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-19 12:57 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Kirill A. Shutemov, intel-gfx, Andrew Morton, Peter Zijlstra,
	Rik van Riel, Mel Gorman, Cyrill Gorcunov, Johannes Weiner,
	linux-mm

Chris Wilson wrote:
> On Thu, Jun 19, 2014 at 02:50:18PM +0300, Kirill A. Shutemov wrote:
> > > +	if (err) {
> > >  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> > > +		if (err != -EBUSY)
> > > +			zap_page_range_single(vma, addr, size, NULL);
> > 
> > Hm. If I read it correctly, you zap whole range, not only what you've
> > set up. Looks wrong.
> 
> Yes. I didn't fancy threading the last touched pte back, but that should
> be easier if moving to a struct.
>  
> > And for after zap, you probably whant to return -EBUSY to caller of
> > remap_pfn_range(), not -EINVAL.
> 
> No, it has to be EINVAL for my purpose. If we return EBUSY, the caller
> will just report VM_NOPAGE back to the fault handler, and the fault will
> be retriggered - but the overlapping object will still be present.

IIUC, what you're saying makes sense only if the range starts from PTE
you've got fault to. I failed to see why this assumption should be part of
remap_pfn_range() interface.

One possible option is to create a variant of remap_pfn_range() which will
return how many PTEs it was able to setup, before hitting the !pte_none().
Caller will decide what to do with partially filled range.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
@ 2014-06-19 12:57           ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-19 12:57 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Rik van Riel, Peter Zijlstra, intel-gfx, Cyrill Gorcunov,
	linux-mm, Mel Gorman, Johannes Weiner, Andrew Morton,
	Kirill A. Shutemov

Chris Wilson wrote:
> On Thu, Jun 19, 2014 at 02:50:18PM +0300, Kirill A. Shutemov wrote:
> > > +	if (err) {
> > >  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> > > +		if (err != -EBUSY)
> > > +			zap_page_range_single(vma, addr, size, NULL);
> > 
> > Hm. If I read it correctly, you zap whole range, not only what you've
> > set up. Looks wrong.
> 
> Yes. I didn't fancy threading the last touched pte back, but that should
> be easier if moving to a struct.
>  
> > And for after zap, you probably whant to return -EBUSY to caller of
> > remap_pfn_range(), not -EINVAL.
> 
> No, it has to be EINVAL for my purpose. If we return EBUSY, the caller
> will just report VM_NOPAGE back to the fault handler, and the fault will
> be retriggered - but the overlapping object will still be present.

IIUC, what you're saying makes sense only if the range starts from PTE
you've got fault to. I failed to see why this assumption should be part of
remap_pfn_range() interface.

One possible option is to create a variant of remap_pfn_range() which will
return how many PTEs it was able to setup, before hitting the !pte_none().
Caller will decide what to do with partially filled range.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-19 12:57           ` Kirill A. Shutemov
  (?)
@ 2014-06-19 13:22           ` Chris Wilson
  2014-06-19 13:59               ` Kirill A. Shutemov
  -1 siblings, 1 reply; 22+ messages in thread
From: Chris Wilson @ 2014-06-19 13:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: intel-gfx, Andrew Morton, Peter Zijlstra, Rik van Riel,
	Mel Gorman, Cyrill Gorcunov, Johannes Weiner, linux-mm

On Thu, Jun 19, 2014 at 03:57:46PM +0300, Kirill A. Shutemov wrote:
> Chris Wilson wrote:
> > On Thu, Jun 19, 2014 at 02:50:18PM +0300, Kirill A. Shutemov wrote:
> > > > +	if (err) {
> > > >  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> > > > +		if (err != -EBUSY)
> > > > +			zap_page_range_single(vma, addr, size, NULL);
> > > 
> > > Hm. If I read it correctly, you zap whole range, not only what you've
> > > set up. Looks wrong.
> > 
> > Yes. I didn't fancy threading the last touched pte back, but that should
> > be easier if moving to a struct.
> >  
> > > And for after zap, you probably whant to return -EBUSY to caller of
> > > remap_pfn_range(), not -EINVAL.
> > 
> > No, it has to be EINVAL for my purpose. If we return EBUSY, the caller
> > will just report VM_NOPAGE back to the fault handler, and the fault will
> > be retriggered - but the overlapping object will still be present.
> 
> IIUC, what you're saying makes sense only if the range starts from PTE
> you've got fault to. I failed to see why this assumption should be part of
> remap_pfn_range() interface.

That I agree with.
 
> One possible option is to create a variant of remap_pfn_range() which will
> return how many PTEs it was able to setup, before hitting the !pte_none().
> Caller will decide what to do with partially filled range.

Looked at just returning the address remap_pfn_range() got up to, which is
easy enough, but I think given that remap_pfn_range() will clean up
correctly after a failed remap, any EBUSY from partway through would be
a pathological driver error. 
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
  2014-06-19 13:22           ` Chris Wilson
@ 2014-06-19 13:59               ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-19 13:59 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Kirill A. Shutemov, intel-gfx, Andrew Morton, Peter Zijlstra,
	Rik van Riel, Mel Gorman, Cyrill Gorcunov, Johannes Weiner,
	linux-mm

Chris Wilson wrote:
> On Thu, Jun 19, 2014 at 03:57:46PM +0300, Kirill A. Shutemov wrote:
> > One possible option is to create a variant of remap_pfn_range() which will
> > return how many PTEs it was able to setup, before hitting the !pte_none().
> > Caller will decide what to do with partially filled range.
> 
> Looked at just returning the address remap_pfn_range() got up to, which is
> easy enough, but I think given that remap_pfn_range() will clean up
> correctly after a failed remap, any EBUSY from partway through would be
> a pathological driver error. 

I would prefer keep remap_pfn_range() interface intact with BUG_ON() on
unexpected !pte_none() and introduce new function with more flexible
behaviour (sharing underlying infrastructure).
This way we can avoid changing every remap_pfn_range() caller.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()
@ 2014-06-19 13:59               ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-19 13:59 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Rik van Riel, Peter Zijlstra, intel-gfx, Cyrill Gorcunov,
	linux-mm, Mel Gorman, Johannes Weiner, Andrew Morton,
	Kirill A. Shutemov

Chris Wilson wrote:
> On Thu, Jun 19, 2014 at 03:57:46PM +0300, Kirill A. Shutemov wrote:
> > One possible option is to create a variant of remap_pfn_range() which will
> > return how many PTEs it was able to setup, before hitting the !pte_none().
> > Caller will decide what to do with partially filled range.
> 
> Looked at just returning the address remap_pfn_range() got up to, which is
> easy enough, but I think given that remap_pfn_range() will clean up
> correctly after a failed remap, any EBUSY from partway through would be
> a pathological driver error. 

I would prefer keep remap_pfn_range() interface intact with BUG_ON() on
unexpected !pte_none() and introduce new function with more flexible
behaviour (sharing underlying infrastructure).
This way we can avoid changing every remap_pfn_range() caller.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] mm: Refactor remap_pfn_range()
  2014-06-19 13:59               ` Kirill A. Shutemov
  (?)
@ 2014-06-21 15:53               ` Chris Wilson
  2014-06-21 15:53                 ` [PATCH 2/4] io-mapping: Always create a struct to hold metadata about the io-mapping Chris Wilson
                                   ` (3 more replies)
  -1 siblings, 4 replies; 22+ messages in thread
From: Chris Wilson @ 2014-06-21 15:53 UTC (permalink / raw)
  To: intel-gfx
  Cc: Rik van Riel, Peter Zijlstra, Cyrill Gorcunov, linux-mm,
	Mel Gorman, Johannes Weiner, Andrew Morton, Kirill A. Shutemov

In preparation for exporting very similar functionality through another
interface, gut the current remap_pfn_range(). The motivating factor here
is to reuse the PGB/PUD/PMD/PTE walker, but allow back progation of
errors rather than BUG_ON.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
---
 mm/memory.c | 102 +++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 57 insertions(+), 45 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 037b812a9531..d2c7fe88a289 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2295,71 +2295,81 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+struct remap_pfn {
+	struct mm_struct *mm;
+	unsigned long addr;
+	unsigned long pfn;
+	pgprot_t prot;
+};
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
  * in null mappings (currently treated as "copy-on-access")
  */
-static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, unsigned long end,
-			unsigned long pfn, pgprot_t prot)
+static inline int remap_pfn(struct remap_pfn *r, pte_t *pte)
+{
+	if (!pte_none(*pte))
+		return -EBUSY;
+
+	set_pte_at(r->mm, r->addr, pte,
+		   pte_mkspecial(pfn_pte(r->pfn, r->prot)));
+	r->pfn++;
+	r->addr += PAGE_SIZE;
+	return 0;
+}
+
+static int remap_pte_range(struct remap_pfn *r, pmd_t *pmd, unsigned long end)
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	int err;
 
-	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	pte = pte_alloc_map_lock(r->mm, pmd, r->addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
+
 	arch_enter_lazy_mmu_mode();
 	do {
-		BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
-		pfn++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+		err = remap_pfn(r, pte++);
+	} while (err == 0 && r->addr < end);
 	arch_leave_lazy_mmu_mode();
+
 	pte_unmap_unlock(pte - 1, ptl);
-	return 0;
+	return err;
 }
 
-static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
-			unsigned long addr, unsigned long end,
-			unsigned long pfn, pgprot_t prot)
+static inline int remap_pmd_range(struct remap_pfn *r, pud_t *pud, unsigned long end)
 {
 	pmd_t *pmd;
-	unsigned long next;
+	int err;
 
-	pfn -= addr >> PAGE_SHIFT;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(r->mm, pud, r->addr);
 	if (!pmd)
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
+
 	do {
-		next = pmd_addr_end(addr, end);
-		if (remap_pte_range(mm, pmd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
+		err = remap_pte_range(r, pmd++, pmd_addr_end(r->addr, end));
+	} while (err == 0 && r->addr < end);
+
+	return err;
 }
 
-static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
-			unsigned long addr, unsigned long end,
-			unsigned long pfn, pgprot_t prot)
+static inline int remap_pud_range(struct remap_pfn *r, pgd_t *pgd, unsigned long end)
 {
 	pud_t *pud;
-	unsigned long next;
+	int err;
 
-	pfn -= addr >> PAGE_SHIFT;
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(r->mm, pgd, r->addr);
 	if (!pud)
 		return -ENOMEM;
+
 	do {
-		next = pud_addr_end(addr, end);
-		if (remap_pmd_range(mm, pud, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
-	} while (pud++, addr = next, addr != end);
-	return 0;
+		err = remap_pmd_range(r, pud++, pud_addr_end(r->addr, end));
+	} while (err == 0 && r->addr < end);
+
+	return err;
 }
 
 /**
@@ -2375,10 +2385,9 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
 int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
-	pgd_t *pgd;
-	unsigned long next;
 	unsigned long end = addr + PAGE_ALIGN(size);
-	struct mm_struct *mm = vma->vm_mm;
+	struct remap_pfn r;
+	pgd_t *pgd;
 	int err;
 
 	/*
@@ -2412,19 +2421,22 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
 
 	BUG_ON(addr >= end);
-	pfn -= addr >> PAGE_SHIFT;
-	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+
+	r.mm = vma->vm_mm;
+	r.addr = addr;
+	r.pfn = pfn;
+	r.prot = prot;
+
+	pgd = pgd_offset(r.mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
-		err = remap_pud_range(mm, pgd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot);
-		if (err)
-			break;
-	} while (pgd++, addr = next, addr != end);
+		err = remap_pud_range(&r, pgd++, pgd_addr_end(r.addr, end));
+	} while (err == 0 && r.addr < end);
 
-	if (err)
+	if (err) {
 		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
+		BUG_ON(err == -EBUSY);
+	}
 
 	return err;
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] io-mapping: Always create a struct to hold metadata about the io-mapping
  2014-06-21 15:53               ` [PATCH 1/4] mm: Refactor remap_pfn_range() Chris Wilson
@ 2014-06-21 15:53                 ` Chris Wilson
  2014-06-21 15:53                 ` [PATCH 3/4] mm: Export remap_io_mapping() Chris Wilson
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Chris Wilson @ 2014-06-21 15:53 UTC (permalink / raw)
  To: intel-gfx; +Cc: linux-mm

Currently, we only allocate a structure to hold metadata if we need to
allocate an ioremap for every access, such as on x86-32. However, it
would be useful to store basic information about the io-mapping, such as
its page protection, on all platforms.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: linux-mm@kvack.org
---
 include/linux/io-mapping.h | 52 ++++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/include/linux/io-mapping.h b/include/linux/io-mapping.h
index 657fab4efab3..e053011f50bb 100644
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -31,16 +31,17 @@
  * See Documentation/io-mapping.txt
  */
 
-#ifdef CONFIG_HAVE_ATOMIC_IOMAP
-
-#include <asm/iomap.h>
-
 struct io_mapping {
 	resource_size_t base;
 	unsigned long size;
 	pgprot_t prot;
+	void __iomem *iomem;
 };
 
+
+#ifdef CONFIG_HAVE_ATOMIC_IOMAP
+
+#include <asm/iomap.h>
 /*
  * For small address space machines, mapping large objects
  * into the kernel virtual space isn't practical. Where
@@ -119,48 +120,59 @@ io_mapping_unmap(void __iomem *vaddr)
 #else
 
 #include <linux/uaccess.h>
-
-/* this struct isn't actually defined anywhere */
-struct io_mapping;
+#include <asm/pgtable_types.h>
 
 /* Create the io_mapping object*/
 static inline struct io_mapping *
 io_mapping_create_wc(resource_size_t base, unsigned long size)
 {
-	return (struct io_mapping __force *) ioremap_wc(base, size);
+	struct io_mapping *iomap;
+
+	iomap = kmalloc(sizeof(*iomap), GFP_KERNEL);
+	if (!iomap)
+		return NULL;
+
+	iomap->base = base;
+	iomap->size = size;
+	iomap->iomem = ioremap_wc(base, size);
+	iomap->prot = pgprot_writecombine(PAGE_KERNEL_IO);
+
+	return iomap;
 }
 
 static inline void
 io_mapping_free(struct io_mapping *mapping)
 {
-	iounmap((void __force __iomem *) mapping);
+	iounmap(mapping->iomem);
+	kfree(mapping);
 }
 
-/* Atomic map/unmap */
+/* Non-atomic map/unmap */
 static inline void __iomem *
-io_mapping_map_atomic_wc(struct io_mapping *mapping,
-			 unsigned long offset)
+io_mapping_map_wc(struct io_mapping *mapping, unsigned long offset)
 {
-	pagefault_disable();
-	return ((char __force __iomem *) mapping) + offset;
+	return mapping->iomem + offset;
 }
 
 static inline void
-io_mapping_unmap_atomic(void __iomem *vaddr)
+io_mapping_unmap(void __iomem *vaddr)
 {
-	pagefault_enable();
 }
 
-/* Non-atomic map/unmap */
+/* Atomic map/unmap */
 static inline void __iomem *
-io_mapping_map_wc(struct io_mapping *mapping, unsigned long offset)
+io_mapping_map_atomic_wc(struct io_mapping *mapping,
+			 unsigned long offset)
 {
-	return ((char __force __iomem *) mapping) + offset;
+	pagefault_disable();
+	return io_mapping_map_wc(mapping, offset);
 }
 
 static inline void
-io_mapping_unmap(void __iomem *vaddr)
+io_mapping_unmap_atomic(void __iomem *vaddr)
 {
+	io_mapping_unmap(vaddr);
+	pagefault_enable();
 }
 
 #endif /* HAVE_ATOMIC_IOMAP */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mm: Export remap_io_mapping()
  2014-06-21 15:53               ` [PATCH 1/4] mm: Refactor remap_pfn_range() Chris Wilson
  2014-06-21 15:53                 ` [PATCH 2/4] io-mapping: Always create a struct to hold metadata about the io-mapping Chris Wilson
@ 2014-06-21 15:53                 ` Chris Wilson
  2014-06-30 14:32                     ` Kirill A. Shutemov
  2014-06-21 15:53                 ` [PATCH 4/4] drm/i915: Use remap_io_mapping() to prefault all PTE in a single pass Chris Wilson
  2014-06-30 14:26                   ` Kirill A. Shutemov
  3 siblings, 1 reply; 22+ messages in thread
From: Chris Wilson @ 2014-06-21 15:53 UTC (permalink / raw)
  To: intel-gfx
  Cc: Rik van Riel, Peter Zijlstra, Cyrill Gorcunov, linux-mm,
	Mel Gorman, Johannes Weiner, Andrew Morton, Kirill A. Shutemov

This is similar to remap_pfn_range(), and uses the recently refactor
code to do the page table walking. The key difference is that is back
propagates its error as this is required for use from within a pagefault
handler. The other difference, is that it combine the page protection
from io-mapping, which is known from when the io-mapping is created,
with the per-vma page protection flags. This avoids having to walk the
entire system description to rediscover the special page protection
established for the io-mapping.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
---
 include/linux/mm.h |  4 ++++
 mm/memory.c        | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6777060449f..aa766bbc6981 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1941,6 +1941,10 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
+struct io_mapping;
+int remap_io_mapping(struct vm_area_struct *,
+		     unsigned long addr, unsigned long pfn, unsigned long size,
+		     struct io_mapping *iomap);
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
diff --git a/mm/memory.c b/mm/memory.c
index d2c7fe88a289..8af2bd2de98e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/io-mapping.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2443,6 +2444,51 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL(remap_pfn_range);
 
 /**
+ * remap_io_mapping - remap an IO mapping to userspace
+ * @vma: user vma to map to
+ * @addr: target user address to start at
+ * @pfn: physical address of kernel memory
+ * @size: size of map area
+ * @iomap: the source io_mapping
+ *
+ *  Note: this is only safe if the mm semaphore is held when called.
+ */
+int remap_io_mapping(struct vm_area_struct *vma,
+		     unsigned long addr, unsigned long pfn, unsigned long size,
+		     struct io_mapping *iomap)
+{
+	unsigned long end = addr + PAGE_ALIGN(size);
+	struct remap_pfn r;
+	pgd_t *pgd;
+	int err;
+
+	if (WARN_ON(addr >= end))
+		return -EINVAL;
+
+#define MUST_SET (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
+	BUG_ON(is_cow_mapping(vma->vm_flags));
+	BUG_ON((vma->vm_flags & MUST_SET) != MUST_SET);
+#undef MUST_SET
+
+	r.mm = vma->vm_mm;
+	r.addr = addr;
+	r.pfn = pfn;
+	r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
+			  (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK));
+
+	pgd = pgd_offset(r.mm, addr);
+	do {
+		err = remap_pud_range(&r, pgd++, pgd_addr_end(r.addr, end));
+	} while (err == 0 && r.addr < end);
+
+	if (err)
+		zap_page_range_single(vma, addr, r.addr - addr, NULL);
+
+	return err;
+}
+EXPORT_SYMBOL(remap_io_mapping);
+
+/**
  * vm_iomap_memory - remap memory to userspace
  * @vma: user vma to map to
  * @start: start of area
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] drm/i915: Use remap_io_mapping() to prefault all PTE in a single pass
  2014-06-21 15:53               ` [PATCH 1/4] mm: Refactor remap_pfn_range() Chris Wilson
  2014-06-21 15:53                 ` [PATCH 2/4] io-mapping: Always create a struct to hold metadata about the io-mapping Chris Wilson
  2014-06-21 15:53                 ` [PATCH 3/4] mm: Export remap_io_mapping() Chris Wilson
@ 2014-06-21 15:53                 ` Chris Wilson
  2014-06-30 14:26                   ` Kirill A. Shutemov
  3 siblings, 0 replies; 22+ messages in thread
From: Chris Wilson @ 2014-06-21 15:53 UTC (permalink / raw)
  To: intel-gfx; +Cc: linux-mm

On an Ivybridge i7-3720qm with 1600MHz DDR3, with 32 fences,
Upload rate for 2 linear surfaces:  8134MiB/s -> 8154MiB/s
Upload rate for 2 tiled surfaces:   8625MiB/s -> 8632MiB/s
Upload rate for 4 linear surfaces:  8127MiB/s -> 8134MiB/s
Upload rate for 4 tiled surfaces:   8602MiB/s -> 8629MiB/s
Upload rate for 8 linear surfaces:  8124MiB/s -> 8137MiB/s
Upload rate for 8 tiled surfaces:   8603MiB/s -> 8624MiB/s
Upload rate for 16 linear surfaces: 8123MiB/s -> 8128MiB/s
Upload rate for 16 tiled surfaces:  8606MiB/s -> 8618MiB/s
Upload rate for 32 linear surfaces: 8121MiB/s -> 8128MiB/s
Upload rate for 32 tiled surfaces:  8605MiB/s -> 8614MiB/s
Upload rate for 64 linear surfaces: 8121MiB/s -> 8127MiB/s
Upload rate for 64 tiled surfaces:  3017MiB/s -> 5202MiB/s

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Testcase: igt/gem_fence_upload/performance
Testcase: igt/gem_mmap_gtt
Reviewed-by: Brad Volkin <bradley.d.volkin@intel.com>
Cc: linux-mm@kvack.org
---
 drivers/gpu/drm/i915/i915_gem.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f6d123828926..f0628e40cb1d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1565,25 +1565,14 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pfn = dev_priv->gtt.mappable_base + i915_gem_obj_ggtt_offset(obj);
 	pfn >>= PAGE_SHIFT;
 
-	if (!obj->fault_mappable) {
-		unsigned long size = min_t(unsigned long,
-					   vma->vm_end - vma->vm_start,
-					   obj->base.size);
-		int i;
+	ret = remap_io_mapping(vma,
+			       vma->vm_start, pfn, vma->vm_end - vma->vm_start,
+			       dev_priv->gtt.mappable);
+	if (ret)
+		goto unpin;
 
-		for (i = 0; i < size >> PAGE_SHIFT; i++) {
-			ret = vm_insert_pfn(vma,
-					    (unsigned long)vma->vm_start + i * PAGE_SIZE,
-					    pfn + i);
-			if (ret)
-				break;
-		}
+	obj->fault_mappable = true;
 
-		obj->fault_mappable = true;
-	} else
-		ret = vm_insert_pfn(vma,
-				    (unsigned long)vmf->virtual_address,
-				    pfn + page_offset);
 unpin:
 	i915_gem_object_ggtt_unpin(obj);
 unlock:
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] mm: Refactor remap_pfn_range()
  2014-06-21 15:53               ` [PATCH 1/4] mm: Refactor remap_pfn_range() Chris Wilson
@ 2014-06-30 14:26                   ` Kirill A. Shutemov
  2014-06-21 15:53                 ` [PATCH 3/4] mm: Export remap_io_mapping() Chris Wilson
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-30 14:26 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, linux-mm, Andrew Morton, Kirill A. Shutemov,
	Peter Zijlstra, Rik van Riel, Mel Gorman, Cyrill Gorcunov,
	Johannes Weiner

Chris Wilson wrote:
> In preparation for exporting very similar functionality through another
> interface, gut the current remap_pfn_range(). The motivating factor here
> is to reuse the PGB/PUD/PMD/PTE walker, but allow back progation of
> errors rather than BUG_ON.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/memory.c | 102 +++++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 57 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..d2c7fe88a289 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2295,71 +2295,81 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_mixed);
>  
> +struct remap_pfn {
> +	struct mm_struct *mm;
> +	unsigned long addr;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +};
> +
>  /*
>   * maps a range of physical memory into the requested pages. the old
>   * mappings are removed. any references to nonexistent pages results
>   * in null mappings (currently treated as "copy-on-access")
>   */
> -static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +static inline int remap_pfn(struct remap_pfn *r, pte_t *pte)
> +{
> +	if (!pte_none(*pte))
> +		return -EBUSY;
> +
> +	set_pte_at(r->mm, r->addr, pte,
> +		   pte_mkspecial(pfn_pte(r->pfn, r->prot)));
> +	r->pfn++;
> +	r->addr += PAGE_SIZE;
> +	return 0;
> +}
> +
> +static int remap_pte_range(struct remap_pfn *r, pmd_t *pmd, unsigned long end)
>  {
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int err;
>  
> -	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> +	pte = pte_alloc_map_lock(r->mm, pmd, r->addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
> +
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> -		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
> -		pfn++;
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +		err = remap_pfn(r, pte++);
> +	} while (err == 0 && r->addr < end);
>  	arch_leave_lazy_mmu_mode();
> +
>  	pte_unmap_unlock(pte - 1, ptl);
> -	return 0;
> +	return err;
>  }
>  
> -static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +static inline int remap_pmd_range(struct remap_pfn *r, pud_t *pud, unsigned long end)
>  {
>  	pmd_t *pmd;
> -	unsigned long next;
> +	int err;
>  
> -	pfn -= addr >> PAGE_SHIFT;
> -	pmd = pmd_alloc(mm, pud, addr);
> +	pmd = pmd_alloc(r->mm, pud, r->addr);
>  	if (!pmd)
>  		return -ENOMEM;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
> +
>  	do {
> -		next = pmd_addr_end(addr, end);
> -		if (remap_pte_range(mm, pmd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> -	} while (pmd++, addr = next, addr != end);
> -	return 0;
> +		err = remap_pte_range(r, pmd++, pmd_addr_end(r->addr, end));
> +	} while (err == 0 && r->addr < end);
> +
> +	return err;
>  }
>  
> -static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +static inline int remap_pud_range(struct remap_pfn *r, pgd_t *pgd, unsigned long end)
>  {
>  	pud_t *pud;
> -	unsigned long next;
> +	int err;
>  
> -	pfn -= addr >> PAGE_SHIFT;
> -	pud = pud_alloc(mm, pgd, addr);
> +	pud = pud_alloc(r->mm, pgd, r->addr);
>  	if (!pud)
>  		return -ENOMEM;
> +
>  	do {
> -		next = pud_addr_end(addr, end);
> -		if (remap_pmd_range(mm, pud, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> -	} while (pud++, addr = next, addr != end);
> -	return 0;
> +		err = remap_pmd_range(r, pud++, pud_addr_end(r->addr, end));
> +	} while (err == 0 && r->addr < end);
> +
> +	return err;
>  }
>  
>  /**
> @@ -2375,10 +2385,9 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
>  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  		    unsigned long pfn, unsigned long size, pgprot_t prot)
>  {
> -	pgd_t *pgd;
> -	unsigned long next;
>  	unsigned long end = addr + PAGE_ALIGN(size);
> -	struct mm_struct *mm = vma->vm_mm;
> +	struct remap_pfn r;
> +	pgd_t *pgd;
>  	int err;
>  
>  	/*
> @@ -2412,19 +2421,22 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
>  
>  	BUG_ON(addr >= end);
> -	pfn -= addr >> PAGE_SHIFT;
> -	pgd = pgd_offset(mm, addr);
>  	flush_cache_range(vma, addr, end);
> +
> +	r.mm = vma->vm_mm;
> +	r.addr = addr;
> +	r.pfn = pfn;
> +	r.prot = prot;
> +
> +	pgd = pgd_offset(r.mm, addr);
>  	do {
> -		next = pgd_addr_end(addr, end);
> -		err = remap_pud_range(mm, pgd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot);
> -		if (err)
> -			break;
> -	} while (pgd++, addr = next, addr != end);
> +		err = remap_pud_range(&r, pgd++, pgd_addr_end(r.addr, end));
> +	} while (err == 0 && r.addr < end);
>  
> -	if (err)
> +	if (err) {
>  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> +		BUG_ON(err == -EBUSY);

We probably need a comment for the BUG_ON().

Otherwise,

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> +	}
>  
>  	return err;
>  }
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] mm: Refactor remap_pfn_range()
@ 2014-06-30 14:26                   ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-30 14:26 UTC (permalink / raw)
  Cc: intel-gfx, linux-mm, Chris Wilson, Andrew Morton,
	Kirill A. Shutemov, Peter Zijlstra, Rik van Riel, Mel Gorman,
	Cyrill Gorcunov, Johannes Weiner

Chris Wilson wrote:
> In preparation for exporting very similar functionality through another
> interface, gut the current remap_pfn_range(). The motivating factor here
> is to reuse the PGB/PUD/PMD/PTE walker, but allow back progation of
> errors rather than BUG_ON.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/memory.c | 102 +++++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 57 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..d2c7fe88a289 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2295,71 +2295,81 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_mixed);
>  
> +struct remap_pfn {
> +	struct mm_struct *mm;
> +	unsigned long addr;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +};
> +
>  /*
>   * maps a range of physical memory into the requested pages. the old
>   * mappings are removed. any references to nonexistent pages results
>   * in null mappings (currently treated as "copy-on-access")
>   */
> -static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +static inline int remap_pfn(struct remap_pfn *r, pte_t *pte)
> +{
> +	if (!pte_none(*pte))
> +		return -EBUSY;
> +
> +	set_pte_at(r->mm, r->addr, pte,
> +		   pte_mkspecial(pfn_pte(r->pfn, r->prot)));
> +	r->pfn++;
> +	r->addr += PAGE_SIZE;
> +	return 0;
> +}
> +
> +static int remap_pte_range(struct remap_pfn *r, pmd_t *pmd, unsigned long end)
>  {
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int err;
>  
> -	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> +	pte = pte_alloc_map_lock(r->mm, pmd, r->addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
> +
>  	arch_enter_lazy_mmu_mode();
>  	do {
> -		BUG_ON(!pte_none(*pte));
> -		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
> -		pfn++;
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +		err = remap_pfn(r, pte++);
> +	} while (err == 0 && r->addr < end);
>  	arch_leave_lazy_mmu_mode();
> +
>  	pte_unmap_unlock(pte - 1, ptl);
> -	return 0;
> +	return err;
>  }
>  
> -static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +static inline int remap_pmd_range(struct remap_pfn *r, pud_t *pud, unsigned long end)
>  {
>  	pmd_t *pmd;
> -	unsigned long next;
> +	int err;
>  
> -	pfn -= addr >> PAGE_SHIFT;
> -	pmd = pmd_alloc(mm, pud, addr);
> +	pmd = pmd_alloc(r->mm, pud, r->addr);
>  	if (!pmd)
>  		return -ENOMEM;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
> +
>  	do {
> -		next = pmd_addr_end(addr, end);
> -		if (remap_pte_range(mm, pmd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> -	} while (pmd++, addr = next, addr != end);
> -	return 0;
> +		err = remap_pte_range(r, pmd++, pmd_addr_end(r->addr, end));
> +	} while (err == 0 && r->addr < end);
> +
> +	return err;
>  }
>  
> -static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> -			unsigned long addr, unsigned long end,
> -			unsigned long pfn, pgprot_t prot)
> +static inline int remap_pud_range(struct remap_pfn *r, pgd_t *pgd, unsigned long end)
>  {
>  	pud_t *pud;
> -	unsigned long next;
> +	int err;
>  
> -	pfn -= addr >> PAGE_SHIFT;
> -	pud = pud_alloc(mm, pgd, addr);
> +	pud = pud_alloc(r->mm, pgd, r->addr);
>  	if (!pud)
>  		return -ENOMEM;
> +
>  	do {
> -		next = pud_addr_end(addr, end);
> -		if (remap_pmd_range(mm, pud, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot))
> -			return -ENOMEM;
> -	} while (pud++, addr = next, addr != end);
> -	return 0;
> +		err = remap_pmd_range(r, pud++, pud_addr_end(r->addr, end));
> +	} while (err == 0 && r->addr < end);
> +
> +	return err;
>  }
>  
>  /**
> @@ -2375,10 +2385,9 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
>  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  		    unsigned long pfn, unsigned long size, pgprot_t prot)
>  {
> -	pgd_t *pgd;
> -	unsigned long next;
>  	unsigned long end = addr + PAGE_ALIGN(size);
> -	struct mm_struct *mm = vma->vm_mm;
> +	struct remap_pfn r;
> +	pgd_t *pgd;
>  	int err;
>  
>  	/*
> @@ -2412,19 +2421,22 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
>  
>  	BUG_ON(addr >= end);
> -	pfn -= addr >> PAGE_SHIFT;
> -	pgd = pgd_offset(mm, addr);
>  	flush_cache_range(vma, addr, end);
> +
> +	r.mm = vma->vm_mm;
> +	r.addr = addr;
> +	r.pfn = pfn;
> +	r.prot = prot;
> +
> +	pgd = pgd_offset(r.mm, addr);
>  	do {
> -		next = pgd_addr_end(addr, end);
> -		err = remap_pud_range(mm, pgd, addr, next,
> -				pfn + (addr >> PAGE_SHIFT), prot);
> -		if (err)
> -			break;
> -	} while (pgd++, addr = next, addr != end);
> +		err = remap_pud_range(&r, pgd++, pgd_addr_end(r.addr, end));
> +	} while (err == 0 && r.addr < end);
>  
> -	if (err)
> +	if (err) {
>  		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> +		BUG_ON(err == -EBUSY);

We probably need a comment for the BUG_ON().

Otherwise,

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> +	}
>  
>  	return err;
>  }
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 3/4] mm: Export remap_io_mapping()
  2014-06-21 15:53                 ` [PATCH 3/4] mm: Export remap_io_mapping() Chris Wilson
@ 2014-06-30 14:32                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-30 14:32 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, linux-mm, Andrew Morton, Kirill A. Shutemov,
	Peter Zijlstra, Rik van Riel, Mel Gorman, Cyrill Gorcunov,
	Johannes Weiner

Chris Wilson wrote:
> This is similar to remap_pfn_range(), and uses the recently refactor
> code to do the page table walking. The key difference is that is back
> propagates its error as this is required for use from within a pagefault
> handler. The other difference, is that it combine the page protection
> from io-mapping, which is known from when the io-mapping is created,
> with the per-vma page protection flags. This avoids having to walk the
> entire system description to rediscover the special page protection
> established for the io-mapping.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Looks okay to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 3/4] mm: Export remap_io_mapping()
@ 2014-06-30 14:32                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-30 14:32 UTC (permalink / raw)
  Cc: intel-gfx, linux-mm, Chris Wilson, Andrew Morton,
	Kirill A. Shutemov, Peter Zijlstra, Rik van Riel, Mel Gorman,
	Cyrill Gorcunov, Johannes Weiner

Chris Wilson wrote:
> This is similar to remap_pfn_range(), and uses the recently refactor
> code to do the page table walking. The key difference is that is back
> propagates its error as this is required for use from within a pagefault
> handler. The other difference, is that it combine the page protection
> from io-mapping, which is known from when the io-mapping is created,
> with the per-vma page protection flags. This avoids having to walk the
> entire system description to rediscover the special page protection
> established for the io-mapping.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Looks okay to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-06-30 14:32 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-13 16:26 [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
2014-06-13 16:26 ` [PATCH 2/2] drm/i915: Use remap_pfn_range() to prefault all PTE in a single pass Chris Wilson
2014-06-13 16:34 ` [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range() Chris Wilson
2014-06-16 13:41 ` Kirill A. Shutemov
2014-06-16 13:41   ` Kirill A. Shutemov
2014-06-19  7:19   ` [PATCH] " Chris Wilson
2014-06-19 11:50     ` Kirill A. Shutemov
2014-06-19 11:50       ` Kirill A. Shutemov
2014-06-19 12:00       ` Chris Wilson
2014-06-19 12:57         ` Kirill A. Shutemov
2014-06-19 12:57           ` Kirill A. Shutemov
2014-06-19 13:22           ` Chris Wilson
2014-06-19 13:59             ` Kirill A. Shutemov
2014-06-19 13:59               ` Kirill A. Shutemov
2014-06-21 15:53               ` [PATCH 1/4] mm: Refactor remap_pfn_range() Chris Wilson
2014-06-21 15:53                 ` [PATCH 2/4] io-mapping: Always create a struct to hold metadata about the io-mapping Chris Wilson
2014-06-21 15:53                 ` [PATCH 3/4] mm: Export remap_io_mapping() Chris Wilson
2014-06-30 14:32                   ` Kirill A. Shutemov
2014-06-30 14:32                     ` Kirill A. Shutemov
2014-06-21 15:53                 ` [PATCH 4/4] drm/i915: Use remap_io_mapping() to prefault all PTE in a single pass Chris Wilson
2014-06-30 14:26                 ` [PATCH 1/4] mm: Refactor remap_pfn_range() Kirill A. Shutemov
2014-06-30 14:26                   ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.