linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Fixes for vm_insert_pfn_prot()
@ 2016-01-25 17:25 Matthew Wilcox
  2016-01-25 17:25 ` [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap() Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-25 17:25 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Matthew Wilcox, Kees Cook, Andrew Morton, linux-kernel, linux-mm

From: Matthew Wilcox <willy@linux.intel.com>

Commit 1745cbc5d0 recently added vm_insert_pfn_prot().  Unfortunately,
it doesn't actually work on x86 with PAT enabled (which is basically
all machines, so I don't know if anyone actually tested it).  Also,
vm_insert_pfn_prot() continues with a couple of old-school traditions,
of taking an unsigned long instead of a pfn_t, and returning an errno
that then has to be translated in the fault handler.

I was looking at adding a somewhat similar function for DAX, so this
patchset includes changing DAX to use Andy's interface.  I'd like to see
at least the first two patches go into Ingo's tree.  The third patch can
find its way into the -mm tree later to stay with the other DAX patches.

Matthew Wilcox (3):
  x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot
  dax: Handle write faults more efficiently

 arch/x86/entry/vdso/vma.c |  6 ++--
 arch/x86/mm/pat.c         |  4 +--
 fs/dax.c                  | 73 ++++++++++++++++++++++++++++++++++-------------
 include/linux/mm.h        |  4 +--
 mm/memory.c               | 31 +++++++++++---------
 5 files changed, 78 insertions(+), 40 deletions(-)

-- 
2.7.0.rc3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-25 17:25 [PATCH 0/3] Fixes for vm_insert_pfn_prot() Matthew Wilcox
@ 2016-01-25 17:25 ` Matthew Wilcox
  2016-01-25 17:33   ` Andy Lutomirski
  2016-02-09 16:09   ` [tip:x86/asm] x86/mm: " tip-bot for Matthew Wilcox
  2016-01-25 17:25 ` [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot Matthew Wilcox
  2016-01-25 17:25 ` [PATCH 3/3] dax: Handle write faults more efficiently Matthew Wilcox
  2 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-25 17:25 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Matthew Wilcox, Kees Cook, Andrew Morton, linux-kernel, linux-mm

From: Matthew Wilcox <willy@linux.intel.com>

track_pfn_insert() overwrites the pgprot that is passed in with a value
based on the VMA's page_prot.  This is a problem for people trying to
do clever things with the new vm_insert_pfn_prot() as it will simply
overwrite the passed protection flags.  If we use the current value of
the pgprot as the base, then it will behave as people are expecting.

Also fix track_pfn_remap() in the same way.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 arch/x86/mm/pat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index f4ae536..04e2e71 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -943,7 +943,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			return -EINVAL;
 	}
 
-	*prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
+	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
 			 cachemode2protval(pcm));
 
 	return 0;
@@ -959,7 +959,7 @@ int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 
 	/* Set prot based on lookup */
 	pcm = lookup_memtype(pfn_t_to_phys(pfn));
-	*prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
+	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
 			 cachemode2protval(pcm));
 
 	return 0;
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot
  2016-01-25 17:25 [PATCH 0/3] Fixes for vm_insert_pfn_prot() Matthew Wilcox
  2016-01-25 17:25 ` [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap() Matthew Wilcox
@ 2016-01-25 17:25 ` Matthew Wilcox
  2016-01-25 17:35   ` Andy Lutomirski
  2016-01-25 17:25 ` [PATCH 3/3] dax: Handle write faults more efficiently Matthew Wilcox
  2 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-25 17:25 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Matthew Wilcox, Kees Cook, Andrew Morton, linux-kernel, linux-mm

From: Matthew Wilcox <willy@linux.intel.com>

Other than the name, the vmf_ version takes a pfn_t parameter, and
returns a VM_FAULT_ code suitable for returning from a fault handler.

This patch also prevents vm_insert_pfn() from returning -EBUSY.
This is a good thing as several callers handled it incorrectly (and
none intentionally treat -EBUSY as a different case from 0).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 arch/x86/entry/vdso/vma.c |  6 +++---
 include/linux/mm.h        |  4 ++--
 mm/memory.c               | 31 ++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 7c912fe..660bb69 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -9,6 +9,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/init.h>
+#include <linux/pfn_t.h>
 #include <linux/random.h>
 #include <linux/elf.h>
 #include <linux/cpu.h>
@@ -131,10 +132,9 @@ static int vvar_fault(const struct vm_special_mapping *sm,
 	} else if (sym_offset == image->sym_hpet_page) {
 #ifdef CONFIG_HPET_TIMER
 		if (hpet_address && vclock_was_used(VCLOCK_HPET)) {
-			ret = vm_insert_pfn_prot(
-				vma,
+			return vmf_insert_pfn_prot(vma,
 				(unsigned long)vmf->virtual_address,
-				hpet_address >> PAGE_SHIFT,
+				phys_to_pfn_t(hpet_address, PFN_DEV),
 				pgprot_noncached(PAGE_READONLY));
 		}
 #endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa6da9a..19f8741 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2138,8 +2138,8 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
-int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn, pgprot_t pgprot);
+int vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn, pgprot_t pgprot);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
diff --git a/mm/memory.c b/mm/memory.c
index a2eaeef..9b57318 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1554,7 +1554,11 @@ out:
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn)
 {
-	return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
+	int result = vmf_insert_pfn_prot(vma, addr,
+			__pfn_to_pfn_t(pfn, PFN_DEV), vma->vm_page_prot);
+	if (result & VM_FAULT_ERROR)
+		return -EFAULT;
+	return 0;
 }
 EXPORT_SYMBOL(vm_insert_pfn);
 
@@ -1570,13 +1574,13 @@ EXPORT_SYMBOL(vm_insert_pfn);
  *
  * This only makes sense for IO mappings, and it makes no sense for
  * cow mappings.  In general, using multiple vmas is preferable;
- * vm_insert_pfn_prot should only be used if using multiple VMAs is
+ * vmf_insert_pfn_prot should only be used if using multiple VMAs is
  * impractical.
  */
-int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn, pgprot_t pgprot)
+int vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn, pgprot_t pgprot)
 {
-	int ret;
+	int error;
 	/*
 	 * Technically, architectures with pte_special can avoid all these
 	 * restrictions (same for remap_pfn_range).  However we would like
@@ -1587,18 +1591,19 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_t_valid(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return -EFAULT;
-	if (track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)))
-		return -EINVAL;
-
-	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
+		return VM_FAULT_SIGBUS;
+	if (track_pfn_insert(vma, &pgprot, pfn))
+		return VM_FAULT_SIGBUS;
 
-	return ret;
+	error = insert_pfn(vma, addr, pfn, pgprot);
+	if (error == -EBUSY || !error)
+		return VM_FAULT_NOPAGE;
+	return VM_FAULT_SIGBUS;
 }
-EXPORT_SYMBOL(vm_insert_pfn_prot);
+EXPORT_SYMBOL(vmf_insert_pfn_prot);
 
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn)
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] dax: Handle write faults more efficiently
  2016-01-25 17:25 [PATCH 0/3] Fixes for vm_insert_pfn_prot() Matthew Wilcox
  2016-01-25 17:25 ` [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap() Matthew Wilcox
  2016-01-25 17:25 ` [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot Matthew Wilcox
@ 2016-01-25 17:25 ` Matthew Wilcox
  2016-01-25 17:38   ` Andy Lutomirski
  2 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-25 17:25 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Matthew Wilcox, Kees Cook, Andrew Morton, linux-kernel, linux-mm

From: Matthew Wilcox <willy@linux.intel.com>

When we handle a write-fault on a DAX mapping, we currently insert a
read-only mapping and then take the page fault again to convert it to
a writable mapping.  This is necessary for the case where we cover a
hole with a read-only zero page, but when we have a data block already
allocated, it is inefficient.

Use the recently added vmf_insert_pfn_prot() to insert a writable mapping,
even though the default VM flags say to use a read-only mapping.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 206650f..3f6138d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -519,9 +519,44 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
+/*
+ * The default page protections for DAX VMAs are set to "copy" so that
+ * we get notifications when zero pages are written to.  This function
+ * is called when we're inserting a mapping to a data page.  If this is
+ * a write fault, we've already done all the necessary accounting and
+ * it's pointless to insert this translation entry read-only.  Convert
+ * the pgprot to be writable.
+ *
+ * While this is not the most elegant code, the compiler can see that (on
+ * any sane architecture) all four arms of the conditional are the same.
+ */
+static pgprot_t dax_pgprot(struct vm_area_struct *vma, bool write)
+{
+	pgprot_t pgprot = vma->vm_page_prot;
+	if (!write)
+		return pgprot;
+	if ((vma->vm_flags & (VM_READ|VM_EXEC)) == (VM_READ|VM_EXEC))
+		return __pgprot(pgprot_val(pgprot) ^
+				pgprot_val(__P111) ^
+				pgprot_val(__S111));
+	else if ((vma->vm_flags & (VM_READ|VM_EXEC)) == VM_READ)
+		return __pgprot(pgprot_val(pgprot) ^
+				pgprot_val(__P110) ^
+				pgprot_val(__S110));
+	else if ((vma->vm_flags & (VM_READ|VM_EXEC)) == VM_EXEC)
+		return __pgprot(pgprot_val(pgprot) ^
+				pgprot_val(__P011) ^
+				pgprot_val(__S011));
+	else
+		return __pgprot(pgprot_val(pgprot) ^
+				pgprot_val(__P010) ^
+				pgprot_val(__S010));
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
@@ -530,7 +565,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		.size = bh->b_size,
 	};
 	pgoff_t size;
-	int error;
+	int result;
 
 	i_mmap_lock_read(mapping);
 
@@ -542,15 +577,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	 * allocated past the end of the file.
 	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (unlikely(vmf->pgoff >= size)) {
-		error = -EIO;
-		goto out;
-	}
+	if (unlikely(vmf->pgoff >= size))
+		goto sigbus;
 
-	if (dax_map_atomic(bdev, &dax) < 0) {
-		error = PTR_ERR(dax.addr);
-		goto out;
-	}
+	if (dax_map_atomic(bdev, &dax) < 0)
+		goto sigbus;
 
 	if (buffer_unwritten(bh) || buffer_new(bh)) {
 		clear_pmem(dax.addr, PAGE_SIZE);
@@ -558,17 +589,19 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
-		goto out;
+	if (dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, write))
+		goto sigbus;
 
-	error = vm_insert_mixed(vma, vaddr, dax.pfn);
+	result = vmf_insert_pfn_prot(vma, vaddr, dax.pfn,
+					dax_pgprot(vma, write));
 
  out:
 	i_mmap_unlock_read(mapping);
+	return result;
 
-	return error;
+ sigbus:
+	result = VM_FAULT_SIGBUS;
+	goto out;
 }
 
 /**
@@ -599,7 +632,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	unsigned blkbits = inode->i_blkbits;
 	sector_t block;
 	pgoff_t size;
-	int error;
+	int result, error;
 	int major = 0;
 
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -701,19 +734,19 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 * indicate what the callback should do via the uptodate variable, same
 	 * as for normal BH based IO completions.
 	 */
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	result = dax_insert_mapping(inode, &bh, vma, vmf);
 	if (buffer_unwritten(&bh)) {
 		if (complete_unwritten)
-			complete_unwritten(&bh, !error);
+			complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 		else
 			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
 	}
+	return result | major;
 
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if ((error < 0) && (error != -EBUSY))
+	if (error < 0)
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-25 17:25 ` [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap() Matthew Wilcox
@ 2016-01-25 17:33   ` Andy Lutomirski
  2016-01-25 17:46     ` Andy Lutomirski
  2016-01-27  4:40     ` Matthew Wilcox
  2016-02-09 16:09   ` [tip:x86/asm] x86/mm: " tip-bot for Matthew Wilcox
  1 sibling, 2 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-25 17:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Matthew Wilcox, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> track_pfn_insert() overwrites the pgprot that is passed in with a value
> based on the VMA's page_prot.  This is a problem for people trying to
> do clever things with the new vm_insert_pfn_prot() as it will simply
> overwrite the passed protection flags.  If we use the current value of
> the pgprot as the base, then it will behave as people are expecting.
>
> Also fix track_pfn_remap() in the same way.

Well that's embarrassing.  Presumably it worked for me because I only
overrode the cacheability bits and lookup_memtype did the right thing.

But shouldn't the PAT code change the memtype if vm_insert_pfn_prot
requests it?  Or are there no callers that actually need that?  (HPET
doesn't, because there's a plain old ioremapped mapping.)

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot
  2016-01-25 17:25 ` [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot Matthew Wilcox
@ 2016-01-25 17:35   ` Andy Lutomirski
  2016-01-27  4:18     ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-25 17:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Matthew Wilcox, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> Other than the name, the vmf_ version takes a pfn_t parameter, and
> returns a VM_FAULT_ code suitable for returning from a fault handler.
>
> This patch also prevents vm_insert_pfn() from returning -EBUSY.
> This is a good thing as several callers handled it incorrectly (and
> none intentionally treat -EBUSY as a different case from 0).
>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  arch/x86/entry/vdso/vma.c |  6 +++---
>  include/linux/mm.h        |  4 ++--
>  mm/memory.c               | 31 ++++++++++++++++++-------------
>  3 files changed, 23 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 7c912fe..660bb69 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -9,6 +9,7 @@
>  #include <linux/sched.h>
>  #include <linux/slab.h>
>  #include <linux/init.h>
> +#include <linux/pfn_t.h>
>  #include <linux/random.h>
>  #include <linux/elf.h>
>  #include <linux/cpu.h>
> @@ -131,10 +132,9 @@ static int vvar_fault(const struct vm_special_mapping *sm,
>         } else if (sym_offset == image->sym_hpet_page) {
>  #ifdef CONFIG_HPET_TIMER
>                 if (hpet_address && vclock_was_used(VCLOCK_HPET)) {
> -                       ret = vm_insert_pfn_prot(
> -                               vma,
> +                       return vmf_insert_pfn_prot(vma,
>                                 (unsigned long)vmf->virtual_address,
> -                               hpet_address >> PAGE_SHIFT,
> +                               phys_to_pfn_t(hpet_address, PFN_DEV),
>                                 pgprot_noncached(PAGE_READONLY));
>                 }

This would be even nicer if you added vmf_insert_pfn as well :)

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] dax: Handle write faults more efficiently
  2016-01-25 17:25 ` [PATCH 3/3] dax: Handle write faults more efficiently Matthew Wilcox
@ 2016-01-25 17:38   ` Andy Lutomirski
  2016-01-27  4:17     ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-25 17:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Matthew Wilcox, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> When we handle a write-fault on a DAX mapping, we currently insert a
> read-only mapping and then take the page fault again to convert it to
> a writable mapping.  This is necessary for the case where we cover a
> hole with a read-only zero page, but when we have a data block already
> allocated, it is inefficient.
>
> Use the recently added vmf_insert_pfn_prot() to insert a writable mapping,
> even though the default VM flags say to use a read-only mapping.

Conceptually, I like this.  Do you need to make sure to do all the
do_wp_page work, though?  (E.g. we currently update mtime in there.
Some day I'll fix that, but it'll be replaced with a set_bit to force
a deferred mtime update.)

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-25 17:33   ` Andy Lutomirski
@ 2016-01-25 17:46     ` Andy Lutomirski
  2016-01-27  4:40     ` Matthew Wilcox
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-25 17:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Matthew Wilcox, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 9:33 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
> <matthew.r.wilcox@intel.com> wrote:
>> From: Matthew Wilcox <willy@linux.intel.com>
>>
>> track_pfn_insert() overwrites the pgprot that is passed in with a value
>> based on the VMA's page_prot.  This is a problem for people trying to
>> do clever things with the new vm_insert_pfn_prot() as it will simply
>> overwrite the passed protection flags.  If we use the current value of
>> the pgprot as the base, then it will behave as people are expecting.
>>
>> Also fix track_pfn_remap() in the same way.
>
> Well that's embarrassing.  Presumably it worked for me because I only
> overrode the cacheability bits and lookup_memtype did the right thing.
>
> But shouldn't the PAT code change the memtype if vm_insert_pfn_prot
> requests it?  Or are there no callers that actually need that?  (HPET
> doesn't, because there's a plain old ioremapped mapping.)
>

Looking a bit further, track_pfn_remap does this, while
track_pfn_insert does not.  I don't know why

I'm also a bit confused as to how any of this works.  There doesn't
seem to be any reference counting of memtypes, so I don't understand
why, say, remapping the same range twice and then freeing them in FIFO
order doesn't break the memtype code.  (There's VM_PAT, but that seems
likely to be extremely fragile.)

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] dax: Handle write faults more efficiently
  2016-01-25 17:38   ` Andy Lutomirski
@ 2016-01-27  4:17     ` Matthew Wilcox
  2016-01-27  5:22       ` Andy Lutomirski
  2016-01-27  6:01       ` Andy Lutomirski
  0 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-27  4:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 09:38:19AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
> <matthew.r.wilcox@intel.com> wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> >
> > When we handle a write-fault on a DAX mapping, we currently insert a
> > read-only mapping and then take the page fault again to convert it to
> > a writable mapping.  This is necessary for the case where we cover a
> > hole with a read-only zero page, but when we have a data block already
> > allocated, it is inefficient.
> >
> > Use the recently added vmf_insert_pfn_prot() to insert a writable mapping,
> > even though the default VM flags say to use a read-only mapping.
> 
> Conceptually, I like this.  Do you need to make sure to do all the
> do_wp_page work, though?  (E.g. we currently update mtime in there.
> Some day I'll fix that, but it'll be replaced with a set_bit to force
> a deferred mtime update.)

We update mtime in the ->fault handler of filesystems which support DAX
like this:

        if (vmf->flags & FAULT_FLAG_WRITE) {
                sb_start_pagefault(inode->i_sb);
                file_update_time(vma->vm_file);
        }

so I think we're covered.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot
  2016-01-25 17:35   ` Andy Lutomirski
@ 2016-01-27  4:18     ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-27  4:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 09:35:36AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
> <matthew.r.wilcox@intel.com> wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> >
> > Other than the name, the vmf_ version takes a pfn_t parameter, and
> > returns a VM_FAULT_ code suitable for returning from a fault handler.
> >
> > This patch also prevents vm_insert_pfn() from returning -EBUSY.
> > This is a good thing as several callers handled it incorrectly (and
> > none intentionally treat -EBUSY as a different case from 0).
> >
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> 
> This would be even nicer if you added vmf_insert_pfn as well :)

I've sent out patches adding it before ... my most recent attempt on
January 5th tied up with the DAX support for 1GB pages.  I'll keep
sending it until it sticks :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-25 17:33   ` Andy Lutomirski
  2016-01-25 17:46     ` Andy Lutomirski
@ 2016-01-27  4:40     ` Matthew Wilcox
  2016-01-27  5:44       ` Andy Lutomirski
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-27  4:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Jan 25, 2016 at 09:33:35AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
> <matthew.r.wilcox@intel.com> wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> >
> > track_pfn_insert() overwrites the pgprot that is passed in with a value
> > based on the VMA's page_prot.  This is a problem for people trying to
> > do clever things with the new vm_insert_pfn_prot() as it will simply
> > overwrite the passed protection flags.  If we use the current value of
> > the pgprot as the base, then it will behave as people are expecting.
> >
> > Also fix track_pfn_remap() in the same way.
> 
> Well that's embarrassing.  Presumably it worked for me because I only
> overrode the cacheability bits and lookup_memtype did the right thing.
> 
> But shouldn't the PAT code change the memtype if vm_insert_pfn_prot
> requests it?  Or are there no callers that actually need that?  (HPET
> doesn't, because there's a plain old ioremapped mapping.)

I'm confused.  Here's what I understand:

 - on x86, the bits in pgprot can be considered as two sets of bits;
   the 'cacheability bits' -- those in _PAGE_CACHE_MASK and the
   'protection bits' -- PRESENT, RW, USER, ACCESSED, NX
 - The purpose of track_pfn_insert() is to ensure that the cacheability bits
   are the same on all mappings of a given page, as strongly advised by the
   Intel manuals [1].  So track_pfn_insert() is really only supposed to
   modify _PAGE_CACHE_MASK of the passed pgprot, but in fact it ends up
   modifying the protection bits as well, due to the bug.

I don't think you overrode the cacheability bits at all.  It looks to
me like your patch ends up mapping the HPET into userspace writable.

I don't think the vm_insert_pfn_prot() call gets to change the memtype.
For one, that page may already be mapped into a differet userspace using
the pre-existing memtype, and [1] continues to bite you.  Then there
may be outstanding kernel users of the page that's being mapped in.

So I think track_pfn_insert() is doing the right thing with respect to
the cacheability bits (overwrite the ones passed in), it's just doing
an unexpected thing with regard to the protection bits, which my patch
should fix.

[1] "The PAT allows any memory type to be specified in the page tables,
and therefore it is possible to have a single physical page mapped to
two or more different linear addresses, each with different memory
types. Intel does not support this practice because it may lead to
undefined operations that can result in a system failure. In particular,
a WC page must never be aliased to a cacheable page because WC writes
may not check the processor caches."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] dax: Handle write faults more efficiently
  2016-01-27  4:17     ` Matthew Wilcox
@ 2016-01-27  5:22       ` Andy Lutomirski
  2016-01-27  6:01       ` Andy Lutomirski
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-27  5:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Tue, Jan 26, 2016 at 8:17 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Jan 25, 2016 at 09:38:19AM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
>> <matthew.r.wilcox@intel.com> wrote:
>> > From: Matthew Wilcox <willy@linux.intel.com>
>> >
>> > When we handle a write-fault on a DAX mapping, we currently insert a
>> > read-only mapping and then take the page fault again to convert it to
>> > a writable mapping.  This is necessary for the case where we cover a
>> > hole with a read-only zero page, but when we have a data block already
>> > allocated, it is inefficient.
>> >
>> > Use the recently added vmf_insert_pfn_prot() to insert a writable mapping,
>> > even though the default VM flags say to use a read-only mapping.
>>
>> Conceptually, I like this.  Do you need to make sure to do all the
>> do_wp_page work, though?  (E.g. we currently update mtime in there.
>> Some day I'll fix that, but it'll be replaced with a set_bit to force
>> a deferred mtime update.)
>
> We update mtime in the ->fault handler of filesystems which support DAX
> like this:
>
>         if (vmf->flags & FAULT_FLAG_WRITE) {
>                 sb_start_pagefault(inode->i_sb);
>                 file_update_time(vma->vm_file);
>         }
>
> so I think we're covered.

Sounds good.

On second reading, though, what ensures that the vm is
VM_WRITE|VM_SHARED?  If nothing else, some nice comments might help.

A WARN_ON_ONCE that the pgprot you're starting with is RO would be
nice if there's a generic way to do that.  Actually, having a generic
pgprot_writable could make this less ugly.

Also, this optimization could be generalized, albeit a bit slower, by
having handle_pte_fault check if the inserted pte is read-only for a
write fault and continuing down the function to the wp_page logic.
After all, returning back to the arch entry code and retrying the
fault the old fashioned way is both very slow and has an outcome
that's known in advance.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-27  4:40     ` Matthew Wilcox
@ 2016-01-27  5:44       ` Andy Lutomirski
  2016-01-29 14:49         ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-27  5:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Tue, Jan 26, 2016 at 8:40 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Jan 25, 2016 at 09:33:35AM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
>> <matthew.r.wilcox@intel.com> wrote:
>> > From: Matthew Wilcox <willy@linux.intel.com>
>> >
>> > track_pfn_insert() overwrites the pgprot that is passed in with a value
>> > based on the VMA's page_prot.  This is a problem for people trying to
>> > do clever things with the new vm_insert_pfn_prot() as it will simply
>> > overwrite the passed protection flags.  If we use the current value of
>> > the pgprot as the base, then it will behave as people are expecting.
>> >
>> > Also fix track_pfn_remap() in the same way.
>>
>> Well that's embarrassing.  Presumably it worked for me because I only
>> overrode the cacheability bits and lookup_memtype did the right thing.
>>
>> But shouldn't the PAT code change the memtype if vm_insert_pfn_prot
>> requests it?  Or are there no callers that actually need that?  (HPET
>> doesn't, because there's a plain old ioremapped mapping.)
>
> I'm confused.  Here's what I understand:
>
>  - on x86, the bits in pgprot can be considered as two sets of bits;
>    the 'cacheability bits' -- those in _PAGE_CACHE_MASK and the
>    'protection bits' -- PRESENT, RW, USER, ACCESSED, NX
>  - The purpose of track_pfn_insert() is to ensure that the cacheability bits
>    are the same on all mappings of a given page, as strongly advised by the
>    Intel manuals [1].  So track_pfn_insert() is really only supposed to
>    modify _PAGE_CACHE_MASK of the passed pgprot, but in fact it ends up
>    modifying the protection bits as well, due to the bug.
>
> I don't think you overrode the cacheability bits at all.  It looks to
> me like your patch ends up mapping the HPET into userspace writable.

I sure hope not.  If vm_page_prot was writable, something was already
broken, because this is the vvar mapping, and the vvar mapping is
VM_READ (and not even VM_MAYREAD).

>
> I don't think the vm_insert_pfn_prot() call gets to change the memtype.
> For one, that page may already be mapped into a differet userspace using
> the pre-existing memtype, and [1] continues to bite you.  Then there
> may be outstanding kernel users of the page that's being mapped in.

So why was remap_pfn_range different?  I'm sure there was a reason.

I don't think that whatever_pfn_prot should ever map a page
inconsistently, but I find it surprising that some of the variants
call reserve_memtype to change the memtype and others don't.

Anyway, this is in no way an objection to your patches.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] dax: Handle write faults more efficiently
  2016-01-27  4:17     ` Matthew Wilcox
  2016-01-27  5:22       ` Andy Lutomirski
@ 2016-01-27  6:01       ` Andy Lutomirski
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-27  6:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Tue, Jan 26, 2016 at 8:17 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Mon, Jan 25, 2016 at 09:38:19AM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
>> <matthew.r.wilcox@intel.com> wrote:
>> > From: Matthew Wilcox <willy@linux.intel.com>
>> >
>> > When we handle a write-fault on a DAX mapping, we currently insert a
>> > read-only mapping and then take the page fault again to convert it to
>> > a writable mapping.  This is necessary for the case where we cover a
>> > hole with a read-only zero page, but when we have a data block already
>> > allocated, it is inefficient.
>> >
>> > Use the recently added vmf_insert_pfn_prot() to insert a writable mapping,
>> > even though the default VM flags say to use a read-only mapping.
>>
>> Conceptually, I like this.  Do you need to make sure to do all the
>> do_wp_page work, though?  (E.g. we currently update mtime in there.
>> Some day I'll fix that, but it'll be replaced with a set_bit to force
>> a deferred mtime update.)
>
> We update mtime in the ->fault handler of filesystems which support DAX
> like this:
>
>         if (vmf->flags & FAULT_FLAG_WRITE) {
>                 sb_start_pagefault(inode->i_sb);
>                 file_update_time(vma->vm_file);
>         }
>
> so I think we're covered.

A question that came up on IRC: if the page is a reflinked page on XFS
(whenever that feature lands), then presumably XFS has real work to do
in page_mkwrite.  If so, what ensures that page_mkwrite gets called?

As a half-baked alternative to this patch, there's a generic
optimization for this case.  do_shared_fault normally calls
do_page_mkwrite and installs the resulting page with the writable bit
set.  But if __do_fault returns VM_FAULT_NOPAGE, then this
optimization is skipped.  Could be add VM_FAULT_NOPAGE_READONLY (or
VM_FAULT_NOPAGE | VM_FAULT_READONLY) as a hint that a page was
installed but that it was installed readonly?  If we did that, then
do_shared_fault could check that bit and go through the wp_page logic
rather than returning to userspace.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-27  5:44       ` Andy Lutomirski
@ 2016-01-29 14:49         ` Matthew Wilcox
  2016-01-29 22:19           ` Andy Lutomirski
  2016-02-09 14:24           ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2016-01-29 14:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Tue, Jan 26, 2016 at 09:44:24PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 26, 2016 at 8:40 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> > On Mon, Jan 25, 2016 at 09:33:35AM -0800, Andy Lutomirski wrote:
> >> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
> >> <matthew.r.wilcox@intel.com> wrote:
> >> > From: Matthew Wilcox <willy@linux.intel.com>
> >> >
> >> > track_pfn_insert() overwrites the pgprot that is passed in with a value
> >> > based on the VMA's page_prot.  This is a problem for people trying to
> >> > do clever things with the new vm_insert_pfn_prot() as it will simply
> >> > overwrite the passed protection flags.  If we use the current value of
> >> > the pgprot as the base, then it will behave as people are expecting.
> >> >
> >> > Also fix track_pfn_remap() in the same way.
> >>
> >> Well that's embarrassing.  Presumably it worked for me because I only
> >> overrode the cacheability bits and lookup_memtype did the right thing.
> >>
> >> But shouldn't the PAT code change the memtype if vm_insert_pfn_prot
> >> requests it?  Or are there no callers that actually need that?  (HPET
> >> doesn't, because there's a plain old ioremapped mapping.)
> >
> > I'm confused.  Here's what I understand:
> >
> >  - on x86, the bits in pgprot can be considered as two sets of bits;
> >    the 'cacheability bits' -- those in _PAGE_CACHE_MASK and the
> >    'protection bits' -- PRESENT, RW, USER, ACCESSED, NX
> >  - The purpose of track_pfn_insert() is to ensure that the cacheability bits
> >    are the same on all mappings of a given page, as strongly advised by the
> >    Intel manuals [1].  So track_pfn_insert() is really only supposed to
> >    modify _PAGE_CACHE_MASK of the passed pgprot, but in fact it ends up
> >    modifying the protection bits as well, due to the bug.
> >
> > I don't think you overrode the cacheability bits at all.  It looks to
> > me like your patch ends up mapping the HPET into userspace writable.
> 
> I sure hope not.  If vm_page_prot was writable, something was already
> broken, because this is the vvar mapping, and the vvar mapping is
> VM_READ (and not even VM_MAYREAD).

I do beg yor pardon.  I thought you were inserting a readonly page
into the middle of a writable mapping.  Instead you're inserting a
non-executable page into the middle of a VM_READ | VM_EXEC mapping.
Sorry for the confusion.  I should have written:

"like your patch ends up mapping the HPET into userspace executable"

which is far less exciting.

> > I don't think the vm_insert_pfn_prot() call gets to change the memtype.
> > For one, that page may already be mapped into a differet userspace using
> > the pre-existing memtype, and [1] continues to bite you.  Then there
> > may be outstanding kernel users of the page that's being mapped in.
> 
> So why was remap_pfn_range different?  I'm sure there was a reason.

Yeah, doesn't make sense to me either.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-29 14:49         ` Matthew Wilcox
@ 2016-01-29 22:19           ` Andy Lutomirski
  2016-02-09 14:24           ` Ingo Molnar
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-29 22:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, Ingo Molnar, Kees Cook, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Jan 29, 2016 at 6:49 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Tue, Jan 26, 2016 at 09:44:24PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 26, 2016 at 8:40 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
>> > On Mon, Jan 25, 2016 at 09:33:35AM -0800, Andy Lutomirski wrote:
>> >> On Mon, Jan 25, 2016 at 9:25 AM, Matthew Wilcox
>> >> <matthew.r.wilcox@intel.com> wrote:
>> >> > From: Matthew Wilcox <willy@linux.intel.com>
>> >> >
>> >> > track_pfn_insert() overwrites the pgprot that is passed in with a value
>> >> > based on the VMA's page_prot.  This is a problem for people trying to
>> >> > do clever things with the new vm_insert_pfn_prot() as it will simply
>> >> > overwrite the passed protection flags.  If we use the current value of
>> >> > the pgprot as the base, then it will behave as people are expecting.
>> >> >
>> >> > Also fix track_pfn_remap() in the same way.
>> >>
>> >> Well that's embarrassing.  Presumably it worked for me because I only
>> >> overrode the cacheability bits and lookup_memtype did the right thing.
>> >>
>> >> But shouldn't the PAT code change the memtype if vm_insert_pfn_prot
>> >> requests it?  Or are there no callers that actually need that?  (HPET
>> >> doesn't, because there's a plain old ioremapped mapping.)
>> >
>> > I'm confused.  Here's what I understand:
>> >
>> >  - on x86, the bits in pgprot can be considered as two sets of bits;
>> >    the 'cacheability bits' -- those in _PAGE_CACHE_MASK and the
>> >    'protection bits' -- PRESENT, RW, USER, ACCESSED, NX
>> >  - The purpose of track_pfn_insert() is to ensure that the cacheability bits
>> >    are the same on all mappings of a given page, as strongly advised by the
>> >    Intel manuals [1].  So track_pfn_insert() is really only supposed to
>> >    modify _PAGE_CACHE_MASK of the passed pgprot, but in fact it ends up
>> >    modifying the protection bits as well, due to the bug.
>> >
>> > I don't think you overrode the cacheability bits at all.  It looks to
>> > me like your patch ends up mapping the HPET into userspace writable.
>>
>> I sure hope not.  If vm_page_prot was writable, something was already
>> broken, because this is the vvar mapping, and the vvar mapping is
>> VM_READ (and not even VM_MAYREAD).
>
> I do beg yor pardon.  I thought you were inserting a readonly page
> into the middle of a writable mapping.  Instead you're inserting a
> non-executable page into the middle of a VM_READ | VM_EXEC mapping.
> Sorry for the confusion.  I should have written:
>
> "like your patch ends up mapping the HPET into userspace executable"
>
> which is far less exciting.

I think it's not even that.  That particular mapping is just VM_READ.

Anyway, this patch is:

Acked-by: Andy Lutomirski <luto@kernel.org>

Ingo etc: this patch should probably go in to tip:x86/asm -- the code
currently in there is wrong, even if it has no obvious symptom.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-29 14:49         ` Matthew Wilcox
  2016-01-29 22:19           ` Andy Lutomirski
@ 2016-02-09 14:24           ` Ingo Molnar
  2016-02-10  3:06             ` Andy Lutomirski
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2016-02-09 14:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Matthew Wilcox, Ingo Molnar, Kees Cook,
	Andrew Morton, linux-kernel, linux-mm


* Matthew Wilcox <willy@linux.intel.com> wrote:

> > I sure hope not.  If vm_page_prot was writable, something was already broken, 
> > because this is the vvar mapping, and the vvar mapping is VM_READ (and not 
> > even VM_MAYREAD).
> 
> I do beg yor pardon.  I thought you were inserting a readonly page into the 
> middle of a writable mapping.  Instead you're inserting a non-executable page 
> into the middle of a VM_READ | VM_EXEC mapping. Sorry for the confusion.  I 
> should have written:
> 
> "like your patch ends up mapping the HPET into userspace executable"
> 
> which is far less exciting.

Btw., a side note, an executable HPET page has its own dangers as well, for 
example because it always changes in value, it can probabilistically represent 
'sensible' (and dangerous) executable x86 instructions that exploits can return 
to.

So only mapping it readable (which Andy's patch attempts I think) is worthwile.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:x86/asm] x86/mm: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-01-25 17:25 ` [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap() Matthew Wilcox
  2016-01-25 17:33   ` Andy Lutomirski
@ 2016-02-09 16:09   ` tip-bot for Matthew Wilcox
  1 sibling, 0 replies; 19+ messages in thread
From: tip-bot for Matthew Wilcox @ 2016-02-09 16:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: keescook, tglx, linux-kernel, mingo, luto, peterz, willy, luto,
	torvalds, hpa, akpm

Commit-ID:  dd7b6847670a84b7bb7c38f8e69b2f12059bca66
Gitweb:     http://git.kernel.org/tip/dd7b6847670a84b7bb7c38f8e69b2f12059bca66
Author:     Matthew Wilcox <willy@linux.intel.com>
AuthorDate: Mon, 25 Jan 2016 12:25:15 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Feb 2016 15:25:36 +0100

x86/mm: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()

track_pfn_insert() overwrites the pgprot that is passed in with a value
based on the VMA's page_prot.  This is a problem for people trying to
do clever things with the new vm_insert_pfn_prot() as it will simply
overwrite the passed protection flags.  If we use the current value of
the pgprot as the base, then it will behave as people are expecting.

Also fix track_pfn_remap() in the same way.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1453742717-10326-2-git-send-email-matthew.r.wilcox@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index f4ae536..04e2e71 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -943,7 +943,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			return -EINVAL;
 	}
 
-	*prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
+	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
 			 cachemode2protval(pcm));
 
 	return 0;
@@ -959,7 +959,7 @@ int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 
 	/* Set prot based on lookup */
 	pcm = lookup_memtype(pfn_t_to_phys(pfn));
-	*prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
+	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
 			 cachemode2protval(pcm));
 
 	return 0;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap()
  2016-02-09 14:24           ` Ingo Molnar
@ 2016-02-10  3:06             ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-02-10  3:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Kees Cook, Ingo Molnar, Matthew Wilcox, linux-mm,
	linux-kernel, Andrew Morton

On Feb 9, 2016 6:24 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Matthew Wilcox <willy@linux.intel.com> wrote:
>
> > > I sure hope not.  If vm_page_prot was writable, something was already broken,
> > > because this is the vvar mapping, and the vvar mapping is VM_READ (and not
> > > even VM_MAYREAD).
> >
> > I do beg yor pardon.  I thought you were inserting a readonly page into the
> > middle of a writable mapping.  Instead you're inserting a non-executable page
> > into the middle of a VM_READ | VM_EXEC mapping. Sorry for the confusion.  I
> > should have written:
> >
> > "like your patch ends up mapping the HPET into userspace executable"
> >
> > which is far less exciting.
>
> Btw., a side note, an executable HPET page has its own dangers as well, for
> example because it always changes in value, it can probabilistically represent
> 'sensible' (and dangerous) executable x86 instructions that exploits can return
> to.
>
> So only mapping it readable (which Andy's patch attempts I think) is worthwile.

The whole vma is readable but not executable, so I don't think this
was a problem.  It's also at a randomized address, which helps.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-02-10  3:06 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-25 17:25 [PATCH 0/3] Fixes for vm_insert_pfn_prot() Matthew Wilcox
2016-01-25 17:25 ` [PATCH 1/3] x86: Honour passed pgprot in track_pfn_insert() and track_pfn_remap() Matthew Wilcox
2016-01-25 17:33   ` Andy Lutomirski
2016-01-25 17:46     ` Andy Lutomirski
2016-01-27  4:40     ` Matthew Wilcox
2016-01-27  5:44       ` Andy Lutomirski
2016-01-29 14:49         ` Matthew Wilcox
2016-01-29 22:19           ` Andy Lutomirski
2016-02-09 14:24           ` Ingo Molnar
2016-02-10  3:06             ` Andy Lutomirski
2016-02-09 16:09   ` [tip:x86/asm] x86/mm: " tip-bot for Matthew Wilcox
2016-01-25 17:25 ` [PATCH 2/3] mm: Convert vm_insert_pfn_prot to vmf_insert_pfn_prot Matthew Wilcox
2016-01-25 17:35   ` Andy Lutomirski
2016-01-27  4:18     ` Matthew Wilcox
2016-01-25 17:25 ` [PATCH 3/3] dax: Handle write faults more efficiently Matthew Wilcox
2016-01-25 17:38   ` Andy Lutomirski
2016-01-27  4:17     ` Matthew Wilcox
2016-01-27  5:22       ` Andy Lutomirski
2016-01-27  6:01       ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).