kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
@ 2021-02-05 10:32 Paolo Bonzini
  2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Paolo Bonzini @ 2021-02-05 10:32 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: jgg, linux-mm, Andrew Morton, dan.j.williams

This series is the first step towards fixing KVM's usage of follow_pfn.
The immediate fix here is that KVM is not checking the writability of
the PFN, which actually dates back to way before the introduction of
follow_pfn in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults
before giving up", 2016-07-05).  There are more changes needed to
invalidate gfn-to-pfn caches from MMU notifiers, but this issue will
be tackled later.

A more fundamental issue however is that the follow_pfn function is
basically impossible to use correctly.  Almost all users for example
are assuming that the page is writable; KVM was not alone in this
mistake.  follow_pte, despite not being exported for modules, is a
far saner API.  Therefore, patch 1 simplifies follow_pte a bit and
makes it available to modules.

Please review and possibly ack for inclusion in the KVM tree,
thanks!

Paolo


Paolo Bonzini (2):
  mm: provide a sane PTE walking API for modules
  KVM: do not assume PTE is writable after follow_pfn

 arch/s390/pci/pci_mmio.c |  2 +-
 fs/dax.c                 |  5 +++--
 include/linux/mm.h       |  6 ++++--
 mm/memory.c              | 35 ++++++++++++++++++++++++++++++-----
 virt/kvm/kvm_main.c      | 15 ++++++++++++---
 5 files changed, 50 insertions(+), 13 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] mm: provide a sane PTE walking API for modules
  2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
@ 2021-02-05 10:32 ` Paolo Bonzini
  2021-02-05 13:49   ` Jason Gunthorpe
  2021-02-08 17:39   ` Christoph Hellwig
  2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
  2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu
  2 siblings, 2 replies; 17+ messages in thread
From: Paolo Bonzini @ 2021-02-05 10:32 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: jgg, linux-mm, Andrew Morton, dan.j.williams

Currently, the follow_pfn function is exported for modules but
follow_pte is not.  However, follow_pfn is very easy to misuse,
because it does not provide protections (so most of its callers
assume the page is writable!) and because it returns after having
already unlocked the page table lock.

Provide instead a simplified version of follow_pte that does
not have the pmdpp and range arguments.  The older version
survives as follow_invalidate_pte() for use by fs/dax.c.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/s390/pci/pci_mmio.c |  2 +-
 fs/dax.c                 |  5 +++--
 include/linux/mm.h       |  6 ++++--
 mm/memory.c              | 35 ++++++++++++++++++++++++++++++-----
 4 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
index 18f2d10c3176..5922dce94bfd 100644
--- a/arch/s390/pci/pci_mmio.c
+++ b/arch/s390/pci/pci_mmio.c
@@ -170,7 +170,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl);
+	ret = follow_pte(vma->vm_mm, mmio_addr, &ptep, &ptl);
 	if (ret)
 		goto out_unlock_mmap;
 
@@ -311,7 +311,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl);
+	ret = follow_pte(vma->vm_mm, mmio_addr, &ptep, &ptl);
 	if (ret)
 		goto out_unlock_mmap;
 
diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..b14874c7b983 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -810,11 +810,12 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
 		address = pgoff_address(index, vma);
 
 		/*
-		 * Note because we provide range to follow_pte it will call
+		 * follow_invalidate_pte() will use the range to call
 		 * mmu_notifier_invalidate_range_start() on our behalf before
 		 * taking any lock.
 		 */
-		if (follow_pte(vma->vm_mm, address, &range, &ptep, &pmdp, &ptl))
+		if (follow_invalidate_pte(vma->vm_mm, address, &range, &ptep,
+					  &pmdp, &ptl))
 			continue;
 
 		/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..90b527260edf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1658,9 +1658,11 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
+int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
+			  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
+			  spinlock_t **ptlp);
 int follow_pte(struct mm_struct *mm, unsigned long address,
-		struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
-		spinlock_t **ptlp);
+	       pte_t **ptepp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..b247d296931c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4709,9 +4709,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-int follow_pte(struct mm_struct *mm, unsigned long address,
-	       struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
-	       spinlock_t **ptlp)
+int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
+			  struct mmu_notifier_range *range, pte_t **ptepp,
+			  pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4776,6 +4776,31 @@ int follow_pte(struct mm_struct *mm, unsigned long address,
 	return -EINVAL;
 }
 
+/**
+ * follow_pte - look up PTE at a user virtual address
+ * @vma: memory mapping
+ * @address: user virtual address
+ * @ptepp: location to store found PTE
+ * @ptlp: location to store the lock for the PTE
+ *
+ * On a successful return, the pointer to the PTE is stored in @ptepp;
+ * the corresponding lock is taken and its location is stored in @ptlp.
+ * The contents of the PTE are only stable until @ptlp is released;
+ * any further use, if any, must be protected against invalidation
+ * with MMU notifiers.
+ *
+ * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
+ * should be taken for read.
+ *
+ * Return: zero on success, -ve otherwise.
+ */
+int follow_pte(struct mm_struct *mm, unsigned long address,
+	       pte_t **ptepp, spinlock_t **ptlp)
+{
+	return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp);
+}
+EXPORT_SYMBOL_GPL(follow_pte);
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4796,7 +4821,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte(vma->vm_mm, address, NULL, &ptep, NULL, &ptl);
+	ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
 	if (ret)
 		return ret;
 	*pfn = pte_pfn(*ptep);
@@ -4817,7 +4842,7 @@ int follow_phys(struct vm_area_struct *vma,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		goto out;
 
-	if (follow_pte(vma->vm_mm, address, NULL, &ptep, NULL, &ptl))
+	if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
 		goto out;
 	pte = *ptep;
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
  2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini
@ 2021-02-05 10:32 ` Paolo Bonzini
  2021-02-05 15:43   ` kernel test robot
  2021-02-05 17:41   ` kernel test robot
  2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu
  2 siblings, 2 replies; 17+ messages in thread
From: Paolo Bonzini @ 2021-02-05 10:32 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: jgg, linux-mm, Andrew Morton, dan.j.williams

In order to convert an HVA to a PFN, KVM usually tries to use
the get_user_pages family of functions.  This however is not
possible for VM_IO vmas; in that case, KVM instead uses follow_pfn.

In doing this however KVM loses the information on whether the
PFN is writable.  That is usually not a problem because the main
use of VM_IO vmas with KVM is for BARs in PCI device assignment,
however it is a bug.  To fix it, use follow_pte and check pte_write
while under the protection of the PTE lock.  The information can
be used to fail hva_to_pfn_remapped or passed back to the
caller via *writable.

Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix
up page faults before giving up", 2016-07-05); however, even older version
have the same issue, all the way back to commit 2e2e3738af33 ("KVM:
Handle vma regions with no backing page", 2008-07-20), as they also did
not check whether the PFN was writable.

Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page")
Reported-by: David Stevens <stevensd@google.com>
Cc: 3pvd@google.com
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 virt/kvm/kvm_main.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f34a85b93c2d..ee4ac2618ec5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1907,9 +1907,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       kvm_pfn_t *p_pfn)
 {
 	unsigned long pfn;
+	pte_t *ptep;
+	spinlock_t *ptl;
 	int r;
 
-	r = follow_pfn(vma, addr, &pfn);
+	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
@@ -1924,14 +1926,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pfn(vma, addr, &pfn);
+		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
 		if (r)
 			return r;
+	}
 
+	if (write_fault && !pte_write(*ptep)) {
+		pfn = KVM_PFN_ERR_RO_FAULT;
+		goto out;
 	}
 
 	if (writable)
-		*writable = true;
+		*writable = pte_write(*ptep);
+	pfn = pte_pfn(*ptep);
 
 	/*
 	 * Get a reference here because callers of *hva_to_pfn* and
@@ -1946,6 +1953,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 */ 
 	kvm_get_pfn(pfn);
 
+out:
+	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
 	return 0;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules
  2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini
@ 2021-02-05 13:49   ` Jason Gunthorpe
  2021-02-08 17:39   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2021-02-05 13:49 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, linux-mm, Andrew Morton, dan.j.williams

On Fri, Feb 05, 2021 at 05:32:58AM -0500, Paolo Bonzini wrote:
> Currently, the follow_pfn function is exported for modules but
> follow_pte is not.  However, follow_pfn is very easy to misuse,
> because it does not provide protections (so most of its callers
> assume the page is writable!) and because it returns after having
> already unlocked the page table lock.
> 
> Provide instead a simplified version of follow_pte that does
> not have the pmdpp and range arguments.  The older version
> survives as follow_invalidate_pte() for use by fs/dax.c.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/s390/pci/pci_mmio.c |  2 +-
>  fs/dax.c                 |  5 +++--
>  include/linux/mm.h       |  6 ++++--
>  mm/memory.c              | 35 ++++++++++++++++++++++++++++++-----
>  4 files changed, 38 insertions(+), 10 deletions(-)

Looks good to me, thanks

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
@ 2021-02-05 15:43   ` kernel test robot
  2021-02-05 17:41   ` kernel test robot
  1 sibling, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-02-05 15:43 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: kbuild-all, jgg, linux-mm, Andrew Morton, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2497 bytes --]

Hi Paolo,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.11-rc6]
[cannot apply to kvm/linux-next hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2ab38c17aac10bf55ab3efde4c4db3893d8691d2
config: i386-randconfig-c004-20210205 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/c2a981899d009fddc691f2edf5ad54e9e5a57401
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434
        git checkout c2a981899d009fddc691f2edf5ad54e9e5a57401
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from arch/x86/kvm/../../../virt/kvm/kvm_main.c:18:
   arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'hva_to_pfn_remapped':
>> include/linux/kvm_host.h:89:30: warning: conversion from 'long long unsigned int' to 'long unsigned int' changes value from '9218868437227405314' to '2' [-Woverflow]
      89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
         |                              ^
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:1932:9: note: in expansion of macro 'KVM_PFN_ERR_RO_FAULT'
    1932 |   pfn = KVM_PFN_ERR_RO_FAULT;
         |         ^~~~~~~~~~~~~~~~~~~~


vim +89 include/linux/kvm_host.h

9c5b11728344e1 Xiao Guangrong 2012-08-03  86  
9c5b11728344e1 Xiao Guangrong 2012-08-03  87  #define KVM_PFN_ERR_FAULT	(KVM_PFN_ERR_MASK)
9c5b11728344e1 Xiao Guangrong 2012-08-03  88  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
81c52c56e2b435 Xiao Guangrong 2012-10-16 @89  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
6c8ee57be9350c Xiao Guangrong 2012-08-03  90  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32689 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
  2021-02-05 15:43   ` kernel test robot
@ 2021-02-05 17:41   ` kernel test robot
  1 sibling, 0 replies; 17+ messages in thread
From: kernel test robot @ 2021-02-05 17:41 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: kbuild-all, jgg, linux-mm, Andrew Morton, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2550 bytes --]

Hi Paolo,

I love your patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.11-rc6 next-20210125]
[cannot apply to kvm/linux-next hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2ab38c17aac10bf55ab3efde4c4db3893d8691d2
config: i386-randconfig-a011-20210205 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/c2a981899d009fddc691f2edf5ad54e9e5a57401
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434
        git checkout c2a981899d009fddc691f2edf5ad54e9e5a57401
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/x86/kvm/../../../virt/kvm/kvm_main.c:18:
   arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'hva_to_pfn_remapped':
>> include/linux/kvm_host.h:89:30: error: conversion from 'long long unsigned int' to 'long unsigned int' changes value from '9218868437227405314' to '2' [-Werror=overflow]
      89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
         |                              ^
   arch/x86/kvm/../../../virt/kvm/kvm_main.c:1932:9: note: in expansion of macro 'KVM_PFN_ERR_RO_FAULT'
    1932 |   pfn = KVM_PFN_ERR_RO_FAULT;
         |         ^~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors


vim +89 include/linux/kvm_host.h

9c5b11728344e1 Xiao Guangrong 2012-08-03  86  
9c5b11728344e1 Xiao Guangrong 2012-08-03  87  #define KVM_PFN_ERR_FAULT	(KVM_PFN_ERR_MASK)
9c5b11728344e1 Xiao Guangrong 2012-08-03  88  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
81c52c56e2b435 Xiao Guangrong 2012-10-16 @89  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
6c8ee57be9350c Xiao Guangrong 2012-08-03  90  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40584 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
  2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini
  2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
@ 2021-02-05 18:14 ` Peter Xu
  2021-02-08 18:51   ` Jason Gunthorpe
  2 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2021-02-05 18:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams

On Fri, Feb 05, 2021 at 05:32:57AM -0500, Paolo Bonzini wrote:
> This series is the first step towards fixing KVM's usage of follow_pfn.
> The immediate fix here is that KVM is not checking the writability of
> the PFN, which actually dates back to way before the introduction of
> follow_pfn in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults
> before giving up", 2016-07-05).  There are more changes needed to
> invalidate gfn-to-pfn caches from MMU notifiers, but this issue will
> be tackled later.
> 
> A more fundamental issue however is that the follow_pfn function is
> basically impossible to use correctly.  Almost all users for example
> are assuming that the page is writable; KVM was not alone in this
> mistake.  follow_pte, despite not being exported for modules, is a
> far saner API.  Therefore, patch 1 simplifies follow_pte a bit and
> makes it available to modules.
> 
> Please review and possibly ack for inclusion in the KVM tree,
> thanks!

FWIW, the patches look correct to me (if with patch 2 report fixed):

Reviewed-by: Peter Xu <peterx@redhat.com>

But I do have a question on why dax as the only user needs to pass in the
notifier to follow_pte() for initialization.

Indeed there're a difference on start/end init of the notifier depending on
whether it's a huge pmd but since there's the pmdp passed over too so I assume
the caller should know how to init the notifier anyways.

The thing is at least in current code we could send meaningless notifiers,
e.g., in follow_pte():

	if (range) {
		mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
					address & PAGE_MASK,
					(address & PAGE_MASK) + PAGE_SIZE);
		mmu_notifier_invalidate_range_start(range);
	}
	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
	if (!pte_present(*ptep))
		goto unlock;
	*ptepp = ptep;
	return 0;
unlock:
	pte_unmap_unlock(ptep, *ptlp);
	if (range)
		mmu_notifier_invalidate_range_end(range);

The notify could be meaningless if we do the "goto unlock" path.

Ideally it seems we can move the notifier code to caller (as what most mmu
notifier users do) and we can also avoid doing that if follow_pte returned
-EINVAL.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules
  2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini
  2021-02-05 13:49   ` Jason Gunthorpe
@ 2021-02-08 17:39   ` Christoph Hellwig
  2021-02-08 18:18     ` Paolo Bonzini
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-02-08 17:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams

> +int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
> +			  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
> +			  spinlock_t **ptlp);

This adds a very pointless overy long line.

> +/**
> + * follow_pte - look up PTE at a user virtual address
> + * @vma: memory mapping
> + * @address: user virtual address
> + * @ptepp: location to store found PTE
> + * @ptlp: location to store the lock for the PTE
> + *
> + * On a successful return, the pointer to the PTE is stored in @ptepp;
> + * the corresponding lock is taken and its location is stored in @ptlp.
> + * The contents of the PTE are only stable until @ptlp is released;
> + * any further use, if any, must be protected against invalidation
> + * with MMU notifiers.
> + *
> + * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
> + * should be taken for read.
> + *
> + * Return: zero on success, -ve otherwise.
> + */
> +int follow_pte(struct mm_struct *mm, unsigned long address,
> +	       pte_t **ptepp, spinlock_t **ptlp)
> +{
> +	return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp);
> +}
> +EXPORT_SYMBOL_GPL(follow_pte);

I still don't think this is good as a general API.  Please document this
as KVM only for now, and hopefully next merge window I'll finish an
export variant restricting us to specific modules.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules
  2021-02-08 17:39   ` Christoph Hellwig
@ 2021-02-08 18:18     ` Paolo Bonzini
  2021-02-09  8:14       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2021-02-08 18:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams

On 08/02/21 18:39, Christoph Hellwig wrote:
>> +int follow_pte(struct mm_struct *mm, unsigned long address,
>> +	       pte_t **ptepp, spinlock_t **ptlp)
>> +{
>> +	return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp);
>> +}
>> +EXPORT_SYMBOL_GPL(follow_pte);
>
> I still don't think this is good as a general API.  Please document this
> as KVM only for now, and hopefully next merge window I'll finish an
> export variant restricting us to specific modules.

Fair enough.  I would expect that pretty much everyone using follow_pfn 
will at least want to switch to this one (as it's less bad and not 
impossible to use correctly), but I'll squash this in:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 90b527260edf..24b292fce8e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1659,8 +1659,8 @@ void free_pgd_range(struct mmu_gather *tlb, 
unsigned long addr,
  int
  copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma);
  int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
-			  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
-			  spinlock_t **ptlp);
+			  struct mmu_notifier_range *range, pte_t **ptepp,
+			  pmd_t **pmdpp, spinlock_t **ptlp);
  int follow_pte(struct mm_struct *mm, unsigned long address,
  	       pte_t **ptepp, spinlock_t **ptlp);
  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 3632f7416248..c8679b15c004 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4792,6 +4792,9 @@ int follow_invalidate_pte(struct mm_struct *mm, 
unsigned long address,
   * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
   * should be taken for read.
   *
+ * KVM uses this function.  While it is arguably less bad than
+ * ``follow_pfn``, it is not a good general-purpose API.
+ *
   * Return: zero on success, -ve otherwise.
   */
  int follow_pte(struct mm_struct *mm, unsigned long address,
@@ -4809,6 +4812,9 @@ EXPORT_SYMBOL_GPL(follow_pte);
   *
   * Only IO mappings and raw PFN mappings are allowed.
   *
+ * This function does not allow the caller to read the permissions
+ * of the PTE.  Do not use it.
+ *
   * Return: zero and the pfn at @pfn on success, -ve otherwise.
   */
  int follow_pfn(struct vm_area_struct *vma, unsigned long address,


(apologies in advance if Thunderbird destroys the patch).

Paolo


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu
@ 2021-02-08 18:51   ` Jason Gunthorpe
  2021-02-08 22:02     ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2021-02-08 18:51 UTC (permalink / raw)
  To: Peter Xu, dan.j.williams
  Cc: Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton

On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:

> But I do have a question on why dax as the only user needs to pass in the
> notifier to follow_pte() for initialization.

Not sure either, why does DAX opencode something very much like
page_mkclean() with dax_entry_mkclean()?

Also it looks like DAX uses the wrong notifier, it calls
MMU_NOTIFY_CLEAR but page_mkclean_one() uses
MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??

page_mkclean() has some technique to make the notifier have the right
size without becoming entangled in the PTL locks..

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-08 18:51   ` Jason Gunthorpe
@ 2021-02-08 22:02     ` Peter Xu
  2021-02-08 23:26       ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2021-02-08 22:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm,
	Andrew Morton

On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:
> 
> > But I do have a question on why dax as the only user needs to pass in the
> > notifier to follow_pte() for initialization.
> 
> Not sure either, why does DAX opencode something very much like
> page_mkclean() with dax_entry_mkclean()?
> 
> Also it looks like DAX uses the wrong notifier, it calls
> MMU_NOTIFY_CLEAR but page_mkclean_one() uses
> MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??
> 
> page_mkclean() has some technique to make the notifier have the right
> size without becoming entangled in the PTL locks..

Right.  I guess it's because dax doesn't have "struct page*" on the back, so it
can't directly use page_mkclean().  However the whole logic does look very
similar, so maybe they can be merged in some way.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-08 22:02     ` Peter Xu
@ 2021-02-08 23:26       ` Jason Gunthorpe
  2021-02-09  0:23         ` Peter Xu
  2021-02-09  8:19         ` Christoph Hellwig
  0 siblings, 2 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2021-02-08 23:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm,
	Andrew Morton

On Mon, Feb 08, 2021 at 05:02:59PM -0500, Peter Xu wrote:
> On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:
> > 
> > > But I do have a question on why dax as the only user needs to pass in the
> > > notifier to follow_pte() for initialization.
> > 
> > Not sure either, why does DAX opencode something very much like
> > page_mkclean() with dax_entry_mkclean()?
> > 
> > Also it looks like DAX uses the wrong notifier, it calls
> > MMU_NOTIFY_CLEAR but page_mkclean_one() uses
> > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??
> > 
> > page_mkclean() has some technique to make the notifier have the right
> > size without becoming entangled in the PTL locks..
> 
> Right.  I guess it's because dax doesn't have "struct page*" on the
> back, so it

It doesn't? I thought DAX cases did?

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-08 23:26       ` Jason Gunthorpe
@ 2021-02-09  0:23         ` Peter Xu
  2021-02-09  8:19         ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Xu @ 2021-02-09  0:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm,
	Andrew Morton

On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 08, 2021 at 05:02:59PM -0500, Peter Xu wrote:
> > On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:
> > > 
> > > > But I do have a question on why dax as the only user needs to pass in the
> > > > notifier to follow_pte() for initialization.
> > > 
> > > Not sure either, why does DAX opencode something very much like
> > > page_mkclean() with dax_entry_mkclean()?
> > > 
> > > Also it looks like DAX uses the wrong notifier, it calls
> > > MMU_NOTIFY_CLEAR but page_mkclean_one() uses
> > > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??
> > > 
> > > page_mkclean() has some technique to make the notifier have the right
> > > size without becoming entangled in the PTL locks..
> > 
> > Right.  I guess it's because dax doesn't have "struct page*" on the
> > back, so it
> 
> It doesn't? I thought DAX cases did?

I'm not familiar with dax at all.. but it seems so: e.g. dax_iomap_pte_fault()
looks like the general fault handler for dax mappings, in which there's calls
to things like vmf_insert_mixed_mkwrite() trying to install ptes with raw pfn.
Or I could also be missing something very important..  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules
  2021-02-08 18:18     ` Paolo Bonzini
@ 2021-02-09  8:14       ` Christoph Hellwig
  2021-02-09  9:19         ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-02-09  8:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christoph Hellwig, linux-kernel, kvm, jgg, linux-mm,
	Andrew Morton, dan.j.williams, Daniel Vetter

On Mon, Feb 08, 2021 at 07:18:56PM +0100, Paolo Bonzini wrote:
> Fair enough.  I would expect that pretty much everyone using follow_pfn will
> at least want to switch to this one (as it's less bad and not impossible to
> use correctly), but I'll squash this in:


Daniel looked into them, so he may correct me, but the other follow_pfn
users and their destiny are:

 - SGX, which is not modular and I think I just saw a patch to kill them
 - v4l videobuf and frame vector: I think those are going away
   entirely as they implement a rather broken pre-dmabuf P2P scheme
 - vfio: should use MMU notifiers eventually

Daniel, what happened to your follow_pfn series?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-08 23:26       ` Jason Gunthorpe
  2021-02-09  0:23         ` Peter Xu
@ 2021-02-09  8:19         ` Christoph Hellwig
  2021-02-09 10:02           ` Joao Martins
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-02-09  8:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, dan.j.williams, Paolo Bonzini, linux-kernel, kvm,
	linux-mm, Andrew Morton

On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
> > > page_mkclean() has some technique to make the notifier have the right
> > > size without becoming entangled in the PTL locks..
> > 
> > Right.  I guess it's because dax doesn't have "struct page*" on the
> > back, so it
> 
> It doesn't? I thought DAX cases did?

File system DAX has a struct page, device DAX does not.  Which means
everything using iomap should have a page available, but i'm adding
Dan as he should know the details :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules
  2021-02-09  8:14       ` Christoph Hellwig
@ 2021-02-09  9:19         ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2021-02-09  9:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams,
	Daniel Vetter

On 09/02/21 09:14, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 07:18:56PM +0100, Paolo Bonzini wrote:
>> Fair enough.  I would expect that pretty much everyone using follow_pfn will
>> at least want to switch to this one (as it's less bad and not impossible to
>> use correctly), but I'll squash this in:
> 
> 
> Daniel looked into them, so he may correct me, but the other follow_pfn
> users and their destiny are:
> 
>   - SGX, which is not modular and I think I just saw a patch to kill them
>   - v4l videobuf and frame vector: I think those are going away
>     entirely as they implement a rather broken pre-dmabuf P2P scheme
>   - vfio: should use MMU notifiers eventually

Yes, I'm thinking mostly of vfio, which could use follow_pte as a 
short-term fix for just the missing permission check.

There's also s390 PCI, which is also not modular.

Paolo

> Daniel, what happened to your follow_pfn series?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn
  2021-02-09  8:19         ` Christoph Hellwig
@ 2021-02-09 10:02           ` Joao Martins
  0 siblings, 0 replies; 17+ messages in thread
From: Joao Martins @ 2021-02-09 10:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Xu, dan.j.williams, Paolo Bonzini, linux-kernel, kvm,
	linux-mm, Andrew Morton, Jason Gunthorpe

On 2/9/21 8:19 AM, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
>>>> page_mkclean() has some technique to make the notifier have the right
>>>> size without becoming entangled in the PTL locks..
>>>
>>> Right.  I guess it's because dax doesn't have "struct page*" on the
>>> back, so it
>>
>> It doesn't? I thought DAX cases did?
> 
> File system DAX has a struct page, device DAX does not.  Which means
> everything using iomap should have a page available, but i'm adding
> Dan as he should know the details :)
> 
Both fsdax and device-dax have struct page. It's the pmem block device
that doesn't.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-02-09 10:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini
2021-02-05 13:49   ` Jason Gunthorpe
2021-02-08 17:39   ` Christoph Hellwig
2021-02-08 18:18     ` Paolo Bonzini
2021-02-09  8:14       ` Christoph Hellwig
2021-02-09  9:19         ` Paolo Bonzini
2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini
2021-02-05 15:43   ` kernel test robot
2021-02-05 17:41   ` kernel test robot
2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu
2021-02-08 18:51   ` Jason Gunthorpe
2021-02-08 22:02     ` Peter Xu
2021-02-08 23:26       ` Jason Gunthorpe
2021-02-09  0:23         ` Peter Xu
2021-02-09  8:19         ` Christoph Hellwig
2021-02-09 10:02           ` Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).