* [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn @ 2021-02-05 10:32 Paolo Bonzini 2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Paolo Bonzini @ 2021-02-05 10:32 UTC (permalink / raw) To: linux-kernel, kvm; +Cc: jgg, linux-mm, Andrew Morton, dan.j.williams This series is the first step towards fixing KVM's usage of follow_pfn. The immediate fix here is that KVM is not checking the writability of the PFN, which actually dates back to way before the introduction of follow_pfn in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05). There are more changes needed to invalidate gfn-to-pfn caches from MMU notifiers, but this issue will be tackled later. A more fundamental issue however is that the follow_pfn function is basically impossible to use correctly. Almost all users for example are assuming that the page is writable; KVM was not alone in this mistake. follow_pte, despite not being exported for modules, is a far saner API. Therefore, patch 1 simplifies follow_pte a bit and makes it available to modules. Please review and possibly ack for inclusion in the KVM tree, thanks! Paolo Paolo Bonzini (2): mm: provide a sane PTE walking API for modules KVM: do not assume PTE is writable after follow_pfn arch/s390/pci/pci_mmio.c | 2 +- fs/dax.c | 5 +++-- include/linux/mm.h | 6 ++++-- mm/memory.c | 35 ++++++++++++++++++++++++++++++----- virt/kvm/kvm_main.c | 15 ++++++++++++--- 5 files changed, 50 insertions(+), 13 deletions(-) -- 2.26.2 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] mm: provide a sane PTE walking API for modules 2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini @ 2021-02-05 10:32 ` Paolo Bonzini 2021-02-05 13:49 ` Jason Gunthorpe 2021-02-08 17:39 ` Christoph Hellwig 2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini 2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu 2 siblings, 2 replies; 17+ messages in thread From: Paolo Bonzini @ 2021-02-05 10:32 UTC (permalink / raw) To: linux-kernel, kvm; +Cc: jgg, linux-mm, Andrew Morton, dan.j.williams Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/s390/pci/pci_mmio.c | 2 +- fs/dax.c | 5 +++-- include/linux/mm.h | 6 ++++-- mm/memory.c | 35 ++++++++++++++++++++++++++++++----- 4 files changed, 38 insertions(+), 10 deletions(-) diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 18f2d10c3176..5922dce94bfd 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -170,7 +170,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock_mmap; - ret = follow_pte(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl); + ret = follow_pte(vma->vm_mm, mmio_addr, &ptep, &ptl); if (ret) goto out_unlock_mmap; @@ -311,7 +311,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock_mmap; - ret = follow_pte(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl); + ret = follow_pte(vma->vm_mm, mmio_addr, &ptep, &ptl); if (ret) goto out_unlock_mmap; diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..b14874c7b983 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -810,11 +810,12 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, address = pgoff_address(index, vma); /* - * Note because we provide range to follow_pte it will call + * follow_invalidate_pte() will use the range to call * mmu_notifier_invalidate_range_start() on our behalf before * taking any lock. */ - if (follow_pte(vma->vm_mm, address, &range, &ptep, &pmdp, &ptl)) + if (follow_invalidate_pte(vma->vm_mm, address, &range, &ptep, + &pmdp, &ptl)) continue; /* diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..90b527260edf 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1658,9 +1658,11 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); +int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, + struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, + spinlock_t **ptlp); int follow_pte(struct mm_struct *mm, unsigned long address, - struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, - spinlock_t **ptlp); + pte_t **ptepp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..b247d296931c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4709,9 +4709,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -int follow_pte(struct mm_struct *mm, unsigned long address, - struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, - spinlock_t **ptlp) +int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, + struct mmu_notifier_range *range, pte_t **ptepp, + pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; p4d_t *p4d; @@ -4776,6 +4776,31 @@ int follow_pte(struct mm_struct *mm, unsigned long address, return -EINVAL; } +/** + * follow_pte - look up PTE at a user virtual address + * @vma: memory mapping + * @address: user virtual address + * @ptepp: location to store found PTE + * @ptlp: location to store the lock for the PTE + * + * On a successful return, the pointer to the PTE is stored in @ptepp; + * the corresponding lock is taken and its location is stored in @ptlp. + * The contents of the PTE are only stable until @ptlp is released; + * any further use, if any, must be protected against invalidation + * with MMU notifiers. + * + * Only IO mappings and raw PFN mappings are allowed. The mmap semaphore + * should be taken for read. + * + * Return: zero on success, -ve otherwise. + */ +int follow_pte(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, spinlock_t **ptlp) +{ + return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp); +} +EXPORT_SYMBOL_GPL(follow_pte); + /** * follow_pfn - look up PFN at a user virtual address * @vma: memory mapping @@ -4796,7 +4821,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) return ret; - ret = follow_pte(vma->vm_mm, address, NULL, &ptep, NULL, &ptl); + ret = follow_pte(vma->vm_mm, address, &ptep, &ptl); if (ret) return ret; *pfn = pte_pfn(*ptep); @@ -4817,7 +4842,7 @@ int follow_phys(struct vm_area_struct *vma, if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) goto out; - if (follow_pte(vma->vm_mm, address, NULL, &ptep, NULL, &ptl)) + if (follow_pte(vma->vm_mm, address, &ptep, &ptl)) goto out; pte = *ptep; -- 2.26.2 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules 2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini @ 2021-02-05 13:49 ` Jason Gunthorpe 2021-02-08 17:39 ` Christoph Hellwig 1 sibling, 0 replies; 17+ messages in thread From: Jason Gunthorpe @ 2021-02-05 13:49 UTC (permalink / raw) To: Paolo Bonzini; +Cc: linux-kernel, kvm, linux-mm, Andrew Morton, dan.j.williams On Fri, Feb 05, 2021 at 05:32:58AM -0500, Paolo Bonzini wrote: > Currently, the follow_pfn function is exported for modules but > follow_pte is not. However, follow_pfn is very easy to misuse, > because it does not provide protections (so most of its callers > assume the page is writable!) and because it returns after having > already unlocked the page table lock. > > Provide instead a simplified version of follow_pte that does > not have the pmdpp and range arguments. The older version > survives as follow_invalidate_pte() for use by fs/dax.c. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/s390/pci/pci_mmio.c | 2 +- > fs/dax.c | 5 +++-- > include/linux/mm.h | 6 ++++-- > mm/memory.c | 35 ++++++++++++++++++++++++++++++----- > 4 files changed, 38 insertions(+), 10 deletions(-) Looks good to me, thanks Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules 2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini 2021-02-05 13:49 ` Jason Gunthorpe @ 2021-02-08 17:39 ` Christoph Hellwig 2021-02-08 18:18 ` Paolo Bonzini 1 sibling, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2021-02-08 17:39 UTC (permalink / raw) To: Paolo Bonzini Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams > +int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, > + struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, > + spinlock_t **ptlp); This adds a very pointless overy long line. > +/** > + * follow_pte - look up PTE at a user virtual address > + * @vma: memory mapping > + * @address: user virtual address > + * @ptepp: location to store found PTE > + * @ptlp: location to store the lock for the PTE > + * > + * On a successful return, the pointer to the PTE is stored in @ptepp; > + * the corresponding lock is taken and its location is stored in @ptlp. > + * The contents of the PTE are only stable until @ptlp is released; > + * any further use, if any, must be protected against invalidation > + * with MMU notifiers. > + * > + * Only IO mappings and raw PFN mappings are allowed. The mmap semaphore > + * should be taken for read. > + * > + * Return: zero on success, -ve otherwise. > + */ > +int follow_pte(struct mm_struct *mm, unsigned long address, > + pte_t **ptepp, spinlock_t **ptlp) > +{ > + return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp); > +} > +EXPORT_SYMBOL_GPL(follow_pte); I still don't think this is good as a general API. Please document this as KVM only for now, and hopefully next merge window I'll finish an export variant restricting us to specific modules. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules 2021-02-08 17:39 ` Christoph Hellwig @ 2021-02-08 18:18 ` Paolo Bonzini 2021-02-09 8:14 ` Christoph Hellwig 0 siblings, 1 reply; 17+ messages in thread From: Paolo Bonzini @ 2021-02-08 18:18 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams On 08/02/21 18:39, Christoph Hellwig wrote: >> +int follow_pte(struct mm_struct *mm, unsigned long address, >> + pte_t **ptepp, spinlock_t **ptlp) >> +{ >> + return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp); >> +} >> +EXPORT_SYMBOL_GPL(follow_pte); > > I still don't think this is good as a general API. Please document this > as KVM only for now, and hopefully next merge window I'll finish an > export variant restricting us to specific modules. Fair enough. I would expect that pretty much everyone using follow_pfn will at least want to switch to this one (as it's less bad and not impossible to use correctly), but I'll squash this in: diff --git a/include/linux/mm.h b/include/linux/mm.h index 90b527260edf..24b292fce8e5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1659,8 +1659,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, - struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, - spinlock_t **ptlp); + struct mmu_notifier_range *range, pte_t **ptepp, + pmd_t **pmdpp, spinlock_t **ptlp); int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 3632f7416248..c8679b15c004 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4792,6 +4792,9 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, * Only IO mappings and raw PFN mappings are allowed. The mmap semaphore * should be taken for read. * + * KVM uses this function. While it is arguably less bad than + * ``follow_pfn``, it is not a good general-purpose API. + * * Return: zero on success, -ve otherwise. */ int follow_pte(struct mm_struct *mm, unsigned long address, @@ -4809,6 +4812,9 @@ EXPORT_SYMBOL_GPL(follow_pte); * * Only IO mappings and raw PFN mappings are allowed. * + * This function does not allow the caller to read the permissions + * of the PTE. Do not use it. + * * Return: zero and the pfn at @pfn on success, -ve otherwise. */ int follow_pfn(struct vm_area_struct *vma, unsigned long address, (apologies in advance if Thunderbird destroys the patch). Paolo ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules 2021-02-08 18:18 ` Paolo Bonzini @ 2021-02-09 8:14 ` Christoph Hellwig 2021-02-09 9:19 ` Paolo Bonzini 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2021-02-09 8:14 UTC (permalink / raw) To: Paolo Bonzini Cc: Christoph Hellwig, linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams, Daniel Vetter On Mon, Feb 08, 2021 at 07:18:56PM +0100, Paolo Bonzini wrote: > Fair enough. I would expect that pretty much everyone using follow_pfn will > at least want to switch to this one (as it's less bad and not impossible to > use correctly), but I'll squash this in: Daniel looked into them, so he may correct me, but the other follow_pfn users and their destiny are: - SGX, which is not modular and I think I just saw a patch to kill them - v4l videobuf and frame vector: I think those are going away entirely as they implement a rather broken pre-dmabuf P2P scheme - vfio: should use MMU notifiers eventually Daniel, what happened to your follow_pfn series? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: provide a sane PTE walking API for modules 2021-02-09 8:14 ` Christoph Hellwig @ 2021-02-09 9:19 ` Paolo Bonzini 0 siblings, 0 replies; 17+ messages in thread From: Paolo Bonzini @ 2021-02-09 9:19 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams, Daniel Vetter On 09/02/21 09:14, Christoph Hellwig wrote: > On Mon, Feb 08, 2021 at 07:18:56PM +0100, Paolo Bonzini wrote: >> Fair enough. I would expect that pretty much everyone using follow_pfn will >> at least want to switch to this one (as it's less bad and not impossible to >> use correctly), but I'll squash this in: > > > Daniel looked into them, so he may correct me, but the other follow_pfn > users and their destiny are: > > - SGX, which is not modular and I think I just saw a patch to kill them > - v4l videobuf and frame vector: I think those are going away > entirely as they implement a rather broken pre-dmabuf P2P scheme > - vfio: should use MMU notifiers eventually Yes, I'm thinking mostly of vfio, which could use follow_pte as a short-term fix for just the missing permission check. There's also s390 PCI, which is also not modular. Paolo > Daniel, what happened to your follow_pfn series? ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini 2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini @ 2021-02-05 10:32 ` Paolo Bonzini 2021-02-05 15:43 ` kernel test robot 2021-02-05 17:41 ` kernel test robot 2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu 2 siblings, 2 replies; 17+ messages in thread From: Paolo Bonzini @ 2021-02-05 10:32 UTC (permalink / raw) To: linux-kernel, kvm; +Cc: jgg, linux-mm, Andrew Morton, dan.j.williams In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functions. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens <stevensd@google.com> Cc: 3pvd@google.com Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- virt/kvm/kvm_main.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f34a85b93c2d..ee4ac2618ec5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1907,9 +1907,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, kvm_pfn_t *p_pfn) { unsigned long pfn; + pte_t *ptep; + spinlock_t *ptl; int r; - r = follow_pfn(vma, addr, &pfn); + r = follow_pte(vma->vm_mm, addr, &ptep, &ptl); if (r) { /* * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does @@ -1924,14 +1926,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, if (r) return r; - r = follow_pfn(vma, addr, &pfn); + r = follow_pte(vma->vm_mm, addr, &ptep, &ptl); if (r) return r; + } + if (write_fault && !pte_write(*ptep)) { + pfn = KVM_PFN_ERR_RO_FAULT; + goto out; } if (writable) - *writable = true; + *writable = pte_write(*ptep); + pfn = pte_pfn(*ptep); /* * Get a reference here because callers of *hva_to_pfn* and @@ -1946,6 +1953,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, */ kvm_get_pfn(pfn); +out: + pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; return 0; } -- 2.26.2 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini @ 2021-02-05 15:43 ` kernel test robot 2021-02-05 17:41 ` kernel test robot 1 sibling, 0 replies; 17+ messages in thread From: kernel test robot @ 2021-02-05 15:43 UTC (permalink / raw) To: Paolo Bonzini, linux-kernel, kvm Cc: kbuild-all, jgg, linux-mm, Andrew Morton, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 2497 bytes --] Hi Paolo, I love your patch! Perhaps something to improve: [auto build test WARNING on linux/master] [also build test WARNING on linus/master v5.11-rc6] [cannot apply to kvm/linux-next hnaz-linux-mm/master] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2ab38c17aac10bf55ab3efde4c4db3893d8691d2 config: i386-randconfig-c004-20210205 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/c2a981899d009fddc691f2edf5ad54e9e5a57401 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434 git checkout c2a981899d009fddc691f2edf5ad54e9e5a57401 # save the attached .config to linux build tree make W=1 ARCH=i386 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): In file included from arch/x86/kvm/../../../virt/kvm/kvm_main.c:18: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'hva_to_pfn_remapped': >> include/linux/kvm_host.h:89:30: warning: conversion from 'long long unsigned int' to 'long unsigned int' changes value from '9218868437227405314' to '2' [-Woverflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ arch/x86/kvm/../../../virt/kvm/kvm_main.c:1932:9: note: in expansion of macro 'KVM_PFN_ERR_RO_FAULT' 1932 | pfn = KVM_PFN_ERR_RO_FAULT; | ^~~~~~~~~~~~~~~~~~~~ vim +89 include/linux/kvm_host.h 9c5b11728344e1 Xiao Guangrong 2012-08-03 86 9c5b11728344e1 Xiao Guangrong 2012-08-03 87 #define KVM_PFN_ERR_FAULT (KVM_PFN_ERR_MASK) 9c5b11728344e1 Xiao Guangrong 2012-08-03 88 #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) 81c52c56e2b435 Xiao Guangrong 2012-10-16 @89 #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) 6c8ee57be9350c Xiao Guangrong 2012-08-03 90 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 32689 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini 2021-02-05 15:43 ` kernel test robot @ 2021-02-05 17:41 ` kernel test robot 1 sibling, 0 replies; 17+ messages in thread From: kernel test robot @ 2021-02-05 17:41 UTC (permalink / raw) To: Paolo Bonzini, linux-kernel, kvm Cc: kbuild-all, jgg, linux-mm, Andrew Morton, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 2550 bytes --] Hi Paolo, I love your patch! Yet something to improve: [auto build test ERROR on linux/master] [also build test ERROR on linus/master v5.11-rc6 next-20210125] [cannot apply to kvm/linux-next hnaz-linux-mm/master] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2ab38c17aac10bf55ab3efde4c4db3893d8691d2 config: i386-randconfig-a011-20210205 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/c2a981899d009fddc691f2edf5ad54e9e5a57401 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Paolo-Bonzini/KVM-do-not-assume-PTE-is-writable-after-follow_pfn/20210205-183434 git checkout c2a981899d009fddc691f2edf5ad54e9e5a57401 # save the attached .config to linux build tree make W=1 ARCH=i386 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): In file included from arch/x86/kvm/../../../virt/kvm/kvm_main.c:18: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'hva_to_pfn_remapped': >> include/linux/kvm_host.h:89:30: error: conversion from 'long long unsigned int' to 'long unsigned int' changes value from '9218868437227405314' to '2' [-Werror=overflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ arch/x86/kvm/../../../virt/kvm/kvm_main.c:1932:9: note: in expansion of macro 'KVM_PFN_ERR_RO_FAULT' 1932 | pfn = KVM_PFN_ERR_RO_FAULT; | ^~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors vim +89 include/linux/kvm_host.h 9c5b11728344e1 Xiao Guangrong 2012-08-03 86 9c5b11728344e1 Xiao Guangrong 2012-08-03 87 #define KVM_PFN_ERR_FAULT (KVM_PFN_ERR_MASK) 9c5b11728344e1 Xiao Guangrong 2012-08-03 88 #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) 81c52c56e2b435 Xiao Guangrong 2012-10-16 @89 #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) 6c8ee57be9350c Xiao Guangrong 2012-08-03 90 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 40584 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini 2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini 2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini @ 2021-02-05 18:14 ` Peter Xu 2021-02-08 18:51 ` Jason Gunthorpe 2 siblings, 1 reply; 17+ messages in thread From: Peter Xu @ 2021-02-05 18:14 UTC (permalink / raw) To: Paolo Bonzini Cc: linux-kernel, kvm, jgg, linux-mm, Andrew Morton, dan.j.williams On Fri, Feb 05, 2021 at 05:32:57AM -0500, Paolo Bonzini wrote: > This series is the first step towards fixing KVM's usage of follow_pfn. > The immediate fix here is that KVM is not checking the writability of > the PFN, which actually dates back to way before the introduction of > follow_pfn in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults > before giving up", 2016-07-05). There are more changes needed to > invalidate gfn-to-pfn caches from MMU notifiers, but this issue will > be tackled later. > > A more fundamental issue however is that the follow_pfn function is > basically impossible to use correctly. Almost all users for example > are assuming that the page is writable; KVM was not alone in this > mistake. follow_pte, despite not being exported for modules, is a > far saner API. Therefore, patch 1 simplifies follow_pte a bit and > makes it available to modules. > > Please review and possibly ack for inclusion in the KVM tree, > thanks! FWIW, the patches look correct to me (if with patch 2 report fixed): Reviewed-by: Peter Xu <peterx@redhat.com> But I do have a question on why dax as the only user needs to pass in the notifier to follow_pte() for initialization. Indeed there're a difference on start/end init of the notifier depending on whether it's a huge pmd but since there's the pmdp passed over too so I assume the caller should know how to init the notifier anyways. The thing is at least in current code we could send meaningless notifiers, e.g., in follow_pte(): if (range) { mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, NULL, mm, address & PAGE_MASK, (address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(range); } ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) goto unlock; *ptepp = ptep; return 0; unlock: pte_unmap_unlock(ptep, *ptlp); if (range) mmu_notifier_invalidate_range_end(range); The notify could be meaningless if we do the "goto unlock" path. Ideally it seems we can move the notifier code to caller (as what most mmu notifier users do) and we can also avoid doing that if follow_pte returned -EINVAL. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu @ 2021-02-08 18:51 ` Jason Gunthorpe 2021-02-08 22:02 ` Peter Xu 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2021-02-08 18:51 UTC (permalink / raw) To: Peter Xu, dan.j.williams Cc: Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote: > But I do have a question on why dax as the only user needs to pass in the > notifier to follow_pte() for initialization. Not sure either, why does DAX opencode something very much like page_mkclean() with dax_entry_mkclean()? Also it looks like DAX uses the wrong notifier, it calls MMU_NOTIFY_CLEAR but page_mkclean_one() uses MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence?? page_mkclean() has some technique to make the notifier have the right size without becoming entangled in the PTL locks.. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-08 18:51 ` Jason Gunthorpe @ 2021-02-08 22:02 ` Peter Xu 2021-02-08 23:26 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Peter Xu @ 2021-02-08 22:02 UTC (permalink / raw) To: Jason Gunthorpe Cc: dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote: > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote: > > > But I do have a question on why dax as the only user needs to pass in the > > notifier to follow_pte() for initialization. > > Not sure either, why does DAX opencode something very much like > page_mkclean() with dax_entry_mkclean()? > > Also it looks like DAX uses the wrong notifier, it calls > MMU_NOTIFY_CLEAR but page_mkclean_one() uses > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence?? > > page_mkclean() has some technique to make the notifier have the right > size without becoming entangled in the PTL locks.. Right. I guess it's because dax doesn't have "struct page*" on the back, so it can't directly use page_mkclean(). However the whole logic does look very similar, so maybe they can be merged in some way. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-08 22:02 ` Peter Xu @ 2021-02-08 23:26 ` Jason Gunthorpe 2021-02-09 0:23 ` Peter Xu 2021-02-09 8:19 ` Christoph Hellwig 0 siblings, 2 replies; 17+ messages in thread From: Jason Gunthorpe @ 2021-02-08 23:26 UTC (permalink / raw) To: Peter Xu Cc: dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton On Mon, Feb 08, 2021 at 05:02:59PM -0500, Peter Xu wrote: > On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote: > > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote: > > > > > But I do have a question on why dax as the only user needs to pass in the > > > notifier to follow_pte() for initialization. > > > > Not sure either, why does DAX opencode something very much like > > page_mkclean() with dax_entry_mkclean()? > > > > Also it looks like DAX uses the wrong notifier, it calls > > MMU_NOTIFY_CLEAR but page_mkclean_one() uses > > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence?? > > > > page_mkclean() has some technique to make the notifier have the right > > size without becoming entangled in the PTL locks.. > > Right. I guess it's because dax doesn't have "struct page*" on the > back, so it It doesn't? I thought DAX cases did? Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-08 23:26 ` Jason Gunthorpe @ 2021-02-09 0:23 ` Peter Xu 2021-02-09 8:19 ` Christoph Hellwig 1 sibling, 0 replies; 17+ messages in thread From: Peter Xu @ 2021-02-09 0:23 UTC (permalink / raw) To: Jason Gunthorpe Cc: dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote: > On Mon, Feb 08, 2021 at 05:02:59PM -0500, Peter Xu wrote: > > On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote: > > > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote: > > > > > > > But I do have a question on why dax as the only user needs to pass in the > > > > notifier to follow_pte() for initialization. > > > > > > Not sure either, why does DAX opencode something very much like > > > page_mkclean() with dax_entry_mkclean()? > > > > > > Also it looks like DAX uses the wrong notifier, it calls > > > MMU_NOTIFY_CLEAR but page_mkclean_one() uses > > > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence?? > > > > > > page_mkclean() has some technique to make the notifier have the right > > > size without becoming entangled in the PTL locks.. > > > > Right. I guess it's because dax doesn't have "struct page*" on the > > back, so it > > It doesn't? I thought DAX cases did? I'm not familiar with dax at all.. but it seems so: e.g. dax_iomap_pte_fault() looks like the general fault handler for dax mappings, in which there's calls to things like vmf_insert_mixed_mkwrite() trying to install ptes with raw pfn. Or I could also be missing something very important.. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-08 23:26 ` Jason Gunthorpe 2021-02-09 0:23 ` Peter Xu @ 2021-02-09 8:19 ` Christoph Hellwig 2021-02-09 10:02 ` Joao Martins 1 sibling, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2021-02-09 8:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: Peter Xu, dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote: > > > page_mkclean() has some technique to make the notifier have the right > > > size without becoming entangled in the PTL locks.. > > > > Right. I guess it's because dax doesn't have "struct page*" on the > > back, so it > > It doesn't? I thought DAX cases did? File system DAX has a struct page, device DAX does not. Which means everything using iomap should have a page available, but i'm adding Dan as he should know the details :) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn 2021-02-09 8:19 ` Christoph Hellwig @ 2021-02-09 10:02 ` Joao Martins 0 siblings, 0 replies; 17+ messages in thread From: Joao Martins @ 2021-02-09 10:02 UTC (permalink / raw) To: Christoph Hellwig Cc: Peter Xu, dan.j.williams, Paolo Bonzini, linux-kernel, kvm, linux-mm, Andrew Morton, Jason Gunthorpe On 2/9/21 8:19 AM, Christoph Hellwig wrote: > On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote: >>>> page_mkclean() has some technique to make the notifier have the right >>>> size without becoming entangled in the PTL locks.. >>> >>> Right. I guess it's because dax doesn't have "struct page*" on the >>> back, so it >> >> It doesn't? I thought DAX cases did? > > File system DAX has a struct page, device DAX does not. Which means > everything using iomap should have a page available, but i'm adding > Dan as he should know the details :) > Both fsdax and device-dax have struct page. It's the pmem block device that doesn't. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-02-09 10:06 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-05 10:32 [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini 2021-02-05 10:32 ` [PATCH 1/2] mm: provide a sane PTE walking API for modules Paolo Bonzini 2021-02-05 13:49 ` Jason Gunthorpe 2021-02-08 17:39 ` Christoph Hellwig 2021-02-08 18:18 ` Paolo Bonzini 2021-02-09 8:14 ` Christoph Hellwig 2021-02-09 9:19 ` Paolo Bonzini 2021-02-05 10:32 ` [PATCH 2/2] KVM: do not assume PTE is writable after follow_pfn Paolo Bonzini 2021-02-05 15:43 ` kernel test robot 2021-02-05 17:41 ` kernel test robot 2021-02-05 18:14 ` [PATCH 0/2] " Peter Xu 2021-02-08 18:51 ` Jason Gunthorpe 2021-02-08 22:02 ` Peter Xu 2021-02-08 23:26 ` Jason Gunthorpe 2021-02-09 0:23 ` Peter Xu 2021-02-09 8:19 ` Christoph Hellwig 2021-02-09 10:02 ` Joao Martins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).