All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
@ 2021-03-08 11:11 Zeng Tao
  2021-03-08 20:21 ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Zeng Tao @ 2021-03-08 11:11 UTC (permalink / raw)
  To: alex.williamson
  Cc: linuxarm, Zeng Tao, Cornelia Huck, Kevin Tian, Andrew Morton,
	Peter Xu, Giovanni Cabiddu, Michel Lespinasse, Jann Horn,
	Max Gurtovoy, kvm, linux-kernel

We have met the following error when test with DPDK testpmd:
[ 1591.733256] kernel BUG at mm/memory.c:2177!
[ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O)
[ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
[ 1591.770735] Hardware name:  , BIOS HixxxxFPGA 1P B600 V121-1
[ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1591.786134] pc : remap_pfn_range+0x214/0x340
[ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
[ 1591.799117] sp : ffff80001068bbd0
[ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000
[ 1591.810404] x27: 0000001100910000 x26: 0000001300910000
[ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358
[ 1591.825144] x23: 0000001140000000 x22: 0000000000000041
[ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000
[ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000
[ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540
[ 1591.853570] x15: 0000000000000000 x14: 0000000000000000
[ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880
[ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000
[ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0
[ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000
[ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001
[ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3
[ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868
[ 1591.910234] Call trace:
[ 1591.914837]  remap_pfn_range+0x214/0x340
[ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
[ 1591.931200]  __do_fault+0x44/0x12c
[ 1591.937031]  handle_mm_fault+0xcc8/0x1230
[ 1591.942475]  do_page_fault+0x16c/0x484
[ 1591.948635]  do_translation_fault+0xbc/0xd8
[ 1591.954171]  do_mem_abort+0x4c/0xc0
[ 1591.960316]  el0_da+0x40/0x80
[ 1591.965585]  el0_sync_handler+0x168/0x1b0
[ 1591.971608]  el0_sync+0x174/0x180
[ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)

The cause is that the vfio_pci_mmap_fault function is not reentrant, if
multiple threads access the same address which will lead to a page fault
at the same time, we will have the above error.

Fix the issue by making the vfio_pci_mmap_fault reentrant, and there is
another issue that when the io_remap_pfn_range fails, we need to undo
the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma down
after the io_remap_pfn_range.

Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
---
 drivers/vfio/pci/vfio_pci.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b..6928c37 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_device *vdev = vma->vm_private_data;
 	vm_fault_t ret = VM_FAULT_NOPAGE;
+	unsigned long pfn;
 
 	mutex_lock(&vdev->vma_lock);
 	down_read(&vdev->memory_lock);
@@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 		goto up_out;
 	}
 
-	if (__vfio_pci_add_vma(vdev, vma)) {
-		ret = VM_FAULT_OOM;
+	if (!follow_pfn(vma, vma->vm_start, &pfn)) {
 		mutex_unlock(&vdev->vma_lock);
 		goto up_out;
 	}
 
-	mutex_unlock(&vdev->vma_lock);
 
 	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
+			       vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
 		ret = VM_FAULT_SIGBUS;
+		mutex_unlock(&vdev->vma_lock);
+		goto up_out;
+	}
+
+	if (__vfio_pci_add_vma(vdev, vma))
+		ret = VM_FAULT_OOM;
 
+	mutex_unlock(&vdev->vma_lock);
 up_out:
 	up_read(&vdev->memory_lock);
 	return ret;
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-08 11:11 [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant Zeng Tao
@ 2021-03-08 20:21 ` Alex Williamson
  2021-03-08 22:56   ` Peter Xu
  2021-03-08 23:43   ` Jason Gunthorpe
  0 siblings, 2 replies; 20+ messages in thread
From: Alex Williamson @ 2021-03-08 20:21 UTC (permalink / raw)
  To: Zeng Tao
  Cc: linuxarm, Cornelia Huck, Kevin Tian, Andrew Morton, Peter Xu,
	Giovanni Cabiddu, Michel Lespinasse, Jann Horn, Max Gurtovoy,
	kvm, linux-kernel, Jason Gunthorpe

On Mon, 8 Mar 2021 19:11:26 +0800
Zeng Tao <prime.zeng@hisilicon.com> wrote:

> We have met the following error when test with DPDK testpmd:
> [ 1591.733256] kernel BUG at mm/memory.c:2177!
> [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O)
> [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
> [ 1591.770735] Hardware name:  , BIOS HixxxxFPGA 1P B600 V121-1
> [ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
> [ 1591.786134] pc : remap_pfn_range+0x214/0x340
> [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
> [ 1591.799117] sp : ffff80001068bbd0
> [ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000
> [ 1591.810404] x27: 0000001100910000 x26: 0000001300910000
> [ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358
> [ 1591.825144] x23: 0000001140000000 x22: 0000000000000041
> [ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000
> [ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000
> [ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540
> [ 1591.853570] x15: 0000000000000000 x14: 0000000000000000
> [ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880
> [ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000
> [ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0
> [ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000
> [ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001
> [ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3
> [ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868
> [ 1591.910234] Call trace:
> [ 1591.914837]  remap_pfn_range+0x214/0x340
> [ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
> [ 1591.931200]  __do_fault+0x44/0x12c
> [ 1591.937031]  handle_mm_fault+0xcc8/0x1230
> [ 1591.942475]  do_page_fault+0x16c/0x484
> [ 1591.948635]  do_translation_fault+0xbc/0xd8
> [ 1591.954171]  do_mem_abort+0x4c/0xc0
> [ 1591.960316]  el0_da+0x40/0x80
> [ 1591.965585]  el0_sync_handler+0x168/0x1b0
> [ 1591.971608]  el0_sync+0x174/0x180
> [ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)
> 
> The cause is that the vfio_pci_mmap_fault function is not reentrant, if
> multiple threads access the same address which will lead to a page fault
> at the same time, we will have the above error.
> 
> Fix the issue by making the vfio_pci_mmap_fault reentrant, and there is
> another issue that when the io_remap_pfn_range fails, we need to undo
> the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma down
> after the io_remap_pfn_range.
> 
> Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
> Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
> ---
>  drivers/vfio/pci/vfio_pci.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b..6928c37 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct vfio_pci_device *vdev = vma->vm_private_data;
>  	vm_fault_t ret = VM_FAULT_NOPAGE;
> +	unsigned long pfn;
>  
>  	mutex_lock(&vdev->vma_lock);
>  	down_read(&vdev->memory_lock);
> @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  		goto up_out;
>  	}
>  
> -	if (__vfio_pci_add_vma(vdev, vma)) {
> -		ret = VM_FAULT_OOM;
> +	if (!follow_pfn(vma, vma->vm_start, &pfn)) {
>  		mutex_unlock(&vdev->vma_lock);
>  		goto up_out;
>  	}
>  
> -	mutex_unlock(&vdev->vma_lock);


If I understand correctly, I think you're using (perhaps slightly
abusing) the vma_lock to extend the serialization of the vma_list
manipulation to include io_remap_pfn_range() such that you can test
whether the pte has already been populated using follow_pfn().  In that
case we return VM_FAULT_NOPAGE without trying to repopulate the page
and therefore avoid the BUG_ON in remap_pte_range() triggered by trying
to overwrite an existing pte, and less importantly, a duplicate vma in
our list.  I wonder if use of follow_pfn() is still strongly
discouraged for this use case.

I'm surprised that it's left to the fault handler to provide this
serialization, is this because we're filling the entire vma rather than
only the faulting page?

As we move to unmap_mapping_range()[1] we remove all of the complexity
of managing a list of vmas to zap based on whether device memory is
enabled, including the vma_lock.  Are we going to need to replace that
with another lock here, or is there a better approach to handling
concurrency of this fault handler?  Jason/Peter?  Thanks,

Alex

[1]https://lore.kernel.org/kvm/161401267316.16443.11184767955094847849.stgit@gimli.home/

>  
>  	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
> +			       vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
>  		ret = VM_FAULT_SIGBUS;
> +		mutex_unlock(&vdev->vma_lock);
> +		goto up_out;
> +	}
> +
> +	if (__vfio_pci_add_vma(vdev, vma))
> +		ret = VM_FAULT_OOM;
>  
> +	mutex_unlock(&vdev->vma_lock);
>  up_out:
>  	up_read(&vdev->memory_lock);
>  	return ret;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-08 20:21 ` Alex Williamson
@ 2021-03-08 22:56   ` Peter Xu
  2021-03-09  3:49     ` 答复: " Zengtao (B)
  2021-03-08 23:43   ` Jason Gunthorpe
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Xu @ 2021-03-08 22:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zeng Tao, linuxarm, Cornelia Huck, Kevin Tian, Andrew Morton,
	Giovanni Cabiddu, Michel Lespinasse, Jann Horn, Max Gurtovoy,
	kvm, linux-kernel, Jason Gunthorpe

On Mon, Mar 08, 2021 at 01:21:06PM -0700, Alex Williamson wrote:
> On Mon, 8 Mar 2021 19:11:26 +0800
> Zeng Tao <prime.zeng@hisilicon.com> wrote:
> 
> > We have met the following error when test with DPDK testpmd:
> > [ 1591.733256] kernel BUG at mm/memory.c:2177!
> > [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> > [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O)
> > [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
> > [ 1591.770735] Hardware name:  , BIOS HixxxxFPGA 1P B600 V121-1
> > [ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
> > [ 1591.786134] pc : remap_pfn_range+0x214/0x340
> > [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
> > [ 1591.799117] sp : ffff80001068bbd0
> > [ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000
> > [ 1591.810404] x27: 0000001100910000 x26: 0000001300910000
> > [ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358
> > [ 1591.825144] x23: 0000001140000000 x22: 0000000000000041
> > [ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000
> > [ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000
> > [ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540
> > [ 1591.853570] x15: 0000000000000000 x14: 0000000000000000
> > [ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880
> > [ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000
> > [ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0
> > [ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000
> > [ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001
> > [ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3
> > [ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868
> > [ 1591.910234] Call trace:
> > [ 1591.914837]  remap_pfn_range+0x214/0x340
> > [ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
> > [ 1591.931200]  __do_fault+0x44/0x12c
> > [ 1591.937031]  handle_mm_fault+0xcc8/0x1230
> > [ 1591.942475]  do_page_fault+0x16c/0x484
> > [ 1591.948635]  do_translation_fault+0xbc/0xd8
> > [ 1591.954171]  do_mem_abort+0x4c/0xc0
> > [ 1591.960316]  el0_da+0x40/0x80
> > [ 1591.965585]  el0_sync_handler+0x168/0x1b0
> > [ 1591.971608]  el0_sync+0x174/0x180
> > [ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)
> > 
> > The cause is that the vfio_pci_mmap_fault function is not reentrant, if
> > multiple threads access the same address which will lead to a page fault
> > at the same time, we will have the above error.
> > 
> > Fix the issue by making the vfio_pci_mmap_fault reentrant, and there is
> > another issue that when the io_remap_pfn_range fails, we need to undo
> > the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma down
> > after the io_remap_pfn_range.
> > 
> > Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
> > Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 14 ++++++++++----
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b..6928c37 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> >  	vm_fault_t ret = VM_FAULT_NOPAGE;
> > +	unsigned long pfn;
> >  
> >  	mutex_lock(&vdev->vma_lock);
> >  	down_read(&vdev->memory_lock);
> > @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  		goto up_out;
> >  	}
> >  
> > -	if (__vfio_pci_add_vma(vdev, vma)) {
> > -		ret = VM_FAULT_OOM;
> > +	if (!follow_pfn(vma, vma->vm_start, &pfn)) {
> >  		mutex_unlock(&vdev->vma_lock);
> >  		goto up_out;
> >  	}
> >  
> > -	mutex_unlock(&vdev->vma_lock);
> 
> 
> If I understand correctly, I think you're using (perhaps slightly
> abusing) the vma_lock to extend the serialization of the vma_list
> manipulation to include io_remap_pfn_range() such that you can test
> whether the pte has already been populated using follow_pfn().  In that
> case we return VM_FAULT_NOPAGE without trying to repopulate the page
> and therefore avoid the BUG_ON in remap_pte_range() triggered by trying
> to overwrite an existing pte, and less importantly, a duplicate vma in
> our list.  I wonder if use of follow_pfn() is still strongly
> discouraged for this use case.
> 
> I'm surprised that it's left to the fault handler to provide this
> serialization, is this because we're filling the entire vma rather than
> only the faulting page?

There's definitely some kind of serialization in the process using pgtable
locks, which gives me the feeling that the BUG_ON() in remap_pte_range() seems
too strong on "!pte_none(*pte)" rather than -EEXIST.

However there'll still be the issue of duplicated vma in vma_list - that seems
to be a sign that it's still better to fix it from vfio layer.

> 
> As we move to unmap_mapping_range()[1] we remove all of the complexity
> of managing a list of vmas to zap based on whether device memory is
> enabled, including the vma_lock.  Are we going to need to replace that
> with another lock here, or is there a better approach to handling
> concurrency of this fault handler?  Jason/Peter?  Thanks,

Not looked into the new series of unmap_mapping_range() yet..  But for the
current code base: instead of follow_pte(), maybe we could simply do the
ordering by searching the vma list first before inserting into the vma list?
Because if vma existed, it means the pte installation has done, or at least in
progress.  Then we could return VM_FAULT_RETRY hoping that it'll be done soon.

Then maybe it would also make some sense to have vma_lock protect the whole
io_remap_pfn_range() too? - it'll not be for the ordering, but just that it'll
guarantee after we're with the vma_lock it means current vma has all ptes
installed, then the next memory access will guaranteed to success.  It seems
more efficient than multiple VM_FAULT_RETRY page fault looping until it's done.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-08 20:21 ` Alex Williamson
  2021-03-08 22:56   ` Peter Xu
@ 2021-03-08 23:43   ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2021-03-08 23:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zeng Tao, linuxarm, Cornelia Huck, Kevin Tian, Andrew Morton,
	Peter Xu, Giovanni Cabiddu, Michel Lespinasse, Jann Horn,
	Max Gurtovoy, kvm, linux-kernel

On Mon, Mar 08, 2021 at 01:21:06PM -0700, Alex Williamson wrote:
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b..6928c37 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> >  	vm_fault_t ret = VM_FAULT_NOPAGE;
> > +	unsigned long pfn;
> >  
> >  	mutex_lock(&vdev->vma_lock);
> >  	down_read(&vdev->memory_lock);
> > @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  		goto up_out;
> >  	}
> >  
> > -	if (__vfio_pci_add_vma(vdev, vma)) {
> > -		ret = VM_FAULT_OOM;
> > +	if (!follow_pfn(vma, vma->vm_start, &pfn)) {
> >  		mutex_unlock(&vdev->vma_lock);
> >  		goto up_out;

Gah, no new follow_pfn users please we are trying to delete this
stuff..

I believe the right fix is to change the fault handler to use
vmf_insert_pfn_prot() which has all the right locking/etc

The 
> I'm surprised that it's left to the fault handler to provide this
> serialization, is this because we're filling the entire vma rather than
> only the faulting page?

I think it is because remap_pfn is not intended to be called from a
fault handler. The fault handler APIs seem to be named vmf_* ..

If you want to use remap API it has to be done and managed outside the
fault handler. Ie when the MMIO transitions from valid->invalid vfio-pci
wipes the address space, when it transitions from invalid->valid it
calls remap_pfn. vfio-pci provides its own locking to protect these
state transitions. fault simply always triggers sigbus

I recall we discussed this design when you made the original patches
but I don't completely recall why it ended this way, however I think
the reason might disappear after the address_space conversion in your
other series.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* 答复: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-08 22:56   ` Peter Xu
@ 2021-03-09  3:49     ` Zengtao (B)
  2021-03-09 12:46       ` Jason Gunthorpe
  0 siblings, 1 reply; 20+ messages in thread
From: Zengtao (B) @ 2021-03-09  3:49 UTC (permalink / raw)
  To: Peter Xu, Alex Williamson
  Cc: Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Jason Gunthorpe, Linuxarm

Hi guys:

Thanks for the helpful comments, after rethinking the issue, I have proposed
 the following change: 
1. follow_pte instead of follow_pfn.
2. vmf_insert_pfn loops instead of io_remap_pfn_range
3. proper undos when some call fails.
4. keep the bigger lock range to avoid unessary pte install. 

please help to take a look and get your comments, thanks.

static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct vfio_pci_device *vdev = vma->vm_private_data;
	vm_fault_t ret = VM_FAULT_NOPAGE;
	unsigned long vaddr, pfn;
	pte_t *ptep;
	spinlock_t *ptl;

	mutex_lock(&vdev->vma_lock);
	down_read(&vdev->memory_lock);

	if (!__vfio_pci_memory_enabled(vdev)) {
		ret = VM_FAULT_SIGBUS;
		goto up_out;
	}

	if (!follow_pte(vma->vm_mm, vma->vm_start, &ptep, &ptl))
		goto up_out;

	for (vaddr = vma->start, pfn = vma->vm_pgoff; vaddr < vma->end;) {
		ret = vmf_insert_pfn(vma, vaddr, pfn);
		if (ret)
			goto zap_vma;
		vaddr += PAGE_SIZE;
		pfn += 1;
	}

	if (__vfio_pci_add_vma(vdev, vma)) {
		ret = VM_FAULT_OOM;
		goto zap_vma;
	}

	mutex_unlock(&vdev->vma_lock);
	up_read(&vdev->memory_lock);
	return ret;

zap_vma:
	zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start);
up_out:
	mutex_unlock(&vdev->vma_lock);
	up_read(&vdev->memory_lock);
	return ret;
}

> -----邮件原件-----
> 发件人: Peter Xu [mailto:peterx@redhat.com]
> 发送时间: 2021年3月9日 6:56
> 收件人: Alex Williamson <alex.williamson@redhat.com>
> 抄送: Zeng Tao <prime.zeng@hisilicon.com>; linuxarm@huawei.com; Cornelia
> Huck <cohuck@redhat.com>; Kevin Tian <kevin.tian@intel.com>; Andrew
> Morton <akpm@linux-foundation.org>; Giovanni Cabiddu
> <giovanni.cabiddu@intel.com>; Michel Lespinasse <walken@google.com>; Jann
> Horn <jannh@google.com>; Max Gurtovoy <mgurtovoy@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Jason Gunthorpe
> <jgg@nvidia.com>
> 主题: Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
> 
> On Mon, Mar 08, 2021 at 01:21:06PM -0700, Alex Williamson wrote:
> > On Mon, 8 Mar 2021 19:11:26 +0800
> > Zeng Tao <prime.zeng@hisilicon.com> wrote:
> >
> > > We have met the following error when test with DPDK testpmd:
> > > [ 1591.733256] kernel BUG at mm/memory.c:2177!
> > > [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [
> > > 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci
> > > vfio_virqfd vfio pv680_mii(O) [ 1591.760536] CPU: 2 PID: 227 Comm:
> > > lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1 [ 1591.770735] Hardware
> > > name:  , BIOS HixxxxFPGA 1P B600 V121-1 [ 1591.778872] pstate:
> > > 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) [ 1591.786134] pc :
> > > remap_pfn_range+0x214/0x340 [ 1591.793564] lr :
> > > remap_pfn_range+0x1b8/0x340 [ 1591.799117] sp : ffff80001068bbd0 [
> > > 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000 [
> > > 1591.810404] x27: 0000001100910000 x26: 0000001300910000 [
> > > 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358 [
> > > 1591.825144] x23: 0000001140000000 x22: 0000000000000041 [
> > > 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000 [
> > > 1591.839520] x19: 0000001100a00000 x18: 0000000000000000 [
> > > 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540 [
> > > 1591.853570] x15: 0000000000000000 x14: 0000000000000000 [
> > > 1591.860768] x13: fffffc0000000000 x12: 0000000000000880 [
> > > 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000 [
> > > 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0 [
> > > 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000 [
> > > 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001 [
> > > 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3 [
> > > 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868 [
> > > 1591.910234] Call trace:
> > > [ 1591.914837]  remap_pfn_range+0x214/0x340 [ 1591.921765]
> > > vfio_pci_mmap_fault+0xac/0x130 [vfio_pci] [ 1591.931200]
> > > __do_fault+0x44/0x12c [ 1591.937031]  handle_mm_fault+0xcc8/0x1230 [
> > > 1591.942475]  do_page_fault+0x16c/0x484 [ 1591.948635]
> > > do_translation_fault+0xbc/0xd8 [ 1591.954171]
> > > do_mem_abort+0x4c/0xc0 [ 1591.960316]  el0_da+0x40/0x80 [
> > > 1591.965585]  el0_sync_handler+0x168/0x1b0 [ 1591.971608]
> > > el0_sync+0x174/0x180 [ 1591.978312] Code: eb1b027f 540000c0 f9400022
> > > b4fffe02 (d4210000)
> > >
> > > The cause is that the vfio_pci_mmap_fault function is not reentrant,
> > > if multiple threads access the same address which will lead to a
> > > page fault at the same time, we will have the above error.
> > >
> > > Fix the issue by making the vfio_pci_mmap_fault reentrant, and there
> > > is another issue that when the io_remap_pfn_range fails, we need to
> > > undo the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma
> > > down after the io_remap_pfn_range.
> > >
> > > Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
> > > Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
> > > ---
> > >  drivers/vfio/pci/vfio_pci.c | 14 ++++++++++----
> > >  1 file changed, 10 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci.c
> > > b/drivers/vfio/pci/vfio_pci.c index 65e7e6b..6928c37 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> vm_fault *vmf)
> > >  	struct vm_area_struct *vma = vmf->vma;
> > >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > >  	vm_fault_t ret = VM_FAULT_NOPAGE;
> > > +	unsigned long pfn;
> > >
> > >  	mutex_lock(&vdev->vma_lock);
> > >  	down_read(&vdev->memory_lock);
> > > @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> vm_fault *vmf)
> > >  		goto up_out;
> > >  	}
> > >
> > > -	if (__vfio_pci_add_vma(vdev, vma)) {
> > > -		ret = VM_FAULT_OOM;
> > > +	if (!follow_pfn(vma, vma->vm_start, &pfn)) {
> > >  		mutex_unlock(&vdev->vma_lock);
> > >  		goto up_out;
> > >  	}
> > >
> > > -	mutex_unlock(&vdev->vma_lock);
> >
> >
> > If I understand correctly, I think you're using (perhaps slightly
> > abusing) the vma_lock to extend the serialization of the vma_list
> > manipulation to include io_remap_pfn_range() such that you can test
> > whether the pte has already been populated using follow_pfn().  In
> > that case we return VM_FAULT_NOPAGE without trying to repopulate the
> > page and therefore avoid the BUG_ON in remap_pte_range() triggered by
> > trying to overwrite an existing pte, and less importantly, a duplicate
> > vma in our list.  I wonder if use of follow_pfn() is still strongly
> > discouraged for this use case.
> >
> > I'm surprised that it's left to the fault handler to provide this
> > serialization, is this because we're filling the entire vma rather
> > than only the faulting page?
> 
> There's definitely some kind of serialization in the process using pgtable locks,
> which gives me the feeling that the BUG_ON() in remap_pte_range() seems too
> strong on "!pte_none(*pte)" rather than -EEXIST.
> 
> However there'll still be the issue of duplicated vma in vma_list - that seems to
> be a sign that it's still better to fix it from vfio layer.
> 
> >
> > As we move to unmap_mapping_range()[1] we remove all of the complexity
> > of managing a list of vmas to zap based on whether device memory is
> > enabled, including the vma_lock.  Are we going to need to replace that
> > with another lock here, or is there a better approach to handling
> > concurrency of this fault handler?  Jason/Peter?  Thanks,
> 
> Not looked into the new series of unmap_mapping_range() yet..  But for the
> current code base: instead of follow_pte(), maybe we could simply do the
> ordering by searching the vma list first before inserting into the vma list?
> Because if vma existed, it means the pte installation has done, or at least in
> progress.  Then we could return VM_FAULT_RETRY hoping that it'll be done
> soon.
> 
> Then maybe it would also make some sense to have vma_lock protect the whole
> io_remap_pfn_range() too? - it'll not be for the ordering, but just that it'll
> guarantee after we're with the vma_lock it means current vma has all ptes
> installed, then the next memory access will guaranteed to success.  It seems
> more efficient than multiple VM_FAULT_RETRY page fault looping until it's done.
> 
> Thanks,
> 
> --
> Peter Xu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 答复: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09  3:49     ` 答复: " Zengtao (B)
@ 2021-03-09 12:46       ` Jason Gunthorpe
  2021-03-09 15:29         ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2021-03-09 12:46 UTC (permalink / raw)
  To: Zengtao (B)
  Cc: Peter Xu, Alex Williamson, Cornelia Huck, Kevin Tian,
	Andrew Morton, Giovanni Cabiddu, Michel Lespinasse, Jann Horn,
	Max Gurtovoy, kvm, linux-kernel, Linuxarm

On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:
> Hi guys:
> 
> Thanks for the helpful comments, after rethinking the issue, I have proposed
>  the following change: 
> 1. follow_pte instead of follow_pfn.

Still no on follow_pfn, you don't need it once you use vmf_insert_pfn

> 2. vmf_insert_pfn loops instead of io_remap_pfn_range
> 3. proper undos when some call fails.
> 4. keep the bigger lock range to avoid unessary pte install. 

Why do we need locks at all here?

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 12:46       ` Jason Gunthorpe
@ 2021-03-09 15:29         ` Alex Williamson
  2021-03-09 16:40           ` Jason Gunthorpe
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2021-03-09 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zengtao (B),
	Peter Xu, Cornelia Huck, Kevin Tian, Andrew Morton,
	Giovanni Cabiddu, Michel Lespinasse, Jann Horn, Max Gurtovoy,
	kvm, linux-kernel, Linuxarm

On Tue, 9 Mar 2021 08:46:09 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:
> > Hi guys:
> > 
> > Thanks for the helpful comments, after rethinking the issue, I have proposed
> >  the following change: 
> > 1. follow_pte instead of follow_pfn.  
> 
> Still no on follow_pfn, you don't need it once you use vmf_insert_pfn

vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
here to determine whether the translation is already present to avoid
both duplicate work in inserting the translation and allocating a
duplicate vma tracking structure.

> > 2. vmf_insert_pfn loops instead of io_remap_pfn_range
> > 3. proper undos when some call fails.
> > 4. keep the bigger lock range to avoid unessary pte install.   
> 
> Why do we need locks at all here?

For the vma tracking and testing whether the fault is already
populated.  Once we get rid of the vma list, maybe it makes sense to
only insert the faulting page rather than the entire vma, at which
point I think we'd have no reason to serialize.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 15:29         ` Alex Williamson
@ 2021-03-09 16:40           ` Jason Gunthorpe
  2021-03-09 18:47             ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2021-03-09 16:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zengtao (B),
	Peter Xu, Cornelia Huck, Kevin Tian, Andrew Morton,
	Giovanni Cabiddu, Michel Lespinasse, Jann Horn, Max Gurtovoy,
	kvm, linux-kernel, Linuxarm

On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:
> On Tue, 9 Mar 2021 08:46:09 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:
> > > Hi guys:
> > > 
> > > Thanks for the helpful comments, after rethinking the issue, I have proposed
> > >  the following change: 
> > > 1. follow_pte instead of follow_pfn.  
> > 
> > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn
> 
> vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> here to determine whether the translation is already present to avoid
> both duplicate work in inserting the translation and allocating a
> duplicate vma tracking structure.
 
Oh.. Doing something stateful in fault is not nice at all

I would rather see __vfio_pci_add_vma() search the vma_list for dups
than call follow_pfn/pte..

> For the vma tracking and testing whether the fault is already
> populated.  Once we get rid of the vma list, maybe it makes sense to
> only insert the faulting page rather than the entire vma, at which
> point I think we'd have no reason to serialize.  Thanks,

Yes, the address_space stuff is a much better solution to all of this.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 16:40           ` Jason Gunthorpe
@ 2021-03-09 18:47             ` Peter Xu
  2021-03-09 19:26               ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2021-03-09 18:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:
> On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:
> > On Tue, 9 Mar 2021 08:46:09 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:
> > > > Hi guys:
> > > > 
> > > > Thanks for the helpful comments, after rethinking the issue, I have proposed
> > > >  the following change: 
> > > > 1. follow_pte instead of follow_pfn.  
> > > 
> > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn
> > 
> > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > here to determine whether the translation is already present to avoid
> > both duplicate work in inserting the translation and allocating a
> > duplicate vma tracking structure.
>  
> Oh.. Doing something stateful in fault is not nice at all
> 
> I would rather see __vfio_pci_add_vma() search the vma_list for dups
> than call follow_pfn/pte..

It seems to me that searching vma list is still the simplest way to fix the
problem for the current code base.  I see io_remap_pfn_range() is also used in
the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY got
turned on/off in the new series (I just noticed remap_pfn_range modifies vma
flags..), as you suggested in the other email.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 18:47             ` Peter Xu
@ 2021-03-09 19:26               ` Alex Williamson
  2021-03-09 19:48                 ` Peter Xu
                                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Alex Williamson @ 2021-03-09 19:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, 9 Mar 2021 13:47:39 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:
> > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:  
> > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:  
> > > > > Hi guys:
> > > > > 
> > > > > Thanks for the helpful comments, after rethinking the issue, I have proposed
> > > > >  the following change: 
> > > > > 1. follow_pte instead of follow_pfn.    
> > > > 
> > > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn  
> > > 
> > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > here to determine whether the translation is already present to avoid
> > > both duplicate work in inserting the translation and allocating a
> > > duplicate vma tracking structure.  
> >  
> > Oh.. Doing something stateful in fault is not nice at all
> > 
> > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > than call follow_pfn/pte..  
> 
> It seems to me that searching vma list is still the simplest way to fix the
> problem for the current code base.  I see io_remap_pfn_range() is also used in
> the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY got
> turned on/off in the new series (I just noticed remap_pfn_range modifies vma
> flags..), as you suggested in the other email.


In the new series, I think the fault handler becomes (untested):

static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
        struct vm_area_struct *vma = vmf->vma;
        struct vfio_pci_device *vdev = vma->vm_private_data;
        unsigned long base_pfn, pgoff;
        vm_fault_t ret = VM_FAULT_SIGBUS;

        if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
                return ret;

        pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;

        down_read(&vdev->memory_lock);

        if (__vfio_pci_memory_enabled(vdev))
                ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);

        up_read(&vdev->memory_lock);

        return ret;
}

Thanks,
Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 19:26               ` Alex Williamson
@ 2021-03-09 19:48                 ` Peter Xu
  2021-03-09 20:11                   ` Alex Williamson
  2021-03-09 19:56                 ` Alex Williamson
  2021-03-09 23:41                 ` Jason Gunthorpe
  2 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2021-03-09 19:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:
> On Tue, 9 Mar 2021 13:47:39 -0500
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:  
> > > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >   
> > > > > On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:  
> > > > > > Hi guys:
> > > > > > 
> > > > > > Thanks for the helpful comments, after rethinking the issue, I have proposed
> > > > > >  the following change: 
> > > > > > 1. follow_pte instead of follow_pfn.    
> > > > > 
> > > > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn  
> > > > 
> > > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > > here to determine whether the translation is already present to avoid
> > > > both duplicate work in inserting the translation and allocating a
> > > > duplicate vma tracking structure.  
> > >  
> > > Oh.. Doing something stateful in fault is not nice at all
> > > 
> > > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > > than call follow_pfn/pte..  
> > 
> > It seems to me that searching vma list is still the simplest way to fix the
> > problem for the current code base.  I see io_remap_pfn_range() is also used in
> > the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY got
> > turned on/off in the new series (I just noticed remap_pfn_range modifies vma
> > flags..), as you suggested in the other email.
> 
> 
> In the new series, I think the fault handler becomes (untested):
> 
> static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> {
>         struct vm_area_struct *vma = vmf->vma;
>         struct vfio_pci_device *vdev = vma->vm_private_data;
>         unsigned long base_pfn, pgoff;
>         vm_fault_t ret = VM_FAULT_SIGBUS;
> 
>         if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
>                 return ret;
> 
>         pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> 
>         down_read(&vdev->memory_lock);
> 
>         if (__vfio_pci_memory_enabled(vdev))
>                 ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);
> 
>         up_read(&vdev->memory_lock);
> 
>         return ret;
> }

It's just that the initial MMIO access delay would be spread to the 1st access
of each mmio page access rather than using the previous pre-fault scheme.  I
think an userspace cares the delay enough should pre-fault all pages anyway,
but just raise this up.  Otherwise looks sane.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 19:26               ` Alex Williamson
  2021-03-09 19:48                 ` Peter Xu
@ 2021-03-09 19:56                 ` Alex Williamson
  2021-03-09 23:45                   ` Jason Gunthorpe
  2021-03-09 23:41                 ` Jason Gunthorpe
  2 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2021-03-09 19:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, 9 Mar 2021 12:26:07 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 9 Mar 2021 13:47:39 -0500
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:  
> > > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:    
> > > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >     
> > > > > On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:    
> > > > > > Hi guys:
> > > > > > 
> > > > > > Thanks for the helpful comments, after rethinking the issue, I have proposed
> > > > > >  the following change: 
> > > > > > 1. follow_pte instead of follow_pfn.      
> > > > > 
> > > > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn    
> > > > 
> > > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > > here to determine whether the translation is already present to avoid
> > > > both duplicate work in inserting the translation and allocating a
> > > > duplicate vma tracking structure.    
> > >  
> > > Oh.. Doing something stateful in fault is not nice at all
> > > 
> > > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > > than call follow_pfn/pte..    
> > 
> > It seems to me that searching vma list is still the simplest way to fix the
> > problem for the current code base.  I see io_remap_pfn_range() is also used in
> > the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY got
> > turned on/off in the new series (I just noticed remap_pfn_range modifies vma
> > flags..), as you suggested in the other email.  
> 
> 
> In the new series, I think the fault handler becomes (untested):
> 
> static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> {
>         struct vm_area_struct *vma = vmf->vma;
>         struct vfio_pci_device *vdev = vma->vm_private_data;
>         unsigned long base_pfn, pgoff;
>         vm_fault_t ret = VM_FAULT_SIGBUS;
> 
>         if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
>                 return ret;
> 
>         pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> 
>         down_read(&vdev->memory_lock);
> 
>         if (__vfio_pci_memory_enabled(vdev))
>                 ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);
> 
>         up_read(&vdev->memory_lock);
> 
>         return ret;
> }

And I think this is what we end up with for the current code base:

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..2f247ab18c66 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1568,19 +1568,24 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
 }
 
 /* Caller holds vma_lock */
-static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
-			      struct vm_area_struct *vma)
+struct vfio_pci_mmap_vma *__vfio_pci_add_vma(struct vfio_pci_device *vdev,
+					     struct vm_area_struct *vma)
 {
 	struct vfio_pci_mmap_vma *mmap_vma;
 
+	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+		if (mmap_vma->vma == vma)
+			return ERR_PTR(-EEXIST);
+	}
+
 	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
 	if (!mmap_vma)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	mmap_vma->vma = vma;
 	list_add(&mmap_vma->vma_next, &vdev->vma_list);
 
-	return 0;
+	return mmap_vma;
 }
 
 /*
@@ -1612,30 +1617,39 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_device *vdev = vma->vm_private_data;
-	vm_fault_t ret = VM_FAULT_NOPAGE;
+	struct vfio_pci_mmap_vma *mmap_vma;
+	unsigned long vaddr, pfn;
+	vm_fault_t ret;
 
 	mutex_lock(&vdev->vma_lock);
 	down_read(&vdev->memory_lock);
 
 	if (!__vfio_pci_memory_enabled(vdev)) {
 		ret = VM_FAULT_SIGBUS;
-		mutex_unlock(&vdev->vma_lock);
 		goto up_out;
 	}
 
-	if (__vfio_pci_add_vma(vdev, vma)) {
-		ret = VM_FAULT_OOM;
-		mutex_unlock(&vdev->vma_lock);
+	mmap_vma = __vfio_pci_add_vma(vdev, vma);
+	if (IS_ERR(mmap_vma)) {
+		/* A concurrent fault might have already inserted the page */
+		ret = (PTR_ERR(mmap_vma) == -EEXIST) ? VM_FAULT_NOPAGE :
+						       VM_FAULT_OOM;
 		goto up_out;
 	}
 
-	mutex_unlock(&vdev->vma_lock);
-
-	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
-		ret = VM_FAULT_SIGBUS;
-
+	for (vaddr = vma->vm_start, pfn = vma->vm_pgoff;
+	     vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
+		ret = vmf_insert_pfn(vma, vaddr, pfn);
+		if (ret != VM_FAULT_NOPAGE) {
+			zap_vma_ptes(vma, vma->vm_start,
+				     vma->vm_end - vma->vm_start);
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
+			break;
+		}
+	}
 up_out:
+	mutex_unlock(&vdev->vma_lock);
 	up_read(&vdev->memory_lock);
 	return ret;
 }


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 19:48                 ` Peter Xu
@ 2021-03-09 20:11                   ` Alex Williamson
  2021-03-09 21:00                     ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2021-03-09 20:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, 9 Mar 2021 14:48:24 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:
> > On Tue, 9 Mar 2021 13:47:39 -0500
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:  
> > > > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:    
> > > > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > > >     
> > > > > > On Tue, Mar 09, 2021 at 03:49:09AM +0000, Zengtao (B) wrote:    
> > > > > > > Hi guys:
> > > > > > > 
> > > > > > > Thanks for the helpful comments, after rethinking the issue, I have proposed
> > > > > > >  the following change: 
> > > > > > > 1. follow_pte instead of follow_pfn.      
> > > > > > 
> > > > > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn    
> > > > > 
> > > > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > > > here to determine whether the translation is already present to avoid
> > > > > both duplicate work in inserting the translation and allocating a
> > > > > duplicate vma tracking structure.    
> > > >  
> > > > Oh.. Doing something stateful in fault is not nice at all
> > > > 
> > > > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > > > than call follow_pfn/pte..    
> > > 
> > > It seems to me that searching vma list is still the simplest way to fix the
> > > problem for the current code base.  I see io_remap_pfn_range() is also used in
> > > the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY got
> > > turned on/off in the new series (I just noticed remap_pfn_range modifies vma
> > > flags..), as you suggested in the other email.  
> > 
> > 
> > In the new series, I think the fault handler becomes (untested):
> > 
> > static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > {
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct vfio_pci_device *vdev = vma->vm_private_data;
> >         unsigned long base_pfn, pgoff;
> >         vm_fault_t ret = VM_FAULT_SIGBUS;
> > 
> >         if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
> >                 return ret;
> > 
> >         pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> > 
> >         down_read(&vdev->memory_lock);
> > 
> >         if (__vfio_pci_memory_enabled(vdev))
> >                 ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);
> > 
> >         up_read(&vdev->memory_lock);
> > 
> >         return ret;
> > }  
> 
> It's just that the initial MMIO access delay would be spread to the 1st access
> of each mmio page access rather than using the previous pre-fault scheme.  I
> think an userspace cares the delay enough should pre-fault all pages anyway,
> but just raise this up.  Otherwise looks sane.

Yep, this is a concern.  Is it safe to have loops concurrently and fully
populating the same vma with vmf_insert_pfn()?  If it is then we could
just ignore that we're doing duplicate work when we hit this race
condition.  Otherwise we'd need to serialize again, perhaps via a lock
and flag stored in a struct linked from vm_private_data, along with
tracking to free that object :-\  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 20:11                   ` Alex Williamson
@ 2021-03-09 21:00                     ` Peter Xu
  2021-03-09 21:43                       ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2021-03-09 21:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, Mar 09, 2021 at 01:11:04PM -0700, Alex Williamson wrote:
> > It's just that the initial MMIO access delay would be spread to the 1st access
> > of each mmio page access rather than using the previous pre-fault scheme.  I
> > think an userspace cares the delay enough should pre-fault all pages anyway,
> > but just raise this up.  Otherwise looks sane.
> 
> Yep, this is a concern.  Is it safe to have loops concurrently and fully
> populating the same vma with vmf_insert_pfn()?

AFAIU it's safe, and probably the (so far) best way for an userspace to quickly
populate a huge chunk of mmap()ed region for either MMIO or RAM.  Indeed from
that pov vmf_insert_pfn() seems to be even more efficient on prefaulting since
it can be threaded.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 21:00                     ` Peter Xu
@ 2021-03-09 21:43                       ` Alex Williamson
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2021-03-09 21:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, 9 Mar 2021 16:00:36 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Mar 09, 2021 at 01:11:04PM -0700, Alex Williamson wrote:
> > > It's just that the initial MMIO access delay would be spread to the 1st access
> > > of each mmio page access rather than using the previous pre-fault scheme.  I
> > > think an userspace cares the delay enough should pre-fault all pages anyway,
> > > but just raise this up.  Otherwise looks sane.  
> > 
> > Yep, this is a concern.  Is it safe to have loops concurrently and fully
> > populating the same vma with vmf_insert_pfn()?  
> 
> AFAIU it's safe, and probably the (so far) best way for an userspace to quickly
> populate a huge chunk of mmap()ed region for either MMIO or RAM.  Indeed from
> that pov vmf_insert_pfn() seems to be even more efficient on prefaulting since
> it can be threaded.

Ok, then we'll keep the loop and expect that a race might incur
duplicate work, but should be safe.

It also occurred to me that Jason was suggesting the _prot version of
vmf_insert_pfn(), which I think is necessary if we want to keep the
same semantics where the default io_remap_pfn_range() was applying
pgprot_decrypted() onto vma->vm_page_prot.  So if we don't want to
break SME use cases we better apply that ourselves.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 19:26               ` Alex Williamson
  2021-03-09 19:48                 ` Peter Xu
  2021-03-09 19:56                 ` Alex Williamson
@ 2021-03-09 23:41                 ` Jason Gunthorpe
  2021-03-10  6:08                   ` Alex Williamson
  2 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2021-03-09 23:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Xu, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:

> In the new series, I think the fault handler becomes (untested):
> 
> static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> {
>         struct vm_area_struct *vma = vmf->vma;
>         struct vfio_pci_device *vdev = vma->vm_private_data;
>         unsigned long base_pfn, pgoff;
>         vm_fault_t ret = VM_FAULT_SIGBUS;
> 
>         if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
>                 return ret;
> 
>         pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;

I don't think this math is completely safe, it needs to parse the
vm_pgoff..

I'm worried userspace could split/punch/mangle a VMA using
munmap/mremap/etc/etc in a way that does update the pg_off but is
incompatible with the above.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 19:56                 ` Alex Williamson
@ 2021-03-09 23:45                   ` Jason Gunthorpe
  2021-03-10  6:23                     ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2021-03-09 23:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Xu, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, Mar 09, 2021 at 12:56:39PM -0700, Alex Williamson wrote:

> And I think this is what we end up with for the current code base:

Yeah, that looks Ok
 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b44578..2f247ab18c66 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1568,19 +1568,24 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
>  }
>  
>  /* Caller holds vma_lock */
> -static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
> -			      struct vm_area_struct *vma)
> +struct vfio_pci_mmap_vma *__vfio_pci_add_vma(struct vfio_pci_device *vdev,
> +					     struct vm_area_struct *vma)
>  {
>  	struct vfio_pci_mmap_vma *mmap_vma;
>  
> +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> +		if (mmap_vma->vma == vma)
> +			return ERR_PTR(-EEXIST);
> +	}
> +
>  	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>  	if (!mmap_vma)
> -		return -ENOMEM;
> +		return ERR_PTR(-ENOMEM);
>  
>  	mmap_vma->vma = vma;
>  	list_add(&mmap_vma->vma_next, &vdev->vma_list);
>  
> -	return 0;
> +	return mmap_vma;
>  }
>  
>  /*
> @@ -1612,30 +1617,39 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct vfio_pci_device *vdev = vma->vm_private_data;
> -	vm_fault_t ret = VM_FAULT_NOPAGE;
> +	struct vfio_pci_mmap_vma *mmap_vma;
> +	unsigned long vaddr, pfn;
> +	vm_fault_t ret;
>  
>  	mutex_lock(&vdev->vma_lock);
>  	down_read(&vdev->memory_lock);
>  
>  	if (!__vfio_pci_memory_enabled(vdev)) {
>  		ret = VM_FAULT_SIGBUS;
> -		mutex_unlock(&vdev->vma_lock);
>  		goto up_out;
>  	}
>  
> -	if (__vfio_pci_add_vma(vdev, vma)) {
> -		ret = VM_FAULT_OOM;
> -		mutex_unlock(&vdev->vma_lock);
> +	mmap_vma = __vfio_pci_add_vma(vdev, vma);
> +	if (IS_ERR(mmap_vma)) {
> +		/* A concurrent fault might have already inserted the page */
> +		ret = (PTR_ERR(mmap_vma) == -EEXIST) ? VM_FAULT_NOPAGE :
> +						       VM_FAULT_OOM;

I think -EEIXST should not be an error, lets just go down to the
vmf_insert_pfn() and let the MM resolve the race naturally.

I suspect returning VM_FAULT_NOPAGE will be averse to the userspace if
it hits this race??

Also the _prot does look needed at least due to the SME, but possibly
also to ensure NC gets set..

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 23:41                 ` Jason Gunthorpe
@ 2021-03-10  6:08                   ` Alex Williamson
  2021-03-11  3:32                     ` 答复: " Zengtao (B)
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2021-03-10  6:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, 9 Mar 2021 19:41:27 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:
> 
> > In the new series, I think the fault handler becomes (untested):
> > 
> > static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > {
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct vfio_pci_device *vdev = vma->vm_private_data;
> >         unsigned long base_pfn, pgoff;
> >         vm_fault_t ret = VM_FAULT_SIGBUS;
> > 
> >         if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
> >                 return ret;
> > 
> >         pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;  
> 
> I don't think this math is completely safe, it needs to parse the
> vm_pgoff..
> 
> I'm worried userspace could split/punch/mangle a VMA using
> munmap/mremap/etc/etc in a way that does update the pg_off but is
> incompatible with the above.

parsing vm_pgoff is done in:

static int vfio_pci_bar_vma_to_pfn(struct vm_area_struct *vma,
                                   unsigned long *pfn)
{
        struct vfio_pci_device *vdev = vma->vm_private_data;
        struct pci_dev *pdev = vdev->pdev;
        int index;
        u64 pgoff;

        index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

        if (index >= VFIO_PCI_ROM_REGION_INDEX ||
            !vdev->bar_mmap_supported[index] || !vdev->barmap[index])
                return -EINVAL;

        pgoff = vma->vm_pgoff &
                ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);

        *pfn = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

        return 0;
}

But given Peter's concern about faulting individual pages, I think the
fault handler becomes:

static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
        struct vm_area_struct *vma = vmf->vma;
        struct vfio_pci_device *vdev = vma->vm_private_data;
        unsigned long vaddr, pfn;
        vm_fault_t ret = VM_FAULT_SIGBUS;

        if (vfio_pci_bar_vma_to_pfn(vma, &pfn))
                return ret;

        down_read(&vdev->memory_lock);

        if (__vfio_pci_memory_enabled(vdev)) {
                for (vaddr = vma->vm_start;
                     vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
                        ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
                                        pgprot_decrypted(vma->vm_page_prot));
                        if (ret != VM_FAULT_NOPAGE) {
                                zap_vma_ptes(vma, vma->vm_start,
                                             vaddr - vma->vm_start);
                                break;
                        }
                }
        }

        up_read(&vdev->memory_lock);

        return ret;
}

Thanks,
Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-09 23:45                   ` Jason Gunthorpe
@ 2021-03-10  6:23                     ` Alex Williamson
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2021-03-10  6:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Zengtao (B),
	Cornelia Huck, Kevin Tian, Andrew Morton, Giovanni Cabiddu,
	Michel Lespinasse, Jann Horn, Max Gurtovoy, kvm, linux-kernel,
	Linuxarm

On Tue, 9 Mar 2021 19:45:03 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 09, 2021 at 12:56:39PM -0700, Alex Williamson wrote:
> 
> > And I think this is what we end up with for the current code base:  
> 
> Yeah, that looks Ok
>  
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b44578..2f247ab18c66 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1568,19 +1568,24 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
> >  }
> >  
> >  /* Caller holds vma_lock */
> > -static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > -			      struct vm_area_struct *vma)
> > +struct vfio_pci_mmap_vma *__vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > +					     struct vm_area_struct *vma)
> >  {
> >  	struct vfio_pci_mmap_vma *mmap_vma;
> >  
> > +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> > +		if (mmap_vma->vma == vma)
> > +			return ERR_PTR(-EEXIST);
> > +	}
> > +
> >  	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> >  	if (!mmap_vma)
> > -		return -ENOMEM;
> > +		return ERR_PTR(-ENOMEM);
> >  
> >  	mmap_vma->vma = vma;
> >  	list_add(&mmap_vma->vma_next, &vdev->vma_list);
> >  
> > -	return 0;
> > +	return mmap_vma;
> >  }
> >  
> >  /*
> > @@ -1612,30 +1617,39 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > -	vm_fault_t ret = VM_FAULT_NOPAGE;
> > +	struct vfio_pci_mmap_vma *mmap_vma;
> > +	unsigned long vaddr, pfn;
> > +	vm_fault_t ret;
> >  
> >  	mutex_lock(&vdev->vma_lock);
> >  	down_read(&vdev->memory_lock);
> >  
> >  	if (!__vfio_pci_memory_enabled(vdev)) {
> >  		ret = VM_FAULT_SIGBUS;
> > -		mutex_unlock(&vdev->vma_lock);
> >  		goto up_out;
> >  	}
> >  
> > -	if (__vfio_pci_add_vma(vdev, vma)) {
> > -		ret = VM_FAULT_OOM;
> > -		mutex_unlock(&vdev->vma_lock);
> > +	mmap_vma = __vfio_pci_add_vma(vdev, vma);
> > +	if (IS_ERR(mmap_vma)) {
> > +		/* A concurrent fault might have already inserted the page */
> > +		ret = (PTR_ERR(mmap_vma) == -EEXIST) ? VM_FAULT_NOPAGE :
> > +						       VM_FAULT_OOM;  
> 
> I think -EEIXST should not be an error, lets just go down to the
> vmf_insert_pfn() and let the MM resolve the race naturally.
> 
> I suspect returning VM_FAULT_NOPAGE will be averse to the userspace if
> it hits this race??

Given the serialization on vma_lock, if the vma_list entry exists then
the full vma should already be populated, so I don't see the NOPAGE
issue you're worried about.  However, if we wanted to be more similar
to what we expect the new version to do, we could proceed through
re-inserting the pages on -EEXIST.  Zeng Tao's re-ordering to add the
vma_list entry only after successfully inserting all the pfns might work
better for that.
 
> Also the _prot does look needed at least due to the SME, but possibly
> also to ensure NC gets set..

If we need more than pgprot_decrypted(vma->vm_page_prot), please let us
know, but that's all we were getting from io_remap_pfn_range() afaict.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* 答复: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
  2021-03-10  6:08                   ` Alex Williamson
@ 2021-03-11  3:32                     ` Zengtao (B)
  0 siblings, 0 replies; 20+ messages in thread
From: Zengtao (B) @ 2021-03-11  3:32 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Peter Xu, Cornelia Huck, Kevin Tian, Andrew Morton,
	Giovanni Cabiddu, Michel Lespinasse, Jann Horn, Max Gurtovoy,
	kvm, linux-kernel, Linuxarm

Hi Alex:

> -----邮件原件-----
> 发件人: Alex Williamson [mailto:alex.williamson@redhat.com]
> 发送时间: 2021年3月10日 14:09
> 收件人: Jason Gunthorpe <jgg@nvidia.com>
> 抄送: Peter Xu <peterx@redhat.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> Cornelia Huck <cohuck@redhat.com>; Kevin Tian <kevin.tian@intel.com>;
> Andrew Morton <akpm@linux-foundation.org>; Giovanni Cabiddu
> <giovanni.cabiddu@intel.com>; Michel Lespinasse <walken@google.com>; Jann
> Horn <jannh@google.com>; Max Gurtovoy <mgurtovoy@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm
> <linuxarm@huawei.com>
> 主题: Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
> 
> On Tue, 9 Mar 2021 19:41:27 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:
> >
> > > In the new series, I think the fault handler becomes (untested):
> > >
> > > static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) {
> > >         struct vm_area_struct *vma = vmf->vma;
> > >         struct vfio_pci_device *vdev = vma->vm_private_data;
> > >         unsigned long base_pfn, pgoff;
> > >         vm_fault_t ret = VM_FAULT_SIGBUS;
> > >
> > >         if (vfio_pci_bar_vma_to_pfn(vma, &base_pfn))
> > >                 return ret;
> > >
> > >         pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> >
> > I don't think this math is completely safe, it needs to parse the
> > vm_pgoff..
> >
> > I'm worried userspace could split/punch/mangle a VMA using
> > munmap/mremap/etc/etc in a way that does update the pg_off but is
> > incompatible with the above.
> 
> parsing vm_pgoff is done in:
> 
> static int vfio_pci_bar_vma_to_pfn(struct vm_area_struct *vma,
>                                    unsigned long *pfn) {
>         struct vfio_pci_device *vdev = vma->vm_private_data;
>         struct pci_dev *pdev = vdev->pdev;
>         int index;
>         u64 pgoff;
> 
>         index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
>         if (index >= VFIO_PCI_ROM_REGION_INDEX ||
>             !vdev->bar_mmap_supported[index] || !vdev->barmap[index])
>                 return -EINVAL;
> 
>         pgoff = vma->vm_pgoff &
>                 ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> 
>         *pfn = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> 
>         return 0;
> }
> 
> But given Peter's concern about faulting individual pages, I think the fault handler
> becomes:
> 
> static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) {
>         struct vm_area_struct *vma = vmf->vma;
>         struct vfio_pci_device *vdev = vma->vm_private_data;
>         unsigned long vaddr, pfn;
>         vm_fault_t ret = VM_FAULT_SIGBUS;
> 
>         if (vfio_pci_bar_vma_to_pfn(vma, &pfn))
>                 return ret;
> 
>         down_read(&vdev->memory_lock);
> 
>         if (__vfio_pci_memory_enabled(vdev)) {
>                 for (vaddr = vma->vm_start;
>                      vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
One concern here is the performance, since you are doing the mapping for the
 whole vma, what about using block mapping if applicable?

>                         ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
> 
> pgprot_decrypted(vma->vm_page_prot));
>                         if (ret != VM_FAULT_NOPAGE) {
>                                 zap_vma_ptes(vma, vma->vm_start,
>                                              vaddr - vma->vm_start);
>                                 break;
>                         }
>                 }
>         }
> 
>         up_read(&vdev->memory_lock);
> 
>         return ret;
> }
> 
> Thanks,
> Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-03-11  3:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-08 11:11 [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant Zeng Tao
2021-03-08 20:21 ` Alex Williamson
2021-03-08 22:56   ` Peter Xu
2021-03-09  3:49     ` 答复: " Zengtao (B)
2021-03-09 12:46       ` Jason Gunthorpe
2021-03-09 15:29         ` Alex Williamson
2021-03-09 16:40           ` Jason Gunthorpe
2021-03-09 18:47             ` Peter Xu
2021-03-09 19:26               ` Alex Williamson
2021-03-09 19:48                 ` Peter Xu
2021-03-09 20:11                   ` Alex Williamson
2021-03-09 21:00                     ` Peter Xu
2021-03-09 21:43                       ` Alex Williamson
2021-03-09 19:56                 ` Alex Williamson
2021-03-09 23:45                   ` Jason Gunthorpe
2021-03-10  6:23                     ` Alex Williamson
2021-03-09 23:41                 ` Jason Gunthorpe
2021-03-10  6:08                   ` Alex Williamson
2021-03-11  3:32                     ` 答复: " Zengtao (B)
2021-03-08 23:43   ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.