* [PATCH] vfio: remove useless judgement @ 2022-06-27 3:51 lizhe.67 2022-06-27 22:06 ` Alex Williamson 2022-06-30 19:51 ` Alex Williamson 0 siblings, 2 replies; 7+ messages in thread From: lizhe.67 @ 2022-06-27 3:51 UTC (permalink / raw) To: alex.williamson, cohuck, jgg; +Cc: lizhe.67, kvm, linux-kernel, lizefan.x From: Li Zhe <lizhe.67@bytedance.com> In function vfio_dma_do_unmap(), we currently prevent process to unmap vfio dma region whose mm_struct is different from the vfio_dma->task. In our virtual machine scenario which is using kvm and qemu, this judgement stops us from liveupgrading our qemu, which uses fork() && exec() to load the new binary but the new process cannot do the VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement. This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add task structure to vfio_dma") for the security reason. But it seems that no other task who has no family relationship with old and new process can get the same vfio_dma struct here for the reason of resource isolation. So this patch delete it. Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> --- drivers/vfio/vfio_iommu_type1.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index c13b9290e357..a8ff00dad834 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, if (!iommu->v2 && iova > dma->iova) break; - /* - * Task with same address space who mapped this iova range is - * allowed to unmap the iova range. - */ - if (dma->task->mm != current->mm) - break; if (invalidate_vaddr) { if (dma->vaddr_invalid) { -- 2.20.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement 2022-06-27 3:51 [PATCH] vfio: remove useless judgement lizhe.67 @ 2022-06-27 22:06 ` Alex Williamson 2022-06-28 12:48 ` Steven Sistare 2022-06-30 19:51 ` Alex Williamson 1 sibling, 1 reply; 7+ messages in thread From: Alex Williamson @ 2022-06-27 22:06 UTC (permalink / raw) To: Steve Sistare; +Cc: lizhe.67, cohuck, jgg, kvm, linux-kernel, lizefan.x Hey Steve, how did you get around this for cpr or is this a gap? Thanks, Alex On Mon, 27 Jun 2022 11:51:09 +0800 lizhe.67@bytedance.com wrote: > From: Li Zhe <lizhe.67@bytedance.com> > > In function vfio_dma_do_unmap(), we currently prevent process to unmap > vfio dma region whose mm_struct is different from the vfio_dma->task. > In our virtual machine scenario which is using kvm and qemu, this > judgement stops us from liveupgrading our qemu, which uses fork() && > exec() to load the new binary but the new process cannot do the > VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement. > > This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add > task structure to vfio_dma") for the security reason. But it seems that > no other task who has no family relationship with old and new process > can get the same vfio_dma struct here for the reason of resource > isolation. So this patch delete it. > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com> > Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> > --- > drivers/vfio/vfio_iommu_type1.c | 6 ------ > 1 file changed, 6 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index c13b9290e357..a8ff00dad834 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > > if (!iommu->v2 && iova > dma->iova) > break; > - /* > - * Task with same address space who mapped this iova range is > - * allowed to unmap the iova range. > - */ > - if (dma->task->mm != current->mm) > - break; > > if (invalidate_vaddr) { > if (dma->vaddr_invalid) { ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement 2022-06-27 22:06 ` Alex Williamson @ 2022-06-28 12:48 ` Steven Sistare 2022-06-28 13:03 ` Jason Gunthorpe 0 siblings, 1 reply; 7+ messages in thread From: Steven Sistare @ 2022-06-28 12:48 UTC (permalink / raw) To: Alex Williamson; +Cc: lizhe.67, cohuck, jgg, kvm, linux-kernel, lizefan.x For cpr, old qemu directly exec's new qemu, so task does not change. To support fork+exec, the ownership test needs to be deleted or modified. Pinned page accounting is another issue, as the parent counts pins in its mm->locked_vm. If the child unmaps, it cannot simply decrement its own mm->locked_vm counter. As you and I have discussed, the count is also wrong in the direct exec model, because exec clears mm->locked_vm. I am thinking vfio could count pins in struct user locked_vm to handle both models. The user struct and its count would persist across direct exec, and be shared by parent and child for fork+exec. However, that does change the RLIMIT_MEMLOCK value that applications must set, because the limit must accommodate vfio plus other sub-systems that count in user->locked_vm, which includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all processes of that user, not just a single process. Folks like fork+exec because it allows recovery if the new qemu process fails to initialize. One can fall back to the original process, if the above issues are fixed. - Steve On 6/27/2022 6:06 PM, Alex Williamson wrote: > > Hey Steve, how did you get around this for cpr or is this a gap? > Thanks, > > Alex > > On Mon, 27 Jun 2022 11:51:09 +0800 > lizhe.67@bytedance.com wrote: > >> From: Li Zhe <lizhe.67@bytedance.com> >> >> In function vfio_dma_do_unmap(), we currently prevent process to unmap >> vfio dma region whose mm_struct is different from the vfio_dma->task. >> In our virtual machine scenario which is using kvm and qemu, this >> judgement stops us from liveupgrading our qemu, which uses fork() && >> exec() to load the new binary but the new process cannot do the >> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement. >> >> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add >> task structure to vfio_dma") for the security reason. But it seems that >> no other task who has no family relationship with old and new process >> can get the same vfio_dma struct here for the reason of resource >> isolation. So this patch delete it. >> >> Signed-off-by: Li Zhe <lizhe.67@bytedance.com> >> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> >> --- >> drivers/vfio/vfio_iommu_type1.c | 6 ------ >> 1 file changed, 6 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >> index c13b9290e357..a8ff00dad834 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> >> if (!iommu->v2 && iova > dma->iova) >> break; >> - /* >> - * Task with same address space who mapped this iova range is >> - * allowed to unmap the iova range. >> - */ >> - if (dma->task->mm != current->mm) >> - break; >> >> if (invalidate_vaddr) { >> if (dma->vaddr_invalid) { > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement 2022-06-28 12:48 ` Steven Sistare @ 2022-06-28 13:03 ` Jason Gunthorpe 2022-06-28 13:54 ` Steven Sistare 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2022-06-28 13:03 UTC (permalink / raw) To: Steven Sistare Cc: Alex Williamson, lizhe.67, cohuck, kvm, linux-kernel, lizefan.x On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote: > For cpr, old qemu directly exec's new qemu, so task does not change. > > To support fork+exec, the ownership test needs to be deleted or modified. > > Pinned page accounting is another issue, as the parent counts pins in its > mm->locked_vm. If the child unmaps, it cannot simply decrement its own > mm->locked_vm counter. It is fine already: mm = async ? get_task_mm(dma->task) : dma->task->mm; if (!mm) return -ESRCH; /* process exited */ ret = mmap_write_lock_killable(mm); if (!ret) { ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task, dma->lock_cap); Each 'dma' already stores a pointer to the mm that sourced it and only manipulates the counter in that mm. AFAICT 'current' is not used during unmap. > As you and I have discussed, the count is also wrong in the direct > exec model, because exec clears mm->locked_vm. Really? Yikes, I thought exec would generate a new mm? > I am thinking vfio could count pins in struct user locked_vm to handle both > models. The user struct and its count would persist across direct exec, > and be shared by parent and child for fork+exec. However, that does change > the RLIMIT_MEMLOCK value that applications must set, because the limit must > accommodate vfio plus other sub-systems that count in user->locked_vm, which > includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all > processes of that user, not just a single process. We discussed this, for iommufd we are currently planning to go this way and will See How it Goes. Jason ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement 2022-06-28 13:03 ` Jason Gunthorpe @ 2022-06-28 13:54 ` Steven Sistare 2022-06-28 14:04 ` Jason Gunthorpe 0 siblings, 1 reply; 7+ messages in thread From: Steven Sistare @ 2022-06-28 13:54 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, lizhe.67, cohuck, kvm, linux-kernel, lizefan.x On 6/28/2022 9:03 AM, Jason Gunthorpe wrote: > On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote: >> For cpr, old qemu directly exec's new qemu, so task does not change. >> >> To support fork+exec, the ownership test needs to be deleted or modified. >> >> Pinned page accounting is another issue, as the parent counts pins in its >> mm->locked_vm. If the child unmaps, it cannot simply decrement its own >> mm->locked_vm counter. > > It is fine already: > > mm = async ? get_task_mm(dma->task) : dma->task->mm; > if (!mm) > return -ESRCH; /* process exited */ > > ret = mmap_write_lock_killable(mm); > if (!ret) { > ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task, > dma->lock_cap); > > Each 'dma' already stores a pointer to the mm that sourced it and only > manipulates the counter in that mm. AFAICT 'current' is not used > during unmap. Ah yes, no problem then. Limits become looser, though, as the child can pin an additional RLIMIT_MEMLOCK of pages. That is the natural consequence of mm->locked_vm being a per process limit, but probably not what the application wants. Another argument for switching to user->locked_vm. >> As you and I have discussed, the count is also wrong in the direct >> exec model, because exec clears mm->locked_vm. > > Really? Yikes, I thought exec would generate a new mm? Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped on the floor. The existing dma points to the same task, but task->mm has changed, and dma->task->mm->locked_vm is 0. An unmap ioctl drives it negative. I have prototyped a few possible fixes. One changes vfio to use user->locked_vm. Another changes to mm->pinned_vm and preserves it during exec. A third preserves mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes vfio pins and mlocks. The mlock component must be cleared during exec, and we don't have a separate count for it. >> I am thinking vfio could count pins in struct user locked_vm to handle both >> models. The user struct and its count would persist across direct exec, >> and be shared by parent and child for fork+exec. However, that does change >> the RLIMIT_MEMLOCK value that applications must set, because the limit must >> accommodate vfio plus other sub-systems that count in user->locked_vm, which >> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all >> processes of that user, not just a single process. > > We discussed this, for iommufd we are currently planning to go this > way and will See How it Goes. Yes, I have followed that thread with interest. - Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement 2022-06-28 13:54 ` Steven Sistare @ 2022-06-28 14:04 ` Jason Gunthorpe 0 siblings, 0 replies; 7+ messages in thread From: Jason Gunthorpe @ 2022-06-28 14:04 UTC (permalink / raw) To: Steven Sistare Cc: Alex Williamson, lizhe.67, cohuck, kvm, linux-kernel, lizefan.x On Tue, Jun 28, 2022 at 09:54:19AM -0400, Steven Sistare wrote: > >> As you and I have discussed, the count is also wrong in the direct > >> exec model, because exec clears mm->locked_vm. > > > > Really? Yikes, I thought exec would generate a new mm? > > Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped > on the floor. The existing dma points to the same task, but task->mm has changed, > and dma->task->mm->locked_vm is 0. An unmap ioctl drives it > negative. Oh.. This is probably a bug, vfio should never use task->mm, the mm itself should be held using mmgrab instead. Otherwise exec case is broken as you describe. > I have prototyped a few possible fixes. One changes vfio to use user->locked_vm. > Another changes to mm->pinned_vm and preserves it during exec. A third preserves > mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes > vfio pins and mlocks. The mlock component must be cleared during exec, and we don't > have a separate count for it. Lossing locked_vm on exec/fork is the correct and expected behavior for the core kernel code, the bug is that vfio drives it negative. Jason ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement 2022-06-27 3:51 [PATCH] vfio: remove useless judgement lizhe.67 2022-06-27 22:06 ` Alex Williamson @ 2022-06-30 19:51 ` Alex Williamson 1 sibling, 0 replies; 7+ messages in thread From: Alex Williamson @ 2022-06-30 19:51 UTC (permalink / raw) To: lizhe.67; +Cc: cohuck, jgg, kvm, linux-kernel, lizefan.x On Mon, 27 Jun 2022 11:51:09 +0800 lizhe.67@bytedance.com wrote: > From: Li Zhe <lizhe.67@bytedance.com> > > In function vfio_dma_do_unmap(), we currently prevent process to unmap > vfio dma region whose mm_struct is different from the vfio_dma->task. > In our virtual machine scenario which is using kvm and qemu, this > judgement stops us from liveupgrading our qemu, which uses fork() && > exec() to load the new binary but the new process cannot do the > VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement. > > This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add > task structure to vfio_dma") for the security reason. But it seems that > no other task who has no family relationship with old and new process > can get the same vfio_dma struct here for the reason of resource > isolation. So this patch delete it. > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com> > Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> > --- > drivers/vfio/vfio_iommu_type1.c | 6 ------ > 1 file changed, 6 deletions(-) Applied to vfio next branch for v5.20. Thanks, Alex > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index c13b9290e357..a8ff00dad834 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > > if (!iommu->v2 && iova > dma->iova) > break; > - /* > - * Task with same address space who mapped this iova range is > - * allowed to unmap the iova range. > - */ > - if (dma->task->mm != current->mm) > - break; > > if (invalidate_vaddr) { > if (dma->vaddr_invalid) { ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-06-30 19:52 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-27 3:51 [PATCH] vfio: remove useless judgement lizhe.67 2022-06-27 22:06 ` Alex Williamson 2022-06-28 12:48 ` Steven Sistare 2022-06-28 13:03 ` Jason Gunthorpe 2022-06-28 13:54 ` Steven Sistare 2022-06-28 14:04 ` Jason Gunthorpe 2022-06-30 19:51 ` Alex Williamson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).