* [PATCH] vfio: remove useless judgement
@ 2022-06-27 3:51 lizhe.67
2022-06-27 22:06 ` Alex Williamson
2022-06-30 19:51 ` Alex Williamson
0 siblings, 2 replies; 7+ messages in thread
From: lizhe.67 @ 2022-06-27 3:51 UTC (permalink / raw)
To: alex.williamson, cohuck, jgg; +Cc: lizhe.67, kvm, linux-kernel, lizefan.x
From: Li Zhe <lizhe.67@bytedance.com>
In function vfio_dma_do_unmap(), we currently prevent process to unmap
vfio dma region whose mm_struct is different from the vfio_dma->task.
In our virtual machine scenario which is using kvm and qemu, this
judgement stops us from liveupgrading our qemu, which uses fork() &&
exec() to load the new binary but the new process cannot do the
VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
task structure to vfio_dma") for the security reason. But it seems that
no other task who has no family relationship with old and new process
can get the same vfio_dma struct here for the reason of resource
isolation. So this patch delete it.
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
---
drivers/vfio/vfio_iommu_type1.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c13b9290e357..a8ff00dad834 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
if (!iommu->v2 && iova > dma->iova)
break;
- /*
- * Task with same address space who mapped this iova range is
- * allowed to unmap the iova range.
- */
- if (dma->task->mm != current->mm)
- break;
if (invalidate_vaddr) {
if (dma->vaddr_invalid) {
--
2.20.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement
2022-06-27 3:51 [PATCH] vfio: remove useless judgement lizhe.67
@ 2022-06-27 22:06 ` Alex Williamson
2022-06-28 12:48 ` Steven Sistare
2022-06-30 19:51 ` Alex Williamson
1 sibling, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2022-06-27 22:06 UTC (permalink / raw)
To: Steve Sistare; +Cc: lizhe.67, cohuck, jgg, kvm, linux-kernel, lizefan.x
Hey Steve, how did you get around this for cpr or is this a gap?
Thanks,
Alex
On Mon, 27 Jun 2022 11:51:09 +0800
lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
>
> In function vfio_dma_do_unmap(), we currently prevent process to unmap
> vfio dma region whose mm_struct is different from the vfio_dma->task.
> In our virtual machine scenario which is using kvm and qemu, this
> judgement stops us from liveupgrading our qemu, which uses fork() &&
> exec() to load the new binary but the new process cannot do the
> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>
> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
> task structure to vfio_dma") for the security reason. But it seems that
> no other task who has no family relationship with old and new process
> can get the same vfio_dma struct here for the reason of resource
> isolation. So this patch delete it.
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
> ---
> drivers/vfio/vfio_iommu_type1.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..a8ff00dad834 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>
> if (!iommu->v2 && iova > dma->iova)
> break;
> - /*
> - * Task with same address space who mapped this iova range is
> - * allowed to unmap the iova range.
> - */
> - if (dma->task->mm != current->mm)
> - break;
>
> if (invalidate_vaddr) {
> if (dma->vaddr_invalid) {
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement
2022-06-27 22:06 ` Alex Williamson
@ 2022-06-28 12:48 ` Steven Sistare
2022-06-28 13:03 ` Jason Gunthorpe
0 siblings, 1 reply; 7+ messages in thread
From: Steven Sistare @ 2022-06-28 12:48 UTC (permalink / raw)
To: Alex Williamson; +Cc: lizhe.67, cohuck, jgg, kvm, linux-kernel, lizefan.x
For cpr, old qemu directly exec's new qemu, so task does not change.
To support fork+exec, the ownership test needs to be deleted or modified.
Pinned page accounting is another issue, as the parent counts pins in its
mm->locked_vm. If the child unmaps, it cannot simply decrement its own
mm->locked_vm counter. As you and I have discussed, the count is also
wrong in the direct exec model, because exec clears mm->locked_vm. I am
thinking vfio could count pins in struct user locked_vm to handle both
models. The user struct and its count would persist across direct exec,
and be shared by parent and child for fork+exec. However, that does change
the RLIMIT_MEMLOCK value that applications must set, because the limit must
accommodate vfio plus other sub-systems that count in user->locked_vm, which
includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
processes of that user, not just a single process.
Folks like fork+exec because it allows recovery if the new qemu process fails to
initialize. One can fall back to the original process, if the above issues are fixed.
- Steve
On 6/27/2022 6:06 PM, Alex Williamson wrote:
>
> Hey Steve, how did you get around this for cpr or is this a gap?
> Thanks,
>
> Alex
>
> On Mon, 27 Jun 2022 11:51:09 +0800
> lizhe.67@bytedance.com wrote:
>
>> From: Li Zhe <lizhe.67@bytedance.com>
>>
>> In function vfio_dma_do_unmap(), we currently prevent process to unmap
>> vfio dma region whose mm_struct is different from the vfio_dma->task.
>> In our virtual machine scenario which is using kvm and qemu, this
>> judgement stops us from liveupgrading our qemu, which uses fork() &&
>> exec() to load the new binary but the new process cannot do the
>> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>>
>> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
>> task structure to vfio_dma") for the security reason. But it seems that
>> no other task who has no family relationship with old and new process
>> can get the same vfio_dma struct here for the reason of resource
>> isolation. So this patch delete it.
>>
>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 6 ------
>> 1 file changed, 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c13b9290e357..a8ff00dad834 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>
>> if (!iommu->v2 && iova > dma->iova)
>> break;
>> - /*
>> - * Task with same address space who mapped this iova range is
>> - * allowed to unmap the iova range.
>> - */
>> - if (dma->task->mm != current->mm)
>> - break;
>>
>> if (invalidate_vaddr) {
>> if (dma->vaddr_invalid) {
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement
2022-06-28 12:48 ` Steven Sistare
@ 2022-06-28 13:03 ` Jason Gunthorpe
2022-06-28 13:54 ` Steven Sistare
0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2022-06-28 13:03 UTC (permalink / raw)
To: Steven Sistare
Cc: Alex Williamson, lizhe.67, cohuck, kvm, linux-kernel, lizefan.x
On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
> For cpr, old qemu directly exec's new qemu, so task does not change.
>
> To support fork+exec, the ownership test needs to be deleted or modified.
>
> Pinned page accounting is another issue, as the parent counts pins in its
> mm->locked_vm. If the child unmaps, it cannot simply decrement its own
> mm->locked_vm counter.
It is fine already:
mm = async ? get_task_mm(dma->task) : dma->task->mm;
if (!mm)
return -ESRCH; /* process exited */
ret = mmap_write_lock_killable(mm);
if (!ret) {
ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
dma->lock_cap);
Each 'dma' already stores a pointer to the mm that sourced it and only
manipulates the counter in that mm. AFAICT 'current' is not used
during unmap.
> As you and I have discussed, the count is also wrong in the direct
> exec model, because exec clears mm->locked_vm.
Really? Yikes, I thought exec would generate a new mm?
> I am thinking vfio could count pins in struct user locked_vm to handle both
> models. The user struct and its count would persist across direct exec,
> and be shared by parent and child for fork+exec. However, that does change
> the RLIMIT_MEMLOCK value that applications must set, because the limit must
> accommodate vfio plus other sub-systems that count in user->locked_vm, which
> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
> processes of that user, not just a single process.
We discussed this, for iommufd we are currently planning to go this
way and will See How it Goes.
Jason
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement
2022-06-28 13:03 ` Jason Gunthorpe
@ 2022-06-28 13:54 ` Steven Sistare
2022-06-28 14:04 ` Jason Gunthorpe
0 siblings, 1 reply; 7+ messages in thread
From: Steven Sistare @ 2022-06-28 13:54 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, lizhe.67, cohuck, kvm, linux-kernel, lizefan.x
On 6/28/2022 9:03 AM, Jason Gunthorpe wrote:
> On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
>> For cpr, old qemu directly exec's new qemu, so task does not change.
>>
>> To support fork+exec, the ownership test needs to be deleted or modified.
>>
>> Pinned page accounting is another issue, as the parent counts pins in its
>> mm->locked_vm. If the child unmaps, it cannot simply decrement its own
>> mm->locked_vm counter.
>
> It is fine already:
>
> mm = async ? get_task_mm(dma->task) : dma->task->mm;
> if (!mm)
> return -ESRCH; /* process exited */
>
> ret = mmap_write_lock_killable(mm);
> if (!ret) {
> ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
> dma->lock_cap);
>
> Each 'dma' already stores a pointer to the mm that sourced it and only
> manipulates the counter in that mm. AFAICT 'current' is not used
> during unmap.
Ah yes, no problem then.
Limits become looser, though, as the child can pin an additional RLIMIT_MEMLOCK
of pages. That is the natural consequence of mm->locked_vm being a per process limit,
but probably not what the application wants. Another argument for switching to
user->locked_vm.
>> As you and I have discussed, the count is also wrong in the direct
>> exec model, because exec clears mm->locked_vm.
>
> Really? Yikes, I thought exec would generate a new mm?
Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped
on the floor. The existing dma points to the same task, but task->mm has changed,
and dma->task->mm->locked_vm is 0. An unmap ioctl drives it negative.
I have prototyped a few possible fixes. One changes vfio to use user->locked_vm.
Another changes to mm->pinned_vm and preserves it during exec. A third preserves
mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
vfio pins and mlocks. The mlock component must be cleared during exec, and we don't
have a separate count for it.
>> I am thinking vfio could count pins in struct user locked_vm to handle both
>> models. The user struct and its count would persist across direct exec,
>> and be shared by parent and child for fork+exec. However, that does change
>> the RLIMIT_MEMLOCK value that applications must set, because the limit must
>> accommodate vfio plus other sub-systems that count in user->locked_vm, which
>> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
>> processes of that user, not just a single process.
>
> We discussed this, for iommufd we are currently planning to go this
> way and will See How it Goes.
Yes, I have followed that thread with interest.
- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement
2022-06-28 13:54 ` Steven Sistare
@ 2022-06-28 14:04 ` Jason Gunthorpe
0 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2022-06-28 14:04 UTC (permalink / raw)
To: Steven Sistare
Cc: Alex Williamson, lizhe.67, cohuck, kvm, linux-kernel, lizefan.x
On Tue, Jun 28, 2022 at 09:54:19AM -0400, Steven Sistare wrote:
> >> As you and I have discussed, the count is also wrong in the direct
> >> exec model, because exec clears mm->locked_vm.
> >
> > Really? Yikes, I thought exec would generate a new mm?
>
> Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped
> on the floor. The existing dma points to the same task, but task->mm has changed,
> and dma->task->mm->locked_vm is 0. An unmap ioctl drives it
> negative.
Oh.. This is probably a bug, vfio should never use task->mm, the mm
itself should be held using mmgrab instead.
Otherwise exec case is broken as you describe.
> I have prototyped a few possible fixes. One changes vfio to use user->locked_vm.
> Another changes to mm->pinned_vm and preserves it during exec. A third preserves
> mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
> vfio pins and mlocks. The mlock component must be cleared during exec, and we don't
> have a separate count for it.
Lossing locked_vm on exec/fork is the correct and expected behavior
for the core kernel code, the bug is that vfio drives it negative.
Jason
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] vfio: remove useless judgement
2022-06-27 3:51 [PATCH] vfio: remove useless judgement lizhe.67
2022-06-27 22:06 ` Alex Williamson
@ 2022-06-30 19:51 ` Alex Williamson
1 sibling, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2022-06-30 19:51 UTC (permalink / raw)
To: lizhe.67; +Cc: cohuck, jgg, kvm, linux-kernel, lizefan.x
On Mon, 27 Jun 2022 11:51:09 +0800
lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
>
> In function vfio_dma_do_unmap(), we currently prevent process to unmap
> vfio dma region whose mm_struct is different from the vfio_dma->task.
> In our virtual machine scenario which is using kvm and qemu, this
> judgement stops us from liveupgrading our qemu, which uses fork() &&
> exec() to load the new binary but the new process cannot do the
> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>
> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
> task structure to vfio_dma") for the security reason. But it seems that
> no other task who has no family relationship with old and new process
> can get the same vfio_dma struct here for the reason of resource
> isolation. So this patch delete it.
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
> ---
> drivers/vfio/vfio_iommu_type1.c | 6 ------
> 1 file changed, 6 deletions(-)
Applied to vfio next branch for v5.20. Thanks,
Alex
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..a8ff00dad834 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>
> if (!iommu->v2 && iova > dma->iova)
> break;
> - /*
> - * Task with same address space who mapped this iova range is
> - * allowed to unmap the iova range.
> - */
> - if (dma->task->mm != current->mm)
> - break;
>
> if (invalidate_vaddr) {
> if (dma->vaddr_invalid) {
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-06-30 19:52 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27 3:51 [PATCH] vfio: remove useless judgement lizhe.67
2022-06-27 22:06 ` Alex Williamson
2022-06-28 12:48 ` Steven Sistare
2022-06-28 13:03 ` Jason Gunthorpe
2022-06-28 13:54 ` Steven Sistare
2022-06-28 14:04 ` Jason Gunthorpe
2022-06-30 19:51 ` Alex Williamson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).