Sorry for the typo in my previous email. Please read Adrian Reber* On 12/22/2021 8:49 PM, Bhardwaj, Rajneesh wrote: > > Adding Adrian Rebel who is the CRIU maintainer and CRIU list > > On 12/22/2021 3:53 PM, Daniel Vetter wrote: >> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >>> On 12/20/2021 4:29 AM, Daniel Vetter wrote: >>>> On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote: >>>>> Am 09.12.21 um 19:28 schrieb Felix Kuehling: >>>>>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König: >>>>>>> That still won't work. >>>>>>> >>>>>>> But I think we could do this change for the amdgpu mmap callback only. >>>>>> If graphics user mode has problems with it, we could even make this >>>>>> specific to KFD BOs in the amdgpu_gem_object_mmap callback. >>>>> I think it's fine for the whole amdgpu stack, my concern is more about >>>>> radeon, nouveau and the ARM stacks which are using this as well. >>>>> >>>>> That blew up so nicely the last time we tried to change it and I know of at >>>>> least one case where radeon was/is used with BOs in a child process. >>>> I'm way late and burried again, but I think it'd be good to be consistent > > > I had committed this change into our amd-staging-drm-next branch last > week after I got the ACK and RB from Felix and Christian. > > >>>> here across drivers. Or at least across drm drivers. And we've had the vma >>>> open/close refcounting to make fork work since forever. >>>> >>>> I think if we do this we should really only do this for mmap() where this >>>> applies, but reading through the thread here I'm honestly confused why >>>> this is a problem. If CRIU can't handle forked mmaps it needs to be >>>> thought that, not hacked around. Or at least I'm not understanding why >>>> this shouldn't work ... >>>> -Daniel >>>> >>> Hi Daniel >>> >>> In the v2 >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2F&data=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3D&reserved=0 >>> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding >>> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to >>> recreate all the child processes and then mmaps all the VMAs it sees (as per >>> checkpoint snapshot) in the new process address space after the VMA >>> placements are finalized in the position independent code phase. Since the >>> inherited VMAs don't have access rights the criu mmap fails. >> Still sounds funky. I think minimally we should have an ack from CRIU >> developers that this is officially the right way to solve this problem. I >> really don't want to have random one-off hacks that don't work across the >> board, for a problem where we (drm subsystem) really shouldn't be the only >> one with this problem. Where "this problem" means that the mmap space is >> per file description, and not per underlying inode or real device or >> whatever. That part sounds like a CRIU problem, and I expect CRIU folks >> want a consistent solution across the board for this. Hence please grab an >> ack from them. >> >> Cheers, Daniel > > > Maybe Adrian can share his views on this. > > Hi Adrian - For the context, on CRIU restore we see mmap failures ( in > PIE restore phase) due to permission issues on the (render node) VMAs > that were inherited since the application that check pointed had > forked.  The VMAs ideally should not be in the child process but the > smaps file shows these VMAs in the child address space. We didn't want > to use madvise to avoid this copy and rather change in the kernel mode > to limit the impact to our user space library thunk. Based on my > understanding, during PIE restore phase, after the VMA placements are > finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry > list and I think its not an issue as per CRIU design but do you think > we could handle this corner case better inside CRIU? > > >>> Regards, >>> >>> Rajneesh >>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Regards, >>>>>>   Felix >>>>>> >>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: >>>>>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank >>>>>>>> you! >>>>>>>> >>>>>>>> On 12/9/2021 10:27 AM, Christian König wrote: >>>>>>>>> Hi Rajneesh, >>>>>>>>> >>>>>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly >>>>>>>>> a good idea. >>>>>>>>> >>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>>>>> access restrictions applied >>>>>>>>> exactly that is not correct. That behavior is actively used by some >>>>>>>>> userspace stacks as far as I know. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>>>>>>>>> Thanks Christian. Would it make it less intrusive if I just use the >>>>>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>>>>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change >>>>>>>>>> should suffice and we don't want to put any more work arounds in >>>>>>>>>> the user space (thunk, in our case). >>>>>>>>>> >>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>>>>> access restrictions applied so I wonder why even inherit the vma? >>>>>>>>>> >>>>>>>>>> On 12/9/2021 2:54 AM, Christian König wrote: >>>>>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>>>>>>>>> When an application having open file access to a node forks, its >>>>>>>>>>>> shared >>>>>>>>>>>> mappings also get reflected in the address space of child process >>>>>>>>>>>> even >>>>>>>>>>>> though it cannot access them with the object permissions applied. >>>>>>>>>>>> With the >>>>>>>>>>>> existing permission checks on the gem objects, it might be >>>>>>>>>>>> reasonable to >>>>>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>>>>>>>>> application >>>>>>>>>>>> doesn't need to explicitly call the madvise(addr, len, >>>>>>>>>>>> MADV_DONTFORK) >>>>>>>>>>>> system call to prevent the pages in the mapped range to appear in >>>>>>>>>>>> the >>>>>>>>>>>> address space of the child process. It also prevents the memory >>>>>>>>>>>> leaks >>>>>>>>>>>> due to additional reference counts on the mapped BOs in the child >>>>>>>>>>>> process that prevented freeing the memory in the parent for which >>>>>>>>>>>> we had >>>>>>>>>>>> worked around earlier in the user space inside the thunk library. >>>>>>>>>>>> >>>>>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>>>>>>>>> restore >>>>>>>>>>>> an application that had such inherited mappings in the child which >>>>>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>>>>>>>>> this so >>>>>>>>>>>> this is needed only for the render nodes. >>>>>>>>>>> Unfortunately that is most likely a NAK. We already tried >>>>>>>>>>> something similar. >>>>>>>>>>> >>>>>>>>>>> While it is illegal by the OpenGL specification and doesn't work >>>>>>>>>>> for most userspace stacks, we do have some implementations which >>>>>>>>>>> call fork() with a GL context open and expect it to work. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>>>> Cc: Felix Kuehling >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: David Yat Sin >>>>>>>>>>>> Signed-off-by: Rajneesh Bhardwaj >>>>>>>>>>>> --- >>>>>>>>>>>>   drivers/gpu/drm/drm_gem.c       | 3 ++- >>>>>>>>>>>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>>>>>>>>>   2 files changed, 3 insertions(+), 2 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>>>>>>>>> index 09c820045859..d9c4149f36dd 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>>>>>>>>> *obj, unsigned long obj_size, >>>>>>>>>>>>               goto err_drm_gem_object_put; >>>>>>>>>>>>           } >>>>>>>>>>>>   -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>>>>>>>>> VM_DONTDUMP; >>>>>>>>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>>>>>>>>> +                | VM_DONTDUMP | VM_DONTCOPY; >>>>>>>>>>>>           vma->vm_page_prot = >>>>>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>>>>>>>>>           vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>>>>>>>>>       } >>>>>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> index 33680c94127c..420a4898fdd2 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>>>>>>>>> *vma, struct ttm_buffer_object *bo) >>>>>>>>>>>>         vma->vm_private_data = bo; >>>>>>>>>>>>   -    vma->vm_flags |= VM_PFNMAP; >>>>>>>>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>>>>>>>>>       vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>>>>>>>>>       return 0; >>>>>>>>>>>>   }