From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dri-devel-bounces@lists.freedesktop.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 39120CD128A
	for <dri-devel@archiver.kernel.org>; Mon,  1 Apr 2024 16:56:36 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 6983010F30F;
	Mon,  1 Apr 2024 16:56:33 +0000 (UTC)
Authentication-Results: gabe.freedesktop.org;
	dkim=pass (2048-bit key; unprotected) header.d=ursulin-net.20230601.gappssmtp.com header.i=@ursulin-net.20230601.gappssmtp.com header.b="Yeut9Gh1";
	dkim-atps=neutral
Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com
 [209.85.221.51])
 by gabe.freedesktop.org (Postfix) with ESMTPS id 2AA9F10F30D
 for <dri-devel@lists.freedesktop.org>; Mon,  1 Apr 2024 16:56:32 +0000 (UTC)
Received: by mail-wr1-f51.google.com with SMTP id
 ffacd0b85a97d-34261edaf7eso2988028f8f.1
 for <dri-devel@lists.freedesktop.org>; Mon, 01 Apr 2024 09:56:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=ursulin-net.20230601.gappssmtp.com; s=20230601; t=1711990590; x=1712595390;
 darn=lists.freedesktop.org; 
 h=content-transfer-encoding:in-reply-to:from:references:cc:to
 :content-language:subject:user-agent:mime-version:date:message-id
 :from:to:cc:subject:date:message-id:reply-to;
 bh=jr2qZk9lct5yzLLB+HbnGtnXS4sn62UZ/oLBm2lRIrQ=;
 b=Yeut9Gh1+l2DpIue3yLBiMOWLjzs22TG4nkw8qa5HCl8ZRhQH3Q0s1XVdvtGyUxklt
 Ibf+kX7EpftWQmll9trUcOx3bIrweevkL+gFmRQ5bSHQU5BA0KsrpnuHqposkIvdF2VG
 nZaG5SwvcX0XT5kdqOErjJQ3DFkkCA4q+9llSlugstFgNAT/I04/5XALmpXoOe14QNAn
 tJQVFad6ZIQTKz0FWQYtYSvAZyEvRNLRByb8L6RsFnpwUPIXaOzlxDwSUQPlJ5ZM9qbN
 Ld4pQe8sLp7B8PncpCGDeQv6nZ5FtOpfeWhfFt8f0wV7NZR6RB9zBholWYeI/b/nRRi5
 +hqg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1711990590; x=1712595390;
 h=content-transfer-encoding:in-reply-to:from:references:cc:to
 :content-language:subject:user-agent:mime-version:date:message-id
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=jr2qZk9lct5yzLLB+HbnGtnXS4sn62UZ/oLBm2lRIrQ=;
 b=IkmRGsL7wwuIrqxNgvhlyVVSYOlSZnOtgpqcCbnrfzdkKDwp33vZjdLn0OlpCZQvU5
 p9J9VvSFPRm9Um8Mj5il0Ys3/1c0KTdaq2AaRK99aq1EXw00/GUR7/DFvOW3Oat9HtS2
 L8i6TNtVkq81cHu/zQxmFQJJnTdJcD6hUfFKaF/dxd36T6WeBdvM8dUFtHsXbTqiRk21
 Nqt1f7cOuU1XdgmGK/edQzm5NE4IJqRDhio//lOKh8PJTtf3G2P2saaE6Ivit8w9HK5G
 nWUHjKF5hfcDtg13vzTh5D/RBojZYKogFQ/7JIUOFzEXLwvAPMcukn0OIDYgqtsDePZq
 Iq9w==
X-Forwarded-Encrypted: i=1;
 AJvYcCVyRWbD2wCQg6M59LQlK5yAoRWzoSwRFETk6l/t9HrKgFDtOW8O6EeZHClmstM4QJoqBso55DtBrI5V4ZsXzZWaO4DD7qAXFDawYDrCiNaA
X-Gm-Message-State: AOJu0Yw9uyjs8HNinw8BMj4T/C01YM/p15b6Spmm/CA8lfDYVh/xknfF
 /vhz32VPPI74L7ToYlxFPUK+S02fTYlXML7No39yQeKp0QQZa013vxvHVYb20sE=
X-Google-Smtp-Source: AGHT+IGl1N2rOwOo0vjy6dwSTCtG5W6uo3MB0kB1gDlFtjbFUQa/0TO5nnzw/rhL2OZkUQWJ93XfLQ==
X-Received: by 2002:a5d:590e:0:b0:33e:781e:da57 with SMTP id
 v14-20020a5d590e000000b0033e781eda57mr5515191wrd.53.1711990590000; 
 Mon, 01 Apr 2024 09:56:30 -0700 (PDT)
Received: from [192.168.0.101] ([84.65.0.132])
 by smtp.gmail.com with ESMTPSA id
 h1-20020a05600016c100b0033e72e104c5sm9412382wrf.34.2024.04.01.09.56.29
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Mon, 01 Apr 2024 09:56:29 -0700 (PDT)
Message-ID: <a2efc4d4-1fc6-4a5c-8ce8-b7161d447559@ursulin.net>
Date: Mon, 1 Apr 2024 17:56:28 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Proposal to add CRIU support to DRM render nodes
Content-Language: en-GB
To: Felix Kuehling <felix.kuehling@amd.com>,
 "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>,
 "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
 Dave Airlie <airlied@redhat.com>, Daniel Vetter <daniel@ffwll.ch>,
 criu@openvz.org
Cc: "Errabolu, Ramesh" <Ramesh.Errabolu@amd.com>,
 =?UTF-8?Q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>
References: <03a9977c-2044-40c7-9688-9ab326abb81c@amd.com>
 <81c7b2c3-564a-4946-832b-5394cd19a7b7@ursulin.net>
 <b5134b40-6e84-47c9-a4f2-2a2c0faa9481@ursulin.net>
 <330606ad-4236-4b2c-9606-b51ec6567a25@amd.com>
 <a9b7e59f-7269-414b-a9f9-68107dc81c8b@ursulin.net>
 <2e2a49bc-1698-4b61-88ad-0ee53ad4d774@ursulin.net>
 <13062848-6538-443f-b204-e61ac4f1d4ec@amd.com>
 <f1ad96c0-2c35-4b5b-bf82-75884e3ac960@ursulin.net>
 <c2caebaa-7735-4c42-9ec1-1426f5896016@amd.com>
From: Tvrtko Ursulin <tursulin@ursulin.net>
In-Reply-To: <c2caebaa-7735-4c42-9ec1-1426f5896016@amd.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: dri-devel@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Direct Rendering Infrastructure - Development
 <dri-devel.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>


On 01/04/2024 17:37, Felix Kuehling wrote:
> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>
>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>
>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>
>>>> Hi Felix,
>>>>
>>>> I had one more thought while browsing around the amdgpu CRIU plugin. 
>>>> It appears it relies on the KFD support being compiled in and 
>>>> /dev/kfd present, correct? AFAICT at least, it relies on that to 
>>>> figure out the amdgpu DRM node.
>>>>
>>>> In would be probably good to consider designing things without that 
>>>> dependency. So that checkpointing an application which does not use 
>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD 
>>>> support compiled in.
>>>
>>> Yeah, if we want to support graphics apps that don't use KFD, we 
>>> should definitely do that. Currently we get a lot of topology 
>>> information from KFD, not even from the /dev/kfd device but from the 
>>> sysfs nodes exposed by KFD. We'd need to get GPU device info from the 
>>> render nodes instead. And if KFD is available, we may need to 
>>> integrate both sources of information.
>>>
>>>
>>>>
>>>> It could perhaps mean no more than adding some GPU discovery code 
>>>> into CRIU. Which shuold be flexible enough to account for things 
>>>> like re-assigned minor numbers due driver reload.
>>>
>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. 
>>> I was thinking this is still part of the plugin.
>>
>> Yes I agree. I was only thinking about adding some DRM device 
>> discovery code in a more decoupled fashion from the current plugin, 
>> for both the reason discussed above (decoupling a bit from reliance on 
>> kfd sysfs), and then also if/when a new DRM driver might want to 
>> implement this the code could be move to some common plugin area.
>>
>> I am not sure how feasible that would be though. The "gpu id" concept 
>> and it's matching in the current kernel code and CRIU plugin - is that 
>> value tied to the physical GPU instance or how it works?
> 
> The concept of the GPU ID is that it's stable while the system is up, 
> even when devices get added and removed dynamically. It was baked into 
> the API early on, but I don't think we ever fully validated device hot 
> plug. I think the closest we're getting is with our latest MI GPUs and 
> dynamic partition mode change.

Doesn't it read the saved gpu id from the image file while doing restore 
and tries to open the render node to match it? Maybe I am misreading the 
code.. But if it does, does it imply that in practice it could be stable 
across reboots? Or that it is not possible to restore to a different 
instance of maybe the same GPU model installed in a system?

> This also highlights another aspect on those spatially partitioned GPUs. 
> GPU IDs identify device partitions, not devices. Similarly, each 
> partition has its own render node, and the KFD topology info in sysfs 
> points to the render-minor number corresponding to each GPU ID.

I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?

Regards,

Tvrtko

>>>> Otherwise I am eagerly awaiting to hear more about the design 
>>>> specifics around dma-buf handling. And also seeing how to extend to 
>>>> other DRM related anonymous fds.
>>>
>>> I've been pretty far under-water lately. I hope I'll find time to 
>>> work on this more, but it's probably going to be at least a few weeks.
>>
>> Got it.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Regards,
>>>    Felix
>>>
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>>
>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> Hi Felix,
>>>>>>>>
>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render 
>>>>>>>>> nodes in order to maintain CRIU support for ROCm application 
>>>>>>>>> once they start relying on render nodes for more GPU memory 
>>>>>>>>> management. In this email I'm providing some background why we 
>>>>>>>>> are doing this, and outlining some of the problems we need to 
>>>>>>>>> solve to checkpoint and restore render node state and shared 
>>>>>>>>> memory (DMABuf) state. I have some thoughts on the API design, 
>>>>>>>>> leaning on what we did for KFD, but would like to get feedback 
>>>>>>>>> from the DRI community regarding that API and to what extent 
>>>>>>>>> there is interest in making that generic.
>>>>>>>>>
>>>>>>>>> We are working on using DRM render nodes for virtual address 
>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM 
>>>>>>>>> API and improve interoperability between graphics and compute. 
>>>>>>>>> This uses DMABufs for sharing buffer objects between KFD and 
>>>>>>>>> multiple render node devices, as well as between processes. In 
>>>>>>>>> the long run this also provides a path to moving all or most 
>>>>>>>>> memory management from the KFD ioctl API to libdrm.
>>>>>>>>>
>>>>>>>>> Once ROCm user mode starts using render nodes for virtual 
>>>>>>>>> address management, that creates a problem for checkpointing 
>>>>>>>>> and restoring ROCm applications with CRIU. Currently there is 
>>>>>>>>> no support for checkpointing and restoring render node state, 
>>>>>>>>> other than CPU virtual address mappings. Support will be needed 
>>>>>>>>> for checkpointing GEM buffer objects and handles, their GPU 
>>>>>>>>> virtual address mappings and memory sharing relationships 
>>>>>>>>> between devices and processes.
>>>>>>>>>
>>>>>>>>> Eventually, if full CRIU support for graphics applications is 
>>>>>>>>> desired, more state would need to be captured, including 
>>>>>>>>> scheduler contexts and BO lists. Most of this state is 
>>>>>>>>> driver-specific.
>>>>>>>>>
>>>>>>>>> After some internal discussions we decided to take our design 
>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf 
>>>>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>>>>
>>>>>>>>> One basic question before going into any API details: Is there 
>>>>>>>>> a desire to have CRIU support for other DRM drivers?
>>>>>>>>
>>>>>>>> This sounds like a very interesting feature on the overall, 
>>>>>>>> although I cannot answer on the last question here.
>>>>>>>
>>>>>>> I forgot to finish this thought. I cannot answer / don't know of 
>>>>>>> any concrete plans, but I think feature is pretty cool and if 
>>>>>>> amdgpu gets it working I wouldn't be surprised if other drivers 
>>>>>>> would get interested.
>>>>>>
>>>>>> Thanks, that's good to hear!
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I 
>>>>>>>> recently implemented on Mesa's request, which is to be able to 
>>>>>>>> "upload" the GPU context from the GPU hang error state and 
>>>>>>>> replay the hanging request. It is kind of (at a stretch) a very 
>>>>>>>> special tiny subset of checkout and restore so I am not 
>>>>>>>> mentioning it as a curiosity.
>>>>>>>>
>>>>>>>> And there is also another partical conceptual intersect with the 
>>>>>>>> (at the moment not yet upstream) i915 online debugger. This part 
>>>>>>>> being in the area of discovering and enumerating GPU resources 
>>>>>>>> beloning to the client.
>>>>>>>>
>>>>>>>> I don't see an immediate design or code sharing opportunities 
>>>>>>>> though but just mentioning.
>>>>>>>>
>>>>>>>> I did spend some time reading your plugin and kernel 
>>>>>>>> implementation out of curiousity and have some comments and 
>>>>>>>> questions.
>>>>>>>>
>>>>>>>>> With that out of the way, some considerations for a possible 
>>>>>>>>> DRM CRIU API (either generic of AMDGPU driver specific): The 
>>>>>>>>> API goes through several phases during checkpoint and restore:
>>>>>>>>>
>>>>>>>>> Checkpoint:
>>>>>>>>>
>>>>>>>>>  1. Process-info (enumerates objects and sizes so user mode can 
>>>>>>>>> allocate
>>>>>>>>>     memory for the checkpoint, stops execution on the GPU)
>>>>>>>>>  2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>>>  3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>>
>>>>>>>>> Restore:
>>>>>>>>>
>>>>>>>>>  1. Restore (restore objects, VMAs are not in the right place 
>>>>>>>>> at this time)
>>>>>>>>>  2. Resume (final fixups after the VMAs are sorted out, resume 
>>>>>>>>> execution)
>>>>>>>>
>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is 
>>>>>>>> idled? For instance dma_resv objects are free of fences which 
>>>>>>>> would need to restored for things to continue executing 
>>>>>>>> sensibly? Or how is that handled?
>>>>>>
>>>>>> In our compute use cases, we suspend user mode queues. This can 
>>>>>> include CWSR (compute-wave-save-restore) where the state of 
>>>>>> in-flight waves is stored in memory and can be reloaded and 
>>>>>> resumed from memory later. We don't use any fences other than 
>>>>>> "eviction fences", that are signaled after the queues are 
>>>>>> suspended. And those fences are never handed to user mode. So we 
>>>>>> don't need to worry about any fence state in the checkpoint.
>>>>>>
>>>>>> If we extended this to support the kernel mode command submission 
>>>>>> APIs, I would expect that we'd wait for all current submissions to 
>>>>>> complete, and stop new ones from being sent to the HW before 
>>>>>> taking the checkpoint. When we take the checkpoint in the CRIU 
>>>>>> plugin, the CPU threads are already frozen and cannot submit any 
>>>>>> more work. If we wait for all currently pending submissions to 
>>>>>> drain, I think we don't need to save any fence state because all 
>>>>>> the fences will have signaled. (I may be missing some intricacies 
>>>>>> and I'm afraid it may not be that simple in reality, but that's my 
>>>>>> opening bid. ;)
>>>>>
>>>>> It feels feasible to me too, for the normally behaving clients at 
>>>>> least.
>>>>>
>>>>> Presumably, given that the whole checkpointing is not instant, it 
>>>>> would be okay to wait a second or two longer for the in-progress 
>>>>> submissions complete. After which kernel would need to prune all 
>>>>> signalled fences from the respective container objects before 
>>>>> checkpointing.
>>>>>
>>>>> For the "misbehaving" clients who have perhaps queued up too much 
>>>>> work, either still in the scheduler with unsatisfied dependencies, 
>>>>> or already submitted to the hardware and/or driver backend, is 
>>>>> there a timeout concept in CRIU so it would be possible to say 
>>>>> something like "try to checkpoint but if the kernel says no time 
>>>>> period t then give up"?
>>>>>
>>>>>>>>> For some more background about our implementation in KFD, you 
>>>>>>>>> can refer to this whitepaper: 
>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>>
>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render 
>>>>>>>>> nodes, I'll address each of them in more detail below:
>>>>>>>>>
>>>>>>>>>   * Opaque information in the checkpoint data that user mode can't
>>>>>>>>>     interpret or do anything with
>>>>>>>>>   * A second API for creating objects (e.g. BOs) that is 
>>>>>>>>> separate from
>>>>>>>>>     the regular BO creation API
>>>>>>>>>   * Kernel mode would need to be involved in restoring BO sharing
>>>>>>>>>     relationships rather than replaying BO creation, export and 
>>>>>>>>> import
>>>>>>>>>     from user mode
>>>>>>>>>
>>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>>
>>>>>>>>> This comes out of ABI compatibility considerations. Adding any 
>>>>>>>>> new objects or attributes to the driver/HW state that needs to 
>>>>>>>>> be checkpointed could potentially break the ABI of the CRIU 
>>>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that 
>>>>>>>>> information. Therefore, much of the information in our KFD CRIU 
>>>>>>>>> ioctl API is opaque. It is written by kernel mode in the 
>>>>>>>>> checkpoint, it is consumed by kernel mode when restoring the 
>>>>>>>>> checkpoint, but user mode doesn't care about the contents or 
>>>>>>>>> binary layout, so there is no user mode ABI to break. This is 
>>>>>>>>> how we were able to maintain CRIU support when we added the SVM 
>>>>>>>>> API to KFD without changing the CRIU plugin and without 
>>>>>>>>> breaking our ABI.
>>>>>>>>>
>>>>>>>>> Opaque information may also lend itself to API abstraction, if 
>>>>>>>>> this becomes a generic DRM API with driver-specific callbacks 
>>>>>>>>> that fill in HW-specific opaque data.
>>>>>>>>
>>>>>>>> This feels sound in principle to me. Fundamentally the state is 
>>>>>>>> very hardware specfic, and/or driver version specific, so I 
>>>>>>>> don't see anything could be gained in practice by making it much 
>>>>>>>> less opaque. (Apart from making things more complicated.)
>>>>>>>>
>>>>>>>> I was however unsure of the current split of how you dump buffer 
>>>>>>>> objects with some data in the defined bo structure, and some in 
>>>>>>>> completely opaque private data. Is there a benefit to that 
>>>>>>>> split, or maybe in other words, is there a benefit on having 
>>>>>>>> part transparent and part opaque for buffer objects?
>>>>>>
>>>>>> Some of the buffer object state is needed by the plugin. E.g. the 
>>>>>> size and mmap offset are needed to match VMAs with BOs. I'd have 
>>>>>> to review the plugin in detail to prove that all the fields are, 
>>>>>> in fact, needed by the plugin, but that was the intent. Anything 
>>>>>> that the plugin doesn't need to know should be in the opaque data 
>>>>>> structures.
>>>>>
>>>>> Right, got it.
>>>>>
>>>>> Would it make sense to make the opaque data in the same block as 
>>>>> the defined one? I mean instead of separating the two in the binary 
>>>>> image for instance have struct kfd_criu_bo_bucket have a trailing 
>>>>> priv_data blob? Maybe it is too late now if the image format is not 
>>>>> versioned or something.
>>>>>
>>>>>>>> To slightly touch upon the question of whether this could become 
>>>>>>>> a generic DRM API. It feels it would be hard to do it from the 
>>>>>>>> start. What sounds more feasible is if/when generic looking 
>>>>>>>> helpers can be spotted while developing the RFC then potentially 
>>>>>>>> structure the code they can easily be promoted to shared/common 
>>>>>>>> at some future moment.
>>>>>>
>>>>>> Yes, that's how this usually goes, in my experience. Thanks for 
>>>>>> confirming.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> # Second API for creating objects
>>>>>>>>>
>>>>>>>>> Creating BOs and other objects when restoring a checkpoint 
>>>>>>>>> needs more information than the usual BO alloc and similar APIs 
>>>>>>>>> provide. For example, we need to restore BOs with the same GEM 
>>>>>>>>> handles so that user mode can continue using those handles 
>>>>>>>>> after resuming execution. If BOs are shared through DMABufs 
>>>>>>>>> without dynamic attachment, we need to restore pinned BOs as 
>>>>>>>>> pinned. Validation of virtual addresses and handling MMU 
>>>>>>>>> notifiers must be suspended until the virtual address space is 
>>>>>>>>> restored. For user mode queues we need to save and restore a 
>>>>>>>>> lot of queue execution state so that execution can resume cleanly.
>>>>>>>>
>>>>>>>> This also sounds justified to me. Restore creating all internal 
>>>>>>>> objects is definitely special and sounds better to add uapi to 
>>>>>>>> create them directly with the correct properties, than to add 
>>>>>>>> uapi to adjust internal properties after creation. And in case 
>>>>>>>> you would always need some new uapi - so at least to adjust 
>>>>>>>> after creation. At which point you may have both in one. 
>>>>>>>> Internally implementation can be split or common, whatever makes 
>>>>>>>> sense for a given object type, but new uapi definitely sounds is 
>>>>>>>> required.
>>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>>
>>>>>>>>> Different GEM handles in different render nodes and processes 
>>>>>>>>> can refer to the same underlying shared memory, either by 
>>>>>>>>> directly pointing to the same GEM object, or by creating an 
>>>>>>>>> import attachment that may get its SG tables invalidated and 
>>>>>>>>> updated dynamically through dynamic attachment callbacks. In 
>>>>>>>>> the latter case it's obvious, who is the exporter and who is 
>>>>>>>>> the importer. In the first case, either one could be the 
>>>>>>>>> exporter, and it's not clear who would need to create the BO 
>>>>>>>>> and who would need to 
>>>>>>>>
>>>>>>>> To see if I follow the former case correctly.
>>>>>>>>
>>>>>>>> This could be two clients A and B, where B has imported a 
>>>>>>>> dma-buf shared BO from A and has since closed the dma-buf fd? 
>>>>>>>> Which results in a single BO with reference count of 2 and 
>>>>>>>> obj->import_attach unset. History of who created the object is 
>>>>>>>> lost.
>>>>>>
>>>>>> Yes. In the amdgpu driver this happens when the exporter and 
>>>>>> import device are the same.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> In fact it could even be that two imported objects remain 
>>>>>>>> (clients A, B and C) and A, who originally created the BO, has 
>>>>>>>> since fully released it. So any kind of "creator" tracking if 
>>>>>>>> added wouldn't be fully reliable either.
>>>>>>
>>>>>> That's a good point.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> import it when restoring the checkpoint. To further complicate 
>>>>>>>>> things, multiple processes in a checkpoint get restored 
>>>>>>>>> concurrently. So there is no guarantee that an exporter has 
>>>>>>>>> restored a shared BO at the time an importer is trying to 
>>>>>>>>> restore its import.
>>>>>>>>>
>>>>>>>>> A proposal to deal with these problems would be to treat 
>>>>>>>>> importers and exporters the same. Whoever restores first, ends 
>>>>>>>>> up creating the BO and potentially attaching to it. The other 
>>>>>>>>> process(es) can find BOs that were already restored by another 
>>>>>>>>> process by looking it up with a unique ID that could be based 
>>>>>>>>> on the DMABuf inode number. An alternative would be a two-pass 
>>>>>>>>> approach that needs to restore BOs on two passes:
>>>>>>>>>
>>>>>>>>>  1. Restore exported BOs
>>>>>>>>>  2. Restore imports
>>>>>>>>>
>>>>>>>>> With some inter-process synchronization in CRIU itself between 
>>>>>>>>> these two passes. This may require changes in the core CRIU, 
>>>>>>>>> outside our plugin. Both approaches depend on identifying BOs 
>>>>>>>>> with some unique ID that could be based on the DMABuf inode 
>>>>>>>>> number in the checkpoint. However, we would need to identify 
>>>>>>>>> the processes in the same restore session, possibly based on 
>>>>>>>>> parent/child process relationships, to create a scope where 
>>>>>>>>> those IDs are valid during restore.
>>>>>>>>
>>>>>>>> If my understanding above is on the right track, then I think 
>>>>>>>> this is the only thing which can be done (for all scenarios).
>>>>>>
>>>>>> I presented two alternatives. I think you're in favor of the first 
>>>>>> one, where it doesn't matter who is the importer and exporter. I 
>>>>>> think the two-pass approach requires that you can identify an 
>>>>>> exporter. And as you pointed out, the exporter may already have 
>>>>>> dropped their reference to the BO.
>>>>>
>>>>> Yep.
>>>>>
>>>>>>>> I also *think* it would be safe. At least at the moment I cannot 
>>>>>>>> think what could go wrong. Semantics are that it doesn't really 
>>>>>>>> matter who created the object.
>>>>>>
>>>>>> I would agree. What matters is that the object is recreated on the 
>>>>>> correct device, and that all the direct references and import 
>>>>>> attachments pointing to it are restored.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf 
>>>>>>>>> file descriptors themselves. These are anonymous file 
>>>>>>>>> descriptors. The CRIU plugin could probably be taught to 
>>>>>>>>> recreate them from the original exported BO based on the inode 
>>>>>>>>> number that could be queried with fstat in the checkpoint. It 
>>>>>>>>> would need help from the render node CRIU API to find the right 
>>>>>>>>> BO from the inode, which may be from a different process in the 
>>>>>>>>> same restore session.
>>>>>>>>
>>>>>>>> This part feels like it is breaking the component separation a 
>>>>>>>> bit because even for buffers fully owned by amdgpu, strictly 
>>>>>>>> speaking the dma-buf fd is not. At least my understanding from 
>>>>>>>> the above is that you propose to attempt to import the fd, from 
>>>>>>>> the kernel side, during the checkpoint process? Like:
>>>>>>>>
>>>>>>>> Checkpoint:
>>>>>>>>
>>>>>>>> CRIU for each anon fd:
>>>>>>>>    amdgpu_plugin(fd)
>>>>>>>>      -> attempt in kernel dma buf import (passes fd to kernel 
>>>>>>>> via ioctl?)
>>>>>>>>          -> is it ours? (no -> error)
>>>>>>>>              -> create a record mapping fd number to amdgpu BO
>>>>>>>>
>>>>>>>> Restore:
>>>>>>>>
>>>>>>>> for each dma-buf fd record:
>>>>>>>>     create BO if does not exists
>>>>>>>>     export BO to same fd
>>>>>>>>     close BO handle if not in regular BO handle records
>>>>>>>>
>>>>>>>> Or since you mention lookup by inode, that would need to be 
>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount 
>>>>>>>> space. However how would it co-operate with amdgpu_plugin is not 
>>>>>>>> clear to me.
>>>>>>
>>>>>> The way I think about the ownership is, whichever driver created 
>>>>>> the underlying BO owns the checkpointing of the dmabuf. You need 
>>>>>> driver-specific information to link the dmabuf with the driver's 
>>>>>> BO and you need the right driver to recreate the BO and the dmabuf 
>>>>>> fd when restoring the checkpoint.
>>>>>>
>>>>>> It gets really interesting if you have an amdgpu plugin and an 
>>>>>> i915 plugin, and they checkpoint an application that shares BOs 
>>>>>> between the two devices through DMABufs. E.g. if i915 created a BO 
>>>>>> and amdgpu imported it, then during restore, i915 needs to restore 
>>>>>> the dmabuf before the amdgpu import of it can be restored. I think 
>>>>>> that brings us back to a two-phase approach to restoring the 
>>>>>> memory sharing relationships. Uff.
>>>>>
>>>>> I think this part of the discussion somewhat depends on the 
>>>>> previous part about idling. If it is feasible to completely idle 
>>>>> and prune, and fail if that is not happening quickly enough, then 
>>>>> maybe there wouldn't be too much hierarchical state to save.
>>>>>
>>>>> Otherwise my idea was that there is a top-level drm_plugin.so which 
>>>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new 
>>>>> uapi to uniquely identify each, associate with the correct driver, 
>>>>> and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. 
>>>>> Building the in memory representation of their relationships. As 
>>>>> long as all objects and their relationships have been recorded I 
>>>>> think everything could then be correctly restored.
>>>>>
>>>>> It is possible there is flaw in my thinking and that something in 
>>>>> CRIU design would make this impossible? I think it would require 
>>>>> the top-level drm_plugin.so to hold all state in memory until the 
>>>>> whole checkpointing is done, and then verify something is not 
>>>>> incomplete, failing it all if it was. (For instance one plugin 
>>>>> discovered an reference to an object which was not discoverd by any 
>>>>> other plugin or things like that.) May need some further tweaks to 
>>>>> CRIU common code.
>>>>>
>>>>> Maybe I need to better understand how exactly you mean to query the 
>>>>> DRM driver about random anonymous fds. I see it as a problem in the 
>>>>> design, possibly even implementation, but maybe I am missing 
>>>>> something which makes it not so. I mean even with my general idea I 
>>>>> don't know how would one determine which driver to query about a 
>>>>> particular anonymous inode.
>>>>>
>>>>>>> I later also realised that I was maybe increasing the scope for 
>>>>>>> you here. :) You did state focus is ROCm applications which 
>>>>>>> possibly doesn't care about dma-resv, fences, syncobjs etc?
>>>>>>
>>>>>> That's my focus for now. But I don't want to engineer a solution 
>>>>>> that would preclude your use cases in the future.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> But I think the "how to handle dma-bufs" design question is still 
>>>>>>> relevant and interesting. For example I had this thought that 
>>>>>>> perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>>
>>>>>>> Because fundamentally we would be snapshoting a hierarcy of 
>>>>>>> kernel objects belonging to different drivers (kfd, amdgpu, 
>>>>>>> dma-buf, ...). And if one day someone would to try to handle dma 
>>>>>>> fences and drm syncobjs, the argument for a hierarchial design 
>>>>>>> would be even stronger I think.
>>>>>>>
>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu, 
>>>>>>> dma-buf, sync file, ...) and internally build the representation 
>>>>>>> of the whole state and how the relationship between the objects.
>>>>>>
>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not 
>>>>>> sure it's strictly a hierarchy. Maybe more like some common code 
>>>>>> shared by multiple GPU driver plugins. I think the common 
>>>>>> checkpoint state is quite limited and restoring it requires the 
>>>>>> GPU-specific drivers anyway.
>>>>>>
>>>>>> Also the idea of building a representation of the whole state 
>>>>>> doesn't work well with the CRIU design, because "the whole state" 
>>>>>> can include multiple processes that restore themselves 
>>>>>> concurrently and only synchronize with each other in a few places 
>>>>>> in the restore process. I feel, if we can work out how to 
>>>>>> checkpoint and restore shared objects between processes, we can do 
>>>>>> the same for shared objects between drivers without imposing a 
>>>>>> strict hierarchy and some omniscient entity that needs to know 
>>>>>> "the whole state".
>>>>>
>>>>> Okay, this continues on the same problem space as above. And you 
>>>>> obviously know how CRIU works much better than me.
>>>>>
>>>>>>> With that kind of design there probably would be a need to define 
>>>>>>> some common kernel side api and uapi, so all involved objects can 
>>>>>>> be enumerated with some unique ids etc.
>>>>>>>
>>>>>>> Now.. the counter argument.. the more state from different 
>>>>>>> drivers would one want to handle the bigger this project would 
>>>>>>> get. Would it even be feasible is the question, to the point that 
>>>>>>> it may be simpler to just run the workload in a VM via SR-IOV and 
>>>>>>> simply hibernate the whole thin guest. :)
>>>>>>
>>>>>> Well, CRIU kind of tries to do that, but with containers instead 
>>>>>> of VMs. ;)
>>>>>
>>>>> It would definitely be useful for hardware and drivers without 
>>>>> SR-IOV support so lets hope it is doable. :)
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko