* Proposal to add CRIU support to DRM render nodes
@ 2023-12-06 21:23 Felix Kuehling
2024-01-15 18:58 ` Felix Kuehling
2024-03-11 14:48 ` Tvrtko Ursulin
0 siblings, 2 replies; 14+ messages in thread
From: Felix Kuehling @ 2023-12-06 21:23 UTC (permalink / raw)
To: dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
[-- Attachment #1: Type: text/plain, Size: 7069 bytes --]
Executive Summary: We need to add CRIU support to DRM render nodes in
order to maintain CRIU support for ROCm application once they start
relying on render nodes for more GPU memory management. In this email
I'm providing some background why we are doing this, and outlining some
of the problems we need to solve to checkpoint and restore render node
state and shared memory (DMABuf) state. I have some thoughts on the API
design, leaning on what we did for KFD, but would like to get feedback
from the DRI community regarding that API and to what extent there is
interest in making that generic.
We are working on using DRM render nodes for virtual address mappings in
ROCm applications to implement the CUDA11-style VM API and improve
interoperability between graphics and compute. This uses DMABufs for
sharing buffer objects between KFD and multiple render node devices, as
well as between processes. In the long run this also provides a path to
moving all or most memory management from the KFD ioctl API to libdrm.
Once ROCm user mode starts using render nodes for virtual address
management, that creates a problem for checkpointing and restoring ROCm
applications with CRIU. Currently there is no support for checkpointing
and restoring render node state, other than CPU virtual address
mappings. Support will be needed for checkpointing GEM buffer objects
and handles, their GPU virtual address mappings and memory sharing
relationships between devices and processes.
Eventually, if full CRIU support for graphics applications is desired,
more state would need to be captured, including scheduler contexts and
BO lists. Most of this state is driver-specific.
After some internal discussions we decided to take our design process
public as this potentially touches DRM GEM and DMABuf APIs and may have
implications for other drivers in the future.
One basic question before going into any API details: Is there a desire
to have CRIU support for other DRM drivers?
With that out of the way, some considerations for a possible DRM CRIU
API (either generic of AMDGPU driver specific): The API goes through
several phases during checkpoint and restore:
Checkpoint:
1. Process-info (enumerates objects and sizes so user mode can allocate
memory for the checkpoint, stops execution on the GPU)
2. Checkpoint (store object metadata for BOs, queues, etc.)
3. Unpause (resumes execution after the checkpoint is complete)
Restore:
1. Restore (restore objects, VMAs are not in the right place at this time)
2. Resume (final fixups after the VMAs are sorted out, resume execution)
For some more background about our implementation in KFD, you can refer
to this whitepaper:
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
Potential objections to a KFD-style CRIU API in DRM render nodes, I'll
address each of them in more detail below:
* Opaque information in the checkpoint data that user mode can't
interpret or do anything with
* A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
* Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode
# Opaque information in the checkpoint
This comes out of ABI compatibility considerations. Adding any new
objects or attributes to the driver/HW state that needs to be
checkpointed could potentially break the ABI of the CRIU
checkpoint/restore ioctl if the plugin needs to parse that information.
Therefore, much of the information in our KFD CRIU ioctl API is opaque.
It is written by kernel mode in the checkpoint, it is consumed by kernel
mode when restoring the checkpoint, but user mode doesn't care about the
contents or binary layout, so there is no user mode ABI to break. This
is how we were able to maintain CRIU support when we added the SVM API
to KFD without changing the CRIU plugin and without breaking our ABI.
Opaque information may also lend itself to API abstraction, if this
becomes a generic DRM API with driver-specific callbacks that fill in
HW-specific opaque data.
# Second API for creating objects
Creating BOs and other objects when restoring a checkpoint needs more
information than the usual BO alloc and similar APIs provide. For
example, we need to restore BOs with the same GEM handles so that user
mode can continue using those handles after resuming execution. If BOs
are shared through DMABufs without dynamic attachment, we need to
restore pinned BOs as pinned. Validation of virtual addresses and
handling MMU notifiers must be suspended until the virtual address space
is restored. For user mode queues we need to save and restore a lot of
queue execution state so that execution can resume cleanly.
# Restoring buffer sharing relationships
Different GEM handles in different render nodes and processes can refer
to the same underlying shared memory, either by directly pointing to the
same GEM object, or by creating an import attachment that may get its SG
tables invalidated and updated dynamically through dynamic attachment
callbacks. In the latter case it's obvious, who is the exporter and who
is the importer. In the first case, either one could be the exporter,
and it's not clear who would need to create the BO and who would need to
import it when restoring the checkpoint. To further complicate things,
multiple processes in a checkpoint get restored concurrently. So there
is no guarantee that an exporter has restored a shared BO at the time an
importer is trying to restore its import.
A proposal to deal with these problems would be to treat importers and
exporters the same. Whoever restores first, ends up creating the BO and
potentially attaching to it. The other process(es) can find BOs that
were already restored by another process by looking it up with a unique
ID that could be based on the DMABuf inode number. An alternative would
be a two-pass approach that needs to restore BOs on two passes:
1. Restore exported BOs
2. Restore imports
With some inter-process synchronization in CRIU itself between these two
passes. This may require changes in the core CRIU, outside our plugin.
Both approaches depend on identifying BOs with some unique ID that could
be based on the DMABuf inode number in the checkpoint. However, we would
need to identify the processes in the same restore session, possibly
based on parent/child process relationships, to create a scope where
those IDs are valid during restore.
Finally, we would also need to checkpoint and restore DMABuf file
descriptors themselves. These are anonymous file descriptors. The CRIU
plugin could probably be taught to recreate them from the original
exported BO based on the inode number that could be queried with fstat
in the checkpoint. It would need help from the render node CRIU API to
find the right BO from the inode, which may be from a different process
in the same restore session.
Regards,
Felix
[-- Attachment #2: Type: text/html, Size: 8290 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2023-12-06 21:23 Proposal to add CRIU support to DRM render nodes Felix Kuehling
@ 2024-01-15 18:58 ` Felix Kuehling
2024-03-11 14:48 ` Tvrtko Ursulin
1 sibling, 0 replies; 14+ messages in thread
From: Felix Kuehling @ 2024-01-15 18:58 UTC (permalink / raw)
To: dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
I haven't seen any replies to this proposal. Either it got lost in the
pre-holiday noise, or there is genuinely no interest in this.
If it's the latter, I would look for an AMDGPU driver-specific solution
with minimally invasive changes in DRM and DMABuf code, if needed. Maybe
it could be generalized later if there is interest then.
Regards,
Felix
On 2023-12-06 16:23, Felix Kuehling wrote:
> Executive Summary: We need to add CRIU support to DRM render nodes in
> order to maintain CRIU support for ROCm application once they start
> relying on render nodes for more GPU memory management. In this email
> I'm providing some background why we are doing this, and outlining
> some of the problems we need to solve to checkpoint and restore render
> node state and shared memory (DMABuf) state. I have some thoughts on
> the API design, leaning on what we did for KFD, but would like to get
> feedback from the DRI community regarding that API and to what extent
> there is interest in making that generic.
>
> We are working on using DRM render nodes for virtual address mappings
> in ROCm applications to implement the CUDA11-style VM API and improve
> interoperability between graphics and compute. This uses DMABufs for
> sharing buffer objects between KFD and multiple render node devices,
> as well as between processes. In the long run this also provides a
> path to moving all or most memory management from the KFD ioctl API to
> libdrm.
>
> Once ROCm user mode starts using render nodes for virtual address
> management, that creates a problem for checkpointing and restoring
> ROCm applications with CRIU. Currently there is no support for
> checkpointing and restoring render node state, other than CPU virtual
> address mappings. Support will be needed for checkpointing GEM buffer
> objects and handles, their GPU virtual address mappings and memory
> sharing relationships between devices and processes.
>
> Eventually, if full CRIU support for graphics applications is desired,
> more state would need to be captured, including scheduler contexts and
> BO lists. Most of this state is driver-specific.
>
> After some internal discussions we decided to take our design process
> public as this potentially touches DRM GEM and DMABuf APIs and may
> have implications for other drivers in the future.
>
> One basic question before going into any API details: Is there a
> desire to have CRIU support for other DRM drivers?
>
> With that out of the way, some considerations for a possible DRM CRIU
> API (either generic of AMDGPU driver specific): The API goes through
> several phases during checkpoint and restore:
>
> Checkpoint:
>
> 1. Process-info (enumerates objects and sizes so user mode can
> allocate memory for the checkpoint, stops execution on the GPU)
> 2. Checkpoint (store object metadata for BOs, queues, etc.)
> 3. Unpause (resumes execution after the checkpoint is complete)
>
> Restore:
>
> 1. Restore (restore objects, VMAs are not in the right place at this
> time)
> 2. Resume (final fixups after the VMAs are sorted out, resume execution)
>
> For some more background about our implementation in KFD, you can
> refer to this whitepaper:
> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>
> Potential objections to a KFD-style CRIU API in DRM render nodes, I'll
> address each of them in more detail below:
>
> * Opaque information in the checkpoint data that user mode can't
> interpret or do anything with
> * A second API for creating objects (e.g. BOs) that is separate from
> the regular BO creation API
> * Kernel mode would need to be involved in restoring BO sharing
> relationships rather than replaying BO creation, export and import
> from user mode
>
> # Opaque information in the checkpoint
>
> This comes out of ABI compatibility considerations. Adding any new
> objects or attributes to the driver/HW state that needs to be
> checkpointed could potentially break the ABI of the CRIU
> checkpoint/restore ioctl if the plugin needs to parse that
> information. Therefore, much of the information in our KFD CRIU ioctl
> API is opaque. It is written by kernel mode in the checkpoint, it is
> consumed by kernel mode when restoring the checkpoint, but user mode
> doesn't care about the contents or binary layout, so there is no user
> mode ABI to break. This is how we were able to maintain CRIU support
> when we added the SVM API to KFD without changing the CRIU plugin and
> without breaking our ABI.
>
> Opaque information may also lend itself to API abstraction, if this
> becomes a generic DRM API with driver-specific callbacks that fill in
> HW-specific opaque data.
>
> # Second API for creating objects
>
> Creating BOs and other objects when restoring a checkpoint needs more
> information than the usual BO alloc and similar APIs provide. For
> example, we need to restore BOs with the same GEM handles so that user
> mode can continue using those handles after resuming execution. If BOs
> are shared through DMABufs without dynamic attachment, we need to
> restore pinned BOs as pinned. Validation of virtual addresses and
> handling MMU notifiers must be suspended until the virtual address
> space is restored. For user mode queues we need to save and restore a
> lot of queue execution state so that execution can resume cleanly.
>
> # Restoring buffer sharing relationships
>
> Different GEM handles in different render nodes and processes can
> refer to the same underlying shared memory, either by directly
> pointing to the same GEM object, or by creating an import attachment
> that may get its SG tables invalidated and updated dynamically through
> dynamic attachment callbacks. In the latter case it's obvious, who is
> the exporter and who is the importer. In the first case, either one
> could be the exporter, and it's not clear who would need to create the
> BO and who would need to import it when restoring the checkpoint. To
> further complicate things, multiple processes in a checkpoint get
> restored concurrently. So there is no guarantee that an exporter has
> restored a shared BO at the time an importer is trying to restore its
> import.
>
> A proposal to deal with these problems would be to treat importers and
> exporters the same. Whoever restores first, ends up creating the BO
> and potentially attaching to it. The other process(es) can find BOs
> that were already restored by another process by looking it up with a
> unique ID that could be based on the DMABuf inode number. An
> alternative would be a two-pass approach that needs to restore BOs on
> two passes:
>
> 1. Restore exported BOs
> 2. Restore imports
>
> With some inter-process synchronization in CRIU itself between these
> two passes. This may require changes in the core CRIU, outside our
> plugin. Both approaches depend on identifying BOs with some unique ID
> that could be based on the DMABuf inode number in the checkpoint.
> However, we would need to identify the processes in the same restore
> session, possibly based on parent/child process relationships, to
> create a scope where those IDs are valid during restore.
>
> Finally, we would also need to checkpoint and restore DMABuf file
> descriptors themselves. These are anonymous file descriptors. The CRIU
> plugin could probably be taught to recreate them from the original
> exported BO based on the inode number that could be queried with fstat
> in the checkpoint. It would need help from the render node CRIU API to
> find the right BO from the inode, which may be from a different
> process in the same restore session.
>
> Regards,
> Felix
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2023-12-06 21:23 Proposal to add CRIU support to DRM render nodes Felix Kuehling
2024-01-15 18:58 ` Felix Kuehling
@ 2024-03-11 14:48 ` Tvrtko Ursulin
2024-03-12 9:45 ` Tvrtko Ursulin
1 sibling, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-03-11 14:48 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
Hi Felix,
On 06/12/2023 21:23, Felix Kuehling wrote:
> Executive Summary: We need to add CRIU support to DRM render nodes in
> order to maintain CRIU support for ROCm application once they start
> relying on render nodes for more GPU memory management. In this email
> I'm providing some background why we are doing this, and outlining some
> of the problems we need to solve to checkpoint and restore render node
> state and shared memory (DMABuf) state. I have some thoughts on the API
> design, leaning on what we did for KFD, but would like to get feedback
> from the DRI community regarding that API and to what extent there is
> interest in making that generic.
>
> We are working on using DRM render nodes for virtual address mappings in
> ROCm applications to implement the CUDA11-style VM API and improve
> interoperability between graphics and compute. This uses DMABufs for
> sharing buffer objects between KFD and multiple render node devices, as
> well as between processes. In the long run this also provides a path to
> moving all or most memory management from the KFD ioctl API to libdrm.
>
> Once ROCm user mode starts using render nodes for virtual address
> management, that creates a problem for checkpointing and restoring ROCm
> applications with CRIU. Currently there is no support for checkpointing
> and restoring render node state, other than CPU virtual address
> mappings. Support will be needed for checkpointing GEM buffer objects
> and handles, their GPU virtual address mappings and memory sharing
> relationships between devices and processes.
>
> Eventually, if full CRIU support for graphics applications is desired,
> more state would need to be captured, including scheduler contexts and
> BO lists. Most of this state is driver-specific.
>
> After some internal discussions we decided to take our design process
> public as this potentially touches DRM GEM and DMABuf APIs and may have
> implications for other drivers in the future.
>
> One basic question before going into any API details: Is there a desire
> to have CRIU support for other DRM drivers?
This sounds like a very interesting feature on the overall, although I
cannot answer on the last question here.
Funnily enough, it has a tiny relation to an i915 feature I recently
implemented on Mesa's request, which is to be able to "upload" the GPU
context from the GPU hang error state and replay the hanging request. It
is kind of (at a stretch) a very special tiny subset of checkout and
restore so I am not mentioning it as a curiosity.
And there is also another partical conceptual intersect with the (at the
moment not yet upstream) i915 online debugger. This part being in the
area of discovering and enumerating GPU resources beloning to the client.
I don't see an immediate design or code sharing opportunities though but
just mentioning.
I did spend some time reading your plugin and kernel implementation out
of curiousity and have some comments and questions.
> With that out of the way, some considerations for a possible DRM CRIU
> API (either generic of AMDGPU driver specific): The API goes through
> several phases during checkpoint and restore:
>
> Checkpoint:
>
> 1. Process-info (enumerates objects and sizes so user mode can allocate
> memory for the checkpoint, stops execution on the GPU)
> 2. Checkpoint (store object metadata for BOs, queues, etc.)
> 3. Unpause (resumes execution after the checkpoint is complete)
>
> Restore:
>
> 1. Restore (restore objects, VMAs are not in the right place at this time)
> 2. Resume (final fixups after the VMAs are sorted out, resume execution)
Btw is check-pointing guaranteeing all relevant activity is idled? For
instance dma_resv objects are free of fences which would need to
restored for things to continue executing sensibly? Or how is that handled?
> For some more background about our implementation in KFD, you can refer
> to this whitepaper:
> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>
> Potential objections to a KFD-style CRIU API in DRM render nodes, I'll
> address each of them in more detail below:
>
> * Opaque information in the checkpoint data that user mode can't
> interpret or do anything with
> * A second API for creating objects (e.g. BOs) that is separate from
> the regular BO creation API
> * Kernel mode would need to be involved in restoring BO sharing
> relationships rather than replaying BO creation, export and import
> from user mode
>
> # Opaque information in the checkpoint
>
> This comes out of ABI compatibility considerations. Adding any new
> objects or attributes to the driver/HW state that needs to be
> checkpointed could potentially break the ABI of the CRIU
> checkpoint/restore ioctl if the plugin needs to parse that information.
> Therefore, much of the information in our KFD CRIU ioctl API is opaque.
> It is written by kernel mode in the checkpoint, it is consumed by kernel
> mode when restoring the checkpoint, but user mode doesn't care about the
> contents or binary layout, so there is no user mode ABI to break. This
> is how we were able to maintain CRIU support when we added the SVM API
> to KFD without changing the CRIU plugin and without breaking our ABI.
>
> Opaque information may also lend itself to API abstraction, if this
> becomes a generic DRM API with driver-specific callbacks that fill in
> HW-specific opaque data.
This feels sound in principle to me. Fundamentally the state is very
hardware specfic, and/or driver version specific, so I don't see
anything could be gained in practice by making it much less opaque.
(Apart from making things more complicated.)
I was however unsure of the current split of how you dump buffer objects
with some data in the defined bo structure, and some in completely
opaque private data. Is there a benefit to that split, or maybe in other
words, is there a benefit on having part transparent and part opaque for
buffer objects?
To slightly touch upon the question of whether this could become a
generic DRM API. It feels it would be hard to do it from the start. What
sounds more feasible is if/when generic looking helpers can be spotted
while developing the RFC then potentially structure the code they can
easily be promoted to shared/common at some future moment.
> # Second API for creating objects
>
> Creating BOs and other objects when restoring a checkpoint needs more
> information than the usual BO alloc and similar APIs provide. For
> example, we need to restore BOs with the same GEM handles so that user
> mode can continue using those handles after resuming execution. If BOs
> are shared through DMABufs without dynamic attachment, we need to
> restore pinned BOs as pinned. Validation of virtual addresses and
> handling MMU notifiers must be suspended until the virtual address space
> is restored. For user mode queues we need to save and restore a lot of
> queue execution state so that execution can resume cleanly.
This also sounds justified to me. Restore creating all internal objects
is definitely special and sounds better to add uapi to create them
directly with the correct properties, than to add uapi to adjust
internal properties after creation. And in case you would always need
some new uapi - so at least to adjust after creation. At which point you
may have both in one. Internally implementation can be split or common,
whatever makes sense for a given object type, but new uapi definitely
sounds is required.
> # Restoring buffer sharing relationships
>
> Different GEM handles in different render nodes and processes can refer
> to the same underlying shared memory, either by directly pointing to the
> same GEM object, or by creating an import attachment that may get its SG
> tables invalidated and updated dynamically through dynamic attachment
> callbacks. In the latter case it's obvious, who is the exporter and who
> is the importer. In the first case, either one could be the exporter,
> and it's not clear who would need to create the BO and who would need to
To see if I follow the former case correctly.
This could be two clients A and B, where B has imported a dma-buf shared
BO from A and has since closed the dma-buf fd? Which results in a single
BO with reference count of 2 and obj->import_attach unset. History of
who created the object is lost.
In fact it could even be that two imported objects remain (clients A, B
and C) and A, who originally created the BO, has since fully released
it. So any kind of "creator" tracking if added wouldn't be fully
reliable either.
> import it when restoring the checkpoint. To further complicate things,
> multiple processes in a checkpoint get restored concurrently. So there
> is no guarantee that an exporter has restored a shared BO at the time an
> importer is trying to restore its import.
>
> A proposal to deal with these problems would be to treat importers and
> exporters the same. Whoever restores first, ends up creating the BO and
> potentially attaching to it. The other process(es) can find BOs that
> were already restored by another process by looking it up with a unique
> ID that could be based on the DMABuf inode number. An alternative would
> be a two-pass approach that needs to restore BOs on two passes:
>
> 1. Restore exported BOs
> 2. Restore imports
>
> With some inter-process synchronization in CRIU itself between these two
> passes. This may require changes in the core CRIU, outside our plugin.
> Both approaches depend on identifying BOs with some unique ID that could
> be based on the DMABuf inode number in the checkpoint. However, we would
> need to identify the processes in the same restore session, possibly
> based on parent/child process relationships, to create a scope where
> those IDs are valid during restore.
If my understanding above is on the right track, then I think this is
the only thing which can be done (for all scenarios).
I also *think* it would be safe. At least at the moment I cannot think
what could go wrong. Semantics are that it doesn't really matter who
created the object.
> Finally, we would also need to checkpoint and restore DMABuf file
> descriptors themselves. These are anonymous file descriptors. The CRIU
> plugin could probably be taught to recreate them from the original
> exported BO based on the inode number that could be queried with fstat
> in the checkpoint. It would need help from the render node CRIU API to
> find the right BO from the inode, which may be from a different process
> in the same restore session.
This part feels like it is breaking the component separation a bit
because even for buffers fully owned by amdgpu, strictly speaking the
dma-buf fd is not. At least my understanding from the above is that you
propose to attempt to import the fd, from the kernel side, during the
checkpoint process? Like:
Checkpoint:
CRIU for each anon fd:
amdgpu_plugin(fd)
-> attempt in kernel dma buf import (passes fd to kernel via ioctl?)
-> is it ours? (no -> error)
-> create a record mapping fd number to amdgpu BO
Restore:
for each dma-buf fd record:
create BO if does not exists
export BO to same fd
close BO handle if not in regular BO handle records
Or since you mention lookup by inode, that would need to be
dmabuf_plugin so it can lookup inodes in the private mount space.
However how would it co-operate with amdgpu_plugin is not clear to me.
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-03-11 14:48 ` Tvrtko Ursulin
@ 2024-03-12 9:45 ` Tvrtko Ursulin
2024-03-15 2:33 ` Felix Kuehling
0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-03-12 9:45 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>
> Hi Felix,
>
> On 06/12/2023 21:23, Felix Kuehling wrote:
>> Executive Summary: We need to add CRIU support to DRM render nodes in
>> order to maintain CRIU support for ROCm application once they start
>> relying on render nodes for more GPU memory management. In this email
>> I'm providing some background why we are doing this, and outlining
>> some of the problems we need to solve to checkpoint and restore render
>> node state and shared memory (DMABuf) state. I have some thoughts on
>> the API design, leaning on what we did for KFD, but would like to get
>> feedback from the DRI community regarding that API and to what extent
>> there is interest in making that generic.
>>
>> We are working on using DRM render nodes for virtual address mappings
>> in ROCm applications to implement the CUDA11-style VM API and improve
>> interoperability between graphics and compute. This uses DMABufs for
>> sharing buffer objects between KFD and multiple render node devices,
>> as well as between processes. In the long run this also provides a
>> path to moving all or most memory management from the KFD ioctl API to
>> libdrm.
>>
>> Once ROCm user mode starts using render nodes for virtual address
>> management, that creates a problem for checkpointing and restoring
>> ROCm applications with CRIU. Currently there is no support for
>> checkpointing and restoring render node state, other than CPU virtual
>> address mappings. Support will be needed for checkpointing GEM buffer
>> objects and handles, their GPU virtual address mappings and memory
>> sharing relationships between devices and processes.
>>
>> Eventually, if full CRIU support for graphics applications is desired,
>> more state would need to be captured, including scheduler contexts and
>> BO lists. Most of this state is driver-specific.
>>
>> After some internal discussions we decided to take our design process
>> public as this potentially touches DRM GEM and DMABuf APIs and may
>> have implications for other drivers in the future.
>>
>> One basic question before going into any API details: Is there a
>> desire to have CRIU support for other DRM drivers?
>
> This sounds like a very interesting feature on the overall, although I
> cannot answer on the last question here.
I forgot to finish this thought. I cannot answer / don't know of any
concrete plans, but I think feature is pretty cool and if amdgpu gets it
working I wouldn't be surprised if other drivers would get interested.
> Funnily enough, it has a tiny relation to an i915 feature I recently
> implemented on Mesa's request, which is to be able to "upload" the GPU
> context from the GPU hang error state and replay the hanging request. It
> is kind of (at a stretch) a very special tiny subset of checkout and
> restore so I am not mentioning it as a curiosity.
>
> And there is also another partical conceptual intersect with the (at the
> moment not yet upstream) i915 online debugger. This part being in the
> area of discovering and enumerating GPU resources beloning to the client.
>
> I don't see an immediate design or code sharing opportunities though but
> just mentioning.
>
> I did spend some time reading your plugin and kernel implementation out
> of curiousity and have some comments and questions.
>
>> With that out of the way, some considerations for a possible DRM CRIU
>> API (either generic of AMDGPU driver specific): The API goes through
>> several phases during checkpoint and restore:
>>
>> Checkpoint:
>>
>> 1. Process-info (enumerates objects and sizes so user mode can allocate
>> memory for the checkpoint, stops execution on the GPU)
>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>> 3. Unpause (resumes execution after the checkpoint is complete)
>>
>> Restore:
>>
>> 1. Restore (restore objects, VMAs are not in the right place at this
>> time)
>> 2. Resume (final fixups after the VMAs are sorted out, resume execution)
>
> Btw is check-pointing guaranteeing all relevant activity is idled? For
> instance dma_resv objects are free of fences which would need to
> restored for things to continue executing sensibly? Or how is that handled?
>
>> For some more background about our implementation in KFD, you can
>> refer to this whitepaper:
>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>
>> Potential objections to a KFD-style CRIU API in DRM render nodes, I'll
>> address each of them in more detail below:
>>
>> * Opaque information in the checkpoint data that user mode can't
>> interpret or do anything with
>> * A second API for creating objects (e.g. BOs) that is separate from
>> the regular BO creation API
>> * Kernel mode would need to be involved in restoring BO sharing
>> relationships rather than replaying BO creation, export and import
>> from user mode
>>
>> # Opaque information in the checkpoint
>>
>> This comes out of ABI compatibility considerations. Adding any new
>> objects or attributes to the driver/HW state that needs to be
>> checkpointed could potentially break the ABI of the CRIU
>> checkpoint/restore ioctl if the plugin needs to parse that
>> information. Therefore, much of the information in our KFD CRIU ioctl
>> API is opaque. It is written by kernel mode in the checkpoint, it is
>> consumed by kernel mode when restoring the checkpoint, but user mode
>> doesn't care about the contents or binary layout, so there is no user
>> mode ABI to break. This is how we were able to maintain CRIU support
>> when we added the SVM API to KFD without changing the CRIU plugin and
>> without breaking our ABI.
>>
>> Opaque information may also lend itself to API abstraction, if this
>> becomes a generic DRM API with driver-specific callbacks that fill in
>> HW-specific opaque data.
>
> This feels sound in principle to me. Fundamentally the state is very
> hardware specfic, and/or driver version specific, so I don't see
> anything could be gained in practice by making it much less opaque.
> (Apart from making things more complicated.)
>
> I was however unsure of the current split of how you dump buffer objects
> with some data in the defined bo structure, and some in completely
> opaque private data. Is there a benefit to that split, or maybe in other
> words, is there a benefit on having part transparent and part opaque for
> buffer objects?
>
> To slightly touch upon the question of whether this could become a
> generic DRM API. It feels it would be hard to do it from the start. What
> sounds more feasible is if/when generic looking helpers can be spotted
> while developing the RFC then potentially structure the code they can
> easily be promoted to shared/common at some future moment.
>
>> # Second API for creating objects
>>
>> Creating BOs and other objects when restoring a checkpoint needs more
>> information than the usual BO alloc and similar APIs provide. For
>> example, we need to restore BOs with the same GEM handles so that user
>> mode can continue using those handles after resuming execution. If BOs
>> are shared through DMABufs without dynamic attachment, we need to
>> restore pinned BOs as pinned. Validation of virtual addresses and
>> handling MMU notifiers must be suspended until the virtual address
>> space is restored. For user mode queues we need to save and restore a
>> lot of queue execution state so that execution can resume cleanly.
>
> This also sounds justified to me. Restore creating all internal objects
> is definitely special and sounds better to add uapi to create them
> directly with the correct properties, than to add uapi to adjust
> internal properties after creation. And in case you would always need
> some new uapi - so at least to adjust after creation. At which point you
> may have both in one. Internally implementation can be split or common,
> whatever makes sense for a given object type, but new uapi definitely
> sounds is required.
>> # Restoring buffer sharing relationships
>>
>> Different GEM handles in different render nodes and processes can
>> refer to the same underlying shared memory, either by directly
>> pointing to the same GEM object, or by creating an import attachment
>> that may get its SG tables invalidated and updated dynamically through
>> dynamic attachment callbacks. In the latter case it's obvious, who is
>> the exporter and who is the importer. In the first case, either one
>> could be the exporter, and it's not clear who would need to create the
>> BO and who would need to
>
> To see if I follow the former case correctly.
>
> This could be two clients A and B, where B has imported a dma-buf shared
> BO from A and has since closed the dma-buf fd? Which results in a single
> BO with reference count of 2 and obj->import_attach unset. History of
> who created the object is lost.
>
> In fact it could even be that two imported objects remain (clients A, B
> and C) and A, who originally created the BO, has since fully released
> it. So any kind of "creator" tracking if added wouldn't be fully
> reliable either.
>
>> import it when restoring the checkpoint. To further complicate things,
>> multiple processes in a checkpoint get restored concurrently. So there
>> is no guarantee that an exporter has restored a shared BO at the time
>> an importer is trying to restore its import.
>>
>> A proposal to deal with these problems would be to treat importers and
>> exporters the same. Whoever restores first, ends up creating the BO
>> and potentially attaching to it. The other process(es) can find BOs
>> that were already restored by another process by looking it up with a
>> unique ID that could be based on the DMABuf inode number. An
>> alternative would be a two-pass approach that needs to restore BOs on
>> two passes:
>>
>> 1. Restore exported BOs
>> 2. Restore imports
>>
>> With some inter-process synchronization in CRIU itself between these
>> two passes. This may require changes in the core CRIU, outside our
>> plugin. Both approaches depend on identifying BOs with some unique ID
>> that could be based on the DMABuf inode number in the checkpoint.
>> However, we would need to identify the processes in the same restore
>> session, possibly based on parent/child process relationships, to
>> create a scope where those IDs are valid during restore.
>
> If my understanding above is on the right track, then I think this is
> the only thing which can be done (for all scenarios).
>
> I also *think* it would be safe. At least at the moment I cannot think
> what could go wrong. Semantics are that it doesn't really matter who
> created the object.
>
>> Finally, we would also need to checkpoint and restore DMABuf file
>> descriptors themselves. These are anonymous file descriptors. The CRIU
>> plugin could probably be taught to recreate them from the original
>> exported BO based on the inode number that could be queried with fstat
>> in the checkpoint. It would need help from the render node CRIU API to
>> find the right BO from the inode, which may be from a different
>> process in the same restore session.
>
> This part feels like it is breaking the component separation a bit
> because even for buffers fully owned by amdgpu, strictly speaking the
> dma-buf fd is not. At least my understanding from the above is that you
> propose to attempt to import the fd, from the kernel side, during the
> checkpoint process? Like:
>
> Checkpoint:
>
> CRIU for each anon fd:
> amdgpu_plugin(fd)
> -> attempt in kernel dma buf import (passes fd to kernel via ioctl?)
> -> is it ours? (no -> error)
> -> create a record mapping fd number to amdgpu BO
>
> Restore:
>
> for each dma-buf fd record:
> create BO if does not exists
> export BO to same fd
> close BO handle if not in regular BO handle records
>
> Or since you mention lookup by inode, that would need to be
> dmabuf_plugin so it can lookup inodes in the private mount space.
> However how would it co-operate with amdgpu_plugin is not clear to me.
I later also realised that I was maybe increasing the scope for you
here. :) You did state focus is ROCm applications which possibly doesn't
care about dma-resv, fences, syncobjs etc?
But I think the "how to handle dma-bufs" design question is still
relevant and interesting. For example I had this thought that perhaps
what would be needed is a CRIU plugin hierarchy.
Because fundamentally we would be snapshoting a hierarcy of kernel
objects belonging to different drivers (kfd, amdgpu, dma-buf, ...). And
if one day someone would to try to handle dma fences and drm syncobjs,
the argument for a hierarchial design would be even stronger I think.
Something like a drm_plugin.so could call sub-plugins (amdgpu, dma-buf,
sync file, ...) and internally build the representation of the whole
state and how the relationship between the objects.
With that kind of design there probably would be a need to define some
common kernel side api and uapi, so all involved objects can be
enumerated with some unique ids etc.
Now.. the counter argument.. the more state from different drivers would
one want to handle the bigger this project would get. Would it even be
feasible is the question, to the point that it may be simpler to just
run the workload in a VM via SR-IOV and simply hibernate the whole thin
guest. :)
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-03-12 9:45 ` Tvrtko Ursulin
@ 2024-03-15 2:33 ` Felix Kuehling
2024-03-15 18:36 ` Tvrtko Ursulin
0 siblings, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2024-03-15 2:33 UTC (permalink / raw)
To: Tvrtko Ursulin, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>
> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>
>> Hi Felix,
>>
>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>> Executive Summary: We need to add CRIU support to DRM render nodes
>>> in order to maintain CRIU support for ROCm application once they
>>> start relying on render nodes for more GPU memory management. In
>>> this email I'm providing some background why we are doing this, and
>>> outlining some of the problems we need to solve to checkpoint and
>>> restore render node state and shared memory (DMABuf) state. I have
>>> some thoughts on the API design, leaning on what we did for KFD, but
>>> would like to get feedback from the DRI community regarding that API
>>> and to what extent there is interest in making that generic.
>>>
>>> We are working on using DRM render nodes for virtual address
>>> mappings in ROCm applications to implement the CUDA11-style VM API
>>> and improve interoperability between graphics and compute. This uses
>>> DMABufs for sharing buffer objects between KFD and multiple render
>>> node devices, as well as between processes. In the long run this
>>> also provides a path to moving all or most memory management from
>>> the KFD ioctl API to libdrm.
>>>
>>> Once ROCm user mode starts using render nodes for virtual address
>>> management, that creates a problem for checkpointing and restoring
>>> ROCm applications with CRIU. Currently there is no support for
>>> checkpointing and restoring render node state, other than CPU
>>> virtual address mappings. Support will be needed for checkpointing
>>> GEM buffer objects and handles, their GPU virtual address mappings
>>> and memory sharing relationships between devices and processes.
>>>
>>> Eventually, if full CRIU support for graphics applications is
>>> desired, more state would need to be captured, including scheduler
>>> contexts and BO lists. Most of this state is driver-specific.
>>>
>>> After some internal discussions we decided to take our design
>>> process public as this potentially touches DRM GEM and DMABuf APIs
>>> and may have implications for other drivers in the future.
>>>
>>> One basic question before going into any API details: Is there a
>>> desire to have CRIU support for other DRM drivers?
>>
>> This sounds like a very interesting feature on the overall, although
>> I cannot answer on the last question here.
>
> I forgot to finish this thought. I cannot answer / don't know of any
> concrete plans, but I think feature is pretty cool and if amdgpu gets
> it working I wouldn't be surprised if other drivers would get interested.
Thanks, that's good to hear!
>
>> Funnily enough, it has a tiny relation to an i915 feature I recently
>> implemented on Mesa's request, which is to be able to "upload" the
>> GPU context from the GPU hang error state and replay the hanging
>> request. It is kind of (at a stretch) a very special tiny subset of
>> checkout and restore so I am not mentioning it as a curiosity.
>>
>> And there is also another partical conceptual intersect with the (at
>> the moment not yet upstream) i915 online debugger. This part being in
>> the area of discovering and enumerating GPU resources beloning to the
>> client.
>>
>> I don't see an immediate design or code sharing opportunities though
>> but just mentioning.
>>
>> I did spend some time reading your plugin and kernel implementation
>> out of curiousity and have some comments and questions.
>>
>>> With that out of the way, some considerations for a possible DRM
>>> CRIU API (either generic of AMDGPU driver specific): The API goes
>>> through several phases during checkpoint and restore:
>>>
>>> Checkpoint:
>>>
>>> 1. Process-info (enumerates objects and sizes so user mode can
>>> allocate
>>> memory for the checkpoint, stops execution on the GPU)
>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>
>>> Restore:
>>>
>>> 1. Restore (restore objects, VMAs are not in the right place at
>>> this time)
>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>> execution)
>>
>> Btw is check-pointing guaranteeing all relevant activity is idled?
>> For instance dma_resv objects are free of fences which would need to
>> restored for things to continue executing sensibly? Or how is that
>> handled?
In our compute use cases, we suspend user mode queues. This can include
CWSR (compute-wave-save-restore) where the state of in-flight waves is
stored in memory and can be reloaded and resumed from memory later. We
don't use any fences other than "eviction fences", that are signaled
after the queues are suspended. And those fences are never handed to
user mode. So we don't need to worry about any fence state in the
checkpoint.
If we extended this to support the kernel mode command submission APIs,
I would expect that we'd wait for all current submissions to complete,
and stop new ones from being sent to the HW before taking the
checkpoint. When we take the checkpoint in the CRIU plugin, the CPU
threads are already frozen and cannot submit any more work. If we wait
for all currently pending submissions to drain, I think we don't need to
save any fence state because all the fences will have signaled. (I may
be missing some intricacies and I'm afraid it may not be that simple in
reality, but that's my opening bid. ;)
>>
>>> For some more background about our implementation in KFD, you can
>>> refer to this whitepaper:
>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>
>>> Potential objections to a KFD-style CRIU API in DRM render nodes,
>>> I'll address each of them in more detail below:
>>>
>>> * Opaque information in the checkpoint data that user mode can't
>>> interpret or do anything with
>>> * A second API for creating objects (e.g. BOs) that is separate from
>>> the regular BO creation API
>>> * Kernel mode would need to be involved in restoring BO sharing
>>> relationships rather than replaying BO creation, export and import
>>> from user mode
>>>
>>> # Opaque information in the checkpoint
>>>
>>> This comes out of ABI compatibility considerations. Adding any new
>>> objects or attributes to the driver/HW state that needs to be
>>> checkpointed could potentially break the ABI of the CRIU
>>> checkpoint/restore ioctl if the plugin needs to parse that
>>> information. Therefore, much of the information in our KFD CRIU
>>> ioctl API is opaque. It is written by kernel mode in the checkpoint,
>>> it is consumed by kernel mode when restoring the checkpoint, but
>>> user mode doesn't care about the contents or binary layout, so there
>>> is no user mode ABI to break. This is how we were able to maintain
>>> CRIU support when we added the SVM API to KFD without changing the
>>> CRIU plugin and without breaking our ABI.
>>>
>>> Opaque information may also lend itself to API abstraction, if this
>>> becomes a generic DRM API with driver-specific callbacks that fill
>>> in HW-specific opaque data.
>>
>> This feels sound in principle to me. Fundamentally the state is very
>> hardware specfic, and/or driver version specific, so I don't see
>> anything could be gained in practice by making it much less opaque.
>> (Apart from making things more complicated.)
>>
>> I was however unsure of the current split of how you dump buffer
>> objects with some data in the defined bo structure, and some in
>> completely opaque private data. Is there a benefit to that split, or
>> maybe in other words, is there a benefit on having part transparent
>> and part opaque for buffer objects?
Some of the buffer object state is needed by the plugin. E.g. the size
and mmap offset are needed to match VMAs with BOs. I'd have to review
the plugin in detail to prove that all the fields are, in fact, needed
by the plugin, but that was the intent. Anything that the plugin doesn't
need to know should be in the opaque data structures.
>>
>> To slightly touch upon the question of whether this could become a
>> generic DRM API. It feels it would be hard to do it from the start.
>> What sounds more feasible is if/when generic looking helpers can be
>> spotted while developing the RFC then potentially structure the code
>> they can easily be promoted to shared/common at some future moment.
Yes, that's how this usually goes, in my experience. Thanks for confirming.
>>
>>> # Second API for creating objects
>>>
>>> Creating BOs and other objects when restoring a checkpoint needs
>>> more information than the usual BO alloc and similar APIs provide.
>>> For example, we need to restore BOs with the same GEM handles so
>>> that user mode can continue using those handles after resuming
>>> execution. If BOs are shared through DMABufs without dynamic
>>> attachment, we need to restore pinned BOs as pinned. Validation of
>>> virtual addresses and handling MMU notifiers must be suspended until
>>> the virtual address space is restored. For user mode queues we need
>>> to save and restore a lot of queue execution state so that execution
>>> can resume cleanly.
>>
>> This also sounds justified to me. Restore creating all internal
>> objects is definitely special and sounds better to add uapi to create
>> them directly with the correct properties, than to add uapi to adjust
>> internal properties after creation. And in case you would always need
>> some new uapi - so at least to adjust after creation. At which point
>> you may have both in one. Internally implementation can be split or
>> common, whatever makes sense for a given object type, but new uapi
>> definitely sounds is required.
>>> # Restoring buffer sharing relationships
>>>
>>> Different GEM handles in different render nodes and processes can
>>> refer to the same underlying shared memory, either by directly
>>> pointing to the same GEM object, or by creating an import attachment
>>> that may get its SG tables invalidated and updated dynamically
>>> through dynamic attachment callbacks. In the latter case it's
>>> obvious, who is the exporter and who is the importer. In the first
>>> case, either one could be the exporter, and it's not clear who would
>>> need to create the BO and who would need to
>>
>> To see if I follow the former case correctly.
>>
>> This could be two clients A and B, where B has imported a dma-buf
>> shared BO from A and has since closed the dma-buf fd? Which results
>> in a single BO with reference count of 2 and obj->import_attach
>> unset. History of who created the object is lost.
Yes. In the amdgpu driver this happens when the exporter and import
device are the same.
>>
>> In fact it could even be that two imported objects remain (clients A,
>> B and C) and A, who originally created the BO, has since fully
>> released it. So any kind of "creator" tracking if added wouldn't be
>> fully reliable either.
That's a good point.
>>
>>> import it when restoring the checkpoint. To further complicate
>>> things, multiple processes in a checkpoint get restored
>>> concurrently. So there is no guarantee that an exporter has restored
>>> a shared BO at the time an importer is trying to restore its import.
>>>
>>> A proposal to deal with these problems would be to treat importers
>>> and exporters the same. Whoever restores first, ends up creating the
>>> BO and potentially attaching to it. The other process(es) can find
>>> BOs that were already restored by another process by looking it up
>>> with a unique ID that could be based on the DMABuf inode number. An
>>> alternative would be a two-pass approach that needs to restore BOs
>>> on two passes:
>>>
>>> 1. Restore exported BOs
>>> 2. Restore imports
>>>
>>> With some inter-process synchronization in CRIU itself between these
>>> two passes. This may require changes in the core CRIU, outside our
>>> plugin. Both approaches depend on identifying BOs with some unique
>>> ID that could be based on the DMABuf inode number in the checkpoint.
>>> However, we would need to identify the processes in the same restore
>>> session, possibly based on parent/child process relationships, to
>>> create a scope where those IDs are valid during restore.
>>
>> If my understanding above is on the right track, then I think this is
>> the only thing which can be done (for all scenarios).
I presented two alternatives. I think you're in favor of the first one,
where it doesn't matter who is the importer and exporter. I think the
two-pass approach requires that you can identify an exporter. And as you
pointed out, the exporter may already have dropped their reference to
the BO.
>>
>> I also *think* it would be safe. At least at the moment I cannot
>> think what could go wrong. Semantics are that it doesn't really
>> matter who created the object.
I would agree. What matters is that the object is recreated on the
correct device, and that all the direct references and import
attachments pointing to it are restored.
>>
>>> Finally, we would also need to checkpoint and restore DMABuf file
>>> descriptors themselves. These are anonymous file descriptors. The
>>> CRIU plugin could probably be taught to recreate them from the
>>> original exported BO based on the inode number that could be queried
>>> with fstat in the checkpoint. It would need help from the render
>>> node CRIU API to find the right BO from the inode, which may be from
>>> a different process in the same restore session.
>>
>> This part feels like it is breaking the component separation a bit
>> because even for buffers fully owned by amdgpu, strictly speaking the
>> dma-buf fd is not. At least my understanding from the above is that
>> you propose to attempt to import the fd, from the kernel side, during
>> the checkpoint process? Like:
>>
>> Checkpoint:
>>
>> CRIU for each anon fd:
>> amdgpu_plugin(fd)
>> -> attempt in kernel dma buf import (passes fd to kernel via
>> ioctl?)
>> -> is it ours? (no -> error)
>> -> create a record mapping fd number to amdgpu BO
>>
>> Restore:
>>
>> for each dma-buf fd record:
>> create BO if does not exists
>> export BO to same fd
>> close BO handle if not in regular BO handle records
>>
>> Or since you mention lookup by inode, that would need to be
>> dmabuf_plugin so it can lookup inodes in the private mount space.
>> However how would it co-operate with amdgpu_plugin is not clear to me.
The way I think about the ownership is, whichever driver created the
underlying BO owns the checkpointing of the dmabuf. You need
driver-specific information to link the dmabuf with the driver's BO and
you need the right driver to recreate the BO and the dmabuf fd when
restoring the checkpoint.
It gets really interesting if you have an amdgpu plugin and an i915
plugin, and they checkpoint an application that shares BOs between the
two devices through DMABufs. E.g. if i915 created a BO and amdgpu
imported it, then during restore, i915 needs to restore the dmabuf
before the amdgpu import of it can be restored. I think that brings us
back to a two-phase approach to restoring the memory sharing
relationships. Uff.
>
> I later also realised that I was maybe increasing the scope for you
> here. :) You did state focus is ROCm applications which possibly
> doesn't care about dma-resv, fences, syncobjs etc?
That's my focus for now. But I don't want to engineer a solution that
would preclude your use cases in the future.
>
> But I think the "how to handle dma-bufs" design question is still
> relevant and interesting. For example I had this thought that perhaps
> what would be needed is a CRIU plugin hierarchy.
>
> Because fundamentally we would be snapshoting a hierarcy of kernel
> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...).
> And if one day someone would to try to handle dma fences and drm
> syncobjs, the argument for a hierarchial design would be even stronger
> I think.
>
> Something like a drm_plugin.so could call sub-plugins (amdgpu,
> dma-buf, sync file, ...) and internally build the representation of
> the whole state and how the relationship between the objects.
Maybe. I guess a structure similar to libdrm makes sense. I'm not sure
it's strictly a hierarchy. Maybe more like some common code shared by
multiple GPU driver plugins. I think the common checkpoint state is
quite limited and restoring it requires the GPU-specific drivers anyway.
Also the idea of building a representation of the whole state doesn't
work well with the CRIU design, because "the whole state" can include
multiple processes that restore themselves concurrently and only
synchronize with each other in a few places in the restore process. I
feel, if we can work out how to checkpoint and restore shared objects
between processes, we can do the same for shared objects between drivers
without imposing a strict hierarchy and some omniscient entity that
needs to know "the whole state".
>
> With that kind of design there probably would be a need to define some
> common kernel side api and uapi, so all involved objects can be
> enumerated with some unique ids etc.
>
> Now.. the counter argument.. the more state from different drivers
> would one want to handle the bigger this project would get. Would it
> even be feasible is the question, to the point that it may be simpler
> to just run the workload in a VM via SR-IOV and simply hibernate the
> whole thin guest. :)
Well, CRIU kind of tries to do that, but with containers instead of VMs. ;)
Regards,
Felix
>
> Regards,
>
> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-03-15 2:33 ` Felix Kuehling
@ 2024-03-15 18:36 ` Tvrtko Ursulin
2024-03-28 16:03 ` Tvrtko Ursulin
0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-03-15 18:36 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 15/03/2024 02:33, Felix Kuehling wrote:
>
> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>
>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>
>>> Hi Felix,
>>>
>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>> Executive Summary: We need to add CRIU support to DRM render nodes
>>>> in order to maintain CRIU support for ROCm application once they
>>>> start relying on render nodes for more GPU memory management. In
>>>> this email I'm providing some background why we are doing this, and
>>>> outlining some of the problems we need to solve to checkpoint and
>>>> restore render node state and shared memory (DMABuf) state. I have
>>>> some thoughts on the API design, leaning on what we did for KFD, but
>>>> would like to get feedback from the DRI community regarding that API
>>>> and to what extent there is interest in making that generic.
>>>>
>>>> We are working on using DRM render nodes for virtual address
>>>> mappings in ROCm applications to implement the CUDA11-style VM API
>>>> and improve interoperability between graphics and compute. This uses
>>>> DMABufs for sharing buffer objects between KFD and multiple render
>>>> node devices, as well as between processes. In the long run this
>>>> also provides a path to moving all or most memory management from
>>>> the KFD ioctl API to libdrm.
>>>>
>>>> Once ROCm user mode starts using render nodes for virtual address
>>>> management, that creates a problem for checkpointing and restoring
>>>> ROCm applications with CRIU. Currently there is no support for
>>>> checkpointing and restoring render node state, other than CPU
>>>> virtual address mappings. Support will be needed for checkpointing
>>>> GEM buffer objects and handles, their GPU virtual address mappings
>>>> and memory sharing relationships between devices and processes.
>>>>
>>>> Eventually, if full CRIU support for graphics applications is
>>>> desired, more state would need to be captured, including scheduler
>>>> contexts and BO lists. Most of this state is driver-specific.
>>>>
>>>> After some internal discussions we decided to take our design
>>>> process public as this potentially touches DRM GEM and DMABuf APIs
>>>> and may have implications for other drivers in the future.
>>>>
>>>> One basic question before going into any API details: Is there a
>>>> desire to have CRIU support for other DRM drivers?
>>>
>>> This sounds like a very interesting feature on the overall, although
>>> I cannot answer on the last question here.
>>
>> I forgot to finish this thought. I cannot answer / don't know of any
>> concrete plans, but I think feature is pretty cool and if amdgpu gets
>> it working I wouldn't be surprised if other drivers would get interested.
>
> Thanks, that's good to hear!
>
>
>>
>>> Funnily enough, it has a tiny relation to an i915 feature I recently
>>> implemented on Mesa's request, which is to be able to "upload" the
>>> GPU context from the GPU hang error state and replay the hanging
>>> request. It is kind of (at a stretch) a very special tiny subset of
>>> checkout and restore so I am not mentioning it as a curiosity.
>>>
>>> And there is also another partical conceptual intersect with the (at
>>> the moment not yet upstream) i915 online debugger. This part being in
>>> the area of discovering and enumerating GPU resources beloning to the
>>> client.
>>>
>>> I don't see an immediate design or code sharing opportunities though
>>> but just mentioning.
>>>
>>> I did spend some time reading your plugin and kernel implementation
>>> out of curiousity and have some comments and questions.
>>>
>>>> With that out of the way, some considerations for a possible DRM
>>>> CRIU API (either generic of AMDGPU driver specific): The API goes
>>>> through several phases during checkpoint and restore:
>>>>
>>>> Checkpoint:
>>>>
>>>> 1. Process-info (enumerates objects and sizes so user mode can
>>>> allocate
>>>> memory for the checkpoint, stops execution on the GPU)
>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>
>>>> Restore:
>>>>
>>>> 1. Restore (restore objects, VMAs are not in the right place at
>>>> this time)
>>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>>> execution)
>>>
>>> Btw is check-pointing guaranteeing all relevant activity is idled?
>>> For instance dma_resv objects are free of fences which would need to
>>> restored for things to continue executing sensibly? Or how is that
>>> handled?
>
> In our compute use cases, we suspend user mode queues. This can include
> CWSR (compute-wave-save-restore) where the state of in-flight waves is
> stored in memory and can be reloaded and resumed from memory later. We
> don't use any fences other than "eviction fences", that are signaled
> after the queues are suspended. And those fences are never handed to
> user mode. So we don't need to worry about any fence state in the
> checkpoint.
>
> If we extended this to support the kernel mode command submission APIs,
> I would expect that we'd wait for all current submissions to complete,
> and stop new ones from being sent to the HW before taking the
> checkpoint. When we take the checkpoint in the CRIU plugin, the CPU
> threads are already frozen and cannot submit any more work. If we wait
> for all currently pending submissions to drain, I think we don't need to
> save any fence state because all the fences will have signaled. (I may
> be missing some intricacies and I'm afraid it may not be that simple in
> reality, but that's my opening bid. ;)
It feels feasible to me too, for the normally behaving clients at least.
Presumably, given that the whole checkpointing is not instant, it would
be okay to wait a second or two longer for the in-progress submissions
complete. After which kernel would need to prune all signalled fences
from the respective container objects before checkpointing.
For the "misbehaving" clients who have perhaps queued up too much work,
either still in the scheduler with unsatisfied dependencies, or already
submitted to the hardware and/or driver backend, is there a timeout
concept in CRIU so it would be possible to say something like "try to
checkpoint but if the kernel says no time period t then give up"?
>>>> For some more background about our implementation in KFD, you can
>>>> refer to this whitepaper:
>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>
>>>> Potential objections to a KFD-style CRIU API in DRM render nodes,
>>>> I'll address each of them in more detail below:
>>>>
>>>> * Opaque information in the checkpoint data that user mode can't
>>>> interpret or do anything with
>>>> * A second API for creating objects (e.g. BOs) that is separate from
>>>> the regular BO creation API
>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>> relationships rather than replaying BO creation, export and import
>>>> from user mode
>>>>
>>>> # Opaque information in the checkpoint
>>>>
>>>> This comes out of ABI compatibility considerations. Adding any new
>>>> objects or attributes to the driver/HW state that needs to be
>>>> checkpointed could potentially break the ABI of the CRIU
>>>> checkpoint/restore ioctl if the plugin needs to parse that
>>>> information. Therefore, much of the information in our KFD CRIU
>>>> ioctl API is opaque. It is written by kernel mode in the checkpoint,
>>>> it is consumed by kernel mode when restoring the checkpoint, but
>>>> user mode doesn't care about the contents or binary layout, so there
>>>> is no user mode ABI to break. This is how we were able to maintain
>>>> CRIU support when we added the SVM API to KFD without changing the
>>>> CRIU plugin and without breaking our ABI.
>>>>
>>>> Opaque information may also lend itself to API abstraction, if this
>>>> becomes a generic DRM API with driver-specific callbacks that fill
>>>> in HW-specific opaque data.
>>>
>>> This feels sound in principle to me. Fundamentally the state is very
>>> hardware specfic, and/or driver version specific, so I don't see
>>> anything could be gained in practice by making it much less opaque.
>>> (Apart from making things more complicated.)
>>>
>>> I was however unsure of the current split of how you dump buffer
>>> objects with some data in the defined bo structure, and some in
>>> completely opaque private data. Is there a benefit to that split, or
>>> maybe in other words, is there a benefit on having part transparent
>>> and part opaque for buffer objects?
>
> Some of the buffer object state is needed by the plugin. E.g. the size
> and mmap offset are needed to match VMAs with BOs. I'd have to review
> the plugin in detail to prove that all the fields are, in fact, needed
> by the plugin, but that was the intent. Anything that the plugin doesn't
> need to know should be in the opaque data structures.
Right, got it.
Would it make sense to make the opaque data in the same block as the
defined one? I mean instead of separating the two in the binary image
for instance have struct kfd_criu_bo_bucket have a trailing priv_data
blob? Maybe it is too late now if the image format is not versioned or
something.
>>> To slightly touch upon the question of whether this could become a
>>> generic DRM API. It feels it would be hard to do it from the start.
>>> What sounds more feasible is if/when generic looking helpers can be
>>> spotted while developing the RFC then potentially structure the code
>>> they can easily be promoted to shared/common at some future moment.
>
> Yes, that's how this usually goes, in my experience. Thanks for confirming.
>
>
>>>
>>>> # Second API for creating objects
>>>>
>>>> Creating BOs and other objects when restoring a checkpoint needs
>>>> more information than the usual BO alloc and similar APIs provide.
>>>> For example, we need to restore BOs with the same GEM handles so
>>>> that user mode can continue using those handles after resuming
>>>> execution. If BOs are shared through DMABufs without dynamic
>>>> attachment, we need to restore pinned BOs as pinned. Validation of
>>>> virtual addresses and handling MMU notifiers must be suspended until
>>>> the virtual address space is restored. For user mode queues we need
>>>> to save and restore a lot of queue execution state so that execution
>>>> can resume cleanly.
>>>
>>> This also sounds justified to me. Restore creating all internal
>>> objects is definitely special and sounds better to add uapi to create
>>> them directly with the correct properties, than to add uapi to adjust
>>> internal properties after creation. And in case you would always need
>>> some new uapi - so at least to adjust after creation. At which point
>>> you may have both in one. Internally implementation can be split or
>>> common, whatever makes sense for a given object type, but new uapi
>>> definitely sounds is required.
>>>> # Restoring buffer sharing relationships
>>>>
>>>> Different GEM handles in different render nodes and processes can
>>>> refer to the same underlying shared memory, either by directly
>>>> pointing to the same GEM object, or by creating an import attachment
>>>> that may get its SG tables invalidated and updated dynamically
>>>> through dynamic attachment callbacks. In the latter case it's
>>>> obvious, who is the exporter and who is the importer. In the first
>>>> case, either one could be the exporter, and it's not clear who would
>>>> need to create the BO and who would need to
>>>
>>> To see if I follow the former case correctly.
>>>
>>> This could be two clients A and B, where B has imported a dma-buf
>>> shared BO from A and has since closed the dma-buf fd? Which results
>>> in a single BO with reference count of 2 and obj->import_attach
>>> unset. History of who created the object is lost.
>
> Yes. In the amdgpu driver this happens when the exporter and import
> device are the same.
>
>
>>>
>>> In fact it could even be that two imported objects remain (clients A,
>>> B and C) and A, who originally created the BO, has since fully
>>> released it. So any kind of "creator" tracking if added wouldn't be
>>> fully reliable either.
>
> That's a good point.
>
>
>>>
>>>> import it when restoring the checkpoint. To further complicate
>>>> things, multiple processes in a checkpoint get restored
>>>> concurrently. So there is no guarantee that an exporter has restored
>>>> a shared BO at the time an importer is trying to restore its import.
>>>>
>>>> A proposal to deal with these problems would be to treat importers
>>>> and exporters the same. Whoever restores first, ends up creating the
>>>> BO and potentially attaching to it. The other process(es) can find
>>>> BOs that were already restored by another process by looking it up
>>>> with a unique ID that could be based on the DMABuf inode number. An
>>>> alternative would be a two-pass approach that needs to restore BOs
>>>> on two passes:
>>>>
>>>> 1. Restore exported BOs
>>>> 2. Restore imports
>>>>
>>>> With some inter-process synchronization in CRIU itself between these
>>>> two passes. This may require changes in the core CRIU, outside our
>>>> plugin. Both approaches depend on identifying BOs with some unique
>>>> ID that could be based on the DMABuf inode number in the checkpoint.
>>>> However, we would need to identify the processes in the same restore
>>>> session, possibly based on parent/child process relationships, to
>>>> create a scope where those IDs are valid during restore.
>>>
>>> If my understanding above is on the right track, then I think this is
>>> the only thing which can be done (for all scenarios).
>
> I presented two alternatives. I think you're in favor of the first one,
> where it doesn't matter who is the importer and exporter. I think the
> two-pass approach requires that you can identify an exporter. And as you
> pointed out, the exporter may already have dropped their reference to
> the BO.
Yep.
>>> I also *think* it would be safe. At least at the moment I cannot
>>> think what could go wrong. Semantics are that it doesn't really
>>> matter who created the object.
>
> I would agree. What matters is that the object is recreated on the
> correct device, and that all the direct references and import
> attachments pointing to it are restored.
>
>
>>>
>>>> Finally, we would also need to checkpoint and restore DMABuf file
>>>> descriptors themselves. These are anonymous file descriptors. The
>>>> CRIU plugin could probably be taught to recreate them from the
>>>> original exported BO based on the inode number that could be queried
>>>> with fstat in the checkpoint. It would need help from the render
>>>> node CRIU API to find the right BO from the inode, which may be from
>>>> a different process in the same restore session.
>>>
>>> This part feels like it is breaking the component separation a bit
>>> because even for buffers fully owned by amdgpu, strictly speaking the
>>> dma-buf fd is not. At least my understanding from the above is that
>>> you propose to attempt to import the fd, from the kernel side, during
>>> the checkpoint process? Like:
>>>
>>> Checkpoint:
>>>
>>> CRIU for each anon fd:
>>> amdgpu_plugin(fd)
>>> -> attempt in kernel dma buf import (passes fd to kernel via
>>> ioctl?)
>>> -> is it ours? (no -> error)
>>> -> create a record mapping fd number to amdgpu BO
>>>
>>> Restore:
>>>
>>> for each dma-buf fd record:
>>> create BO if does not exists
>>> export BO to same fd
>>> close BO handle if not in regular BO handle records
>>>
>>> Or since you mention lookup by inode, that would need to be
>>> dmabuf_plugin so it can lookup inodes in the private mount space.
>>> However how would it co-operate with amdgpu_plugin is not clear to me.
>
> The way I think about the ownership is, whichever driver created the
> underlying BO owns the checkpointing of the dmabuf. You need
> driver-specific information to link the dmabuf with the driver's BO and
> you need the right driver to recreate the BO and the dmabuf fd when
> restoring the checkpoint.
>
> It gets really interesting if you have an amdgpu plugin and an i915
> plugin, and they checkpoint an application that shares BOs between the
> two devices through DMABufs. E.g. if i915 created a BO and amdgpu
> imported it, then during restore, i915 needs to restore the dmabuf
> before the amdgpu import of it can be restored. I think that brings us
> back to a two-phase approach to restoring the memory sharing
> relationships. Uff.
I think this part of the discussion somewhat depends on the previous
part about idling. If it is feasible to completely idle and prune, and
fail if that is not happening quickly enough, then maybe there wouldn't
be too much hierarchical state to save.
Otherwise my idea was that there is a top-level drm_plugin.so which
understands amdgpu fds, i915, syncobj, sync_file, and uses some new uapi
to uniquely identify each, associate with the correct driver, and then
internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. Building the
in memory representation of their relationships. As long as all objects
and their relationships have been recorded I think everything could then
be correctly restored.
It is possible there is flaw in my thinking and that something in CRIU
design would make this impossible? I think it would require the
top-level drm_plugin.so to hold all state in memory until the whole
checkpointing is done, and then verify something is not incomplete,
failing it all if it was. (For instance one plugin discovered an
reference to an object which was not discoverd by any other plugin or
things like that.) May need some further tweaks to CRIU common code.
Maybe I need to better understand how exactly you mean to query the DRM
driver about random anonymous fds. I see it as a problem in the design,
possibly even implementation, but maybe I am missing something which
makes it not so. I mean even with my general idea I don't know how would
one determine which driver to query about a particular anonymous inode.
>> I later also realised that I was maybe increasing the scope for you
>> here. :) You did state focus is ROCm applications which possibly
>> doesn't care about dma-resv, fences, syncobjs etc?
>
> That's my focus for now. But I don't want to engineer a solution that
> would preclude your use cases in the future.
>
>
>>
>> But I think the "how to handle dma-bufs" design question is still
>> relevant and interesting. For example I had this thought that perhaps
>> what would be needed is a CRIU plugin hierarchy.
>>
>> Because fundamentally we would be snapshoting a hierarcy of kernel
>> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...).
>> And if one day someone would to try to handle dma fences and drm
>> syncobjs, the argument for a hierarchial design would be even stronger
>> I think.
>>
>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>> dma-buf, sync file, ...) and internally build the representation of
>> the whole state and how the relationship between the objects.
>
> Maybe. I guess a structure similar to libdrm makes sense. I'm not sure
> it's strictly a hierarchy. Maybe more like some common code shared by
> multiple GPU driver plugins. I think the common checkpoint state is
> quite limited and restoring it requires the GPU-specific drivers anyway.
>
> Also the idea of building a representation of the whole state doesn't
> work well with the CRIU design, because "the whole state" can include
> multiple processes that restore themselves concurrently and only
> synchronize with each other in a few places in the restore process. I
> feel, if we can work out how to checkpoint and restore shared objects
> between processes, we can do the same for shared objects between drivers
> without imposing a strict hierarchy and some omniscient entity that
> needs to know "the whole state".
Okay, this continues on the same problem space as above. And you
obviously know how CRIU works much better than me.
>> With that kind of design there probably would be a need to define some
>> common kernel side api and uapi, so all involved objects can be
>> enumerated with some unique ids etc.
>>
>> Now.. the counter argument.. the more state from different drivers
>> would one want to handle the bigger this project would get. Would it
>> even be feasible is the question, to the point that it may be simpler
>> to just run the workload in a VM via SR-IOV and simply hibernate the
>> whole thin guest. :)
>
> Well, CRIU kind of tries to do that, but with containers instead of VMs. ;)
It would definitely be useful for hardware and drivers without SR-IOV
support so lets hope it is doable. :)
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-03-15 18:36 ` Tvrtko Ursulin
@ 2024-03-28 16:03 ` Tvrtko Ursulin
2024-03-28 20:42 ` Felix Kuehling
0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-03-28 16:03 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
Hi Felix,
I had one more thought while browsing around the amdgpu CRIU plugin. It
appears it relies on the KFD support being compiled in and /dev/kfd
present, correct? AFAICT at least, it relies on that to figure out the
amdgpu DRM node.
In would be probably good to consider designing things without that
dependency. So that checkpointing an application which does not use
/dev/kfd is possible. Or if the kernel does not even have the KFD
support compiled in.
It could perhaps mean no more than adding some GPU discovery code into
CRIU. Which shuold be flexible enough to account for things like
re-assigned minor numbers due driver reload.
Otherwise I am eagerly awaiting to hear more about the design specifics
around dma-buf handling. And also seeing how to extend to other DRM
related anonymous fds.
Regards,
Tvrtko
On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>
> On 15/03/2024 02:33, Felix Kuehling wrote:
>>
>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>
>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>
>>>> Hi Felix,
>>>>
>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>> Executive Summary: We need to add CRIU support to DRM render nodes
>>>>> in order to maintain CRIU support for ROCm application once they
>>>>> start relying on render nodes for more GPU memory management. In
>>>>> this email I'm providing some background why we are doing this, and
>>>>> outlining some of the problems we need to solve to checkpoint and
>>>>> restore render node state and shared memory (DMABuf) state. I have
>>>>> some thoughts on the API design, leaning on what we did for KFD,
>>>>> but would like to get feedback from the DRI community regarding
>>>>> that API and to what extent there is interest in making that generic.
>>>>>
>>>>> We are working on using DRM render nodes for virtual address
>>>>> mappings in ROCm applications to implement the CUDA11-style VM API
>>>>> and improve interoperability between graphics and compute. This
>>>>> uses DMABufs for sharing buffer objects between KFD and multiple
>>>>> render node devices, as well as between processes. In the long run
>>>>> this also provides a path to moving all or most memory management
>>>>> from the KFD ioctl API to libdrm.
>>>>>
>>>>> Once ROCm user mode starts using render nodes for virtual address
>>>>> management, that creates a problem for checkpointing and restoring
>>>>> ROCm applications with CRIU. Currently there is no support for
>>>>> checkpointing and restoring render node state, other than CPU
>>>>> virtual address mappings. Support will be needed for checkpointing
>>>>> GEM buffer objects and handles, their GPU virtual address mappings
>>>>> and memory sharing relationships between devices and processes.
>>>>>
>>>>> Eventually, if full CRIU support for graphics applications is
>>>>> desired, more state would need to be captured, including scheduler
>>>>> contexts and BO lists. Most of this state is driver-specific.
>>>>>
>>>>> After some internal discussions we decided to take our design
>>>>> process public as this potentially touches DRM GEM and DMABuf APIs
>>>>> and may have implications for other drivers in the future.
>>>>>
>>>>> One basic question before going into any API details: Is there a
>>>>> desire to have CRIU support for other DRM drivers?
>>>>
>>>> This sounds like a very interesting feature on the overall, although
>>>> I cannot answer on the last question here.
>>>
>>> I forgot to finish this thought. I cannot answer / don't know of any
>>> concrete plans, but I think feature is pretty cool and if amdgpu gets
>>> it working I wouldn't be surprised if other drivers would get
>>> interested.
>>
>> Thanks, that's good to hear!
>>
>>
>>>
>>>> Funnily enough, it has a tiny relation to an i915 feature I recently
>>>> implemented on Mesa's request, which is to be able to "upload" the
>>>> GPU context from the GPU hang error state and replay the hanging
>>>> request. It is kind of (at a stretch) a very special tiny subset of
>>>> checkout and restore so I am not mentioning it as a curiosity.
>>>>
>>>> And there is also another partical conceptual intersect with the (at
>>>> the moment not yet upstream) i915 online debugger. This part being
>>>> in the area of discovering and enumerating GPU resources beloning to
>>>> the client.
>>>>
>>>> I don't see an immediate design or code sharing opportunities though
>>>> but just mentioning.
>>>>
>>>> I did spend some time reading your plugin and kernel implementation
>>>> out of curiousity and have some comments and questions.
>>>>
>>>>> With that out of the way, some considerations for a possible DRM
>>>>> CRIU API (either generic of AMDGPU driver specific): The API goes
>>>>> through several phases during checkpoint and restore:
>>>>>
>>>>> Checkpoint:
>>>>>
>>>>> 1. Process-info (enumerates objects and sizes so user mode can
>>>>> allocate
>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>
>>>>> Restore:
>>>>>
>>>>> 1. Restore (restore objects, VMAs are not in the right place at
>>>>> this time)
>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>>>> execution)
>>>>
>>>> Btw is check-pointing guaranteeing all relevant activity is idled?
>>>> For instance dma_resv objects are free of fences which would need to
>>>> restored for things to continue executing sensibly? Or how is that
>>>> handled?
>>
>> In our compute use cases, we suspend user mode queues. This can
>> include CWSR (compute-wave-save-restore) where the state of in-flight
>> waves is stored in memory and can be reloaded and resumed from memory
>> later. We don't use any fences other than "eviction fences", that are
>> signaled after the queues are suspended. And those fences are never
>> handed to user mode. So we don't need to worry about any fence state
>> in the checkpoint.
>>
>> If we extended this to support the kernel mode command submission
>> APIs, I would expect that we'd wait for all current submissions to
>> complete, and stop new ones from being sent to the HW before taking
>> the checkpoint. When we take the checkpoint in the CRIU plugin, the
>> CPU threads are already frozen and cannot submit any more work. If we
>> wait for all currently pending submissions to drain, I think we don't
>> need to save any fence state because all the fences will have
>> signaled. (I may be missing some intricacies and I'm afraid it may not
>> be that simple in reality, but that's my opening bid. ;)
>
> It feels feasible to me too, for the normally behaving clients at least.
>
> Presumably, given that the whole checkpointing is not instant, it would
> be okay to wait a second or two longer for the in-progress submissions
> complete. After which kernel would need to prune all signalled fences
> from the respective container objects before checkpointing.
>
> For the "misbehaving" clients who have perhaps queued up too much work,
> either still in the scheduler with unsatisfied dependencies, or already
> submitted to the hardware and/or driver backend, is there a timeout
> concept in CRIU so it would be possible to say something like "try to
> checkpoint but if the kernel says no time period t then give up"?
>
>>>>> For some more background about our implementation in KFD, you can
>>>>> refer to this whitepaper:
>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>
>>>>> Potential objections to a KFD-style CRIU API in DRM render nodes,
>>>>> I'll address each of them in more detail below:
>>>>>
>>>>> * Opaque information in the checkpoint data that user mode can't
>>>>> interpret or do anything with
>>>>> * A second API for creating objects (e.g. BOs) that is separate from
>>>>> the regular BO creation API
>>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>>> relationships rather than replaying BO creation, export and import
>>>>> from user mode
>>>>>
>>>>> # Opaque information in the checkpoint
>>>>>
>>>>> This comes out of ABI compatibility considerations. Adding any new
>>>>> objects or attributes to the driver/HW state that needs to be
>>>>> checkpointed could potentially break the ABI of the CRIU
>>>>> checkpoint/restore ioctl if the plugin needs to parse that
>>>>> information. Therefore, much of the information in our KFD CRIU
>>>>> ioctl API is opaque. It is written by kernel mode in the
>>>>> checkpoint, it is consumed by kernel mode when restoring the
>>>>> checkpoint, but user mode doesn't care about the contents or binary
>>>>> layout, so there is no user mode ABI to break. This is how we were
>>>>> able to maintain CRIU support when we added the SVM API to KFD
>>>>> without changing the CRIU plugin and without breaking our ABI.
>>>>>
>>>>> Opaque information may also lend itself to API abstraction, if this
>>>>> becomes a generic DRM API with driver-specific callbacks that fill
>>>>> in HW-specific opaque data.
>>>>
>>>> This feels sound in principle to me. Fundamentally the state is very
>>>> hardware specfic, and/or driver version specific, so I don't see
>>>> anything could be gained in practice by making it much less opaque.
>>>> (Apart from making things more complicated.)
>>>>
>>>> I was however unsure of the current split of how you dump buffer
>>>> objects with some data in the defined bo structure, and some in
>>>> completely opaque private data. Is there a benefit to that split, or
>>>> maybe in other words, is there a benefit on having part transparent
>>>> and part opaque for buffer objects?
>>
>> Some of the buffer object state is needed by the plugin. E.g. the size
>> and mmap offset are needed to match VMAs with BOs. I'd have to review
>> the plugin in detail to prove that all the fields are, in fact, needed
>> by the plugin, but that was the intent. Anything that the plugin
>> doesn't need to know should be in the opaque data structures.
>
> Right, got it.
>
> Would it make sense to make the opaque data in the same block as the
> defined one? I mean instead of separating the two in the binary image
> for instance have struct kfd_criu_bo_bucket have a trailing priv_data
> blob? Maybe it is too late now if the image format is not versioned or
> something.
>
>>>> To slightly touch upon the question of whether this could become a
>>>> generic DRM API. It feels it would be hard to do it from the start.
>>>> What sounds more feasible is if/when generic looking helpers can be
>>>> spotted while developing the RFC then potentially structure the code
>>>> they can easily be promoted to shared/common at some future moment.
>>
>> Yes, that's how this usually goes, in my experience. Thanks for
>> confirming.
>>
>>
>>>>
>>>>> # Second API for creating objects
>>>>>
>>>>> Creating BOs and other objects when restoring a checkpoint needs
>>>>> more information than the usual BO alloc and similar APIs provide.
>>>>> For example, we need to restore BOs with the same GEM handles so
>>>>> that user mode can continue using those handles after resuming
>>>>> execution. If BOs are shared through DMABufs without dynamic
>>>>> attachment, we need to restore pinned BOs as pinned. Validation of
>>>>> virtual addresses and handling MMU notifiers must be suspended
>>>>> until the virtual address space is restored. For user mode queues
>>>>> we need to save and restore a lot of queue execution state so that
>>>>> execution can resume cleanly.
>>>>
>>>> This also sounds justified to me. Restore creating all internal
>>>> objects is definitely special and sounds better to add uapi to
>>>> create them directly with the correct properties, than to add uapi
>>>> to adjust internal properties after creation. And in case you would
>>>> always need some new uapi - so at least to adjust after creation. At
>>>> which point you may have both in one. Internally implementation can
>>>> be split or common, whatever makes sense for a given object type,
>>>> but new uapi definitely sounds is required.
>>>>> # Restoring buffer sharing relationships
>>>>>
>>>>> Different GEM handles in different render nodes and processes can
>>>>> refer to the same underlying shared memory, either by directly
>>>>> pointing to the same GEM object, or by creating an import
>>>>> attachment that may get its SG tables invalidated and updated
>>>>> dynamically through dynamic attachment callbacks. In the latter
>>>>> case it's obvious, who is the exporter and who is the importer. In
>>>>> the first case, either one could be the exporter, and it's not
>>>>> clear who would need to create the BO and who would need to
>>>>
>>>> To see if I follow the former case correctly.
>>>>
>>>> This could be two clients A and B, where B has imported a dma-buf
>>>> shared BO from A and has since closed the dma-buf fd? Which results
>>>> in a single BO with reference count of 2 and obj->import_attach
>>>> unset. History of who created the object is lost.
>>
>> Yes. In the amdgpu driver this happens when the exporter and import
>> device are the same.
>>
>>
>>>>
>>>> In fact it could even be that two imported objects remain (clients
>>>> A, B and C) and A, who originally created the BO, has since fully
>>>> released it. So any kind of "creator" tracking if added wouldn't be
>>>> fully reliable either.
>>
>> That's a good point.
>>
>>
>>>>
>>>>> import it when restoring the checkpoint. To further complicate
>>>>> things, multiple processes in a checkpoint get restored
>>>>> concurrently. So there is no guarantee that an exporter has
>>>>> restored a shared BO at the time an importer is trying to restore
>>>>> its import.
>>>>>
>>>>> A proposal to deal with these problems would be to treat importers
>>>>> and exporters the same. Whoever restores first, ends up creating
>>>>> the BO and potentially attaching to it. The other process(es) can
>>>>> find BOs that were already restored by another process by looking
>>>>> it up with a unique ID that could be based on the DMABuf inode
>>>>> number. An alternative would be a two-pass approach that needs to
>>>>> restore BOs on two passes:
>>>>>
>>>>> 1. Restore exported BOs
>>>>> 2. Restore imports
>>>>>
>>>>> With some inter-process synchronization in CRIU itself between
>>>>> these two passes. This may require changes in the core CRIU,
>>>>> outside our plugin. Both approaches depend on identifying BOs with
>>>>> some unique ID that could be based on the DMABuf inode number in
>>>>> the checkpoint. However, we would need to identify the processes in
>>>>> the same restore session, possibly based on parent/child process
>>>>> relationships, to create a scope where those IDs are valid during
>>>>> restore.
>>>>
>>>> If my understanding above is on the right track, then I think this
>>>> is the only thing which can be done (for all scenarios).
>>
>> I presented two alternatives. I think you're in favor of the first
>> one, where it doesn't matter who is the importer and exporter. I think
>> the two-pass approach requires that you can identify an exporter. And
>> as you pointed out, the exporter may already have dropped their
>> reference to the BO.
>
> Yep.
>
>>>> I also *think* it would be safe. At least at the moment I cannot
>>>> think what could go wrong. Semantics are that it doesn't really
>>>> matter who created the object.
>>
>> I would agree. What matters is that the object is recreated on the
>> correct device, and that all the direct references and import
>> attachments pointing to it are restored.
>>
>>
>>>>
>>>>> Finally, we would also need to checkpoint and restore DMABuf file
>>>>> descriptors themselves. These are anonymous file descriptors. The
>>>>> CRIU plugin could probably be taught to recreate them from the
>>>>> original exported BO based on the inode number that could be
>>>>> queried with fstat in the checkpoint. It would need help from the
>>>>> render node CRIU API to find the right BO from the inode, which may
>>>>> be from a different process in the same restore session.
>>>>
>>>> This part feels like it is breaking the component separation a bit
>>>> because even for buffers fully owned by amdgpu, strictly speaking
>>>> the dma-buf fd is not. At least my understanding from the above is
>>>> that you propose to attempt to import the fd, from the kernel side,
>>>> during the checkpoint process? Like:
>>>>
>>>> Checkpoint:
>>>>
>>>> CRIU for each anon fd:
>>>> amdgpu_plugin(fd)
>>>> -> attempt in kernel dma buf import (passes fd to kernel via
>>>> ioctl?)
>>>> -> is it ours? (no -> error)
>>>> -> create a record mapping fd number to amdgpu BO
>>>>
>>>> Restore:
>>>>
>>>> for each dma-buf fd record:
>>>> create BO if does not exists
>>>> export BO to same fd
>>>> close BO handle if not in regular BO handle records
>>>>
>>>> Or since you mention lookup by inode, that would need to be
>>>> dmabuf_plugin so it can lookup inodes in the private mount space.
>>>> However how would it co-operate with amdgpu_plugin is not clear to me.
>>
>> The way I think about the ownership is, whichever driver created the
>> underlying BO owns the checkpointing of the dmabuf. You need
>> driver-specific information to link the dmabuf with the driver's BO
>> and you need the right driver to recreate the BO and the dmabuf fd
>> when restoring the checkpoint.
>>
>> It gets really interesting if you have an amdgpu plugin and an i915
>> plugin, and they checkpoint an application that shares BOs between the
>> two devices through DMABufs. E.g. if i915 created a BO and amdgpu
>> imported it, then during restore, i915 needs to restore the dmabuf
>> before the amdgpu import of it can be restored. I think that brings us
>> back to a two-phase approach to restoring the memory sharing
>> relationships. Uff.
>
> I think this part of the discussion somewhat depends on the previous
> part about idling. If it is feasible to completely idle and prune, and
> fail if that is not happening quickly enough, then maybe there wouldn't
> be too much hierarchical state to save.
>
> Otherwise my idea was that there is a top-level drm_plugin.so which
> understands amdgpu fds, i915, syncobj, sync_file, and uses some new uapi
> to uniquely identify each, associate with the correct driver, and then
> internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. Building the
> in memory representation of their relationships. As long as all objects
> and their relationships have been recorded I think everything could then
> be correctly restored.
>
> It is possible there is flaw in my thinking and that something in CRIU
> design would make this impossible? I think it would require the
> top-level drm_plugin.so to hold all state in memory until the whole
> checkpointing is done, and then verify something is not incomplete,
> failing it all if it was. (For instance one plugin discovered an
> reference to an object which was not discoverd by any other plugin or
> things like that.) May need some further tweaks to CRIU common code.
>
> Maybe I need to better understand how exactly you mean to query the DRM
> driver about random anonymous fds. I see it as a problem in the design,
> possibly even implementation, but maybe I am missing something which
> makes it not so. I mean even with my general idea I don't know how would
> one determine which driver to query about a particular anonymous inode.
>
>>> I later also realised that I was maybe increasing the scope for you
>>> here. :) You did state focus is ROCm applications which possibly
>>> doesn't care about dma-resv, fences, syncobjs etc?
>>
>> That's my focus for now. But I don't want to engineer a solution that
>> would preclude your use cases in the future.
>>
>>
>>>
>>> But I think the "how to handle dma-bufs" design question is still
>>> relevant and interesting. For example I had this thought that perhaps
>>> what would be needed is a CRIU plugin hierarchy.
>>>
>>> Because fundamentally we would be snapshoting a hierarcy of kernel
>>> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...).
>>> And if one day someone would to try to handle dma fences and drm
>>> syncobjs, the argument for a hierarchial design would be even
>>> stronger I think.
>>>
>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>> dma-buf, sync file, ...) and internally build the representation of
>>> the whole state and how the relationship between the objects.
>>
>> Maybe. I guess a structure similar to libdrm makes sense. I'm not sure
>> it's strictly a hierarchy. Maybe more like some common code shared by
>> multiple GPU driver plugins. I think the common checkpoint state is
>> quite limited and restoring it requires the GPU-specific drivers anyway.
>>
>> Also the idea of building a representation of the whole state doesn't
>> work well with the CRIU design, because "the whole state" can include
>> multiple processes that restore themselves concurrently and only
>> synchronize with each other in a few places in the restore process. I
>> feel, if we can work out how to checkpoint and restore shared objects
>> between processes, we can do the same for shared objects between
>> drivers without imposing a strict hierarchy and some omniscient entity
>> that needs to know "the whole state".
>
> Okay, this continues on the same problem space as above. And you
> obviously know how CRIU works much better than me.
>
>>> With that kind of design there probably would be a need to define
>>> some common kernel side api and uapi, so all involved objects can be
>>> enumerated with some unique ids etc.
>>>
>>> Now.. the counter argument.. the more state from different drivers
>>> would one want to handle the bigger this project would get. Would it
>>> even be feasible is the question, to the point that it may be simpler
>>> to just run the workload in a VM via SR-IOV and simply hibernate the
>>> whole thin guest. :)
>>
>> Well, CRIU kind of tries to do that, but with containers instead of
>> VMs. ;)
>
> It would definitely be useful for hardware and drivers without SR-IOV
> support so lets hope it is doable. :)
>
> Regards,
>
> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-03-28 16:03 ` Tvrtko Ursulin
@ 2024-03-28 20:42 ` Felix Kuehling
2024-04-01 15:09 ` Tvrtko Ursulin
0 siblings, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2024-03-28 20:42 UTC (permalink / raw)
To: Tvrtko Ursulin, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>
> Hi Felix,
>
> I had one more thought while browsing around the amdgpu CRIU plugin.
> It appears it relies on the KFD support being compiled in and /dev/kfd
> present, correct? AFAICT at least, it relies on that to figure out the
> amdgpu DRM node.
>
> In would be probably good to consider designing things without that
> dependency. So that checkpointing an application which does not use
> /dev/kfd is possible. Or if the kernel does not even have the KFD
> support compiled in.
Yeah, if we want to support graphics apps that don't use KFD, we should
definitely do that. Currently we get a lot of topology information from
KFD, not even from the /dev/kfd device but from the sysfs nodes exposed
by KFD. We'd need to get GPU device info from the render nodes instead.
And if KFD is available, we may need to integrate both sources of
information.
>
> It could perhaps mean no more than adding some GPU discovery code into
> CRIU. Which shuold be flexible enough to account for things like
> re-assigned minor numbers due driver reload.
Do you mean adding GPU discovery to the core CRIU, or to the plugin. I
was thinking this is still part of the plugin.
>
> Otherwise I am eagerly awaiting to hear more about the design
> specifics around dma-buf handling. And also seeing how to extend to
> other DRM related anonymous fds.
I've been pretty far under-water lately. I hope I'll find time to work
on this more, but it's probably going to be at least a few weeks.
Regards,
Felix
>
> Regards,
>
> Tvrtko
>
> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>
>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>
>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>
>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>
>>>>> Hi Felix,
>>>>>
>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>> Executive Summary: We need to add CRIU support to DRM render
>>>>>> nodes in order to maintain CRIU support for ROCm application once
>>>>>> they start relying on render nodes for more GPU memory
>>>>>> management. In this email I'm providing some background why we
>>>>>> are doing this, and outlining some of the problems we need to
>>>>>> solve to checkpoint and restore render node state and shared
>>>>>> memory (DMABuf) state. I have some thoughts on the API design,
>>>>>> leaning on what we did for KFD, but would like to get feedback
>>>>>> from the DRI community regarding that API and to what extent
>>>>>> there is interest in making that generic.
>>>>>>
>>>>>> We are working on using DRM render nodes for virtual address
>>>>>> mappings in ROCm applications to implement the CUDA11-style VM
>>>>>> API and improve interoperability between graphics and compute.
>>>>>> This uses DMABufs for sharing buffer objects between KFD and
>>>>>> multiple render node devices, as well as between processes. In
>>>>>> the long run this also provides a path to moving all or most
>>>>>> memory management from the KFD ioctl API to libdrm.
>>>>>>
>>>>>> Once ROCm user mode starts using render nodes for virtual address
>>>>>> management, that creates a problem for checkpointing and
>>>>>> restoring ROCm applications with CRIU. Currently there is no
>>>>>> support for checkpointing and restoring render node state, other
>>>>>> than CPU virtual address mappings. Support will be needed for
>>>>>> checkpointing GEM buffer objects and handles, their GPU virtual
>>>>>> address mappings and memory sharing relationships between devices
>>>>>> and processes.
>>>>>>
>>>>>> Eventually, if full CRIU support for graphics applications is
>>>>>> desired, more state would need to be captured, including
>>>>>> scheduler contexts and BO lists. Most of this state is
>>>>>> driver-specific.
>>>>>>
>>>>>> After some internal discussions we decided to take our design
>>>>>> process public as this potentially touches DRM GEM and DMABuf
>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>
>>>>>> One basic question before going into any API details: Is there a
>>>>>> desire to have CRIU support for other DRM drivers?
>>>>>
>>>>> This sounds like a very interesting feature on the overall,
>>>>> although I cannot answer on the last question here.
>>>>
>>>> I forgot to finish this thought. I cannot answer / don't know of
>>>> any concrete plans, but I think feature is pretty cool and if
>>>> amdgpu gets it working I wouldn't be surprised if other drivers
>>>> would get interested.
>>>
>>> Thanks, that's good to hear!
>>>
>>>
>>>>
>>>>> Funnily enough, it has a tiny relation to an i915 feature I
>>>>> recently implemented on Mesa's request, which is to be able to
>>>>> "upload" the GPU context from the GPU hang error state and replay
>>>>> the hanging request. It is kind of (at a stretch) a very special
>>>>> tiny subset of checkout and restore so I am not mentioning it as a
>>>>> curiosity.
>>>>>
>>>>> And there is also another partical conceptual intersect with the
>>>>> (at the moment not yet upstream) i915 online debugger. This part
>>>>> being in the area of discovering and enumerating GPU resources
>>>>> beloning to the client.
>>>>>
>>>>> I don't see an immediate design or code sharing opportunities
>>>>> though but just mentioning.
>>>>>
>>>>> I did spend some time reading your plugin and kernel
>>>>> implementation out of curiousity and have some comments and
>>>>> questions.
>>>>>
>>>>>> With that out of the way, some considerations for a possible DRM
>>>>>> CRIU API (either generic of AMDGPU driver specific): The API goes
>>>>>> through several phases during checkpoint and restore:
>>>>>>
>>>>>> Checkpoint:
>>>>>>
>>>>>> 1. Process-info (enumerates objects and sizes so user mode can
>>>>>> allocate
>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>
>>>>>> Restore:
>>>>>>
>>>>>> 1. Restore (restore objects, VMAs are not in the right place at
>>>>>> this time)
>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>>>>> execution)
>>>>>
>>>>> Btw is check-pointing guaranteeing all relevant activity is idled?
>>>>> For instance dma_resv objects are free of fences which would need
>>>>> to restored for things to continue executing sensibly? Or how is
>>>>> that handled?
>>>
>>> In our compute use cases, we suspend user mode queues. This can
>>> include CWSR (compute-wave-save-restore) where the state of
>>> in-flight waves is stored in memory and can be reloaded and resumed
>>> from memory later. We don't use any fences other than "eviction
>>> fences", that are signaled after the queues are suspended. And those
>>> fences are never handed to user mode. So we don't need to worry
>>> about any fence state in the checkpoint.
>>>
>>> If we extended this to support the kernel mode command submission
>>> APIs, I would expect that we'd wait for all current submissions to
>>> complete, and stop new ones from being sent to the HW before taking
>>> the checkpoint. When we take the checkpoint in the CRIU plugin, the
>>> CPU threads are already frozen and cannot submit any more work. If
>>> we wait for all currently pending submissions to drain, I think we
>>> don't need to save any fence state because all the fences will have
>>> signaled. (I may be missing some intricacies and I'm afraid it may
>>> not be that simple in reality, but that's my opening bid. ;)
>>
>> It feels feasible to me too, for the normally behaving clients at least.
>>
>> Presumably, given that the whole checkpointing is not instant, it
>> would be okay to wait a second or two longer for the in-progress
>> submissions complete. After which kernel would need to prune all
>> signalled fences from the respective container objects before
>> checkpointing.
>>
>> For the "misbehaving" clients who have perhaps queued up too much
>> work, either still in the scheduler with unsatisfied dependencies, or
>> already submitted to the hardware and/or driver backend, is there a
>> timeout concept in CRIU so it would be possible to say something like
>> "try to checkpoint but if the kernel says no time period t then give
>> up"?
>>
>>>>>> For some more background about our implementation in KFD, you can
>>>>>> refer to this whitepaper:
>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>
>>>>>>
>>>>>> Potential objections to a KFD-style CRIU API in DRM render nodes,
>>>>>> I'll address each of them in more detail below:
>>>>>>
>>>>>> * Opaque information in the checkpoint data that user mode can't
>>>>>> interpret or do anything with
>>>>>> * A second API for creating objects (e.g. BOs) that is separate
>>>>>> from
>>>>>> the regular BO creation API
>>>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>>>> relationships rather than replaying BO creation, export and
>>>>>> import
>>>>>> from user mode
>>>>>>
>>>>>> # Opaque information in the checkpoint
>>>>>>
>>>>>> This comes out of ABI compatibility considerations. Adding any
>>>>>> new objects or attributes to the driver/HW state that needs to be
>>>>>> checkpointed could potentially break the ABI of the CRIU
>>>>>> checkpoint/restore ioctl if the plugin needs to parse that
>>>>>> information. Therefore, much of the information in our KFD CRIU
>>>>>> ioctl API is opaque. It is written by kernel mode in the
>>>>>> checkpoint, it is consumed by kernel mode when restoring the
>>>>>> checkpoint, but user mode doesn't care about the contents or
>>>>>> binary layout, so there is no user mode ABI to break. This is how
>>>>>> we were able to maintain CRIU support when we added the SVM API
>>>>>> to KFD without changing the CRIU plugin and without breaking our
>>>>>> ABI.
>>>>>>
>>>>>> Opaque information may also lend itself to API abstraction, if
>>>>>> this becomes a generic DRM API with driver-specific callbacks
>>>>>> that fill in HW-specific opaque data.
>>>>>
>>>>> This feels sound in principle to me. Fundamentally the state is
>>>>> very hardware specfic, and/or driver version specific, so I don't
>>>>> see anything could be gained in practice by making it much less
>>>>> opaque. (Apart from making things more complicated.)
>>>>>
>>>>> I was however unsure of the current split of how you dump buffer
>>>>> objects with some data in the defined bo structure, and some in
>>>>> completely opaque private data. Is there a benefit to that split,
>>>>> or maybe in other words, is there a benefit on having part
>>>>> transparent and part opaque for buffer objects?
>>>
>>> Some of the buffer object state is needed by the plugin. E.g. the
>>> size and mmap offset are needed to match VMAs with BOs. I'd have to
>>> review the plugin in detail to prove that all the fields are, in
>>> fact, needed by the plugin, but that was the intent. Anything that
>>> the plugin doesn't need to know should be in the opaque data
>>> structures.
>>
>> Right, got it.
>>
>> Would it make sense to make the opaque data in the same block as the
>> defined one? I mean instead of separating the two in the binary image
>> for instance have struct kfd_criu_bo_bucket have a trailing priv_data
>> blob? Maybe it is too late now if the image format is not versioned
>> or something.
>>
>>>>> To slightly touch upon the question of whether this could become a
>>>>> generic DRM API. It feels it would be hard to do it from the
>>>>> start. What sounds more feasible is if/when generic looking
>>>>> helpers can be spotted while developing the RFC then potentially
>>>>> structure the code they can easily be promoted to shared/common at
>>>>> some future moment.
>>>
>>> Yes, that's how this usually goes, in my experience. Thanks for
>>> confirming.
>>>
>>>
>>>>>
>>>>>> # Second API for creating objects
>>>>>>
>>>>>> Creating BOs and other objects when restoring a checkpoint needs
>>>>>> more information than the usual BO alloc and similar APIs
>>>>>> provide. For example, we need to restore BOs with the same GEM
>>>>>> handles so that user mode can continue using those handles after
>>>>>> resuming execution. If BOs are shared through DMABufs without
>>>>>> dynamic attachment, we need to restore pinned BOs as pinned.
>>>>>> Validation of virtual addresses and handling MMU notifiers must
>>>>>> be suspended until the virtual address space is restored. For
>>>>>> user mode queues we need to save and restore a lot of queue
>>>>>> execution state so that execution can resume cleanly.
>>>>>
>>>>> This also sounds justified to me. Restore creating all internal
>>>>> objects is definitely special and sounds better to add uapi to
>>>>> create them directly with the correct properties, than to add uapi
>>>>> to adjust internal properties after creation. And in case you
>>>>> would always need some new uapi - so at least to adjust after
>>>>> creation. At which point you may have both in one. Internally
>>>>> implementation can be split or common, whatever makes sense for a
>>>>> given object type, but new uapi definitely sounds is required.
>>>>>> # Restoring buffer sharing relationships
>>>>>>
>>>>>> Different GEM handles in different render nodes and processes can
>>>>>> refer to the same underlying shared memory, either by directly
>>>>>> pointing to the same GEM object, or by creating an import
>>>>>> attachment that may get its SG tables invalidated and updated
>>>>>> dynamically through dynamic attachment callbacks. In the latter
>>>>>> case it's obvious, who is the exporter and who is the importer.
>>>>>> In the first case, either one could be the exporter, and it's not
>>>>>> clear who would need to create the BO and who would need to
>>>>>
>>>>> To see if I follow the former case correctly.
>>>>>
>>>>> This could be two clients A and B, where B has imported a dma-buf
>>>>> shared BO from A and has since closed the dma-buf fd? Which
>>>>> results in a single BO with reference count of 2 and
>>>>> obj->import_attach unset. History of who created the object is lost.
>>>
>>> Yes. In the amdgpu driver this happens when the exporter and import
>>> device are the same.
>>>
>>>
>>>>>
>>>>> In fact it could even be that two imported objects remain (clients
>>>>> A, B and C) and A, who originally created the BO, has since fully
>>>>> released it. So any kind of "creator" tracking if added wouldn't
>>>>> be fully reliable either.
>>>
>>> That's a good point.
>>>
>>>
>>>>>
>>>>>> import it when restoring the checkpoint. To further complicate
>>>>>> things, multiple processes in a checkpoint get restored
>>>>>> concurrently. So there is no guarantee that an exporter has
>>>>>> restored a shared BO at the time an importer is trying to restore
>>>>>> its import.
>>>>>>
>>>>>> A proposal to deal with these problems would be to treat
>>>>>> importers and exporters the same. Whoever restores first, ends up
>>>>>> creating the BO and potentially attaching to it. The other
>>>>>> process(es) can find BOs that were already restored by another
>>>>>> process by looking it up with a unique ID that could be based on
>>>>>> the DMABuf inode number. An alternative would be a two-pass
>>>>>> approach that needs to restore BOs on two passes:
>>>>>>
>>>>>> 1. Restore exported BOs
>>>>>> 2. Restore imports
>>>>>>
>>>>>> With some inter-process synchronization in CRIU itself between
>>>>>> these two passes. This may require changes in the core CRIU,
>>>>>> outside our plugin. Both approaches depend on identifying BOs
>>>>>> with some unique ID that could be based on the DMABuf inode
>>>>>> number in the checkpoint. However, we would need to identify the
>>>>>> processes in the same restore session, possibly based on
>>>>>> parent/child process relationships, to create a scope where those
>>>>>> IDs are valid during restore.
>>>>>
>>>>> If my understanding above is on the right track, then I think this
>>>>> is the only thing which can be done (for all scenarios).
>>>
>>> I presented two alternatives. I think you're in favor of the first
>>> one, where it doesn't matter who is the importer and exporter. I
>>> think the two-pass approach requires that you can identify an
>>> exporter. And as you pointed out, the exporter may already have
>>> dropped their reference to the BO.
>>
>> Yep.
>>
>>>>> I also *think* it would be safe. At least at the moment I cannot
>>>>> think what could go wrong. Semantics are that it doesn't really
>>>>> matter who created the object.
>>>
>>> I would agree. What matters is that the object is recreated on the
>>> correct device, and that all the direct references and import
>>> attachments pointing to it are restored.
>>>
>>>
>>>>>
>>>>>> Finally, we would also need to checkpoint and restore DMABuf file
>>>>>> descriptors themselves. These are anonymous file descriptors. The
>>>>>> CRIU plugin could probably be taught to recreate them from the
>>>>>> original exported BO based on the inode number that could be
>>>>>> queried with fstat in the checkpoint. It would need help from the
>>>>>> render node CRIU API to find the right BO from the inode, which
>>>>>> may be from a different process in the same restore session.
>>>>>
>>>>> This part feels like it is breaking the component separation a bit
>>>>> because even for buffers fully owned by amdgpu, strictly speaking
>>>>> the dma-buf fd is not. At least my understanding from the above is
>>>>> that you propose to attempt to import the fd, from the kernel
>>>>> side, during the checkpoint process? Like:
>>>>>
>>>>> Checkpoint:
>>>>>
>>>>> CRIU for each anon fd:
>>>>> amdgpu_plugin(fd)
>>>>> -> attempt in kernel dma buf import (passes fd to kernel via
>>>>> ioctl?)
>>>>> -> is it ours? (no -> error)
>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>
>>>>> Restore:
>>>>>
>>>>> for each dma-buf fd record:
>>>>> create BO if does not exists
>>>>> export BO to same fd
>>>>> close BO handle if not in regular BO handle records
>>>>>
>>>>> Or since you mention lookup by inode, that would need to be
>>>>> dmabuf_plugin so it can lookup inodes in the private mount space.
>>>>> However how would it co-operate with amdgpu_plugin is not clear to
>>>>> me.
>>>
>>> The way I think about the ownership is, whichever driver created the
>>> underlying BO owns the checkpointing of the dmabuf. You need
>>> driver-specific information to link the dmabuf with the driver's BO
>>> and you need the right driver to recreate the BO and the dmabuf fd
>>> when restoring the checkpoint.
>>>
>>> It gets really interesting if you have an amdgpu plugin and an i915
>>> plugin, and they checkpoint an application that shares BOs between
>>> the two devices through DMABufs. E.g. if i915 created a BO and
>>> amdgpu imported it, then during restore, i915 needs to restore the
>>> dmabuf before the amdgpu import of it can be restored. I think that
>>> brings us back to a two-phase approach to restoring the memory
>>> sharing relationships. Uff.
>>
>> I think this part of the discussion somewhat depends on the previous
>> part about idling. If it is feasible to completely idle and prune,
>> and fail if that is not happening quickly enough, then maybe there
>> wouldn't be too much hierarchical state to save.
>>
>> Otherwise my idea was that there is a top-level drm_plugin.so which
>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new
>> uapi to uniquely identify each, associate with the correct driver,
>> and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so.
>> Building the in memory representation of their relationships. As long
>> as all objects and their relationships have been recorded I think
>> everything could then be correctly restored.
>>
>> It is possible there is flaw in my thinking and that something in
>> CRIU design would make this impossible? I think it would require the
>> top-level drm_plugin.so to hold all state in memory until the whole
>> checkpointing is done, and then verify something is not incomplete,
>> failing it all if it was. (For instance one plugin discovered an
>> reference to an object which was not discoverd by any other plugin or
>> things like that.) May need some further tweaks to CRIU common code.
>>
>> Maybe I need to better understand how exactly you mean to query the
>> DRM driver about random anonymous fds. I see it as a problem in the
>> design, possibly even implementation, but maybe I am missing
>> something which makes it not so. I mean even with my general idea I
>> don't know how would one determine which driver to query about a
>> particular anonymous inode.
>>
>>>> I later also realised that I was maybe increasing the scope for you
>>>> here. :) You did state focus is ROCm applications which possibly
>>>> doesn't care about dma-resv, fences, syncobjs etc?
>>>
>>> That's my focus for now. But I don't want to engineer a solution
>>> that would preclude your use cases in the future.
>>>
>>>
>>>>
>>>> But I think the "how to handle dma-bufs" design question is still
>>>> relevant and interesting. For example I had this thought that
>>>> perhaps what would be needed is a CRIU plugin hierarchy.
>>>>
>>>> Because fundamentally we would be snapshoting a hierarcy of kernel
>>>> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...).
>>>> And if one day someone would to try to handle dma fences and drm
>>>> syncobjs, the argument for a hierarchial design would be even
>>>> stronger I think.
>>>>
>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>>> dma-buf, sync file, ...) and internally build the representation of
>>>> the whole state and how the relationship between the objects.
>>>
>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not
>>> sure it's strictly a hierarchy. Maybe more like some common code
>>> shared by multiple GPU driver plugins. I think the common checkpoint
>>> state is quite limited and restoring it requires the GPU-specific
>>> drivers anyway.
>>>
>>> Also the idea of building a representation of the whole state
>>> doesn't work well with the CRIU design, because "the whole state"
>>> can include multiple processes that restore themselves concurrently
>>> and only synchronize with each other in a few places in the restore
>>> process. I feel, if we can work out how to checkpoint and restore
>>> shared objects between processes, we can do the same for shared
>>> objects between drivers without imposing a strict hierarchy and some
>>> omniscient entity that needs to know "the whole state".
>>
>> Okay, this continues on the same problem space as above. And you
>> obviously know how CRIU works much better than me.
>>
>>>> With that kind of design there probably would be a need to define
>>>> some common kernel side api and uapi, so all involved objects can
>>>> be enumerated with some unique ids etc.
>>>>
>>>> Now.. the counter argument.. the more state from different drivers
>>>> would one want to handle the bigger this project would get. Would
>>>> it even be feasible is the question, to the point that it may be
>>>> simpler to just run the workload in a VM via SR-IOV and simply
>>>> hibernate the whole thin guest. :)
>>>
>>> Well, CRIU kind of tries to do that, but with containers instead of
>>> VMs. ;)
>>
>> It would definitely be useful for hardware and drivers without SR-IOV
>> support so lets hope it is doable. :)
>>
>> Regards,
>>
>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-03-28 20:42 ` Felix Kuehling
@ 2024-04-01 15:09 ` Tvrtko Ursulin
2024-04-01 16:37 ` Felix Kuehling
0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-04-01 15:09 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 28/03/2024 20:42, Felix Kuehling wrote:
>
> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>
>> Hi Felix,
>>
>> I had one more thought while browsing around the amdgpu CRIU plugin.
>> It appears it relies on the KFD support being compiled in and /dev/kfd
>> present, correct? AFAICT at least, it relies on that to figure out the
>> amdgpu DRM node.
>>
>> In would be probably good to consider designing things without that
>> dependency. So that checkpointing an application which does not use
>> /dev/kfd is possible. Or if the kernel does not even have the KFD
>> support compiled in.
>
> Yeah, if we want to support graphics apps that don't use KFD, we should
> definitely do that. Currently we get a lot of topology information from
> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed
> by KFD. We'd need to get GPU device info from the render nodes instead.
> And if KFD is available, we may need to integrate both sources of
> information.
>
>
>>
>> It could perhaps mean no more than adding some GPU discovery code into
>> CRIU. Which shuold be flexible enough to account for things like
>> re-assigned minor numbers due driver reload.
>
> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I
> was thinking this is still part of the plugin.
Yes I agree. I was only thinking about adding some DRM device discovery
code in a more decoupled fashion from the current plugin, for both the
reason discussed above (decoupling a bit from reliance on kfd sysfs),
and then also if/when a new DRM driver might want to implement this the
code could be move to some common plugin area.
I am not sure how feasible that would be though. The "gpu id" concept
and it's matching in the current kernel code and CRIU plugin - is that
value tied to the physical GPU instance or how it works?
>> Otherwise I am eagerly awaiting to hear more about the design
>> specifics around dma-buf handling. And also seeing how to extend to
>> other DRM related anonymous fds.
>
> I've been pretty far under-water lately. I hope I'll find time to work
> on this more, but it's probably going to be at least a few weeks.
Got it.
Regards,
Tvrtko
>
> Regards,
> Felix
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>
>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>
>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> Hi Felix,
>>>>>>
>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>> Executive Summary: We need to add CRIU support to DRM render
>>>>>>> nodes in order to maintain CRIU support for ROCm application once
>>>>>>> they start relying on render nodes for more GPU memory
>>>>>>> management. In this email I'm providing some background why we
>>>>>>> are doing this, and outlining some of the problems we need to
>>>>>>> solve to checkpoint and restore render node state and shared
>>>>>>> memory (DMABuf) state. I have some thoughts on the API design,
>>>>>>> leaning on what we did for KFD, but would like to get feedback
>>>>>>> from the DRI community regarding that API and to what extent
>>>>>>> there is interest in making that generic.
>>>>>>>
>>>>>>> We are working on using DRM render nodes for virtual address
>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM
>>>>>>> API and improve interoperability between graphics and compute.
>>>>>>> This uses DMABufs for sharing buffer objects between KFD and
>>>>>>> multiple render node devices, as well as between processes. In
>>>>>>> the long run this also provides a path to moving all or most
>>>>>>> memory management from the KFD ioctl API to libdrm.
>>>>>>>
>>>>>>> Once ROCm user mode starts using render nodes for virtual address
>>>>>>> management, that creates a problem for checkpointing and
>>>>>>> restoring ROCm applications with CRIU. Currently there is no
>>>>>>> support for checkpointing and restoring render node state, other
>>>>>>> than CPU virtual address mappings. Support will be needed for
>>>>>>> checkpointing GEM buffer objects and handles, their GPU virtual
>>>>>>> address mappings and memory sharing relationships between devices
>>>>>>> and processes.
>>>>>>>
>>>>>>> Eventually, if full CRIU support for graphics applications is
>>>>>>> desired, more state would need to be captured, including
>>>>>>> scheduler contexts and BO lists. Most of this state is
>>>>>>> driver-specific.
>>>>>>>
>>>>>>> After some internal discussions we decided to take our design
>>>>>>> process public as this potentially touches DRM GEM and DMABuf
>>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>>
>>>>>>> One basic question before going into any API details: Is there a
>>>>>>> desire to have CRIU support for other DRM drivers?
>>>>>>
>>>>>> This sounds like a very interesting feature on the overall,
>>>>>> although I cannot answer on the last question here.
>>>>>
>>>>> I forgot to finish this thought. I cannot answer / don't know of
>>>>> any concrete plans, but I think feature is pretty cool and if
>>>>> amdgpu gets it working I wouldn't be surprised if other drivers
>>>>> would get interested.
>>>>
>>>> Thanks, that's good to hear!
>>>>
>>>>
>>>>>
>>>>>> Funnily enough, it has a tiny relation to an i915 feature I
>>>>>> recently implemented on Mesa's request, which is to be able to
>>>>>> "upload" the GPU context from the GPU hang error state and replay
>>>>>> the hanging request. It is kind of (at a stretch) a very special
>>>>>> tiny subset of checkout and restore so I am not mentioning it as a
>>>>>> curiosity.
>>>>>>
>>>>>> And there is also another partical conceptual intersect with the
>>>>>> (at the moment not yet upstream) i915 online debugger. This part
>>>>>> being in the area of discovering and enumerating GPU resources
>>>>>> beloning to the client.
>>>>>>
>>>>>> I don't see an immediate design or code sharing opportunities
>>>>>> though but just mentioning.
>>>>>>
>>>>>> I did spend some time reading your plugin and kernel
>>>>>> implementation out of curiousity and have some comments and
>>>>>> questions.
>>>>>>
>>>>>>> With that out of the way, some considerations for a possible DRM
>>>>>>> CRIU API (either generic of AMDGPU driver specific): The API goes
>>>>>>> through several phases during checkpoint and restore:
>>>>>>>
>>>>>>> Checkpoint:
>>>>>>>
>>>>>>> 1. Process-info (enumerates objects and sizes so user mode can
>>>>>>> allocate
>>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>
>>>>>>> Restore:
>>>>>>>
>>>>>>> 1. Restore (restore objects, VMAs are not in the right place at
>>>>>>> this time)
>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>>>>>> execution)
>>>>>>
>>>>>> Btw is check-pointing guaranteeing all relevant activity is idled?
>>>>>> For instance dma_resv objects are free of fences which would need
>>>>>> to restored for things to continue executing sensibly? Or how is
>>>>>> that handled?
>>>>
>>>> In our compute use cases, we suspend user mode queues. This can
>>>> include CWSR (compute-wave-save-restore) where the state of
>>>> in-flight waves is stored in memory and can be reloaded and resumed
>>>> from memory later. We don't use any fences other than "eviction
>>>> fences", that are signaled after the queues are suspended. And those
>>>> fences are never handed to user mode. So we don't need to worry
>>>> about any fence state in the checkpoint.
>>>>
>>>> If we extended this to support the kernel mode command submission
>>>> APIs, I would expect that we'd wait for all current submissions to
>>>> complete, and stop new ones from being sent to the HW before taking
>>>> the checkpoint. When we take the checkpoint in the CRIU plugin, the
>>>> CPU threads are already frozen and cannot submit any more work. If
>>>> we wait for all currently pending submissions to drain, I think we
>>>> don't need to save any fence state because all the fences will have
>>>> signaled. (I may be missing some intricacies and I'm afraid it may
>>>> not be that simple in reality, but that's my opening bid. ;)
>>>
>>> It feels feasible to me too, for the normally behaving clients at least.
>>>
>>> Presumably, given that the whole checkpointing is not instant, it
>>> would be okay to wait a second or two longer for the in-progress
>>> submissions complete. After which kernel would need to prune all
>>> signalled fences from the respective container objects before
>>> checkpointing.
>>>
>>> For the "misbehaving" clients who have perhaps queued up too much
>>> work, either still in the scheduler with unsatisfied dependencies, or
>>> already submitted to the hardware and/or driver backend, is there a
>>> timeout concept in CRIU so it would be possible to say something like
>>> "try to checkpoint but if the kernel says no time period t then give
>>> up"?
>>>
>>>>>>> For some more background about our implementation in KFD, you can
>>>>>>> refer to this whitepaper:
>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>
>>>>>>> Potential objections to a KFD-style CRIU API in DRM render nodes,
>>>>>>> I'll address each of them in more detail below:
>>>>>>>
>>>>>>> * Opaque information in the checkpoint data that user mode can't
>>>>>>> interpret or do anything with
>>>>>>> * A second API for creating objects (e.g. BOs) that is separate
>>>>>>> from
>>>>>>> the regular BO creation API
>>>>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>>>>> relationships rather than replaying BO creation, export and
>>>>>>> import
>>>>>>> from user mode
>>>>>>>
>>>>>>> # Opaque information in the checkpoint
>>>>>>>
>>>>>>> This comes out of ABI compatibility considerations. Adding any
>>>>>>> new objects or attributes to the driver/HW state that needs to be
>>>>>>> checkpointed could potentially break the ABI of the CRIU
>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that
>>>>>>> information. Therefore, much of the information in our KFD CRIU
>>>>>>> ioctl API is opaque. It is written by kernel mode in the
>>>>>>> checkpoint, it is consumed by kernel mode when restoring the
>>>>>>> checkpoint, but user mode doesn't care about the contents or
>>>>>>> binary layout, so there is no user mode ABI to break. This is how
>>>>>>> we were able to maintain CRIU support when we added the SVM API
>>>>>>> to KFD without changing the CRIU plugin and without breaking our
>>>>>>> ABI.
>>>>>>>
>>>>>>> Opaque information may also lend itself to API abstraction, if
>>>>>>> this becomes a generic DRM API with driver-specific callbacks
>>>>>>> that fill in HW-specific opaque data.
>>>>>>
>>>>>> This feels sound in principle to me. Fundamentally the state is
>>>>>> very hardware specfic, and/or driver version specific, so I don't
>>>>>> see anything could be gained in practice by making it much less
>>>>>> opaque. (Apart from making things more complicated.)
>>>>>>
>>>>>> I was however unsure of the current split of how you dump buffer
>>>>>> objects with some data in the defined bo structure, and some in
>>>>>> completely opaque private data. Is there a benefit to that split,
>>>>>> or maybe in other words, is there a benefit on having part
>>>>>> transparent and part opaque for buffer objects?
>>>>
>>>> Some of the buffer object state is needed by the plugin. E.g. the
>>>> size and mmap offset are needed to match VMAs with BOs. I'd have to
>>>> review the plugin in detail to prove that all the fields are, in
>>>> fact, needed by the plugin, but that was the intent. Anything that
>>>> the plugin doesn't need to know should be in the opaque data
>>>> structures.
>>>
>>> Right, got it.
>>>
>>> Would it make sense to make the opaque data in the same block as the
>>> defined one? I mean instead of separating the two in the binary image
>>> for instance have struct kfd_criu_bo_bucket have a trailing priv_data
>>> blob? Maybe it is too late now if the image format is not versioned
>>> or something.
>>>
>>>>>> To slightly touch upon the question of whether this could become a
>>>>>> generic DRM API. It feels it would be hard to do it from the
>>>>>> start. What sounds more feasible is if/when generic looking
>>>>>> helpers can be spotted while developing the RFC then potentially
>>>>>> structure the code they can easily be promoted to shared/common at
>>>>>> some future moment.
>>>>
>>>> Yes, that's how this usually goes, in my experience. Thanks for
>>>> confirming.
>>>>
>>>>
>>>>>>
>>>>>>> # Second API for creating objects
>>>>>>>
>>>>>>> Creating BOs and other objects when restoring a checkpoint needs
>>>>>>> more information than the usual BO alloc and similar APIs
>>>>>>> provide. For example, we need to restore BOs with the same GEM
>>>>>>> handles so that user mode can continue using those handles after
>>>>>>> resuming execution. If BOs are shared through DMABufs without
>>>>>>> dynamic attachment, we need to restore pinned BOs as pinned.
>>>>>>> Validation of virtual addresses and handling MMU notifiers must
>>>>>>> be suspended until the virtual address space is restored. For
>>>>>>> user mode queues we need to save and restore a lot of queue
>>>>>>> execution state so that execution can resume cleanly.
>>>>>>
>>>>>> This also sounds justified to me. Restore creating all internal
>>>>>> objects is definitely special and sounds better to add uapi to
>>>>>> create them directly with the correct properties, than to add uapi
>>>>>> to adjust internal properties after creation. And in case you
>>>>>> would always need some new uapi - so at least to adjust after
>>>>>> creation. At which point you may have both in one. Internally
>>>>>> implementation can be split or common, whatever makes sense for a
>>>>>> given object type, but new uapi definitely sounds is required.
>>>>>>> # Restoring buffer sharing relationships
>>>>>>>
>>>>>>> Different GEM handles in different render nodes and processes can
>>>>>>> refer to the same underlying shared memory, either by directly
>>>>>>> pointing to the same GEM object, or by creating an import
>>>>>>> attachment that may get its SG tables invalidated and updated
>>>>>>> dynamically through dynamic attachment callbacks. In the latter
>>>>>>> case it's obvious, who is the exporter and who is the importer.
>>>>>>> In the first case, either one could be the exporter, and it's not
>>>>>>> clear who would need to create the BO and who would need to
>>>>>>
>>>>>> To see if I follow the former case correctly.
>>>>>>
>>>>>> This could be two clients A and B, where B has imported a dma-buf
>>>>>> shared BO from A and has since closed the dma-buf fd? Which
>>>>>> results in a single BO with reference count of 2 and
>>>>>> obj->import_attach unset. History of who created the object is lost.
>>>>
>>>> Yes. In the amdgpu driver this happens when the exporter and import
>>>> device are the same.
>>>>
>>>>
>>>>>>
>>>>>> In fact it could even be that two imported objects remain (clients
>>>>>> A, B and C) and A, who originally created the BO, has since fully
>>>>>> released it. So any kind of "creator" tracking if added wouldn't
>>>>>> be fully reliable either.
>>>>
>>>> That's a good point.
>>>>
>>>>
>>>>>>
>>>>>>> import it when restoring the checkpoint. To further complicate
>>>>>>> things, multiple processes in a checkpoint get restored
>>>>>>> concurrently. So there is no guarantee that an exporter has
>>>>>>> restored a shared BO at the time an importer is trying to restore
>>>>>>> its import.
>>>>>>>
>>>>>>> A proposal to deal with these problems would be to treat
>>>>>>> importers and exporters the same. Whoever restores first, ends up
>>>>>>> creating the BO and potentially attaching to it. The other
>>>>>>> process(es) can find BOs that were already restored by another
>>>>>>> process by looking it up with a unique ID that could be based on
>>>>>>> the DMABuf inode number. An alternative would be a two-pass
>>>>>>> approach that needs to restore BOs on two passes:
>>>>>>>
>>>>>>> 1. Restore exported BOs
>>>>>>> 2. Restore imports
>>>>>>>
>>>>>>> With some inter-process synchronization in CRIU itself between
>>>>>>> these two passes. This may require changes in the core CRIU,
>>>>>>> outside our plugin. Both approaches depend on identifying BOs
>>>>>>> with some unique ID that could be based on the DMABuf inode
>>>>>>> number in the checkpoint. However, we would need to identify the
>>>>>>> processes in the same restore session, possibly based on
>>>>>>> parent/child process relationships, to create a scope where those
>>>>>>> IDs are valid during restore.
>>>>>>
>>>>>> If my understanding above is on the right track, then I think this
>>>>>> is the only thing which can be done (for all scenarios).
>>>>
>>>> I presented two alternatives. I think you're in favor of the first
>>>> one, where it doesn't matter who is the importer and exporter. I
>>>> think the two-pass approach requires that you can identify an
>>>> exporter. And as you pointed out, the exporter may already have
>>>> dropped their reference to the BO.
>>>
>>> Yep.
>>>
>>>>>> I also *think* it would be safe. At least at the moment I cannot
>>>>>> think what could go wrong. Semantics are that it doesn't really
>>>>>> matter who created the object.
>>>>
>>>> I would agree. What matters is that the object is recreated on the
>>>> correct device, and that all the direct references and import
>>>> attachments pointing to it are restored.
>>>>
>>>>
>>>>>>
>>>>>>> Finally, we would also need to checkpoint and restore DMABuf file
>>>>>>> descriptors themselves. These are anonymous file descriptors. The
>>>>>>> CRIU plugin could probably be taught to recreate them from the
>>>>>>> original exported BO based on the inode number that could be
>>>>>>> queried with fstat in the checkpoint. It would need help from the
>>>>>>> render node CRIU API to find the right BO from the inode, which
>>>>>>> may be from a different process in the same restore session.
>>>>>>
>>>>>> This part feels like it is breaking the component separation a bit
>>>>>> because even for buffers fully owned by amdgpu, strictly speaking
>>>>>> the dma-buf fd is not. At least my understanding from the above is
>>>>>> that you propose to attempt to import the fd, from the kernel
>>>>>> side, during the checkpoint process? Like:
>>>>>>
>>>>>> Checkpoint:
>>>>>>
>>>>>> CRIU for each anon fd:
>>>>>> amdgpu_plugin(fd)
>>>>>> -> attempt in kernel dma buf import (passes fd to kernel via
>>>>>> ioctl?)
>>>>>> -> is it ours? (no -> error)
>>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>>
>>>>>> Restore:
>>>>>>
>>>>>> for each dma-buf fd record:
>>>>>> create BO if does not exists
>>>>>> export BO to same fd
>>>>>> close BO handle if not in regular BO handle records
>>>>>>
>>>>>> Or since you mention lookup by inode, that would need to be
>>>>>> dmabuf_plugin so it can lookup inodes in the private mount space.
>>>>>> However how would it co-operate with amdgpu_plugin is not clear to
>>>>>> me.
>>>>
>>>> The way I think about the ownership is, whichever driver created the
>>>> underlying BO owns the checkpointing of the dmabuf. You need
>>>> driver-specific information to link the dmabuf with the driver's BO
>>>> and you need the right driver to recreate the BO and the dmabuf fd
>>>> when restoring the checkpoint.
>>>>
>>>> It gets really interesting if you have an amdgpu plugin and an i915
>>>> plugin, and they checkpoint an application that shares BOs between
>>>> the two devices through DMABufs. E.g. if i915 created a BO and
>>>> amdgpu imported it, then during restore, i915 needs to restore the
>>>> dmabuf before the amdgpu import of it can be restored. I think that
>>>> brings us back to a two-phase approach to restoring the memory
>>>> sharing relationships. Uff.
>>>
>>> I think this part of the discussion somewhat depends on the previous
>>> part about idling. If it is feasible to completely idle and prune,
>>> and fail if that is not happening quickly enough, then maybe there
>>> wouldn't be too much hierarchical state to save.
>>>
>>> Otherwise my idea was that there is a top-level drm_plugin.so which
>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new
>>> uapi to uniquely identify each, associate with the correct driver,
>>> and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so.
>>> Building the in memory representation of their relationships. As long
>>> as all objects and their relationships have been recorded I think
>>> everything could then be correctly restored.
>>>
>>> It is possible there is flaw in my thinking and that something in
>>> CRIU design would make this impossible? I think it would require the
>>> top-level drm_plugin.so to hold all state in memory until the whole
>>> checkpointing is done, and then verify something is not incomplete,
>>> failing it all if it was. (For instance one plugin discovered an
>>> reference to an object which was not discoverd by any other plugin or
>>> things like that.) May need some further tweaks to CRIU common code.
>>>
>>> Maybe I need to better understand how exactly you mean to query the
>>> DRM driver about random anonymous fds. I see it as a problem in the
>>> design, possibly even implementation, but maybe I am missing
>>> something which makes it not so. I mean even with my general idea I
>>> don't know how would one determine which driver to query about a
>>> particular anonymous inode.
>>>
>>>>> I later also realised that I was maybe increasing the scope for you
>>>>> here. :) You did state focus is ROCm applications which possibly
>>>>> doesn't care about dma-resv, fences, syncobjs etc?
>>>>
>>>> That's my focus for now. But I don't want to engineer a solution
>>>> that would preclude your use cases in the future.
>>>>
>>>>
>>>>>
>>>>> But I think the "how to handle dma-bufs" design question is still
>>>>> relevant and interesting. For example I had this thought that
>>>>> perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>
>>>>> Because fundamentally we would be snapshoting a hierarcy of kernel
>>>>> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...).
>>>>> And if one day someone would to try to handle dma fences and drm
>>>>> syncobjs, the argument for a hierarchial design would be even
>>>>> stronger I think.
>>>>>
>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>>>> dma-buf, sync file, ...) and internally build the representation of
>>>>> the whole state and how the relationship between the objects.
>>>>
>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not
>>>> sure it's strictly a hierarchy. Maybe more like some common code
>>>> shared by multiple GPU driver plugins. I think the common checkpoint
>>>> state is quite limited and restoring it requires the GPU-specific
>>>> drivers anyway.
>>>>
>>>> Also the idea of building a representation of the whole state
>>>> doesn't work well with the CRIU design, because "the whole state"
>>>> can include multiple processes that restore themselves concurrently
>>>> and only synchronize with each other in a few places in the restore
>>>> process. I feel, if we can work out how to checkpoint and restore
>>>> shared objects between processes, we can do the same for shared
>>>> objects between drivers without imposing a strict hierarchy and some
>>>> omniscient entity that needs to know "the whole state".
>>>
>>> Okay, this continues on the same problem space as above. And you
>>> obviously know how CRIU works much better than me.
>>>
>>>>> With that kind of design there probably would be a need to define
>>>>> some common kernel side api and uapi, so all involved objects can
>>>>> be enumerated with some unique ids etc.
>>>>>
>>>>> Now.. the counter argument.. the more state from different drivers
>>>>> would one want to handle the bigger this project would get. Would
>>>>> it even be feasible is the question, to the point that it may be
>>>>> simpler to just run the workload in a VM via SR-IOV and simply
>>>>> hibernate the whole thin guest. :)
>>>>
>>>> Well, CRIU kind of tries to do that, but with containers instead of
>>>> VMs. ;)
>>>
>>> It would definitely be useful for hardware and drivers without SR-IOV
>>> support so lets hope it is doable. :)
>>>
>>> Regards,
>>>
>>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-04-01 15:09 ` Tvrtko Ursulin
@ 2024-04-01 16:37 ` Felix Kuehling
2024-04-01 16:56 ` Tvrtko Ursulin
0 siblings, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2024-04-01 16:37 UTC (permalink / raw)
To: Tvrtko Ursulin, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>
> On 28/03/2024 20:42, Felix Kuehling wrote:
>>
>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>
>>> Hi Felix,
>>>
>>> I had one more thought while browsing around the amdgpu CRIU plugin.
>>> It appears it relies on the KFD support being compiled in and
>>> /dev/kfd present, correct? AFAICT at least, it relies on that to
>>> figure out the amdgpu DRM node.
>>>
>>> In would be probably good to consider designing things without that
>>> dependency. So that checkpointing an application which does not use
>>> /dev/kfd is possible. Or if the kernel does not even have the KFD
>>> support compiled in.
>>
>> Yeah, if we want to support graphics apps that don't use KFD, we
>> should definitely do that. Currently we get a lot of topology
>> information from KFD, not even from the /dev/kfd device but from the
>> sysfs nodes exposed by KFD. We'd need to get GPU device info from the
>> render nodes instead. And if KFD is available, we may need to
>> integrate both sources of information.
>>
>>
>>>
>>> It could perhaps mean no more than adding some GPU discovery code
>>> into CRIU. Which shuold be flexible enough to account for things
>>> like re-assigned minor numbers due driver reload.
>>
>> Do you mean adding GPU discovery to the core CRIU, or to the plugin.
>> I was thinking this is still part of the plugin.
>
> Yes I agree. I was only thinking about adding some DRM device
> discovery code in a more decoupled fashion from the current plugin,
> for both the reason discussed above (decoupling a bit from reliance on
> kfd sysfs), and then also if/when a new DRM driver might want to
> implement this the code could be move to some common plugin area.
>
> I am not sure how feasible that would be though. The "gpu id" concept
> and it's matching in the current kernel code and CRIU plugin - is that
> value tied to the physical GPU instance or how it works?
The concept of the GPU ID is that it's stable while the system is up,
even when devices get added and removed dynamically. It was baked into
the API early on, but I don't think we ever fully validated device hot
plug. I think the closest we're getting is with our latest MI GPUs and
dynamic partition mode change.
This also highlights another aspect on those spatially partitioned GPUs.
GPU IDs identify device partitions, not devices. Similarly, each
partition has its own render node, and the KFD topology info in sysfs
points to the render-minor number corresponding to each GPU ID.
Regards,
Felix
>
>>> Otherwise I am eagerly awaiting to hear more about the design
>>> specifics around dma-buf handling. And also seeing how to extend to
>>> other DRM related anonymous fds.
>>
>> I've been pretty far under-water lately. I hope I'll find time to
>> work on this more, but it's probably going to be at least a few weeks.
>
> Got it.
>
> Regards,
>
> Tvrtko
>
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>
>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>
>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi Felix,
>>>>>>>
>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>> Executive Summary: We need to add CRIU support to DRM render
>>>>>>>> nodes in order to maintain CRIU support for ROCm application
>>>>>>>> once they start relying on render nodes for more GPU memory
>>>>>>>> management. In this email I'm providing some background why we
>>>>>>>> are doing this, and outlining some of the problems we need to
>>>>>>>> solve to checkpoint and restore render node state and shared
>>>>>>>> memory (DMABuf) state. I have some thoughts on the API design,
>>>>>>>> leaning on what we did for KFD, but would like to get feedback
>>>>>>>> from the DRI community regarding that API and to what extent
>>>>>>>> there is interest in making that generic.
>>>>>>>>
>>>>>>>> We are working on using DRM render nodes for virtual address
>>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM
>>>>>>>> API and improve interoperability between graphics and compute.
>>>>>>>> This uses DMABufs for sharing buffer objects between KFD and
>>>>>>>> multiple render node devices, as well as between processes. In
>>>>>>>> the long run this also provides a path to moving all or most
>>>>>>>> memory management from the KFD ioctl API to libdrm.
>>>>>>>>
>>>>>>>> Once ROCm user mode starts using render nodes for virtual
>>>>>>>> address management, that creates a problem for checkpointing
>>>>>>>> and restoring ROCm applications with CRIU. Currently there is
>>>>>>>> no support for checkpointing and restoring render node state,
>>>>>>>> other than CPU virtual address mappings. Support will be needed
>>>>>>>> for checkpointing GEM buffer objects and handles, their GPU
>>>>>>>> virtual address mappings and memory sharing relationships
>>>>>>>> between devices and processes.
>>>>>>>>
>>>>>>>> Eventually, if full CRIU support for graphics applications is
>>>>>>>> desired, more state would need to be captured, including
>>>>>>>> scheduler contexts and BO lists. Most of this state is
>>>>>>>> driver-specific.
>>>>>>>>
>>>>>>>> After some internal discussions we decided to take our design
>>>>>>>> process public as this potentially touches DRM GEM and DMABuf
>>>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>>>
>>>>>>>> One basic question before going into any API details: Is there
>>>>>>>> a desire to have CRIU support for other DRM drivers?
>>>>>>>
>>>>>>> This sounds like a very interesting feature on the overall,
>>>>>>> although I cannot answer on the last question here.
>>>>>>
>>>>>> I forgot to finish this thought. I cannot answer / don't know of
>>>>>> any concrete plans, but I think feature is pretty cool and if
>>>>>> amdgpu gets it working I wouldn't be surprised if other drivers
>>>>>> would get interested.
>>>>>
>>>>> Thanks, that's good to hear!
>>>>>
>>>>>
>>>>>>
>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I
>>>>>>> recently implemented on Mesa's request, which is to be able to
>>>>>>> "upload" the GPU context from the GPU hang error state and
>>>>>>> replay the hanging request. It is kind of (at a stretch) a very
>>>>>>> special tiny subset of checkout and restore so I am not
>>>>>>> mentioning it as a curiosity.
>>>>>>>
>>>>>>> And there is also another partical conceptual intersect with the
>>>>>>> (at the moment not yet upstream) i915 online debugger. This part
>>>>>>> being in the area of discovering and enumerating GPU resources
>>>>>>> beloning to the client.
>>>>>>>
>>>>>>> I don't see an immediate design or code sharing opportunities
>>>>>>> though but just mentioning.
>>>>>>>
>>>>>>> I did spend some time reading your plugin and kernel
>>>>>>> implementation out of curiousity and have some comments and
>>>>>>> questions.
>>>>>>>
>>>>>>>> With that out of the way, some considerations for a possible
>>>>>>>> DRM CRIU API (either generic of AMDGPU driver specific): The
>>>>>>>> API goes through several phases during checkpoint and restore:
>>>>>>>>
>>>>>>>> Checkpoint:
>>>>>>>>
>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode can
>>>>>>>> allocate
>>>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>
>>>>>>>> Restore:
>>>>>>>>
>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place
>>>>>>>> at this time)
>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>>>>>>> execution)
>>>>>>>
>>>>>>> Btw is check-pointing guaranteeing all relevant activity is
>>>>>>> idled? For instance dma_resv objects are free of fences which
>>>>>>> would need to restored for things to continue executing
>>>>>>> sensibly? Or how is that handled?
>>>>>
>>>>> In our compute use cases, we suspend user mode queues. This can
>>>>> include CWSR (compute-wave-save-restore) where the state of
>>>>> in-flight waves is stored in memory and can be reloaded and
>>>>> resumed from memory later. We don't use any fences other than
>>>>> "eviction fences", that are signaled after the queues are
>>>>> suspended. And those fences are never handed to user mode. So we
>>>>> don't need to worry about any fence state in the checkpoint.
>>>>>
>>>>> If we extended this to support the kernel mode command submission
>>>>> APIs, I would expect that we'd wait for all current submissions to
>>>>> complete, and stop new ones from being sent to the HW before
>>>>> taking the checkpoint. When we take the checkpoint in the CRIU
>>>>> plugin, the CPU threads are already frozen and cannot submit any
>>>>> more work. If we wait for all currently pending submissions to
>>>>> drain, I think we don't need to save any fence state because all
>>>>> the fences will have signaled. (I may be missing some intricacies
>>>>> and I'm afraid it may not be that simple in reality, but that's my
>>>>> opening bid. ;)
>>>>
>>>> It feels feasible to me too, for the normally behaving clients at
>>>> least.
>>>>
>>>> Presumably, given that the whole checkpointing is not instant, it
>>>> would be okay to wait a second or two longer for the in-progress
>>>> submissions complete. After which kernel would need to prune all
>>>> signalled fences from the respective container objects before
>>>> checkpointing.
>>>>
>>>> For the "misbehaving" clients who have perhaps queued up too much
>>>> work, either still in the scheduler with unsatisfied dependencies,
>>>> or already submitted to the hardware and/or driver backend, is
>>>> there a timeout concept in CRIU so it would be possible to say
>>>> something like "try to checkpoint but if the kernel says no time
>>>> period t then give up"?
>>>>
>>>>>>>> For some more background about our implementation in KFD, you
>>>>>>>> can refer to this whitepaper:
>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>
>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render
>>>>>>>> nodes, I'll address each of them in more detail below:
>>>>>>>>
>>>>>>>> * Opaque information in the checkpoint data that user mode can't
>>>>>>>> interpret or do anything with
>>>>>>>> * A second API for creating objects (e.g. BOs) that is
>>>>>>>> separate from
>>>>>>>> the regular BO creation API
>>>>>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>>>>>> relationships rather than replaying BO creation, export and
>>>>>>>> import
>>>>>>>> from user mode
>>>>>>>>
>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>
>>>>>>>> This comes out of ABI compatibility considerations. Adding any
>>>>>>>> new objects or attributes to the driver/HW state that needs to
>>>>>>>> be checkpointed could potentially break the ABI of the CRIU
>>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that
>>>>>>>> information. Therefore, much of the information in our KFD CRIU
>>>>>>>> ioctl API is opaque. It is written by kernel mode in the
>>>>>>>> checkpoint, it is consumed by kernel mode when restoring the
>>>>>>>> checkpoint, but user mode doesn't care about the contents or
>>>>>>>> binary layout, so there is no user mode ABI to break. This is
>>>>>>>> how we were able to maintain CRIU support when we added the SVM
>>>>>>>> API to KFD without changing the CRIU plugin and without
>>>>>>>> breaking our ABI.
>>>>>>>>
>>>>>>>> Opaque information may also lend itself to API abstraction, if
>>>>>>>> this becomes a generic DRM API with driver-specific callbacks
>>>>>>>> that fill in HW-specific opaque data.
>>>>>>>
>>>>>>> This feels sound in principle to me. Fundamentally the state is
>>>>>>> very hardware specfic, and/or driver version specific, so I
>>>>>>> don't see anything could be gained in practice by making it much
>>>>>>> less opaque. (Apart from making things more complicated.)
>>>>>>>
>>>>>>> I was however unsure of the current split of how you dump buffer
>>>>>>> objects with some data in the defined bo structure, and some in
>>>>>>> completely opaque private data. Is there a benefit to that
>>>>>>> split, or maybe in other words, is there a benefit on having
>>>>>>> part transparent and part opaque for buffer objects?
>>>>>
>>>>> Some of the buffer object state is needed by the plugin. E.g. the
>>>>> size and mmap offset are needed to match VMAs with BOs. I'd have
>>>>> to review the plugin in detail to prove that all the fields are,
>>>>> in fact, needed by the plugin, but that was the intent. Anything
>>>>> that the plugin doesn't need to know should be in the opaque data
>>>>> structures.
>>>>
>>>> Right, got it.
>>>>
>>>> Would it make sense to make the opaque data in the same block as
>>>> the defined one? I mean instead of separating the two in the binary
>>>> image for instance have struct kfd_criu_bo_bucket have a trailing
>>>> priv_data blob? Maybe it is too late now if the image format is not
>>>> versioned or something.
>>>>
>>>>>>> To slightly touch upon the question of whether this could become
>>>>>>> a generic DRM API. It feels it would be hard to do it from the
>>>>>>> start. What sounds more feasible is if/when generic looking
>>>>>>> helpers can be spotted while developing the RFC then potentially
>>>>>>> structure the code they can easily be promoted to shared/common
>>>>>>> at some future moment.
>>>>>
>>>>> Yes, that's how this usually goes, in my experience. Thanks for
>>>>> confirming.
>>>>>
>>>>>
>>>>>>>
>>>>>>>> # Second API for creating objects
>>>>>>>>
>>>>>>>> Creating BOs and other objects when restoring a checkpoint
>>>>>>>> needs more information than the usual BO alloc and similar APIs
>>>>>>>> provide. For example, we need to restore BOs with the same GEM
>>>>>>>> handles so that user mode can continue using those handles
>>>>>>>> after resuming execution. If BOs are shared through DMABufs
>>>>>>>> without dynamic attachment, we need to restore pinned BOs as
>>>>>>>> pinned. Validation of virtual addresses and handling MMU
>>>>>>>> notifiers must be suspended until the virtual address space is
>>>>>>>> restored. For user mode queues we need to save and restore a
>>>>>>>> lot of queue execution state so that execution can resume cleanly.
>>>>>>>
>>>>>>> This also sounds justified to me. Restore creating all internal
>>>>>>> objects is definitely special and sounds better to add uapi to
>>>>>>> create them directly with the correct properties, than to add
>>>>>>> uapi to adjust internal properties after creation. And in case
>>>>>>> you would always need some new uapi - so at least to adjust
>>>>>>> after creation. At which point you may have both in one.
>>>>>>> Internally implementation can be split or common, whatever makes
>>>>>>> sense for a given object type, but new uapi definitely sounds is
>>>>>>> required.
>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>
>>>>>>>> Different GEM handles in different render nodes and processes
>>>>>>>> can refer to the same underlying shared memory, either by
>>>>>>>> directly pointing to the same GEM object, or by creating an
>>>>>>>> import attachment that may get its SG tables invalidated and
>>>>>>>> updated dynamically through dynamic attachment callbacks. In
>>>>>>>> the latter case it's obvious, who is the exporter and who is
>>>>>>>> the importer. In the first case, either one could be the
>>>>>>>> exporter, and it's not clear who would need to create the BO
>>>>>>>> and who would need to
>>>>>>>
>>>>>>> To see if I follow the former case correctly.
>>>>>>>
>>>>>>> This could be two clients A and B, where B has imported a
>>>>>>> dma-buf shared BO from A and has since closed the dma-buf fd?
>>>>>>> Which results in a single BO with reference count of 2 and
>>>>>>> obj->import_attach unset. History of who created the object is
>>>>>>> lost.
>>>>>
>>>>> Yes. In the amdgpu driver this happens when the exporter and
>>>>> import device are the same.
>>>>>
>>>>>
>>>>>>>
>>>>>>> In fact it could even be that two imported objects remain
>>>>>>> (clients A, B and C) and A, who originally created the BO, has
>>>>>>> since fully released it. So any kind of "creator" tracking if
>>>>>>> added wouldn't be fully reliable either.
>>>>>
>>>>> That's a good point.
>>>>>
>>>>>
>>>>>>>
>>>>>>>> import it when restoring the checkpoint. To further complicate
>>>>>>>> things, multiple processes in a checkpoint get restored
>>>>>>>> concurrently. So there is no guarantee that an exporter has
>>>>>>>> restored a shared BO at the time an importer is trying to
>>>>>>>> restore its import.
>>>>>>>>
>>>>>>>> A proposal to deal with these problems would be to treat
>>>>>>>> importers and exporters the same. Whoever restores first, ends
>>>>>>>> up creating the BO and potentially attaching to it. The other
>>>>>>>> process(es) can find BOs that were already restored by another
>>>>>>>> process by looking it up with a unique ID that could be based
>>>>>>>> on the DMABuf inode number. An alternative would be a two-pass
>>>>>>>> approach that needs to restore BOs on two passes:
>>>>>>>>
>>>>>>>> 1. Restore exported BOs
>>>>>>>> 2. Restore imports
>>>>>>>>
>>>>>>>> With some inter-process synchronization in CRIU itself between
>>>>>>>> these two passes. This may require changes in the core CRIU,
>>>>>>>> outside our plugin. Both approaches depend on identifying BOs
>>>>>>>> with some unique ID that could be based on the DMABuf inode
>>>>>>>> number in the checkpoint. However, we would need to identify
>>>>>>>> the processes in the same restore session, possibly based on
>>>>>>>> parent/child process relationships, to create a scope where
>>>>>>>> those IDs are valid during restore.
>>>>>>>
>>>>>>> If my understanding above is on the right track, then I think
>>>>>>> this is the only thing which can be done (for all scenarios).
>>>>>
>>>>> I presented two alternatives. I think you're in favor of the first
>>>>> one, where it doesn't matter who is the importer and exporter. I
>>>>> think the two-pass approach requires that you can identify an
>>>>> exporter. And as you pointed out, the exporter may already have
>>>>> dropped their reference to the BO.
>>>>
>>>> Yep.
>>>>
>>>>>>> I also *think* it would be safe. At least at the moment I cannot
>>>>>>> think what could go wrong. Semantics are that it doesn't really
>>>>>>> matter who created the object.
>>>>>
>>>>> I would agree. What matters is that the object is recreated on the
>>>>> correct device, and that all the direct references and import
>>>>> attachments pointing to it are restored.
>>>>>
>>>>>
>>>>>>>
>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf
>>>>>>>> file descriptors themselves. These are anonymous file
>>>>>>>> descriptors. The CRIU plugin could probably be taught to
>>>>>>>> recreate them from the original exported BO based on the inode
>>>>>>>> number that could be queried with fstat in the checkpoint. It
>>>>>>>> would need help from the render node CRIU API to find the right
>>>>>>>> BO from the inode, which may be from a different process in the
>>>>>>>> same restore session.
>>>>>>>
>>>>>>> This part feels like it is breaking the component separation a
>>>>>>> bit because even for buffers fully owned by amdgpu, strictly
>>>>>>> speaking the dma-buf fd is not. At least my understanding from
>>>>>>> the above is that you propose to attempt to import the fd, from
>>>>>>> the kernel side, during the checkpoint process? Like:
>>>>>>>
>>>>>>> Checkpoint:
>>>>>>>
>>>>>>> CRIU for each anon fd:
>>>>>>> amdgpu_plugin(fd)
>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel
>>>>>>> via ioctl?)
>>>>>>> -> is it ours? (no -> error)
>>>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>>>
>>>>>>> Restore:
>>>>>>>
>>>>>>> for each dma-buf fd record:
>>>>>>> create BO if does not exists
>>>>>>> export BO to same fd
>>>>>>> close BO handle if not in regular BO handle records
>>>>>>>
>>>>>>> Or since you mention lookup by inode, that would need to be
>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount
>>>>>>> space. However how would it co-operate with amdgpu_plugin is not
>>>>>>> clear to me.
>>>>>
>>>>> The way I think about the ownership is, whichever driver created
>>>>> the underlying BO owns the checkpointing of the dmabuf. You need
>>>>> driver-specific information to link the dmabuf with the driver's
>>>>> BO and you need the right driver to recreate the BO and the dmabuf
>>>>> fd when restoring the checkpoint.
>>>>>
>>>>> It gets really interesting if you have an amdgpu plugin and an
>>>>> i915 plugin, and they checkpoint an application that shares BOs
>>>>> between the two devices through DMABufs. E.g. if i915 created a BO
>>>>> and amdgpu imported it, then during restore, i915 needs to restore
>>>>> the dmabuf before the amdgpu import of it can be restored. I think
>>>>> that brings us back to a two-phase approach to restoring the
>>>>> memory sharing relationships. Uff.
>>>>
>>>> I think this part of the discussion somewhat depends on the
>>>> previous part about idling. If it is feasible to completely idle
>>>> and prune, and fail if that is not happening quickly enough, then
>>>> maybe there wouldn't be too much hierarchical state to save.
>>>>
>>>> Otherwise my idea was that there is a top-level drm_plugin.so which
>>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new
>>>> uapi to uniquely identify each, associate with the correct driver,
>>>> and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so.
>>>> Building the in memory representation of their relationships. As
>>>> long as all objects and their relationships have been recorded I
>>>> think everything could then be correctly restored.
>>>>
>>>> It is possible there is flaw in my thinking and that something in
>>>> CRIU design would make this impossible? I think it would require
>>>> the top-level drm_plugin.so to hold all state in memory until the
>>>> whole checkpointing is done, and then verify something is not
>>>> incomplete, failing it all if it was. (For instance one plugin
>>>> discovered an reference to an object which was not discoverd by any
>>>> other plugin or things like that.) May need some further tweaks to
>>>> CRIU common code.
>>>>
>>>> Maybe I need to better understand how exactly you mean to query the
>>>> DRM driver about random anonymous fds. I see it as a problem in the
>>>> design, possibly even implementation, but maybe I am missing
>>>> something which makes it not so. I mean even with my general idea I
>>>> don't know how would one determine which driver to query about a
>>>> particular anonymous inode.
>>>>
>>>>>> I later also realised that I was maybe increasing the scope for
>>>>>> you here. :) You did state focus is ROCm applications which
>>>>>> possibly doesn't care about dma-resv, fences, syncobjs etc?
>>>>>
>>>>> That's my focus for now. But I don't want to engineer a solution
>>>>> that would preclude your use cases in the future.
>>>>>
>>>>>
>>>>>>
>>>>>> But I think the "how to handle dma-bufs" design question is still
>>>>>> relevant and interesting. For example I had this thought that
>>>>>> perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>
>>>>>> Because fundamentally we would be snapshoting a hierarcy of
>>>>>> kernel objects belonging to different drivers (kfd, amdgpu,
>>>>>> dma-buf, ...). And if one day someone would to try to handle dma
>>>>>> fences and drm syncobjs, the argument for a hierarchial design
>>>>>> would be even stronger I think.
>>>>>>
>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>>>>> dma-buf, sync file, ...) and internally build the representation
>>>>>> of the whole state and how the relationship between the objects.
>>>>>
>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not
>>>>> sure it's strictly a hierarchy. Maybe more like some common code
>>>>> shared by multiple GPU driver plugins. I think the common
>>>>> checkpoint state is quite limited and restoring it requires the
>>>>> GPU-specific drivers anyway.
>>>>>
>>>>> Also the idea of building a representation of the whole state
>>>>> doesn't work well with the CRIU design, because "the whole state"
>>>>> can include multiple processes that restore themselves
>>>>> concurrently and only synchronize with each other in a few places
>>>>> in the restore process. I feel, if we can work out how to
>>>>> checkpoint and restore shared objects between processes, we can do
>>>>> the same for shared objects between drivers without imposing a
>>>>> strict hierarchy and some omniscient entity that needs to know
>>>>> "the whole state".
>>>>
>>>> Okay, this continues on the same problem space as above. And you
>>>> obviously know how CRIU works much better than me.
>>>>
>>>>>> With that kind of design there probably would be a need to define
>>>>>> some common kernel side api and uapi, so all involved objects can
>>>>>> be enumerated with some unique ids etc.
>>>>>>
>>>>>> Now.. the counter argument.. the more state from different
>>>>>> drivers would one want to handle the bigger this project would
>>>>>> get. Would it even be feasible is the question, to the point that
>>>>>> it may be simpler to just run the workload in a VM via SR-IOV and
>>>>>> simply hibernate the whole thin guest. :)
>>>>>
>>>>> Well, CRIU kind of tries to do that, but with containers instead
>>>>> of VMs. ;)
>>>>
>>>> It would definitely be useful for hardware and drivers without
>>>> SR-IOV support so lets hope it is doable. :)
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-04-01 16:37 ` Felix Kuehling
@ 2024-04-01 16:56 ` Tvrtko Ursulin
2024-04-01 17:58 ` Felix Kuehling
0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-04-01 16:56 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 01/04/2024 17:37, Felix Kuehling wrote:
> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>
>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>
>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>
>>>> Hi Felix,
>>>>
>>>> I had one more thought while browsing around the amdgpu CRIU plugin.
>>>> It appears it relies on the KFD support being compiled in and
>>>> /dev/kfd present, correct? AFAICT at least, it relies on that to
>>>> figure out the amdgpu DRM node.
>>>>
>>>> In would be probably good to consider designing things without that
>>>> dependency. So that checkpointing an application which does not use
>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD
>>>> support compiled in.
>>>
>>> Yeah, if we want to support graphics apps that don't use KFD, we
>>> should definitely do that. Currently we get a lot of topology
>>> information from KFD, not even from the /dev/kfd device but from the
>>> sysfs nodes exposed by KFD. We'd need to get GPU device info from the
>>> render nodes instead. And if KFD is available, we may need to
>>> integrate both sources of information.
>>>
>>>
>>>>
>>>> It could perhaps mean no more than adding some GPU discovery code
>>>> into CRIU. Which shuold be flexible enough to account for things
>>>> like re-assigned minor numbers due driver reload.
>>>
>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin.
>>> I was thinking this is still part of the plugin.
>>
>> Yes I agree. I was only thinking about adding some DRM device
>> discovery code in a more decoupled fashion from the current plugin,
>> for both the reason discussed above (decoupling a bit from reliance on
>> kfd sysfs), and then also if/when a new DRM driver might want to
>> implement this the code could be move to some common plugin area.
>>
>> I am not sure how feasible that would be though. The "gpu id" concept
>> and it's matching in the current kernel code and CRIU plugin - is that
>> value tied to the physical GPU instance or how it works?
>
> The concept of the GPU ID is that it's stable while the system is up,
> even when devices get added and removed dynamically. It was baked into
> the API early on, but I don't think we ever fully validated device hot
> plug. I think the closest we're getting is with our latest MI GPUs and
> dynamic partition mode change.
Doesn't it read the saved gpu id from the image file while doing restore
and tries to open the render node to match it? Maybe I am misreading the
code.. But if it does, does it imply that in practice it could be stable
across reboots? Or that it is not possible to restore to a different
instance of maybe the same GPU model installed in a system?
> This also highlights another aspect on those spatially partitioned GPUs.
> GPU IDs identify device partitions, not devices. Similarly, each
> partition has its own render node, and the KFD topology info in sysfs
> points to the render-minor number corresponding to each GPU ID.
I am not familiar with this. This is not SR-IOV but some other kind of
partitioning? Would you have any links where I could read more?
Regards,
Tvrtko
>>>> Otherwise I am eagerly awaiting to hear more about the design
>>>> specifics around dma-buf handling. And also seeing how to extend to
>>>> other DRM related anonymous fds.
>>>
>>> I've been pretty far under-water lately. I hope I'll find time to
>>> work on this more, but it's probably going to be at least a few weeks.
>>
>> Got it.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Regards,
>>> Felix
>>>
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>>
>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> Hi Felix,
>>>>>>>>
>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render
>>>>>>>>> nodes in order to maintain CRIU support for ROCm application
>>>>>>>>> once they start relying on render nodes for more GPU memory
>>>>>>>>> management. In this email I'm providing some background why we
>>>>>>>>> are doing this, and outlining some of the problems we need to
>>>>>>>>> solve to checkpoint and restore render node state and shared
>>>>>>>>> memory (DMABuf) state. I have some thoughts on the API design,
>>>>>>>>> leaning on what we did for KFD, but would like to get feedback
>>>>>>>>> from the DRI community regarding that API and to what extent
>>>>>>>>> there is interest in making that generic.
>>>>>>>>>
>>>>>>>>> We are working on using DRM render nodes for virtual address
>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM
>>>>>>>>> API and improve interoperability between graphics and compute.
>>>>>>>>> This uses DMABufs for sharing buffer objects between KFD and
>>>>>>>>> multiple render node devices, as well as between processes. In
>>>>>>>>> the long run this also provides a path to moving all or most
>>>>>>>>> memory management from the KFD ioctl API to libdrm.
>>>>>>>>>
>>>>>>>>> Once ROCm user mode starts using render nodes for virtual
>>>>>>>>> address management, that creates a problem for checkpointing
>>>>>>>>> and restoring ROCm applications with CRIU. Currently there is
>>>>>>>>> no support for checkpointing and restoring render node state,
>>>>>>>>> other than CPU virtual address mappings. Support will be needed
>>>>>>>>> for checkpointing GEM buffer objects and handles, their GPU
>>>>>>>>> virtual address mappings and memory sharing relationships
>>>>>>>>> between devices and processes.
>>>>>>>>>
>>>>>>>>> Eventually, if full CRIU support for graphics applications is
>>>>>>>>> desired, more state would need to be captured, including
>>>>>>>>> scheduler contexts and BO lists. Most of this state is
>>>>>>>>> driver-specific.
>>>>>>>>>
>>>>>>>>> After some internal discussions we decided to take our design
>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf
>>>>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>>>>
>>>>>>>>> One basic question before going into any API details: Is there
>>>>>>>>> a desire to have CRIU support for other DRM drivers?
>>>>>>>>
>>>>>>>> This sounds like a very interesting feature on the overall,
>>>>>>>> although I cannot answer on the last question here.
>>>>>>>
>>>>>>> I forgot to finish this thought. I cannot answer / don't know of
>>>>>>> any concrete plans, but I think feature is pretty cool and if
>>>>>>> amdgpu gets it working I wouldn't be surprised if other drivers
>>>>>>> would get interested.
>>>>>>
>>>>>> Thanks, that's good to hear!
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I
>>>>>>>> recently implemented on Mesa's request, which is to be able to
>>>>>>>> "upload" the GPU context from the GPU hang error state and
>>>>>>>> replay the hanging request. It is kind of (at a stretch) a very
>>>>>>>> special tiny subset of checkout and restore so I am not
>>>>>>>> mentioning it as a curiosity.
>>>>>>>>
>>>>>>>> And there is also another partical conceptual intersect with the
>>>>>>>> (at the moment not yet upstream) i915 online debugger. This part
>>>>>>>> being in the area of discovering and enumerating GPU resources
>>>>>>>> beloning to the client.
>>>>>>>>
>>>>>>>> I don't see an immediate design or code sharing opportunities
>>>>>>>> though but just mentioning.
>>>>>>>>
>>>>>>>> I did spend some time reading your plugin and kernel
>>>>>>>> implementation out of curiousity and have some comments and
>>>>>>>> questions.
>>>>>>>>
>>>>>>>>> With that out of the way, some considerations for a possible
>>>>>>>>> DRM CRIU API (either generic of AMDGPU driver specific): The
>>>>>>>>> API goes through several phases during checkpoint and restore:
>>>>>>>>>
>>>>>>>>> Checkpoint:
>>>>>>>>>
>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode can
>>>>>>>>> allocate
>>>>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>>
>>>>>>>>> Restore:
>>>>>>>>>
>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place
>>>>>>>>> at this time)
>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume
>>>>>>>>> execution)
>>>>>>>>
>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is
>>>>>>>> idled? For instance dma_resv objects are free of fences which
>>>>>>>> would need to restored for things to continue executing
>>>>>>>> sensibly? Or how is that handled?
>>>>>>
>>>>>> In our compute use cases, we suspend user mode queues. This can
>>>>>> include CWSR (compute-wave-save-restore) where the state of
>>>>>> in-flight waves is stored in memory and can be reloaded and
>>>>>> resumed from memory later. We don't use any fences other than
>>>>>> "eviction fences", that are signaled after the queues are
>>>>>> suspended. And those fences are never handed to user mode. So we
>>>>>> don't need to worry about any fence state in the checkpoint.
>>>>>>
>>>>>> If we extended this to support the kernel mode command submission
>>>>>> APIs, I would expect that we'd wait for all current submissions to
>>>>>> complete, and stop new ones from being sent to the HW before
>>>>>> taking the checkpoint. When we take the checkpoint in the CRIU
>>>>>> plugin, the CPU threads are already frozen and cannot submit any
>>>>>> more work. If we wait for all currently pending submissions to
>>>>>> drain, I think we don't need to save any fence state because all
>>>>>> the fences will have signaled. (I may be missing some intricacies
>>>>>> and I'm afraid it may not be that simple in reality, but that's my
>>>>>> opening bid. ;)
>>>>>
>>>>> It feels feasible to me too, for the normally behaving clients at
>>>>> least.
>>>>>
>>>>> Presumably, given that the whole checkpointing is not instant, it
>>>>> would be okay to wait a second or two longer for the in-progress
>>>>> submissions complete. After which kernel would need to prune all
>>>>> signalled fences from the respective container objects before
>>>>> checkpointing.
>>>>>
>>>>> For the "misbehaving" clients who have perhaps queued up too much
>>>>> work, either still in the scheduler with unsatisfied dependencies,
>>>>> or already submitted to the hardware and/or driver backend, is
>>>>> there a timeout concept in CRIU so it would be possible to say
>>>>> something like "try to checkpoint but if the kernel says no time
>>>>> period t then give up"?
>>>>>
>>>>>>>>> For some more background about our implementation in KFD, you
>>>>>>>>> can refer to this whitepaper:
>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>>
>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render
>>>>>>>>> nodes, I'll address each of them in more detail below:
>>>>>>>>>
>>>>>>>>> * Opaque information in the checkpoint data that user mode can't
>>>>>>>>> interpret or do anything with
>>>>>>>>> * A second API for creating objects (e.g. BOs) that is
>>>>>>>>> separate from
>>>>>>>>> the regular BO creation API
>>>>>>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>>>>>>> relationships rather than replaying BO creation, export and
>>>>>>>>> import
>>>>>>>>> from user mode
>>>>>>>>>
>>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>>
>>>>>>>>> This comes out of ABI compatibility considerations. Adding any
>>>>>>>>> new objects or attributes to the driver/HW state that needs to
>>>>>>>>> be checkpointed could potentially break the ABI of the CRIU
>>>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that
>>>>>>>>> information. Therefore, much of the information in our KFD CRIU
>>>>>>>>> ioctl API is opaque. It is written by kernel mode in the
>>>>>>>>> checkpoint, it is consumed by kernel mode when restoring the
>>>>>>>>> checkpoint, but user mode doesn't care about the contents or
>>>>>>>>> binary layout, so there is no user mode ABI to break. This is
>>>>>>>>> how we were able to maintain CRIU support when we added the SVM
>>>>>>>>> API to KFD without changing the CRIU plugin and without
>>>>>>>>> breaking our ABI.
>>>>>>>>>
>>>>>>>>> Opaque information may also lend itself to API abstraction, if
>>>>>>>>> this becomes a generic DRM API with driver-specific callbacks
>>>>>>>>> that fill in HW-specific opaque data.
>>>>>>>>
>>>>>>>> This feels sound in principle to me. Fundamentally the state is
>>>>>>>> very hardware specfic, and/or driver version specific, so I
>>>>>>>> don't see anything could be gained in practice by making it much
>>>>>>>> less opaque. (Apart from making things more complicated.)
>>>>>>>>
>>>>>>>> I was however unsure of the current split of how you dump buffer
>>>>>>>> objects with some data in the defined bo structure, and some in
>>>>>>>> completely opaque private data. Is there a benefit to that
>>>>>>>> split, or maybe in other words, is there a benefit on having
>>>>>>>> part transparent and part opaque for buffer objects?
>>>>>>
>>>>>> Some of the buffer object state is needed by the plugin. E.g. the
>>>>>> size and mmap offset are needed to match VMAs with BOs. I'd have
>>>>>> to review the plugin in detail to prove that all the fields are,
>>>>>> in fact, needed by the plugin, but that was the intent. Anything
>>>>>> that the plugin doesn't need to know should be in the opaque data
>>>>>> structures.
>>>>>
>>>>> Right, got it.
>>>>>
>>>>> Would it make sense to make the opaque data in the same block as
>>>>> the defined one? I mean instead of separating the two in the binary
>>>>> image for instance have struct kfd_criu_bo_bucket have a trailing
>>>>> priv_data blob? Maybe it is too late now if the image format is not
>>>>> versioned or something.
>>>>>
>>>>>>>> To slightly touch upon the question of whether this could become
>>>>>>>> a generic DRM API. It feels it would be hard to do it from the
>>>>>>>> start. What sounds more feasible is if/when generic looking
>>>>>>>> helpers can be spotted while developing the RFC then potentially
>>>>>>>> structure the code they can easily be promoted to shared/common
>>>>>>>> at some future moment.
>>>>>>
>>>>>> Yes, that's how this usually goes, in my experience. Thanks for
>>>>>> confirming.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> # Second API for creating objects
>>>>>>>>>
>>>>>>>>> Creating BOs and other objects when restoring a checkpoint
>>>>>>>>> needs more information than the usual BO alloc and similar APIs
>>>>>>>>> provide. For example, we need to restore BOs with the same GEM
>>>>>>>>> handles so that user mode can continue using those handles
>>>>>>>>> after resuming execution. If BOs are shared through DMABufs
>>>>>>>>> without dynamic attachment, we need to restore pinned BOs as
>>>>>>>>> pinned. Validation of virtual addresses and handling MMU
>>>>>>>>> notifiers must be suspended until the virtual address space is
>>>>>>>>> restored. For user mode queues we need to save and restore a
>>>>>>>>> lot of queue execution state so that execution can resume cleanly.
>>>>>>>>
>>>>>>>> This also sounds justified to me. Restore creating all internal
>>>>>>>> objects is definitely special and sounds better to add uapi to
>>>>>>>> create them directly with the correct properties, than to add
>>>>>>>> uapi to adjust internal properties after creation. And in case
>>>>>>>> you would always need some new uapi - so at least to adjust
>>>>>>>> after creation. At which point you may have both in one.
>>>>>>>> Internally implementation can be split or common, whatever makes
>>>>>>>> sense for a given object type, but new uapi definitely sounds is
>>>>>>>> required.
>>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>>
>>>>>>>>> Different GEM handles in different render nodes and processes
>>>>>>>>> can refer to the same underlying shared memory, either by
>>>>>>>>> directly pointing to the same GEM object, or by creating an
>>>>>>>>> import attachment that may get its SG tables invalidated and
>>>>>>>>> updated dynamically through dynamic attachment callbacks. In
>>>>>>>>> the latter case it's obvious, who is the exporter and who is
>>>>>>>>> the importer. In the first case, either one could be the
>>>>>>>>> exporter, and it's not clear who would need to create the BO
>>>>>>>>> and who would need to
>>>>>>>>
>>>>>>>> To see if I follow the former case correctly.
>>>>>>>>
>>>>>>>> This could be two clients A and B, where B has imported a
>>>>>>>> dma-buf shared BO from A and has since closed the dma-buf fd?
>>>>>>>> Which results in a single BO with reference count of 2 and
>>>>>>>> obj->import_attach unset. History of who created the object is
>>>>>>>> lost.
>>>>>>
>>>>>> Yes. In the amdgpu driver this happens when the exporter and
>>>>>> import device are the same.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> In fact it could even be that two imported objects remain
>>>>>>>> (clients A, B and C) and A, who originally created the BO, has
>>>>>>>> since fully released it. So any kind of "creator" tracking if
>>>>>>>> added wouldn't be fully reliable either.
>>>>>>
>>>>>> That's a good point.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> import it when restoring the checkpoint. To further complicate
>>>>>>>>> things, multiple processes in a checkpoint get restored
>>>>>>>>> concurrently. So there is no guarantee that an exporter has
>>>>>>>>> restored a shared BO at the time an importer is trying to
>>>>>>>>> restore its import.
>>>>>>>>>
>>>>>>>>> A proposal to deal with these problems would be to treat
>>>>>>>>> importers and exporters the same. Whoever restores first, ends
>>>>>>>>> up creating the BO and potentially attaching to it. The other
>>>>>>>>> process(es) can find BOs that were already restored by another
>>>>>>>>> process by looking it up with a unique ID that could be based
>>>>>>>>> on the DMABuf inode number. An alternative would be a two-pass
>>>>>>>>> approach that needs to restore BOs on two passes:
>>>>>>>>>
>>>>>>>>> 1. Restore exported BOs
>>>>>>>>> 2. Restore imports
>>>>>>>>>
>>>>>>>>> With some inter-process synchronization in CRIU itself between
>>>>>>>>> these two passes. This may require changes in the core CRIU,
>>>>>>>>> outside our plugin. Both approaches depend on identifying BOs
>>>>>>>>> with some unique ID that could be based on the DMABuf inode
>>>>>>>>> number in the checkpoint. However, we would need to identify
>>>>>>>>> the processes in the same restore session, possibly based on
>>>>>>>>> parent/child process relationships, to create a scope where
>>>>>>>>> those IDs are valid during restore.
>>>>>>>>
>>>>>>>> If my understanding above is on the right track, then I think
>>>>>>>> this is the only thing which can be done (for all scenarios).
>>>>>>
>>>>>> I presented two alternatives. I think you're in favor of the first
>>>>>> one, where it doesn't matter who is the importer and exporter. I
>>>>>> think the two-pass approach requires that you can identify an
>>>>>> exporter. And as you pointed out, the exporter may already have
>>>>>> dropped their reference to the BO.
>>>>>
>>>>> Yep.
>>>>>
>>>>>>>> I also *think* it would be safe. At least at the moment I cannot
>>>>>>>> think what could go wrong. Semantics are that it doesn't really
>>>>>>>> matter who created the object.
>>>>>>
>>>>>> I would agree. What matters is that the object is recreated on the
>>>>>> correct device, and that all the direct references and import
>>>>>> attachments pointing to it are restored.
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf
>>>>>>>>> file descriptors themselves. These are anonymous file
>>>>>>>>> descriptors. The CRIU plugin could probably be taught to
>>>>>>>>> recreate them from the original exported BO based on the inode
>>>>>>>>> number that could be queried with fstat in the checkpoint. It
>>>>>>>>> would need help from the render node CRIU API to find the right
>>>>>>>>> BO from the inode, which may be from a different process in the
>>>>>>>>> same restore session.
>>>>>>>>
>>>>>>>> This part feels like it is breaking the component separation a
>>>>>>>> bit because even for buffers fully owned by amdgpu, strictly
>>>>>>>> speaking the dma-buf fd is not. At least my understanding from
>>>>>>>> the above is that you propose to attempt to import the fd, from
>>>>>>>> the kernel side, during the checkpoint process? Like:
>>>>>>>>
>>>>>>>> Checkpoint:
>>>>>>>>
>>>>>>>> CRIU for each anon fd:
>>>>>>>> amdgpu_plugin(fd)
>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel
>>>>>>>> via ioctl?)
>>>>>>>> -> is it ours? (no -> error)
>>>>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>>>>
>>>>>>>> Restore:
>>>>>>>>
>>>>>>>> for each dma-buf fd record:
>>>>>>>> create BO if does not exists
>>>>>>>> export BO to same fd
>>>>>>>> close BO handle if not in regular BO handle records
>>>>>>>>
>>>>>>>> Or since you mention lookup by inode, that would need to be
>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount
>>>>>>>> space. However how would it co-operate with amdgpu_plugin is not
>>>>>>>> clear to me.
>>>>>>
>>>>>> The way I think about the ownership is, whichever driver created
>>>>>> the underlying BO owns the checkpointing of the dmabuf. You need
>>>>>> driver-specific information to link the dmabuf with the driver's
>>>>>> BO and you need the right driver to recreate the BO and the dmabuf
>>>>>> fd when restoring the checkpoint.
>>>>>>
>>>>>> It gets really interesting if you have an amdgpu plugin and an
>>>>>> i915 plugin, and they checkpoint an application that shares BOs
>>>>>> between the two devices through DMABufs. E.g. if i915 created a BO
>>>>>> and amdgpu imported it, then during restore, i915 needs to restore
>>>>>> the dmabuf before the amdgpu import of it can be restored. I think
>>>>>> that brings us back to a two-phase approach to restoring the
>>>>>> memory sharing relationships. Uff.
>>>>>
>>>>> I think this part of the discussion somewhat depends on the
>>>>> previous part about idling. If it is feasible to completely idle
>>>>> and prune, and fail if that is not happening quickly enough, then
>>>>> maybe there wouldn't be too much hierarchical state to save.
>>>>>
>>>>> Otherwise my idea was that there is a top-level drm_plugin.so which
>>>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new
>>>>> uapi to uniquely identify each, associate with the correct driver,
>>>>> and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so.
>>>>> Building the in memory representation of their relationships. As
>>>>> long as all objects and their relationships have been recorded I
>>>>> think everything could then be correctly restored.
>>>>>
>>>>> It is possible there is flaw in my thinking and that something in
>>>>> CRIU design would make this impossible? I think it would require
>>>>> the top-level drm_plugin.so to hold all state in memory until the
>>>>> whole checkpointing is done, and then verify something is not
>>>>> incomplete, failing it all if it was. (For instance one plugin
>>>>> discovered an reference to an object which was not discoverd by any
>>>>> other plugin or things like that.) May need some further tweaks to
>>>>> CRIU common code.
>>>>>
>>>>> Maybe I need to better understand how exactly you mean to query the
>>>>> DRM driver about random anonymous fds. I see it as a problem in the
>>>>> design, possibly even implementation, but maybe I am missing
>>>>> something which makes it not so. I mean even with my general idea I
>>>>> don't know how would one determine which driver to query about a
>>>>> particular anonymous inode.
>>>>>
>>>>>>> I later also realised that I was maybe increasing the scope for
>>>>>>> you here. :) You did state focus is ROCm applications which
>>>>>>> possibly doesn't care about dma-resv, fences, syncobjs etc?
>>>>>>
>>>>>> That's my focus for now. But I don't want to engineer a solution
>>>>>> that would preclude your use cases in the future.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> But I think the "how to handle dma-bufs" design question is still
>>>>>>> relevant and interesting. For example I had this thought that
>>>>>>> perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>>
>>>>>>> Because fundamentally we would be snapshoting a hierarcy of
>>>>>>> kernel objects belonging to different drivers (kfd, amdgpu,
>>>>>>> dma-buf, ...). And if one day someone would to try to handle dma
>>>>>>> fences and drm syncobjs, the argument for a hierarchial design
>>>>>>> would be even stronger I think.
>>>>>>>
>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>>>>>> dma-buf, sync file, ...) and internally build the representation
>>>>>>> of the whole state and how the relationship between the objects.
>>>>>>
>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not
>>>>>> sure it's strictly a hierarchy. Maybe more like some common code
>>>>>> shared by multiple GPU driver plugins. I think the common
>>>>>> checkpoint state is quite limited and restoring it requires the
>>>>>> GPU-specific drivers anyway.
>>>>>>
>>>>>> Also the idea of building a representation of the whole state
>>>>>> doesn't work well with the CRIU design, because "the whole state"
>>>>>> can include multiple processes that restore themselves
>>>>>> concurrently and only synchronize with each other in a few places
>>>>>> in the restore process. I feel, if we can work out how to
>>>>>> checkpoint and restore shared objects between processes, we can do
>>>>>> the same for shared objects between drivers without imposing a
>>>>>> strict hierarchy and some omniscient entity that needs to know
>>>>>> "the whole state".
>>>>>
>>>>> Okay, this continues on the same problem space as above. And you
>>>>> obviously know how CRIU works much better than me.
>>>>>
>>>>>>> With that kind of design there probably would be a need to define
>>>>>>> some common kernel side api and uapi, so all involved objects can
>>>>>>> be enumerated with some unique ids etc.
>>>>>>>
>>>>>>> Now.. the counter argument.. the more state from different
>>>>>>> drivers would one want to handle the bigger this project would
>>>>>>> get. Would it even be feasible is the question, to the point that
>>>>>>> it may be simpler to just run the workload in a VM via SR-IOV and
>>>>>>> simply hibernate the whole thin guest. :)
>>>>>>
>>>>>> Well, CRIU kind of tries to do that, but with containers instead
>>>>>> of VMs. ;)
>>>>>
>>>>> It would definitely be useful for hardware and drivers without
>>>>> SR-IOV support so lets hope it is doable. :)
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-04-01 16:56 ` Tvrtko Ursulin
@ 2024-04-01 17:58 ` Felix Kuehling
2024-04-16 14:04 ` Tvrtko Ursulin
0 siblings, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2024-04-01 17:58 UTC (permalink / raw)
To: Tvrtko Ursulin, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>
> On 01/04/2024 17:37, Felix Kuehling wrote:
>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>>
>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>>
>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>>
>>>>> Hi Felix,
>>>>>
>>>>> I had one more thought while browsing around the amdgpu CRIU
>>>>> plugin. It appears it relies on the KFD support being compiled in
>>>>> and /dev/kfd present, correct? AFAICT at least, it relies on that
>>>>> to figure out the amdgpu DRM node.
>>>>>
>>>>> In would be probably good to consider designing things without
>>>>> that dependency. So that checkpointing an application which does
>>>>> not use /dev/kfd is possible. Or if the kernel does not even have
>>>>> the KFD support compiled in.
>>>>
>>>> Yeah, if we want to support graphics apps that don't use KFD, we
>>>> should definitely do that. Currently we get a lot of topology
>>>> information from KFD, not even from the /dev/kfd device but from
>>>> the sysfs nodes exposed by KFD. We'd need to get GPU device info
>>>> from the render nodes instead. And if KFD is available, we may need
>>>> to integrate both sources of information.
>>>>
>>>>
>>>>>
>>>>> It could perhaps mean no more than adding some GPU discovery code
>>>>> into CRIU. Which shuold be flexible enough to account for things
>>>>> like re-assigned minor numbers due driver reload.
>>>>
>>>> Do you mean adding GPU discovery to the core CRIU, or to the
>>>> plugin. I was thinking this is still part of the plugin.
>>>
>>> Yes I agree. I was only thinking about adding some DRM device
>>> discovery code in a more decoupled fashion from the current plugin,
>>> for both the reason discussed above (decoupling a bit from reliance
>>> on kfd sysfs), and then also if/when a new DRM driver might want to
>>> implement this the code could be move to some common plugin area.
>>>
>>> I am not sure how feasible that would be though. The "gpu id"
>>> concept and it's matching in the current kernel code and CRIU plugin
>>> - is that value tied to the physical GPU instance or how it works?
>>
>> The concept of the GPU ID is that it's stable while the system is up,
>> even when devices get added and removed dynamically. It was baked
>> into the API early on, but I don't think we ever fully validated
>> device hot plug. I think the closest we're getting is with our latest
>> MI GPUs and dynamic partition mode change.
>
> Doesn't it read the saved gpu id from the image file while doing
> restore and tries to open the render node to match it? Maybe I am
> misreading the code.. But if it does, does it imply that in practice
> it could be stable across reboots? Or that it is not possible to
> restore to a different instance of maybe the same GPU model installed
> in a system?
Ah, the idea is, that when you restore on a different system, you may
get different GPU IDs. Or you may checkpoint an app running on GPU 1 but
restore it on GPU 2 on the same system. That's why we need to translate
GPU IDs in restored applications. User mode still uses the old GPU IDs,
but the kernel mode driver translates them to the actual GPU IDs of the
GPUs that the process was restored on.
>
>> This also highlights another aspect on those spatially partitioned
>> GPUs. GPU IDs identify device partitions, not devices. Similarly,
>> each partition has its own render node, and the KFD topology info in
>> sysfs points to the render-minor number corresponding to each GPU ID.
>
> I am not familiar with this. This is not SR-IOV but some other kind of
> partitioning? Would you have any links where I could read more?
Right, the bare-metal driver can partition a PF spatially without SRIOV.
SRIOV can also use spatial partitioning and expose each partition
through its own VF, but that's not useful for bare metal. Spatial
partitioning is new in MI300. There is some high-level info in this
whitepaper:
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.
Regards,
Felix
>
> Regards,
>
> Tvrtko
>
>>>>> Otherwise I am eagerly awaiting to hear more about the design
>>>>> specifics around dma-buf handling. And also seeing how to extend
>>>>> to other DRM related anonymous fds.
>>>>
>>>> I've been pretty far under-water lately. I hope I'll find time to
>>>> work on this more, but it's probably going to be at least a few weeks.
>>>
>>> Got it.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> Regards,
>>>> Felix
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>>>
>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> Hi Felix,
>>>>>>>>>
>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render
>>>>>>>>>> nodes in order to maintain CRIU support for ROCm application
>>>>>>>>>> once they start relying on render nodes for more GPU memory
>>>>>>>>>> management. In this email I'm providing some background why
>>>>>>>>>> we are doing this, and outlining some of the problems we need
>>>>>>>>>> to solve to checkpoint and restore render node state and
>>>>>>>>>> shared memory (DMABuf) state. I have some thoughts on the API
>>>>>>>>>> design, leaning on what we did for KFD, but would like to get
>>>>>>>>>> feedback from the DRI community regarding that API and to
>>>>>>>>>> what extent there is interest in making that generic.
>>>>>>>>>>
>>>>>>>>>> We are working on using DRM render nodes for virtual address
>>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style
>>>>>>>>>> VM API and improve interoperability between graphics and
>>>>>>>>>> compute. This uses DMABufs for sharing buffer objects between
>>>>>>>>>> KFD and multiple render node devices, as well as between
>>>>>>>>>> processes. In the long run this also provides a path to
>>>>>>>>>> moving all or most memory management from the KFD ioctl API
>>>>>>>>>> to libdrm.
>>>>>>>>>>
>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual
>>>>>>>>>> address management, that creates a problem for checkpointing
>>>>>>>>>> and restoring ROCm applications with CRIU. Currently there is
>>>>>>>>>> no support for checkpointing and restoring render node state,
>>>>>>>>>> other than CPU virtual address mappings. Support will be
>>>>>>>>>> needed for checkpointing GEM buffer objects and handles,
>>>>>>>>>> their GPU virtual address mappings and memory sharing
>>>>>>>>>> relationships between devices and processes.
>>>>>>>>>>
>>>>>>>>>> Eventually, if full CRIU support for graphics applications is
>>>>>>>>>> desired, more state would need to be captured, including
>>>>>>>>>> scheduler contexts and BO lists. Most of this state is
>>>>>>>>>> driver-specific.
>>>>>>>>>>
>>>>>>>>>> After some internal discussions we decided to take our design
>>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf
>>>>>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>>>>>
>>>>>>>>>> One basic question before going into any API details: Is
>>>>>>>>>> there a desire to have CRIU support for other DRM drivers?
>>>>>>>>>
>>>>>>>>> This sounds like a very interesting feature on the overall,
>>>>>>>>> although I cannot answer on the last question here.
>>>>>>>>
>>>>>>>> I forgot to finish this thought. I cannot answer / don't know
>>>>>>>> of any concrete plans, but I think feature is pretty cool and
>>>>>>>> if amdgpu gets it working I wouldn't be surprised if other
>>>>>>>> drivers would get interested.
>>>>>>>
>>>>>>> Thanks, that's good to hear!
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I
>>>>>>>>> recently implemented on Mesa's request, which is to be able to
>>>>>>>>> "upload" the GPU context from the GPU hang error state and
>>>>>>>>> replay the hanging request. It is kind of (at a stretch) a
>>>>>>>>> very special tiny subset of checkout and restore so I am not
>>>>>>>>> mentioning it as a curiosity.
>>>>>>>>>
>>>>>>>>> And there is also another partical conceptual intersect with
>>>>>>>>> the (at the moment not yet upstream) i915 online debugger.
>>>>>>>>> This part being in the area of discovering and enumerating GPU
>>>>>>>>> resources beloning to the client.
>>>>>>>>>
>>>>>>>>> I don't see an immediate design or code sharing opportunities
>>>>>>>>> though but just mentioning.
>>>>>>>>>
>>>>>>>>> I did spend some time reading your plugin and kernel
>>>>>>>>> implementation out of curiousity and have some comments and
>>>>>>>>> questions.
>>>>>>>>>
>>>>>>>>>> With that out of the way, some considerations for a possible
>>>>>>>>>> DRM CRIU API (either generic of AMDGPU driver specific): The
>>>>>>>>>> API goes through several phases during checkpoint and restore:
>>>>>>>>>>
>>>>>>>>>> Checkpoint:
>>>>>>>>>>
>>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode
>>>>>>>>>> can allocate
>>>>>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>>>
>>>>>>>>>> Restore:
>>>>>>>>>>
>>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place
>>>>>>>>>> at this time)
>>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out,
>>>>>>>>>> resume execution)
>>>>>>>>>
>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is
>>>>>>>>> idled? For instance dma_resv objects are free of fences which
>>>>>>>>> would need to restored for things to continue executing
>>>>>>>>> sensibly? Or how is that handled?
>>>>>>>
>>>>>>> In our compute use cases, we suspend user mode queues. This can
>>>>>>> include CWSR (compute-wave-save-restore) where the state of
>>>>>>> in-flight waves is stored in memory and can be reloaded and
>>>>>>> resumed from memory later. We don't use any fences other than
>>>>>>> "eviction fences", that are signaled after the queues are
>>>>>>> suspended. And those fences are never handed to user mode. So we
>>>>>>> don't need to worry about any fence state in the checkpoint.
>>>>>>>
>>>>>>> If we extended this to support the kernel mode command
>>>>>>> submission APIs, I would expect that we'd wait for all current
>>>>>>> submissions to complete, and stop new ones from being sent to
>>>>>>> the HW before taking the checkpoint. When we take the checkpoint
>>>>>>> in the CRIU plugin, the CPU threads are already frozen and
>>>>>>> cannot submit any more work. If we wait for all currently
>>>>>>> pending submissions to drain, I think we don't need to save any
>>>>>>> fence state because all the fences will have signaled. (I may be
>>>>>>> missing some intricacies and I'm afraid it may not be that
>>>>>>> simple in reality, but that's my opening bid. ;)
>>>>>>
>>>>>> It feels feasible to me too, for the normally behaving clients at
>>>>>> least.
>>>>>>
>>>>>> Presumably, given that the whole checkpointing is not instant, it
>>>>>> would be okay to wait a second or two longer for the in-progress
>>>>>> submissions complete. After which kernel would need to prune all
>>>>>> signalled fences from the respective container objects before
>>>>>> checkpointing.
>>>>>>
>>>>>> For the "misbehaving" clients who have perhaps queued up too much
>>>>>> work, either still in the scheduler with unsatisfied
>>>>>> dependencies, or already submitted to the hardware and/or driver
>>>>>> backend, is there a timeout concept in CRIU so it would be
>>>>>> possible to say something like "try to checkpoint but if the
>>>>>> kernel says no time period t then give up"?
>>>>>>
>>>>>>>>>> For some more background about our implementation in KFD, you
>>>>>>>>>> can refer to this whitepaper:
>>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>>>
>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render
>>>>>>>>>> nodes, I'll address each of them in more detail below:
>>>>>>>>>>
>>>>>>>>>> * Opaque information in the checkpoint data that user mode
>>>>>>>>>> can't
>>>>>>>>>> interpret or do anything with
>>>>>>>>>> * A second API for creating objects (e.g. BOs) that is
>>>>>>>>>> separate from
>>>>>>>>>> the regular BO creation API
>>>>>>>>>> * Kernel mode would need to be involved in restoring BO
>>>>>>>>>> sharing
>>>>>>>>>> relationships rather than replaying BO creation, export
>>>>>>>>>> and import
>>>>>>>>>> from user mode
>>>>>>>>>>
>>>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>>>
>>>>>>>>>> This comes out of ABI compatibility considerations. Adding
>>>>>>>>>> any new objects or attributes to the driver/HW state that
>>>>>>>>>> needs to be checkpointed could potentially break the ABI of
>>>>>>>>>> the CRIU checkpoint/restore ioctl if the plugin needs to
>>>>>>>>>> parse that information. Therefore, much of the information in
>>>>>>>>>> our KFD CRIU ioctl API is opaque. It is written by kernel
>>>>>>>>>> mode in the checkpoint, it is consumed by kernel mode when
>>>>>>>>>> restoring the checkpoint, but user mode doesn't care about
>>>>>>>>>> the contents or binary layout, so there is no user mode ABI
>>>>>>>>>> to break. This is how we were able to maintain CRIU support
>>>>>>>>>> when we added the SVM API to KFD without changing the CRIU
>>>>>>>>>> plugin and without breaking our ABI.
>>>>>>>>>>
>>>>>>>>>> Opaque information may also lend itself to API abstraction,
>>>>>>>>>> if this becomes a generic DRM API with driver-specific
>>>>>>>>>> callbacks that fill in HW-specific opaque data.
>>>>>>>>>
>>>>>>>>> This feels sound in principle to me. Fundamentally the state
>>>>>>>>> is very hardware specfic, and/or driver version specific, so I
>>>>>>>>> don't see anything could be gained in practice by making it
>>>>>>>>> much less opaque. (Apart from making things more complicated.)
>>>>>>>>>
>>>>>>>>> I was however unsure of the current split of how you dump
>>>>>>>>> buffer objects with some data in the defined bo structure, and
>>>>>>>>> some in completely opaque private data. Is there a benefit to
>>>>>>>>> that split, or maybe in other words, is there a benefit on
>>>>>>>>> having part transparent and part opaque for buffer objects?
>>>>>>>
>>>>>>> Some of the buffer object state is needed by the plugin. E.g.
>>>>>>> the size and mmap offset are needed to match VMAs with BOs. I'd
>>>>>>> have to review the plugin in detail to prove that all the fields
>>>>>>> are, in fact, needed by the plugin, but that was the intent.
>>>>>>> Anything that the plugin doesn't need to know should be in the
>>>>>>> opaque data structures.
>>>>>>
>>>>>> Right, got it.
>>>>>>
>>>>>> Would it make sense to make the opaque data in the same block as
>>>>>> the defined one? I mean instead of separating the two in the
>>>>>> binary image for instance have struct kfd_criu_bo_bucket have a
>>>>>> trailing priv_data blob? Maybe it is too late now if the image
>>>>>> format is not versioned or something.
>>>>>>
>>>>>>>>> To slightly touch upon the question of whether this could
>>>>>>>>> become a generic DRM API. It feels it would be hard to do it
>>>>>>>>> from the start. What sounds more feasible is if/when generic
>>>>>>>>> looking helpers can be spotted while developing the RFC then
>>>>>>>>> potentially structure the code they can easily be promoted to
>>>>>>>>> shared/common at some future moment.
>>>>>>>
>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for
>>>>>>> confirming.
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> # Second API for creating objects
>>>>>>>>>>
>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint
>>>>>>>>>> needs more information than the usual BO alloc and similar
>>>>>>>>>> APIs provide. For example, we need to restore BOs with the
>>>>>>>>>> same GEM handles so that user mode can continue using those
>>>>>>>>>> handles after resuming execution. If BOs are shared through
>>>>>>>>>> DMABufs without dynamic attachment, we need to restore pinned
>>>>>>>>>> BOs as pinned. Validation of virtual addresses and handling
>>>>>>>>>> MMU notifiers must be suspended until the virtual address
>>>>>>>>>> space is restored. For user mode queues we need to save and
>>>>>>>>>> restore a lot of queue execution state so that execution can
>>>>>>>>>> resume cleanly.
>>>>>>>>>
>>>>>>>>> This also sounds justified to me. Restore creating all
>>>>>>>>> internal objects is definitely special and sounds better to
>>>>>>>>> add uapi to create them directly with the correct properties,
>>>>>>>>> than to add uapi to adjust internal properties after creation.
>>>>>>>>> And in case you would always need some new uapi - so at least
>>>>>>>>> to adjust after creation. At which point you may have both in
>>>>>>>>> one. Internally implementation can be split or common,
>>>>>>>>> whatever makes sense for a given object type, but new uapi
>>>>>>>>> definitely sounds is required.
>>>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>>>
>>>>>>>>>> Different GEM handles in different render nodes and processes
>>>>>>>>>> can refer to the same underlying shared memory, either by
>>>>>>>>>> directly pointing to the same GEM object, or by creating an
>>>>>>>>>> import attachment that may get its SG tables invalidated and
>>>>>>>>>> updated dynamically through dynamic attachment callbacks. In
>>>>>>>>>> the latter case it's obvious, who is the exporter and who is
>>>>>>>>>> the importer. In the first case, either one could be the
>>>>>>>>>> exporter, and it's not clear who would need to create the BO
>>>>>>>>>> and who would need to
>>>>>>>>>
>>>>>>>>> To see if I follow the former case correctly.
>>>>>>>>>
>>>>>>>>> This could be two clients A and B, where B has imported a
>>>>>>>>> dma-buf shared BO from A and has since closed the dma-buf fd?
>>>>>>>>> Which results in a single BO with reference count of 2 and
>>>>>>>>> obj->import_attach unset. History of who created the object is
>>>>>>>>> lost.
>>>>>>>
>>>>>>> Yes. In the amdgpu driver this happens when the exporter and
>>>>>>> import device are the same.
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> In fact it could even be that two imported objects remain
>>>>>>>>> (clients A, B and C) and A, who originally created the BO, has
>>>>>>>>> since fully released it. So any kind of "creator" tracking if
>>>>>>>>> added wouldn't be fully reliable either.
>>>>>>>
>>>>>>> That's a good point.
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> import it when restoring the checkpoint. To further
>>>>>>>>>> complicate things, multiple processes in a checkpoint get
>>>>>>>>>> restored concurrently. So there is no guarantee that an
>>>>>>>>>> exporter has restored a shared BO at the time an importer is
>>>>>>>>>> trying to restore its import.
>>>>>>>>>>
>>>>>>>>>> A proposal to deal with these problems would be to treat
>>>>>>>>>> importers and exporters the same. Whoever restores first,
>>>>>>>>>> ends up creating the BO and potentially attaching to it. The
>>>>>>>>>> other process(es) can find BOs that were already restored by
>>>>>>>>>> another process by looking it up with a unique ID that could
>>>>>>>>>> be based on the DMABuf inode number. An alternative would be
>>>>>>>>>> a two-pass approach that needs to restore BOs on two passes:
>>>>>>>>>>
>>>>>>>>>> 1. Restore exported BOs
>>>>>>>>>> 2. Restore imports
>>>>>>>>>>
>>>>>>>>>> With some inter-process synchronization in CRIU itself
>>>>>>>>>> between these two passes. This may require changes in the
>>>>>>>>>> core CRIU, outside our plugin. Both approaches depend on
>>>>>>>>>> identifying BOs with some unique ID that could be based on
>>>>>>>>>> the DMABuf inode number in the checkpoint. However, we would
>>>>>>>>>> need to identify the processes in the same restore session,
>>>>>>>>>> possibly based on parent/child process relationships, to
>>>>>>>>>> create a scope where those IDs are valid during restore.
>>>>>>>>>
>>>>>>>>> If my understanding above is on the right track, then I think
>>>>>>>>> this is the only thing which can be done (for all scenarios).
>>>>>>>
>>>>>>> I presented two alternatives. I think you're in favor of the
>>>>>>> first one, where it doesn't matter who is the importer and
>>>>>>> exporter. I think the two-pass approach requires that you can
>>>>>>> identify an exporter. And as you pointed out, the exporter may
>>>>>>> already have dropped their reference to the BO.
>>>>>>
>>>>>> Yep.
>>>>>>
>>>>>>>>> I also *think* it would be safe. At least at the moment I
>>>>>>>>> cannot think what could go wrong. Semantics are that it
>>>>>>>>> doesn't really matter who created the object.
>>>>>>>
>>>>>>> I would agree. What matters is that the object is recreated on
>>>>>>> the correct device, and that all the direct references and
>>>>>>> import attachments pointing to it are restored.
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf
>>>>>>>>>> file descriptors themselves. These are anonymous file
>>>>>>>>>> descriptors. The CRIU plugin could probably be taught to
>>>>>>>>>> recreate them from the original exported BO based on the
>>>>>>>>>> inode number that could be queried with fstat in the
>>>>>>>>>> checkpoint. It would need help from the render node CRIU API
>>>>>>>>>> to find the right BO from the inode, which may be from a
>>>>>>>>>> different process in the same restore session.
>>>>>>>>>
>>>>>>>>> This part feels like it is breaking the component separation a
>>>>>>>>> bit because even for buffers fully owned by amdgpu, strictly
>>>>>>>>> speaking the dma-buf fd is not. At least my understanding from
>>>>>>>>> the above is that you propose to attempt to import the fd,
>>>>>>>>> from the kernel side, during the checkpoint process? Like:
>>>>>>>>>
>>>>>>>>> Checkpoint:
>>>>>>>>>
>>>>>>>>> CRIU for each anon fd:
>>>>>>>>> amdgpu_plugin(fd)
>>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel
>>>>>>>>> via ioctl?)
>>>>>>>>> -> is it ours? (no -> error)
>>>>>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>>>>>
>>>>>>>>> Restore:
>>>>>>>>>
>>>>>>>>> for each dma-buf fd record:
>>>>>>>>> create BO if does not exists
>>>>>>>>> export BO to same fd
>>>>>>>>> close BO handle if not in regular BO handle records
>>>>>>>>>
>>>>>>>>> Or since you mention lookup by inode, that would need to be
>>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount
>>>>>>>>> space. However how would it co-operate with amdgpu_plugin is
>>>>>>>>> not clear to me.
>>>>>>>
>>>>>>> The way I think about the ownership is, whichever driver created
>>>>>>> the underlying BO owns the checkpointing of the dmabuf. You need
>>>>>>> driver-specific information to link the dmabuf with the driver's
>>>>>>> BO and you need the right driver to recreate the BO and the
>>>>>>> dmabuf fd when restoring the checkpoint.
>>>>>>>
>>>>>>> It gets really interesting if you have an amdgpu plugin and an
>>>>>>> i915 plugin, and they checkpoint an application that shares BOs
>>>>>>> between the two devices through DMABufs. E.g. if i915 created a
>>>>>>> BO and amdgpu imported it, then during restore, i915 needs to
>>>>>>> restore the dmabuf before the amdgpu import of it can be
>>>>>>> restored. I think that brings us back to a two-phase approach to
>>>>>>> restoring the memory sharing relationships. Uff.
>>>>>>
>>>>>> I think this part of the discussion somewhat depends on the
>>>>>> previous part about idling. If it is feasible to completely idle
>>>>>> and prune, and fail if that is not happening quickly enough, then
>>>>>> maybe there wouldn't be too much hierarchical state to save.
>>>>>>
>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so
>>>>>> which understands amdgpu fds, i915, syncobj, sync_file, and uses
>>>>>> some new uapi to uniquely identify each, associate with the
>>>>>> correct driver, and then internally dispatches to
>>>>>> amdgpu|i915|dmabuf|..._plugin.so. Building the in memory
>>>>>> representation of their relationships. As long as all objects and
>>>>>> their relationships have been recorded I think everything could
>>>>>> then be correctly restored.
>>>>>>
>>>>>> It is possible there is flaw in my thinking and that something in
>>>>>> CRIU design would make this impossible? I think it would require
>>>>>> the top-level drm_plugin.so to hold all state in memory until the
>>>>>> whole checkpointing is done, and then verify something is not
>>>>>> incomplete, failing it all if it was. (For instance one plugin
>>>>>> discovered an reference to an object which was not discoverd by
>>>>>> any other plugin or things like that.) May need some further
>>>>>> tweaks to CRIU common code.
>>>>>>
>>>>>> Maybe I need to better understand how exactly you mean to query
>>>>>> the DRM driver about random anonymous fds. I see it as a problem
>>>>>> in the design, possibly even implementation, but maybe I am
>>>>>> missing something which makes it not so. I mean even with my
>>>>>> general idea I don't know how would one determine which driver to
>>>>>> query about a particular anonymous inode.
>>>>>>
>>>>>>>> I later also realised that I was maybe increasing the scope for
>>>>>>>> you here. :) You did state focus is ROCm applications which
>>>>>>>> possibly doesn't care about dma-resv, fences, syncobjs etc?
>>>>>>>
>>>>>>> That's my focus for now. But I don't want to engineer a solution
>>>>>>> that would preclude your use cases in the future.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> But I think the "how to handle dma-bufs" design question is
>>>>>>>> still relevant and interesting. For example I had this thought
>>>>>>>> that perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>>>
>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of
>>>>>>>> kernel objects belonging to different drivers (kfd, amdgpu,
>>>>>>>> dma-buf, ...). And if one day someone would to try to handle
>>>>>>>> dma fences and drm syncobjs, the argument for a hierarchial
>>>>>>>> design would be even stronger I think.
>>>>>>>>
>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>>>>>>> dma-buf, sync file, ...) and internally build the
>>>>>>>> representation of the whole state and how the relationship
>>>>>>>> between the objects.
>>>>>>>
>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm
>>>>>>> not sure it's strictly a hierarchy. Maybe more like some common
>>>>>>> code shared by multiple GPU driver plugins. I think the common
>>>>>>> checkpoint state is quite limited and restoring it requires the
>>>>>>> GPU-specific drivers anyway.
>>>>>>>
>>>>>>> Also the idea of building a representation of the whole state
>>>>>>> doesn't work well with the CRIU design, because "the whole
>>>>>>> state" can include multiple processes that restore themselves
>>>>>>> concurrently and only synchronize with each other in a few
>>>>>>> places in the restore process. I feel, if we can work out how to
>>>>>>> checkpoint and restore shared objects between processes, we can
>>>>>>> do the same for shared objects between drivers without imposing
>>>>>>> a strict hierarchy and some omniscient entity that needs to know
>>>>>>> "the whole state".
>>>>>>
>>>>>> Okay, this continues on the same problem space as above. And you
>>>>>> obviously know how CRIU works much better than me.
>>>>>>
>>>>>>>> With that kind of design there probably would be a need to
>>>>>>>> define some common kernel side api and uapi, so all involved
>>>>>>>> objects can be enumerated with some unique ids etc.
>>>>>>>>
>>>>>>>> Now.. the counter argument.. the more state from different
>>>>>>>> drivers would one want to handle the bigger this project would
>>>>>>>> get. Would it even be feasible is the question, to the point
>>>>>>>> that it may be simpler to just run the workload in a VM via
>>>>>>>> SR-IOV and simply hibernate the whole thin guest. :)
>>>>>>>
>>>>>>> Well, CRIU kind of tries to do that, but with containers instead
>>>>>>> of VMs. ;)
>>>>>>
>>>>>> It would definitely be useful for hardware and drivers without
>>>>>> SR-IOV support so lets hope it is doable. :)
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-04-01 17:58 ` Felix Kuehling
@ 2024-04-16 14:04 ` Tvrtko Ursulin
2024-05-03 14:44 ` Felix Kuehling
0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2024-04-16 14:04 UTC (permalink / raw)
To: Felix Kuehling, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 01/04/2024 18:58, Felix Kuehling wrote:
>
> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>
>> On 01/04/2024 17:37, Felix Kuehling wrote:
>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>>>
>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>>>
>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> Hi Felix,
>>>>>>
>>>>>> I had one more thought while browsing around the amdgpu CRIU
>>>>>> plugin. It appears it relies on the KFD support being compiled in
>>>>>> and /dev/kfd present, correct? AFAICT at least, it relies on that
>>>>>> to figure out the amdgpu DRM node.
>>>>>>
>>>>>> In would be probably good to consider designing things without
>>>>>> that dependency. So that checkpointing an application which does
>>>>>> not use /dev/kfd is possible. Or if the kernel does not even have
>>>>>> the KFD support compiled in.
>>>>>
>>>>> Yeah, if we want to support graphics apps that don't use KFD, we
>>>>> should definitely do that. Currently we get a lot of topology
>>>>> information from KFD, not even from the /dev/kfd device but from
>>>>> the sysfs nodes exposed by KFD. We'd need to get GPU device info
>>>>> from the render nodes instead. And if KFD is available, we may need
>>>>> to integrate both sources of information.
>>>>>
>>>>>
>>>>>>
>>>>>> It could perhaps mean no more than adding some GPU discovery code
>>>>>> into CRIU. Which shuold be flexible enough to account for things
>>>>>> like re-assigned minor numbers due driver reload.
>>>>>
>>>>> Do you mean adding GPU discovery to the core CRIU, or to the
>>>>> plugin. I was thinking this is still part of the plugin.
>>>>
>>>> Yes I agree. I was only thinking about adding some DRM device
>>>> discovery code in a more decoupled fashion from the current plugin,
>>>> for both the reason discussed above (decoupling a bit from reliance
>>>> on kfd sysfs), and then also if/when a new DRM driver might want to
>>>> implement this the code could be move to some common plugin area.
>>>>
>>>> I am not sure how feasible that would be though. The "gpu id"
>>>> concept and it's matching in the current kernel code and CRIU plugin
>>>> - is that value tied to the physical GPU instance or how it works?
>>>
>>> The concept of the GPU ID is that it's stable while the system is up,
>>> even when devices get added and removed dynamically. It was baked
>>> into the API early on, but I don't think we ever fully validated
>>> device hot plug. I think the closest we're getting is with our latest
>>> MI GPUs and dynamic partition mode change.
>>
>> Doesn't it read the saved gpu id from the image file while doing
>> restore and tries to open the render node to match it? Maybe I am
>> misreading the code.. But if it does, does it imply that in practice
>> it could be stable across reboots? Or that it is not possible to
>> restore to a different instance of maybe the same GPU model installed
>> in a system?
>
> Ah, the idea is, that when you restore on a different system, you may
> get different GPU IDs. Or you may checkpoint an app running on GPU 1 but
> restore it on GPU 2 on the same system. That's why we need to translate
> GPU IDs in restored applications. User mode still uses the old GPU IDs,
> but the kernel mode driver translates them to the actual GPU IDs of the
> GPUs that the process was restored on.
I see.. I think. Normal flow is ppd->user_gpu_id set during client init,
but for restored clients it gets overriden during restore so that any
further ioctls can actually not instantly fail.
And then in amdgpu_plugin_restore_file, when it is opening the render
node, it relies on the kfd topology to have filled in (more or less) the
target_gpu_id corresponding to the render node gpu id of the target GPU
- the one associated with the new kfd gpu_id?
I am digging into this because I am trying to see if some part of GPU
discovery could somehow be decoupled.. to offer you to work on at least
that until you start to tackle the main body of the feature. But it
looks properly tangled up.
Do you have any suggestions with what I could help with? Maybe
developing some sort of drm device enumeration library if you see a way
that would be useful in decoupling the device discovery from kfd. We
would need to define what sort of information you would need to be
queryable from it.
>>> This also highlights another aspect on those spatially partitioned
>>> GPUs. GPU IDs identify device partitions, not devices. Similarly,
>>> each partition has its own render node, and the KFD topology info in
>>> sysfs points to the render-minor number corresponding to each GPU ID.
>>
>> I am not familiar with this. This is not SR-IOV but some other kind of
>> partitioning? Would you have any links where I could read more?
>
> Right, the bare-metal driver can partition a PF spatially without SRIOV.
> SRIOV can also use spatial partitioning and expose each partition
> through its own VF, but that's not useful for bare metal. Spatial
> partitioning is new in MI300. There is some high-level info in this
> whitepaper:
> https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.
From the outside (userspace) this looks simply like multiple DRM render
nodes or how exactly?
Regards,
Tvrtko
>
> Regards,
> Felix
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>>>>> Otherwise I am eagerly awaiting to hear more about the design
>>>>>> specifics around dma-buf handling. And also seeing how to extend
>>>>>> to other DRM related anonymous fds.
>>>>>
>>>>> I've been pretty far under-water lately. I hope I'll find time to
>>>>> work on this more, but it's probably going to be at least a few weeks.
>>>>
>>>> Got it.
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>>
>>>>> Regards,
>>>>> Felix
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tvrtko
>>>>>>
>>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>>>>
>>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Felix,
>>>>>>>>>>
>>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render
>>>>>>>>>>> nodes in order to maintain CRIU support for ROCm application
>>>>>>>>>>> once they start relying on render nodes for more GPU memory
>>>>>>>>>>> management. In this email I'm providing some background why
>>>>>>>>>>> we are doing this, and outlining some of the problems we need
>>>>>>>>>>> to solve to checkpoint and restore render node state and
>>>>>>>>>>> shared memory (DMABuf) state. I have some thoughts on the API
>>>>>>>>>>> design, leaning on what we did for KFD, but would like to get
>>>>>>>>>>> feedback from the DRI community regarding that API and to
>>>>>>>>>>> what extent there is interest in making that generic.
>>>>>>>>>>>
>>>>>>>>>>> We are working on using DRM render nodes for virtual address
>>>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style
>>>>>>>>>>> VM API and improve interoperability between graphics and
>>>>>>>>>>> compute. This uses DMABufs for sharing buffer objects between
>>>>>>>>>>> KFD and multiple render node devices, as well as between
>>>>>>>>>>> processes. In the long run this also provides a path to
>>>>>>>>>>> moving all or most memory management from the KFD ioctl API
>>>>>>>>>>> to libdrm.
>>>>>>>>>>>
>>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual
>>>>>>>>>>> address management, that creates a problem for checkpointing
>>>>>>>>>>> and restoring ROCm applications with CRIU. Currently there is
>>>>>>>>>>> no support for checkpointing and restoring render node state,
>>>>>>>>>>> other than CPU virtual address mappings. Support will be
>>>>>>>>>>> needed for checkpointing GEM buffer objects and handles,
>>>>>>>>>>> their GPU virtual address mappings and memory sharing
>>>>>>>>>>> relationships between devices and processes.
>>>>>>>>>>>
>>>>>>>>>>> Eventually, if full CRIU support for graphics applications is
>>>>>>>>>>> desired, more state would need to be captured, including
>>>>>>>>>>> scheduler contexts and BO lists. Most of this state is
>>>>>>>>>>> driver-specific.
>>>>>>>>>>>
>>>>>>>>>>> After some internal discussions we decided to take our design
>>>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf
>>>>>>>>>>> APIs and may have implications for other drivers in the future.
>>>>>>>>>>>
>>>>>>>>>>> One basic question before going into any API details: Is
>>>>>>>>>>> there a desire to have CRIU support for other DRM drivers?
>>>>>>>>>>
>>>>>>>>>> This sounds like a very interesting feature on the overall,
>>>>>>>>>> although I cannot answer on the last question here.
>>>>>>>>>
>>>>>>>>> I forgot to finish this thought. I cannot answer / don't know
>>>>>>>>> of any concrete plans, but I think feature is pretty cool and
>>>>>>>>> if amdgpu gets it working I wouldn't be surprised if other
>>>>>>>>> drivers would get interested.
>>>>>>>>
>>>>>>>> Thanks, that's good to hear!
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I
>>>>>>>>>> recently implemented on Mesa's request, which is to be able to
>>>>>>>>>> "upload" the GPU context from the GPU hang error state and
>>>>>>>>>> replay the hanging request. It is kind of (at a stretch) a
>>>>>>>>>> very special tiny subset of checkout and restore so I am not
>>>>>>>>>> mentioning it as a curiosity.
>>>>>>>>>>
>>>>>>>>>> And there is also another partical conceptual intersect with
>>>>>>>>>> the (at the moment not yet upstream) i915 online debugger.
>>>>>>>>>> This part being in the area of discovering and enumerating GPU
>>>>>>>>>> resources beloning to the client.
>>>>>>>>>>
>>>>>>>>>> I don't see an immediate design or code sharing opportunities
>>>>>>>>>> though but just mentioning.
>>>>>>>>>>
>>>>>>>>>> I did spend some time reading your plugin and kernel
>>>>>>>>>> implementation out of curiousity and have some comments and
>>>>>>>>>> questions.
>>>>>>>>>>
>>>>>>>>>>> With that out of the way, some considerations for a possible
>>>>>>>>>>> DRM CRIU API (either generic of AMDGPU driver specific): The
>>>>>>>>>>> API goes through several phases during checkpoint and restore:
>>>>>>>>>>>
>>>>>>>>>>> Checkpoint:
>>>>>>>>>>>
>>>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode
>>>>>>>>>>> can allocate
>>>>>>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>>>>
>>>>>>>>>>> Restore:
>>>>>>>>>>>
>>>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place
>>>>>>>>>>> at this time)
>>>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out,
>>>>>>>>>>> resume execution)
>>>>>>>>>>
>>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is
>>>>>>>>>> idled? For instance dma_resv objects are free of fences which
>>>>>>>>>> would need to restored for things to continue executing
>>>>>>>>>> sensibly? Or how is that handled?
>>>>>>>>
>>>>>>>> In our compute use cases, we suspend user mode queues. This can
>>>>>>>> include CWSR (compute-wave-save-restore) where the state of
>>>>>>>> in-flight waves is stored in memory and can be reloaded and
>>>>>>>> resumed from memory later. We don't use any fences other than
>>>>>>>> "eviction fences", that are signaled after the queues are
>>>>>>>> suspended. And those fences are never handed to user mode. So we
>>>>>>>> don't need to worry about any fence state in the checkpoint.
>>>>>>>>
>>>>>>>> If we extended this to support the kernel mode command
>>>>>>>> submission APIs, I would expect that we'd wait for all current
>>>>>>>> submissions to complete, and stop new ones from being sent to
>>>>>>>> the HW before taking the checkpoint. When we take the checkpoint
>>>>>>>> in the CRIU plugin, the CPU threads are already frozen and
>>>>>>>> cannot submit any more work. If we wait for all currently
>>>>>>>> pending submissions to drain, I think we don't need to save any
>>>>>>>> fence state because all the fences will have signaled. (I may be
>>>>>>>> missing some intricacies and I'm afraid it may not be that
>>>>>>>> simple in reality, but that's my opening bid. ;)
>>>>>>>
>>>>>>> It feels feasible to me too, for the normally behaving clients at
>>>>>>> least.
>>>>>>>
>>>>>>> Presumably, given that the whole checkpointing is not instant, it
>>>>>>> would be okay to wait a second or two longer for the in-progress
>>>>>>> submissions complete. After which kernel would need to prune all
>>>>>>> signalled fences from the respective container objects before
>>>>>>> checkpointing.
>>>>>>>
>>>>>>> For the "misbehaving" clients who have perhaps queued up too much
>>>>>>> work, either still in the scheduler with unsatisfied
>>>>>>> dependencies, or already submitted to the hardware and/or driver
>>>>>>> backend, is there a timeout concept in CRIU so it would be
>>>>>>> possible to say something like "try to checkpoint but if the
>>>>>>> kernel says no time period t then give up"?
>>>>>>>
>>>>>>>>>>> For some more background about our implementation in KFD, you
>>>>>>>>>>> can refer to this whitepaper:
>>>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>>>>
>>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render
>>>>>>>>>>> nodes, I'll address each of them in more detail below:
>>>>>>>>>>>
>>>>>>>>>>> * Opaque information in the checkpoint data that user mode
>>>>>>>>>>> can't
>>>>>>>>>>> interpret or do anything with
>>>>>>>>>>> * A second API for creating objects (e.g. BOs) that is
>>>>>>>>>>> separate from
>>>>>>>>>>> the regular BO creation API
>>>>>>>>>>> * Kernel mode would need to be involved in restoring BO
>>>>>>>>>>> sharing
>>>>>>>>>>> relationships rather than replaying BO creation, export
>>>>>>>>>>> and import
>>>>>>>>>>> from user mode
>>>>>>>>>>>
>>>>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>>>>
>>>>>>>>>>> This comes out of ABI compatibility considerations. Adding
>>>>>>>>>>> any new objects or attributes to the driver/HW state that
>>>>>>>>>>> needs to be checkpointed could potentially break the ABI of
>>>>>>>>>>> the CRIU checkpoint/restore ioctl if the plugin needs to
>>>>>>>>>>> parse that information. Therefore, much of the information in
>>>>>>>>>>> our KFD CRIU ioctl API is opaque. It is written by kernel
>>>>>>>>>>> mode in the checkpoint, it is consumed by kernel mode when
>>>>>>>>>>> restoring the checkpoint, but user mode doesn't care about
>>>>>>>>>>> the contents or binary layout, so there is no user mode ABI
>>>>>>>>>>> to break. This is how we were able to maintain CRIU support
>>>>>>>>>>> when we added the SVM API to KFD without changing the CRIU
>>>>>>>>>>> plugin and without breaking our ABI.
>>>>>>>>>>>
>>>>>>>>>>> Opaque information may also lend itself to API abstraction,
>>>>>>>>>>> if this becomes a generic DRM API with driver-specific
>>>>>>>>>>> callbacks that fill in HW-specific opaque data.
>>>>>>>>>>
>>>>>>>>>> This feels sound in principle to me. Fundamentally the state
>>>>>>>>>> is very hardware specfic, and/or driver version specific, so I
>>>>>>>>>> don't see anything could be gained in practice by making it
>>>>>>>>>> much less opaque. (Apart from making things more complicated.)
>>>>>>>>>>
>>>>>>>>>> I was however unsure of the current split of how you dump
>>>>>>>>>> buffer objects with some data in the defined bo structure, and
>>>>>>>>>> some in completely opaque private data. Is there a benefit to
>>>>>>>>>> that split, or maybe in other words, is there a benefit on
>>>>>>>>>> having part transparent and part opaque for buffer objects?
>>>>>>>>
>>>>>>>> Some of the buffer object state is needed by the plugin. E.g.
>>>>>>>> the size and mmap offset are needed to match VMAs with BOs. I'd
>>>>>>>> have to review the plugin in detail to prove that all the fields
>>>>>>>> are, in fact, needed by the plugin, but that was the intent.
>>>>>>>> Anything that the plugin doesn't need to know should be in the
>>>>>>>> opaque data structures.
>>>>>>>
>>>>>>> Right, got it.
>>>>>>>
>>>>>>> Would it make sense to make the opaque data in the same block as
>>>>>>> the defined one? I mean instead of separating the two in the
>>>>>>> binary image for instance have struct kfd_criu_bo_bucket have a
>>>>>>> trailing priv_data blob? Maybe it is too late now if the image
>>>>>>> format is not versioned or something.
>>>>>>>
>>>>>>>>>> To slightly touch upon the question of whether this could
>>>>>>>>>> become a generic DRM API. It feels it would be hard to do it
>>>>>>>>>> from the start. What sounds more feasible is if/when generic
>>>>>>>>>> looking helpers can be spotted while developing the RFC then
>>>>>>>>>> potentially structure the code they can easily be promoted to
>>>>>>>>>> shared/common at some future moment.
>>>>>>>>
>>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for
>>>>>>>> confirming.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> # Second API for creating objects
>>>>>>>>>>>
>>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint
>>>>>>>>>>> needs more information than the usual BO alloc and similar
>>>>>>>>>>> APIs provide. For example, we need to restore BOs with the
>>>>>>>>>>> same GEM handles so that user mode can continue using those
>>>>>>>>>>> handles after resuming execution. If BOs are shared through
>>>>>>>>>>> DMABufs without dynamic attachment, we need to restore pinned
>>>>>>>>>>> BOs as pinned. Validation of virtual addresses and handling
>>>>>>>>>>> MMU notifiers must be suspended until the virtual address
>>>>>>>>>>> space is restored. For user mode queues we need to save and
>>>>>>>>>>> restore a lot of queue execution state so that execution can
>>>>>>>>>>> resume cleanly.
>>>>>>>>>>
>>>>>>>>>> This also sounds justified to me. Restore creating all
>>>>>>>>>> internal objects is definitely special and sounds better to
>>>>>>>>>> add uapi to create them directly with the correct properties,
>>>>>>>>>> than to add uapi to adjust internal properties after creation.
>>>>>>>>>> And in case you would always need some new uapi - so at least
>>>>>>>>>> to adjust after creation. At which point you may have both in
>>>>>>>>>> one. Internally implementation can be split or common,
>>>>>>>>>> whatever makes sense for a given object type, but new uapi
>>>>>>>>>> definitely sounds is required.
>>>>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>>>>
>>>>>>>>>>> Different GEM handles in different render nodes and processes
>>>>>>>>>>> can refer to the same underlying shared memory, either by
>>>>>>>>>>> directly pointing to the same GEM object, or by creating an
>>>>>>>>>>> import attachment that may get its SG tables invalidated and
>>>>>>>>>>> updated dynamically through dynamic attachment callbacks. In
>>>>>>>>>>> the latter case it's obvious, who is the exporter and who is
>>>>>>>>>>> the importer. In the first case, either one could be the
>>>>>>>>>>> exporter, and it's not clear who would need to create the BO
>>>>>>>>>>> and who would need to
>>>>>>>>>>
>>>>>>>>>> To see if I follow the former case correctly.
>>>>>>>>>>
>>>>>>>>>> This could be two clients A and B, where B has imported a
>>>>>>>>>> dma-buf shared BO from A and has since closed the dma-buf fd?
>>>>>>>>>> Which results in a single BO with reference count of 2 and
>>>>>>>>>> obj->import_attach unset. History of who created the object is
>>>>>>>>>> lost.
>>>>>>>>
>>>>>>>> Yes. In the amdgpu driver this happens when the exporter and
>>>>>>>> import device are the same.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In fact it could even be that two imported objects remain
>>>>>>>>>> (clients A, B and C) and A, who originally created the BO, has
>>>>>>>>>> since fully released it. So any kind of "creator" tracking if
>>>>>>>>>> added wouldn't be fully reliable either.
>>>>>>>>
>>>>>>>> That's a good point.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> import it when restoring the checkpoint. To further
>>>>>>>>>>> complicate things, multiple processes in a checkpoint get
>>>>>>>>>>> restored concurrently. So there is no guarantee that an
>>>>>>>>>>> exporter has restored a shared BO at the time an importer is
>>>>>>>>>>> trying to restore its import.
>>>>>>>>>>>
>>>>>>>>>>> A proposal to deal with these problems would be to treat
>>>>>>>>>>> importers and exporters the same. Whoever restores first,
>>>>>>>>>>> ends up creating the BO and potentially attaching to it. The
>>>>>>>>>>> other process(es) can find BOs that were already restored by
>>>>>>>>>>> another process by looking it up with a unique ID that could
>>>>>>>>>>> be based on the DMABuf inode number. An alternative would be
>>>>>>>>>>> a two-pass approach that needs to restore BOs on two passes:
>>>>>>>>>>>
>>>>>>>>>>> 1. Restore exported BOs
>>>>>>>>>>> 2. Restore imports
>>>>>>>>>>>
>>>>>>>>>>> With some inter-process synchronization in CRIU itself
>>>>>>>>>>> between these two passes. This may require changes in the
>>>>>>>>>>> core CRIU, outside our plugin. Both approaches depend on
>>>>>>>>>>> identifying BOs with some unique ID that could be based on
>>>>>>>>>>> the DMABuf inode number in the checkpoint. However, we would
>>>>>>>>>>> need to identify the processes in the same restore session,
>>>>>>>>>>> possibly based on parent/child process relationships, to
>>>>>>>>>>> create a scope where those IDs are valid during restore.
>>>>>>>>>>
>>>>>>>>>> If my understanding above is on the right track, then I think
>>>>>>>>>> this is the only thing which can be done (for all scenarios).
>>>>>>>>
>>>>>>>> I presented two alternatives. I think you're in favor of the
>>>>>>>> first one, where it doesn't matter who is the importer and
>>>>>>>> exporter. I think the two-pass approach requires that you can
>>>>>>>> identify an exporter. And as you pointed out, the exporter may
>>>>>>>> already have dropped their reference to the BO.
>>>>>>>
>>>>>>> Yep.
>>>>>>>
>>>>>>>>>> I also *think* it would be safe. At least at the moment I
>>>>>>>>>> cannot think what could go wrong. Semantics are that it
>>>>>>>>>> doesn't really matter who created the object.
>>>>>>>>
>>>>>>>> I would agree. What matters is that the object is recreated on
>>>>>>>> the correct device, and that all the direct references and
>>>>>>>> import attachments pointing to it are restored.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf
>>>>>>>>>>> file descriptors themselves. These are anonymous file
>>>>>>>>>>> descriptors. The CRIU plugin could probably be taught to
>>>>>>>>>>> recreate them from the original exported BO based on the
>>>>>>>>>>> inode number that could be queried with fstat in the
>>>>>>>>>>> checkpoint. It would need help from the render node CRIU API
>>>>>>>>>>> to find the right BO from the inode, which may be from a
>>>>>>>>>>> different process in the same restore session.
>>>>>>>>>>
>>>>>>>>>> This part feels like it is breaking the component separation a
>>>>>>>>>> bit because even for buffers fully owned by amdgpu, strictly
>>>>>>>>>> speaking the dma-buf fd is not. At least my understanding from
>>>>>>>>>> the above is that you propose to attempt to import the fd,
>>>>>>>>>> from the kernel side, during the checkpoint process? Like:
>>>>>>>>>>
>>>>>>>>>> Checkpoint:
>>>>>>>>>>
>>>>>>>>>> CRIU for each anon fd:
>>>>>>>>>> amdgpu_plugin(fd)
>>>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel
>>>>>>>>>> via ioctl?)
>>>>>>>>>> -> is it ours? (no -> error)
>>>>>>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>>>>>>
>>>>>>>>>> Restore:
>>>>>>>>>>
>>>>>>>>>> for each dma-buf fd record:
>>>>>>>>>> create BO if does not exists
>>>>>>>>>> export BO to same fd
>>>>>>>>>> close BO handle if not in regular BO handle records
>>>>>>>>>>
>>>>>>>>>> Or since you mention lookup by inode, that would need to be
>>>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount
>>>>>>>>>> space. However how would it co-operate with amdgpu_plugin is
>>>>>>>>>> not clear to me.
>>>>>>>>
>>>>>>>> The way I think about the ownership is, whichever driver created
>>>>>>>> the underlying BO owns the checkpointing of the dmabuf. You need
>>>>>>>> driver-specific information to link the dmabuf with the driver's
>>>>>>>> BO and you need the right driver to recreate the BO and the
>>>>>>>> dmabuf fd when restoring the checkpoint.
>>>>>>>>
>>>>>>>> It gets really interesting if you have an amdgpu plugin and an
>>>>>>>> i915 plugin, and they checkpoint an application that shares BOs
>>>>>>>> between the two devices through DMABufs. E.g. if i915 created a
>>>>>>>> BO and amdgpu imported it, then during restore, i915 needs to
>>>>>>>> restore the dmabuf before the amdgpu import of it can be
>>>>>>>> restored. I think that brings us back to a two-phase approach to
>>>>>>>> restoring the memory sharing relationships. Uff.
>>>>>>>
>>>>>>> I think this part of the discussion somewhat depends on the
>>>>>>> previous part about idling. If it is feasible to completely idle
>>>>>>> and prune, and fail if that is not happening quickly enough, then
>>>>>>> maybe there wouldn't be too much hierarchical state to save.
>>>>>>>
>>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so
>>>>>>> which understands amdgpu fds, i915, syncobj, sync_file, and uses
>>>>>>> some new uapi to uniquely identify each, associate with the
>>>>>>> correct driver, and then internally dispatches to
>>>>>>> amdgpu|i915|dmabuf|..._plugin.so. Building the in memory
>>>>>>> representation of their relationships. As long as all objects and
>>>>>>> their relationships have been recorded I think everything could
>>>>>>> then be correctly restored.
>>>>>>>
>>>>>>> It is possible there is flaw in my thinking and that something in
>>>>>>> CRIU design would make this impossible? I think it would require
>>>>>>> the top-level drm_plugin.so to hold all state in memory until the
>>>>>>> whole checkpointing is done, and then verify something is not
>>>>>>> incomplete, failing it all if it was. (For instance one plugin
>>>>>>> discovered an reference to an object which was not discoverd by
>>>>>>> any other plugin or things like that.) May need some further
>>>>>>> tweaks to CRIU common code.
>>>>>>>
>>>>>>> Maybe I need to better understand how exactly you mean to query
>>>>>>> the DRM driver about random anonymous fds. I see it as a problem
>>>>>>> in the design, possibly even implementation, but maybe I am
>>>>>>> missing something which makes it not so. I mean even with my
>>>>>>> general idea I don't know how would one determine which driver to
>>>>>>> query about a particular anonymous inode.
>>>>>>>
>>>>>>>>> I later also realised that I was maybe increasing the scope for
>>>>>>>>> you here. :) You did state focus is ROCm applications which
>>>>>>>>> possibly doesn't care about dma-resv, fences, syncobjs etc?
>>>>>>>>
>>>>>>>> That's my focus for now. But I don't want to engineer a solution
>>>>>>>> that would preclude your use cases in the future.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> But I think the "how to handle dma-bufs" design question is
>>>>>>>>> still relevant and interesting. For example I had this thought
>>>>>>>>> that perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>>>>
>>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of
>>>>>>>>> kernel objects belonging to different drivers (kfd, amdgpu,
>>>>>>>>> dma-buf, ...). And if one day someone would to try to handle
>>>>>>>>> dma fences and drm syncobjs, the argument for a hierarchial
>>>>>>>>> design would be even stronger I think.
>>>>>>>>>
>>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu,
>>>>>>>>> dma-buf, sync file, ...) and internally build the
>>>>>>>>> representation of the whole state and how the relationship
>>>>>>>>> between the objects.
>>>>>>>>
>>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm
>>>>>>>> not sure it's strictly a hierarchy. Maybe more like some common
>>>>>>>> code shared by multiple GPU driver plugins. I think the common
>>>>>>>> checkpoint state is quite limited and restoring it requires the
>>>>>>>> GPU-specific drivers anyway.
>>>>>>>>
>>>>>>>> Also the idea of building a representation of the whole state
>>>>>>>> doesn't work well with the CRIU design, because "the whole
>>>>>>>> state" can include multiple processes that restore themselves
>>>>>>>> concurrently and only synchronize with each other in a few
>>>>>>>> places in the restore process. I feel, if we can work out how to
>>>>>>>> checkpoint and restore shared objects between processes, we can
>>>>>>>> do the same for shared objects between drivers without imposing
>>>>>>>> a strict hierarchy and some omniscient entity that needs to know
>>>>>>>> "the whole state".
>>>>>>>
>>>>>>> Okay, this continues on the same problem space as above. And you
>>>>>>> obviously know how CRIU works much better than me.
>>>>>>>
>>>>>>>>> With that kind of design there probably would be a need to
>>>>>>>>> define some common kernel side api and uapi, so all involved
>>>>>>>>> objects can be enumerated with some unique ids etc.
>>>>>>>>>
>>>>>>>>> Now.. the counter argument.. the more state from different
>>>>>>>>> drivers would one want to handle the bigger this project would
>>>>>>>>> get. Would it even be feasible is the question, to the point
>>>>>>>>> that it may be simpler to just run the workload in a VM via
>>>>>>>>> SR-IOV and simply hibernate the whole thin guest. :)
>>>>>>>>
>>>>>>>> Well, CRIU kind of tries to do that, but with containers instead
>>>>>>>> of VMs. ;)
>>>>>>>
>>>>>>> It would definitely be useful for hardware and drivers without
>>>>>>> SR-IOV support so lets hope it is doable. :)
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Proposal to add CRIU support to DRM render nodes
2024-04-16 14:04 ` Tvrtko Ursulin
@ 2024-05-03 14:44 ` Felix Kuehling
0 siblings, 0 replies; 14+ messages in thread
From: Felix Kuehling @ 2024-05-03 14:44 UTC (permalink / raw)
To: Tvrtko Ursulin, dri-devel, amd-gfx, Dave Airlie, Daniel Vetter, criu
Cc: Errabolu, Ramesh, Christian König
On 2024-04-16 10:04, Tvrtko Ursulin wrote:
>
> On 01/04/2024 18:58, Felix Kuehling wrote:
>>
>> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>>
>>> On 01/04/2024 17:37, Felix Kuehling wrote:
>>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>>>>
>>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi Felix,
>>>>>>>
>>>>>>> I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node.
>>>>>>>
>>>>>>> In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in.
>>>>>>
>>>>>> Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload.
>>>>>>
>>>>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin.
>>>>>
>>>>> Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area.
>>>>>
>>>>> I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works?
>>>>
>>>> The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change.
>>>
>>> Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system?
>>
>> Ah, the idea is, that when you restore on a different system, you may get different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore it on GPU 2 on the same system. That's why we need to translate GPU IDs in restored applications. User mode still uses the old GPU IDs, but the kernel mode driver translates them to the actual GPU IDs of the GPUs that the process was restored on.
>
> I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but for restored clients it gets overriden during restore so that any further ioctls can actually not instantly fail.
>
> And then in amdgpu_plugin_restore_file, when it is opening the render node, it relies on the kfd topology to have filled in (more or less) the target_gpu_id corresponding to the render node gpu id of the target GPU - the one associated with the new kfd gpu_id?
Yes.
>
> I am digging into this because I am trying to see if some part of GPU discovery could somehow be decoupled.. to offer you to work on at least that until you start to tackle the main body of the feature. But it looks properly tangled up.
OK. Most of the interesting plugin code should be in amdgpu_plugin_topology.c. It currently has some pretty complicated logic to find a set of devices that matches the topology in the checkpoint, including shader ISA versions, numbers of compute units, memory sizes, firmware versions and IO-Links between GPUs. This was originally done to support P2P with XGMI links. I'm not sure we ever updated it to properly support PCIe P2P.
>
> Do you have any suggestions with what I could help with? Maybe developing some sort of drm device enumeration library if you see a way that would be useful in decoupling the device discovery from kfd. We would need to define what sort of information you would need to be queryable from it.
Maybe. I think a lot of device information is available with some amdgpu info-ioctl. It may not cover all the things we're checking in the KFD topology, though.
>
>>>> This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID.
>>>
>>> I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more?
>>
>> Right, the bare-metal driver can partition a PF spatially without SRIOV. SRIOV can also use spatial partitioning and expose each partition through its own VF, but that's not useful for bare metal. Spatial partitioning is new in MI300. There is some high-level info in this whitepaper: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.
>
> From the outside (userspace) this looks simply like multiple DRM render nodes or how exactly?
Yes, that's correct. Each partition has its own render node and its own node in the KFD topology.
Regards,
Felix
>
> Regards,
>
> Tvrtko
>
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>>>> Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds.
>>>>>>
>>>>>> I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks.
>>>>>
>>>>> Got it.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Felix
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tvrtko
>>>>>>>
>>>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>>>>>
>>>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Felix,
>>>>>>>>>>>
>>>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic.
>>>>>>>>>>>>
>>>>>>>>>>>> We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm.
>>>>>>>>>>>>
>>>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes.
>>>>>>>>>>>>
>>>>>>>>>>>> Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific.
>>>>>>>>>>>>
>>>>>>>>>>>> After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future.
>>>>>>>>>>>>
>>>>>>>>>>>> One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers?
>>>>>>>>>>>
>>>>>>>>>>> This sounds like a very interesting feature on the overall, although I cannot answer on the last question here.
>>>>>>>>>>
>>>>>>>>>> I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested.
>>>>>>>>>
>>>>>>>>> Thanks, that's good to hear!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity.
>>>>>>>>>>>
>>>>>>>>>>> And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client.
>>>>>>>>>>>
>>>>>>>>>>> I don't see an immediate design or code sharing opportunities though but just mentioning.
>>>>>>>>>>>
>>>>>>>>>>> I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions.
>>>>>>>>>>>
>>>>>>>>>>>> With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore:
>>>>>>>>>>>>
>>>>>>>>>>>> Checkpoint:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode can allocate
>>>>>>>>>>>> memory for the checkpoint, stops execution on the GPU)
>>>>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>>>>>
>>>>>>>>>>>> Restore:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place at this time)
>>>>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume execution)
>>>>>>>>>>>
>>>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled?
>>>>>>>>>
>>>>>>>>> In our compute use cases, we suspend user mode queues. This can include CWSR (compute-wave-save-restore) where the state of in-flight waves is stored in memory and can be reloaded and resumed from memory later. We don't use any fences other than "eviction fences", that are signaled after the queues are suspended. And those fences are never handed to user mode. So we don't need to worry about any fence state in the checkpoint.
>>>>>>>>>
>>>>>>>>> If we extended this to support the kernel mode command submission APIs, I would expect that we'd wait for all current submissions to complete, and stop new ones from being sent to the HW before taking the checkpoint. When we take the checkpoint in the CRIU plugin, the CPU threads are already frozen and cannot submit any more work. If we wait for all currently pending submissions to drain, I think we don't need to save any fence state because all the fences will have signaled. (I may be missing some intricacies and I'm afraid it may not be that simple in reality, but that's my opening bid. ;)
>>>>>>>>
>>>>>>>> It feels feasible to me too, for the normally behaving clients at least.
>>>>>>>>
>>>>>>>> Presumably, given that the whole checkpointing is not instant, it would be okay to wait a second or two longer for the in-progress submissions complete. After which kernel would need to prune all signalled fences from the respective container objects before checkpointing.
>>>>>>>>
>>>>>>>> For the "misbehaving" clients who have perhaps queued up too much work, either still in the scheduler with unsatisfied dependencies, or already submitted to the hardware and/or driver backend, is there a timeout concept in CRIU so it would be possible to say something like "try to checkpoint but if the kernel says no time period t then give up"?
>>>>>>>>
>>>>>>>>>>>> For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>>>>>
>>>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below:
>>>>>>>>>>>>
>>>>>>>>>>>> * Opaque information in the checkpoint data that user mode can't
>>>>>>>>>>>> interpret or do anything with
>>>>>>>>>>>> * A second API for creating objects (e.g. BOs) that is separate from
>>>>>>>>>>>> the regular BO creation API
>>>>>>>>>>>> * Kernel mode would need to be involved in restoring BO sharing
>>>>>>>>>>>> relationships rather than replaying BO creation, export and import
>>>>>>>>>>>> from user mode
>>>>>>>>>>>>
>>>>>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>>>>>
>>>>>>>>>>>> This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary layout, so there is no user mode ABI to break. This is how we were able to maintain CRIU support when we added the SVM API to KFD without changing the CRIU plugin and without breaking our ABI.
>>>>>>>>>>>>
>>>>>>>>>>>> Opaque information may also lend itself to API abstraction, if this becomes a generic DRM API with driver-specific callbacks that fill in HW-specific opaque data.
>>>>>>>>>>>
>>>>>>>>>>> This feels sound in principle to me. Fundamentally the state is very hardware specfic, and/or driver version specific, so I don't see anything could be gained in practice by making it much less opaque. (Apart from making things more complicated.)
>>>>>>>>>>>
>>>>>>>>>>> I was however unsure of the current split of how you dump buffer objects with some data in the defined bo structure, and some in completely opaque private data. Is there a benefit to that split, or maybe in other words, is there a benefit on having part transparent and part opaque for buffer objects?
>>>>>>>>>
>>>>>>>>> Some of the buffer object state is needed by the plugin. E.g. the size and mmap offset are needed to match VMAs with BOs. I'd have to review the plugin in detail to prove that all the fields are, in fact, needed by the plugin, but that was the intent. Anything that the plugin doesn't need to know should be in the opaque data structures.
>>>>>>>>
>>>>>>>> Right, got it.
>>>>>>>>
>>>>>>>> Would it make sense to make the opaque data in the same block as the defined one? I mean instead of separating the two in the binary image for instance have struct kfd_criu_bo_bucket have a trailing priv_data blob? Maybe it is too late now if the image format is not versioned or something.
>>>>>>>>
>>>>>>>>>>> To slightly touch upon the question of whether this could become a generic DRM API. It feels it would be hard to do it from the start. What sounds more feasible is if/when generic looking helpers can be spotted while developing the RFC then potentially structure the code they can easily be promoted to shared/common at some future moment.
>>>>>>>>>
>>>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for confirming.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> # Second API for creating objects
>>>>>>>>>>>>
>>>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint needs more information than the usual BO alloc and similar APIs provide. For example, we need to restore BOs with the same GEM handles so that user mode can continue using those handles after resuming execution. If BOs are shared through DMABufs without dynamic attachment, we need to restore pinned BOs as pinned. Validation of virtual addresses and handling MMU notifiers must be suspended until the virtual address space is restored. For user mode queues we need to save and restore a lot of queue execution state so that execution can resume cleanly.
>>>>>>>>>>>
>>>>>>>>>>> This also sounds justified to me. Restore creating all internal objects is definitely special and sounds better to add uapi to create them directly with the correct properties, than to add uapi to adjust internal properties after creation. And in case you would always need some new uapi - so at least to adjust after creation. At which point you may have both in one. Internally implementation can be split or common, whatever makes sense for a given object type, but new uapi definitely sounds is required.
>>>>>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>>>>>
>>>>>>>>>>>> Different GEM handles in different render nodes and processes can refer to the same underlying shared memory, either by directly pointing to the same GEM object, or by creating an import attachment that may get its SG tables invalidated and updated dynamically through dynamic attachment callbacks. In the latter case it's obvious, who is the exporter and who is the importer. In the first case, either one could be the exporter, and it's not clear who would need to create the BO and who would need to
>>>>>>>>>>>
>>>>>>>>>>> To see if I follow the former case correctly.
>>>>>>>>>>>
>>>>>>>>>>> This could be two clients A and B, where B has imported a dma-buf shared BO from A and has since closed the dma-buf fd? Which results in a single BO with reference count of 2 and obj->import_attach unset. History of who created the object is lost.
>>>>>>>>>
>>>>>>>>> Yes. In the amdgpu driver this happens when the exporter and import device are the same.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In fact it could even be that two imported objects remain (clients A, B and C) and A, who originally created the BO, has since fully released it. So any kind of "creator" tracking if added wouldn't be fully reliable either.
>>>>>>>>>
>>>>>>>>> That's a good point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> import it when restoring the checkpoint. To further complicate things, multiple processes in a checkpoint get restored concurrently. So there is no guarantee that an exporter has restored a shared BO at the time an importer is trying to restore its import.
>>>>>>>>>>>>
>>>>>>>>>>>> A proposal to deal with these problems would be to treat importers and exporters the same. Whoever restores first, ends up creating the BO and potentially attaching to it. The other process(es) can find BOs that were already restored by another process by looking it up with a unique ID that could be based on the DMABuf inode number. An alternative would be a two-pass approach that needs to restore BOs on two passes:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Restore exported BOs
>>>>>>>>>>>> 2. Restore imports
>>>>>>>>>>>>
>>>>>>>>>>>> With some inter-process synchronization in CRIU itself between these two passes. This may require changes in the core CRIU, outside our plugin. Both approaches depend on identifying BOs with some unique ID that could be based on the DMABuf inode number in the checkpoint. However, we would need to identify the processes in the same restore session, possibly based on parent/child process relationships, to create a scope where those IDs are valid during restore.
>>>>>>>>>>>
>>>>>>>>>>> If my understanding above is on the right track, then I think this is the only thing which can be done (for all scenarios).
>>>>>>>>>
>>>>>>>>> I presented two alternatives. I think you're in favor of the first one, where it doesn't matter who is the importer and exporter. I think the two-pass approach requires that you can identify an exporter. And as you pointed out, the exporter may already have dropped their reference to the BO.
>>>>>>>>
>>>>>>>> Yep.
>>>>>>>>
>>>>>>>>>>> I also *think* it would be safe. At least at the moment I cannot think what could go wrong. Semantics are that it doesn't really matter who created the object.
>>>>>>>>>
>>>>>>>>> I would agree. What matters is that the object is recreated on the correct device, and that all the direct references and import attachments pointing to it are restored.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf file descriptors themselves. These are anonymous file descriptors. The CRIU plugin could probably be taught to recreate them from the original exported BO based on the inode number that could be queried with fstat in the checkpoint. It would need help from the render node CRIU API to find the right BO from the inode, which may be from a different process in the same restore session.
>>>>>>>>>>>
>>>>>>>>>>> This part feels like it is breaking the component separation a bit because even for buffers fully owned by amdgpu, strictly speaking the dma-buf fd is not. At least my understanding from the above is that you propose to attempt to import the fd, from the kernel side, during the checkpoint process? Like:
>>>>>>>>>>>
>>>>>>>>>>> Checkpoint:
>>>>>>>>>>>
>>>>>>>>>>> CRIU for each anon fd:
>>>>>>>>>>> amdgpu_plugin(fd)
>>>>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel via ioctl?)
>>>>>>>>>>> -> is it ours? (no -> error)
>>>>>>>>>>> -> create a record mapping fd number to amdgpu BO
>>>>>>>>>>>
>>>>>>>>>>> Restore:
>>>>>>>>>>>
>>>>>>>>>>> for each dma-buf fd record:
>>>>>>>>>>> create BO if does not exists
>>>>>>>>>>> export BO to same fd
>>>>>>>>>>> close BO handle if not in regular BO handle records
>>>>>>>>>>>
>>>>>>>>>>> Or since you mention lookup by inode, that would need to be dmabuf_plugin so it can lookup inodes in the private mount space. However how would it co-operate with amdgpu_plugin is not clear to me.
>>>>>>>>>
>>>>>>>>> The way I think about the ownership is, whichever driver created the underlying BO owns the checkpointing of the dmabuf. You need driver-specific information to link the dmabuf with the driver's BO and you need the right driver to recreate the BO and the dmabuf fd when restoring the checkpoint.
>>>>>>>>>
>>>>>>>>> It gets really interesting if you have an amdgpu plugin and an i915 plugin, and they checkpoint an application that shares BOs between the two devices through DMABufs. E.g. if i915 created a BO and amdgpu imported it, then during restore, i915 needs to restore the dmabuf before the amdgpu import of it can be restored. I think that brings us back to a two-phase approach to restoring the memory sharing relationships. Uff.
>>>>>>>>
>>>>>>>> I think this part of the discussion somewhat depends on the previous part about idling. If it is feasible to completely idle and prune, and fail if that is not happening quickly enough, then maybe there wouldn't be too much hierarchical state to save.
>>>>>>>>
>>>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so which understands amdgpu fds, i915, syncobj, sync_file, and uses some new uapi to uniquely identify each, associate with the correct driver, and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. Building the in memory representation of their relationships. As long as all objects and their relationships have been recorded I think everything could then be correctly restored.
>>>>>>>>
>>>>>>>> It is possible there is flaw in my thinking and that something in CRIU design would make this impossible? I think it would require the top-level drm_plugin.so to hold all state in memory until the whole checkpointing is done, and then verify something is not incomplete, failing it all if it was. (For instance one plugin discovered an reference to an object which was not discoverd by any other plugin or things like that.) May need some further tweaks to CRIU common code.
>>>>>>>>
>>>>>>>> Maybe I need to better understand how exactly you mean to query the DRM driver about random anonymous fds. I see it as a problem in the design, possibly even implementation, but maybe I am missing something which makes it not so. I mean even with my general idea I don't know how would one determine which driver to query about a particular anonymous inode.
>>>>>>>>
>>>>>>>>>> I later also realised that I was maybe increasing the scope for you here. :) You did state focus is ROCm applications which possibly doesn't care about dma-resv, fences, syncobjs etc?
>>>>>>>>>
>>>>>>>>> That's my focus for now. But I don't want to engineer a solution that would preclude your use cases in the future.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But I think the "how to handle dma-bufs" design question is still relevant and interesting. For example I had this thought that perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>>>>>
>>>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of kernel objects belonging to different drivers (kfd, amdgpu, dma-buf, ...). And if one day someone would to try to handle dma fences and drm syncobjs, the argument for a hierarchial design would be even stronger I think.
>>>>>>>>>>
>>>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu, dma-buf, sync file, ...) and internally build the representation of the whole state and how the relationship between the objects.
>>>>>>>>>
>>>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not sure it's strictly a hierarchy. Maybe more like some common code shared by multiple GPU driver plugins. I think the common checkpoint state is quite limited and restoring it requires the GPU-specific drivers anyway.
>>>>>>>>>
>>>>>>>>> Also the idea of building a representation of the whole state doesn't work well with the CRIU design, because "the whole state" can include multiple processes that restore themselves concurrently and only synchronize with each other in a few places in the restore process. I feel, if we can work out how to checkpoint and restore shared objects between processes, we can do the same for shared objects between drivers without imposing a strict hierarchy and some omniscient entity that needs to know "the whole state".
>>>>>>>>
>>>>>>>> Okay, this continues on the same problem space as above. And you obviously know how CRIU works much better than me.
>>>>>>>>
>>>>>>>>>> With that kind of design there probably would be a need to define some common kernel side api and uapi, so all involved objects can be enumerated with some unique ids etc.
>>>>>>>>>>
>>>>>>>>>> Now.. the counter argument.. the more state from different drivers would one want to handle the bigger this project would get. Would it even be feasible is the question, to the point that it may be simpler to just run the workload in a VM via SR-IOV and simply hibernate the whole thin guest. :)
>>>>>>>>>
>>>>>>>>> Well, CRIU kind of tries to do that, but with containers instead of VMs. ;)
>>>>>>>>
>>>>>>>> It would definitely be useful for hardware and drivers without SR-IOV support so lets hope it is doable. :)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Tvrtko
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-05-03 14:44 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-06 21:23 Proposal to add CRIU support to DRM render nodes Felix Kuehling
2024-01-15 18:58 ` Felix Kuehling
2024-03-11 14:48 ` Tvrtko Ursulin
2024-03-12 9:45 ` Tvrtko Ursulin
2024-03-15 2:33 ` Felix Kuehling
2024-03-15 18:36 ` Tvrtko Ursulin
2024-03-28 16:03 ` Tvrtko Ursulin
2024-03-28 20:42 ` Felix Kuehling
2024-04-01 15:09 ` Tvrtko Ursulin
2024-04-01 16:37 ` Felix Kuehling
2024-04-01 16:56 ` Tvrtko Ursulin
2024-04-01 17:58 ` Felix Kuehling
2024-04-16 14:04 ` Tvrtko Ursulin
2024-05-03 14:44 ` Felix Kuehling
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.