All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/18] CHECKPOINT RESTORE WITH ROCm
@ 2021-08-19 13:36 David Yat Sin
  2021-08-19 13:36 ` [PATCH 01/18] x86/configs: CRIU update release defconfig David Yat Sin
                   ` (17 more replies)
  0 siblings, 18 replies; 25+ messages in thread
From: David Yat Sin @ 2021-08-19 13:36 UTC (permalink / raw)
  To: amd-gfx; +Cc: felix.kuehling, rajneesh.bhardwaj, David Yat Sin

CRIU is a user space tool which is very popular for container live migration in datacentres. It can checkpoint a running application, save its complete state, memory contents and all system resources to images on disk which can be migrated to another m
achine and restored later. More information on CRIU can be found at https://criu.org/Main_Page

CRIU currently does not support Checkpoint / Restore with applications that have devices files open so it cannot perform checkpoint and restore on GPU devices which are very complex and have their own VRAM managed privately. CRIU, however can support e
xternal devices by using a plugin architecture. This patch series adds initial support for ROCm applications while we add more remaining features. We welcome some feedback, especially in regards to the APIs, before involving a larger audience.

Our plugin code can be found at https://github.com/RadeonOpenCompute/criu/tree/criu-dev/plugins/amdgpu

We have tested the following scenarios:
-Checkpoint / Restore of a Pytorch (BERT) workload
-kfdtests with queues and events
-Gfx9 and Gfx10 based multi GPU test systems
-On baremetal and inside a docker container
-Restoring on a different system

David Yat Sin (9):
  drm/amdkfd: CRIU Implement KFD pause ioctl
  drm/amdkfd: CRIU add queues support
  drm/amdkfd: CRIU restore queue ids
  drm/amdkfd: CRIU restore sdma id for queues
  drm/amdkfd: CRIU restore queue doorbell id
  drm/amdkfd: CRIU dump and restore queue mqds
  drm/amdkfd: CRIU dump/restore queue control stack
  drm/amdkfd: CRIU dump and restore events
  drm/amdkfd: CRIU implement gpu_id remapping

Rajneesh Bhardwaj (9):
  x86/configs: CRIU update release defconfig
  x86/configs: CRIU update debug rock defconfig
  drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
  drm/amdkfd: CRIU Implement KFD process_info ioctl
  drm/amdkfd: CRIU Implement KFD dumper ioctl
  drm/amdkfd: CRIU Implement KFD restore ioctl
  drm/amdkfd: CRIU Implement KFD resume ioctl
  Revert "drm/amdgpu: Remove verify_access shortcut for KFD BOs"
  drm/amdkfd: CRIU export kfd bos as prime dmabuf objects

 arch/x86/configs/rock-dbg_defconfig           |   53 +-
 arch/x86/configs/rock-rel_defconfig           |   13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    5 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   51 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   27 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1730 +++++++++++++++--
 drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  187 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |   14 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  254 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   11 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   76 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   78 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   86 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   77 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  140 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   69 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |   72 +-
 include/uapi/linux/kfd_ioctl.h                |  110 +-
 20 files changed, 2743 insertions(+), 314 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-08-23 20:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19 13:36 [PATCH 00/18] CHECKPOINT RESTORE WITH ROCm David Yat Sin
2021-08-19 13:36 ` [PATCH 01/18] x86/configs: CRIU update release defconfig David Yat Sin
2021-08-19 13:36 ` [PATCH 02/18] x86/configs: CRIU update debug rock defconfig David Yat Sin
2021-08-19 13:36 ` [PATCH 03/18] drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs David Yat Sin
2021-08-23 18:57   ` Felix Kuehling
2021-08-19 13:36 ` [PATCH 04/18] drm/amdkfd: CRIU Implement KFD process_info ioctl David Yat Sin
2021-08-19 13:37 ` [PATCH 05/18] drm/amdkfd: CRIU Implement KFD dumper ioctl David Yat Sin
2021-08-23 18:53   ` Felix Kuehling
2021-08-19 13:37 ` [PATCH 06/18] drm/amdkfd: CRIU Implement KFD restore ioctl David Yat Sin
2021-08-19 13:37 ` [PATCH 07/18] drm/amdkfd: CRIU Implement KFD resume ioctl David Yat Sin
2021-08-19 13:37 ` [PATCH 08/18] drm/amdkfd: CRIU Implement KFD pause ioctl David Yat Sin
2021-08-19 13:37 ` [PATCH 09/18] drm/amdkfd: CRIU add queues support David Yat Sin
2021-08-23 18:29   ` Felix Kuehling
2021-08-19 13:37 ` [PATCH 10/18] drm/amdkfd: CRIU restore queue ids David Yat Sin
2021-08-23 18:29   ` Felix Kuehling
2021-08-19 13:37 ` [PATCH 11/18] drm/amdkfd: CRIU restore sdma id for queues David Yat Sin
2021-08-19 13:37 ` [PATCH 12/18] drm/amdkfd: CRIU restore queue doorbell id David Yat Sin
2021-08-19 13:37 ` [PATCH 13/18] drm/amdkfd: CRIU dump and restore queue mqds David Yat Sin
2021-08-19 13:37 ` [PATCH 14/18] drm/amdkfd: CRIU dump/restore queue control stack David Yat Sin
2021-08-19 13:37 ` [PATCH 15/18] drm/amdkfd: CRIU dump and restore events David Yat Sin
2021-08-23 18:39   ` Felix Kuehling
2021-08-19 13:37 ` [PATCH 16/18] drm/amdkfd: CRIU implement gpu_id remapping David Yat Sin
2021-08-23 18:48   ` Felix Kuehling
2021-08-19 13:37 ` [PATCH 17/18] Revert "drm/amdgpu: Remove verify_access shortcut for KFD BOs" David Yat Sin
2021-08-19 13:37 ` [PATCH 18/18] drm/amdkfd: CRIU export kfd bos as prime dmabuf objects David Yat Sin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.