All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v4 00/24] CHECKPOINT RESTORE WITH ROCm
@ 2021-12-23  0:36 Rajneesh Bhardwaj
  2021-12-23  0:36 ` [Patch v4 01/24] x86/configs: CRIU update debug rock defconfig Rajneesh Bhardwaj
                   ` (23 more replies)
  0 siblings, 24 replies; 39+ messages in thread
From: Rajneesh Bhardwaj @ 2021-12-23  0:36 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: daniel.vetter, felix.kuehling, Rajneesh Bhardwaj,
	alexander.deucher, airlied, christian.koenig

CRIU is a user space tool which is very popular for container live
migration in datacentres. It can checkpoint a running application, save
its complete state, memory contents and all system resources to images
on disk which can be migrated to another m achine and restored later.
More information on CRIU can be found at https://criu.org/Main_Page

CRIU currently does not support Checkpoint / Restore with applications
that have devices files open so it cannot perform checkpoint and restore
on GPU devices which are very complex and have their own VRAM managed
privately. CRIU, however can support external devices by using a plugin
architecture. We feel that we are getting close to finalizing our IOCTL
APIs which were again changed since V3 for an improved modular design.

Our changes to CRIU user space  are can be obtained from here:
https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222

We have tested the following scenarios:
 - Checkpoint / Restore of a Pytorch (BERT) workload
 - kfdtests with queues and events
 - Gfx9 and Gfx10 based multi GPU test systems 
 - On baremetal and inside a docker container
 - Restoring on a different system

V1: Initial
V2: Addressed review comments
V3: Rebased on latest amd-staging-drm-next (5.15 based)
v4: New API design and basic support for SVM, however there is an
outstanding issue with SVM restore which is currently under debug and
hopefully that won't impact the ioctl APIs as SVMs are treated as
private data hidden from user space like queues and events with the new
approch.


David Yat Sin (9):
  drm/amdkfd: CRIU Implement KFD unpause operation
  drm/amdkfd: CRIU add queues support
  drm/amdkfd: CRIU restore queue ids
  drm/amdkfd: CRIU restore sdma id for queues
  drm/amdkfd: CRIU restore queue doorbell id
  drm/amdkfd: CRIU checkpoint and restore queue mqds
  drm/amdkfd: CRIU checkpoint and restore queue control stack
  drm/amdkfd: CRIU checkpoint and restore events
  drm/amdkfd: CRIU implement gpu_id remapping

Rajneesh Bhardwaj (15):
  x86/configs: CRIU update debug rock defconfig
  x86/configs: Add rock-rel_defconfig for amd-feature-criu branch
  drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
  drm/amdkfd: CRIU Implement KFD process_info ioctl
  drm/amdkfd: CRIU Implement KFD checkpoint ioctl
  drm/amdkfd: CRIU Implement KFD restore ioctl
  drm/amdkfd: CRIU Implement KFD resume ioctl
  drm/amdkfd: CRIU export BOs as prime dmabuf objects
  drm/amdkfd: CRIU checkpoint and restore xnack mode
  drm/amdkfd: CRIU allow external mm for svm ranges
  drm/amdkfd: use user_gpu_id for svm ranges
  drm/amdkfd: CRIU Discover svm ranges
  drm/amdkfd: CRIU Save Shared Virtual Memory ranges
  drm/amdkfd: CRIU prepare for svm resume
  drm/amdkfd: CRIU resume shared virtual memory ranges

 arch/x86/configs/rock-dbg_defconfig           |   53 +-
 arch/x86/configs/rock-rel_defconfig           | 4927 +++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    6 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   51 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   20 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1453 ++++-
 drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  185 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |   18 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  313 +-
 drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   14 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   72 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   74 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   89 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   81 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  166 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   86 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |  377 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  326 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |   39 +
 include/uapi/linux/kfd_ioctl.h                |   79 +-
 22 files changed, 8099 insertions(+), 334 deletions(-)
 create mode 100644 arch/x86/configs/rock-rel_defconfig

-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2022-01-11 15:59 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-23  0:36 [Patch v4 00/24] CHECKPOINT RESTORE WITH ROCm Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 01/24] x86/configs: CRIU update debug rock defconfig Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 02/24] x86/configs: Add rock-rel_defconfig for amd-feature-criu branch Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 03/24] drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs Rajneesh Bhardwaj
2022-01-10 22:08   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 04/24] drm/amdkfd: CRIU Implement KFD process_info ioctl Rajneesh Bhardwaj
2022-01-10 22:47   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 05/24] drm/amdkfd: CRIU Implement KFD checkpoint ioctl Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 06/24] drm/amdkfd: CRIU Implement KFD restore ioctl Rajneesh Bhardwaj
2022-01-10 23:01   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 07/24] drm/amdkfd: CRIU Implement KFD resume ioctl Rajneesh Bhardwaj
2022-01-10 23:16   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 08/24] drm/amdkfd: CRIU Implement KFD unpause operation Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 09/24] drm/amdkfd: CRIU add queues support Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 10/24] drm/amdkfd: CRIU restore queue ids Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 11/24] drm/amdkfd: CRIU restore sdma id for queues Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 12/24] drm/amdkfd: CRIU restore queue doorbell id Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 13/24] drm/amdkfd: CRIU checkpoint and restore queue mqds Rajneesh Bhardwaj
2022-01-10 23:32   ` Felix Kuehling
2021-12-23  0:37 ` [Patch v4 14/24] drm/amdkfd: CRIU checkpoint and restore queue control stack Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 15/24] drm/amdkfd: CRIU checkpoint and restore events Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 16/24] drm/amdkfd: CRIU implement gpu_id remapping Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 17/24] drm/amdkfd: CRIU export BOs as prime dmabuf objects Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 18/24] drm/amdkfd: CRIU checkpoint and restore xnack mode Rajneesh Bhardwaj
2022-01-05 15:22   ` philip yang
2022-01-11  0:10     ` Felix Kuehling
2022-01-11 15:49       ` philip yang
2021-12-23  0:37 ` [Patch v4 19/24] drm/amdkfd: CRIU allow external mm for svm ranges Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 20/24] drm/amdkfd: use user_gpu_id " Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 21/24] drm/amdkfd: CRIU Discover " Rajneesh Bhardwaj
2022-01-05 14:48   ` philip yang
2022-01-10 23:11   ` philip yang
2021-12-23  0:37 ` [Patch v4 22/24] drm/amdkfd: CRIU Save Shared Virtual Memory ranges Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 23/24] drm/amdkfd: CRIU prepare for svm resume Rajneesh Bhardwaj
2022-01-05 14:43   ` philip yang
2022-01-10 23:58     ` Felix Kuehling
2022-01-11 15:58       ` philip yang
2021-12-23  0:37 ` [Patch v4 24/24] drm/amdkfd: CRIU resume shared virtual memory ranges Rajneesh Bhardwaj
2022-01-11  0:03   ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.