[RFC] CRIU support for ROCm

* [RFC] CRIU support for ROCm
@ 2021-05-01  1:57 ` Felix Kuehling
  0 siblings, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2021-05-01  1:57 UTC (permalink / raw)
  To: criu, amd-gfx list, DRI Development
  Cc: Alexander Mihalicyn, Pavel Emelyanov, Bhardwaj, Rajneesh,
	Pavel Tikhomirov, Yat Sin, David, Adrian Reber

We have been working on a prototype supporting CRIU (Checkpoint/Restore
In Userspace) for accelerated compute applications running on AMD GPUs
using ROCm (Radeon Open Compute Platform). We're happy to finally share
this work publicly to solicit feedback and advice. The end-goal is to
get this work included upstream in Linux and CRIU. A short whitepaper
describing our design and intention can be found on Github:
https://github.com/RadeonOpenCompute/criu/tree/criu-dev/test/others/ext-kfd/README.md.

We have RFC patch series for the kernel (based on Alex Deucher's
amd-staging-drm-next branch) and for CRIU including a new plugin and a
few core CRIU changes. I will send those to the respective mailing lists
separately in a minute. They can also be found on Github.

    CRIU+plugin: https://github.com/RadeonOpenCompute/criu/commits/criu-dev
    Kernel (KFD):
    https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip

At this point this is very much a work in progress and not ready for
upstream inclusion. There are still several missing features, known
issues, and open questions that we would like to start addressing with
your feedback.

What's working and tested at this point:

  * Checkpoint and restore accelerated machine learning apps: PyTorch
    running Bert on systems with 1 or 2 GPUs (MI50 or MI100), 100%
    unmodified user mode stack
  * Checkpoint on one system, restore on a different system
  * Checkpoint on one GPU, restore on a different GPU

Major Known issues:

  * The KFD ioctl API is not final: Needs a complete redesign to allow
    future extension without breaking the ABI
  * Very slow: Need to implement DMA to dump VRAM contents

Missing or incomplete features:

  * Support for the new KFD SVM API
  * Check device topology during restore
  * Checkpoint and restore multiple processes
  * Support for applications using Mesa for video decode/encode
  * Testing with more different GPUs and workloads

Big Open questions:

  * What's the preferred way to publish our CRIU plugin? In-tree or
    out-of-tree?
  * What's the preferred way to distribute our CRIU plugin? Source?
    Binary .so? Whole CRIU? Just in-box support?
  * If our plugin can be upstreamed in the CRIU tree, what would be the
    right directory?

Best regards,
  Felix

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread