From: Felix Kuehling <felix.kuehling@amd.com> To: criu@openvz.org, amd-gfx list <amd-gfx@lists.freedesktop.org>, DRI Development <dri-devel@lists.freedesktop.org> Cc: Alexander Mihalicyn <alexander@mihalicyn.com>, Pavel Emelyanov <ovzxemul@gmail.com>, "Bhardwaj, Rajneesh" <Rajneesh.Bhardwaj@amd.com>, Pavel Tikhomirov <snorcht@gmail.com>, "Yat Sin, David" <David.YatSin@amd.com>, Adrian Reber <adrian@lisas.de> Subject: [RFC] CRIU support for ROCm Date: Fri, 30 Apr 2021 21:57:45 -0400 [thread overview] Message-ID: <9245171d-ecc9-1bdf-3ecd-cf776dc17855@amd.com> (raw) We have been working on a prototype supporting CRIU (Checkpoint/Restore In Userspace) for accelerated compute applications running on AMD GPUs using ROCm (Radeon Open Compute Platform). We're happy to finally share this work publicly to solicit feedback and advice. The end-goal is to get this work included upstream in Linux and CRIU. A short whitepaper describing our design and intention can be found on Github: https://github.com/RadeonOpenCompute/criu/tree/criu-dev/test/others/ext-kfd/README.md. We have RFC patch series for the kernel (based on Alex Deucher's amd-staging-drm-next branch) and for CRIU including a new plugin and a few core CRIU changes. I will send those to the respective mailing lists separately in a minute. They can also be found on Github. CRIU+plugin: https://github.com/RadeonOpenCompute/criu/commits/criu-dev Kernel (KFD): https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip At this point this is very much a work in progress and not ready for upstream inclusion. There are still several missing features, known issues, and open questions that we would like to start addressing with your feedback. What's working and tested at this point: * Checkpoint and restore accelerated machine learning apps: PyTorch running Bert on systems with 1 or 2 GPUs (MI50 or MI100), 100% unmodified user mode stack * Checkpoint on one system, restore on a different system * Checkpoint on one GPU, restore on a different GPU Major Known issues: * The KFD ioctl API is not final: Needs a complete redesign to allow future extension without breaking the ABI * Very slow: Need to implement DMA to dump VRAM contents Missing or incomplete features: * Support for the new KFD SVM API * Check device topology during restore * Checkpoint and restore multiple processes * Support for applications using Mesa for video decode/encode * Testing with more different GPUs and workloads Big Open questions: * What's the preferred way to publish our CRIU plugin? In-tree or out-of-tree? * What's the preferred way to distribute our CRIU plugin? Source? Binary .so? Whole CRIU? Just in-box support? * If our plugin can be upstreamed in the CRIU tree, what would be the right directory? Best regards, Felix _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
WARNING: multiple messages have this Message-ID (diff)
From: Felix Kuehling <felix.kuehling@amd.com> To: criu@openvz.org, amd-gfx list <amd-gfx@lists.freedesktop.org>, DRI Development <dri-devel@lists.freedesktop.org> Cc: Alexander Mihalicyn <alexander@mihalicyn.com>, Pavel Emelyanov <ovzxemul@gmail.com>, "Bhardwaj, Rajneesh" <Rajneesh.Bhardwaj@amd.com>, Pavel Tikhomirov <snorcht@gmail.com>, "Yat Sin, David" <David.YatSin@amd.com>, Adrian Reber <adrian@lisas.de> Subject: [RFC] CRIU support for ROCm Date: Fri, 30 Apr 2021 21:57:45 -0400 [thread overview] Message-ID: <9245171d-ecc9-1bdf-3ecd-cf776dc17855@amd.com> (raw) We have been working on a prototype supporting CRIU (Checkpoint/Restore In Userspace) for accelerated compute applications running on AMD GPUs using ROCm (Radeon Open Compute Platform). We're happy to finally share this work publicly to solicit feedback and advice. The end-goal is to get this work included upstream in Linux and CRIU. A short whitepaper describing our design and intention can be found on Github: https://github.com/RadeonOpenCompute/criu/tree/criu-dev/test/others/ext-kfd/README.md. We have RFC patch series for the kernel (based on Alex Deucher's amd-staging-drm-next branch) and for CRIU including a new plugin and a few core CRIU changes. I will send those to the respective mailing lists separately in a minute. They can also be found on Github. CRIU+plugin: https://github.com/RadeonOpenCompute/criu/commits/criu-dev Kernel (KFD): https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip At this point this is very much a work in progress and not ready for upstream inclusion. There are still several missing features, known issues, and open questions that we would like to start addressing with your feedback. What's working and tested at this point: * Checkpoint and restore accelerated machine learning apps: PyTorch running Bert on systems with 1 or 2 GPUs (MI50 or MI100), 100% unmodified user mode stack * Checkpoint on one system, restore on a different system * Checkpoint on one GPU, restore on a different GPU Major Known issues: * The KFD ioctl API is not final: Needs a complete redesign to allow future extension without breaking the ABI * Very slow: Need to implement DMA to dump VRAM contents Missing or incomplete features: * Support for the new KFD SVM API * Check device topology during restore * Checkpoint and restore multiple processes * Support for applications using Mesa for video decode/encode * Testing with more different GPUs and workloads Big Open questions: * What's the preferred way to publish our CRIU plugin? In-tree or out-of-tree? * What's the preferred way to distribute our CRIU plugin? Source? Binary .so? Whole CRIU? Just in-box support? * If our plugin can be upstreamed in the CRIU tree, what would be the right directory? Best regards, Felix _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next reply other threads:[~2021-05-01 1:57 UTC|newest] Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-01 1:57 Felix Kuehling [this message] 2021-05-01 1:57 ` [RFC] CRIU support for ROCm Felix Kuehling 2021-05-01 17:03 ` Adrian Reber 2021-05-01 17:03 ` Adrian Reber 2021-05-03 18:21 ` Felix Kuehling 2021-05-03 18:21 ` Felix Kuehling 2021-05-04 12:32 ` Adrian Reber 2021-05-04 12:32 ` Adrian Reber 2021-06-18 21:48 ` Felix Kuehling 2021-06-18 21:48 ` Felix Kuehling 2021-06-21 8:13 ` Adrian Reber 2021-06-21 8:13 ` Adrian Reber 2021-05-04 13:00 ` Daniel Vetter 2021-05-04 13:00 ` Daniel Vetter 2021-05-06 16:10 ` Felix Kuehling 2021-05-06 16:10 ` Felix Kuehling 2021-05-07 9:32 ` Daniel Vetter 2021-05-07 9:32 ` Daniel Vetter
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=9245171d-ecc9-1bdf-3ecd-cf776dc17855@amd.com \ --to=felix.kuehling@amd.com \ --cc=David.YatSin@amd.com \ --cc=Rajneesh.Bhardwaj@amd.com \ --cc=adrian@lisas.de \ --cc=alexander@mihalicyn.com \ --cc=amd-gfx@lists.freedesktop.org \ --cc=criu@openvz.org \ --cc=dri-devel@lists.freedesktop.org \ --cc=ovzxemul@gmail.com \ --cc=snorcht@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.