All of lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: philip yang <yangp@amd.com>,
	Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: alexander.deucher@amd.com, daniel.vetter@ffwll.ch,
	christian.koenig@amd.com, airlied@redhat.com
Subject: Re: [Patch v4 18/24] drm/amdkfd: CRIU checkpoint and restore xnack mode
Date: Mon, 10 Jan 2022 19:10:00 -0500	[thread overview]
Message-ID: <116c8cf4-57c2-f3a1-f4b9-5f0ef4526967@amd.com> (raw)
In-Reply-To: <6e5d64da-3081-a8f3-398c-6e12d18c8507@amd.com>

On 2022-01-05 10:22 a.m., philip yang wrote:
>
>
> On 2021-12-22 7:37 p.m., Rajneesh Bhardwaj wrote:
>> Recoverable page faults are represented by the xnack mode setting inside
>> a kfd process and are used to represent the device page faults. For CR,
>> we don't consider negative values which are typically used for querying
>> the current xnack mode without modifying it.
>>
>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 +++++++++++++++
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  1 +
>>   2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index 178b0ccfb286..446eb9310915 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -1845,6 +1845,11 @@ static int criu_checkpoint_process(struct kfd_process *p,
>>   	memset(&process_priv, 0, sizeof(process_priv));
>>   
>>   	process_priv.version = KFD_CRIU_PRIV_VERSION;
>> +	/* For CR, we don't consider negative xnack mode which is used for
>> +	 * querying without changing it, here 0 simply means disabled and 1
>> +	 * means enabled so retry for finding a valid PTE.
>> +	 */
> Negative value to query xnack mode is for kfd_ioctl_set_xnack_mode 
> user space ioctl interface, which is not used by CRIU, I think this 
> comment is misleading,
>> +	process_priv.xnack_mode = p->xnack_enabled ? 1 : 0;
> change to process_priv.xnack_enabled
>>   
>>   	ret = copy_to_user(user_priv_data + *priv_offset,
>>   				&process_priv, sizeof(process_priv));
>> @@ -2231,6 +2236,16 @@ static int criu_restore_process(struct kfd_process *p,
>>   		return -EINVAL;
>>   	}
>>   
>> +	pr_debug("Setting XNACK mode\n");
>> +	if (process_priv.xnack_mode && !kfd_process_xnack_mode(p, true)) {
>> +		pr_err("xnack mode cannot be set\n");
>> +		ret = -EPERM;
>> +		goto exit;
>> +	} else {
>
> On GFXv9 GPUs except Aldebaran, this means the process checkpointed is 
> xnack off, it can restore and resume on GPU with xnack on, then shader 
> will continue running successfully, but driver is not guaranteed to 
> map svm ranges on GPU all the time, if retry fault happens, the shader 
> will not recover. Maybe change to:
>
> If (KFD_GC_VERSION(dev) != IP_VERSION(9, 4, 2) {
>
The code here was correct. The xnack mode applies to the whole process, 
not just one GPU. The logic for checking the capabilities of all GPUs is 
already in kfd_process_xnack_mode. If XNACK cannot be supported by all 
GPUs, restoring a non-0 XNACK mode will fail.

Any GPU can run in XNACK-disabled mode. So we don't need any limitations 
for process_priv.xnack_enabled == 0.

Regards,
   Felix


>     if (process_priv.xnack_enabled != kfd_process_xnack_mode(p, true)) {
>
>              pr_err("xnack mode cannot be set\n");
>
>              ret = -EPERM;
>
>              goto exit;
>
>     }
>
> }
>
> pr_debug("set xnack mode: %d\n", process_priv.xnack_enabled);
>
> p->xnack_enabled = process_priv.xnack_enabled;
>
>
>> +		pr_debug("set xnack mode: %d\n", process_priv.xnack_mode);
>> +		p->xnack_enabled = process_priv.xnack_mode;
>> +	}
>> +
>>   exit:
>>   	return ret;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index 855c162b85ea..d72dda84c18c 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -1057,6 +1057,7 @@ void kfd_process_set_trap_handler(struct qcm_process_device *qpd,
>>   
>>   struct kfd_criu_process_priv_data {
>>   	uint32_t version;
>> +	uint32_t xnack_mode;
>
> bool xnack_enabled;
>
> Regards,
>
> Philip
>
>>   };
>>   
>>   struct kfd_criu_device_priv_data {

  reply	other threads:[~2022-01-11  0:10 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-23  0:36 [Patch v4 00/24] CHECKPOINT RESTORE WITH ROCm Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 01/24] x86/configs: CRIU update debug rock defconfig Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 02/24] x86/configs: Add rock-rel_defconfig for amd-feature-criu branch Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 03/24] drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs Rajneesh Bhardwaj
2022-01-10 22:08   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 04/24] drm/amdkfd: CRIU Implement KFD process_info ioctl Rajneesh Bhardwaj
2022-01-10 22:47   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 05/24] drm/amdkfd: CRIU Implement KFD checkpoint ioctl Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 06/24] drm/amdkfd: CRIU Implement KFD restore ioctl Rajneesh Bhardwaj
2022-01-10 23:01   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 07/24] drm/amdkfd: CRIU Implement KFD resume ioctl Rajneesh Bhardwaj
2022-01-10 23:16   ` Felix Kuehling
2021-12-23  0:36 ` [Patch v4 08/24] drm/amdkfd: CRIU Implement KFD unpause operation Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 09/24] drm/amdkfd: CRIU add queues support Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 10/24] drm/amdkfd: CRIU restore queue ids Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 11/24] drm/amdkfd: CRIU restore sdma id for queues Rajneesh Bhardwaj
2021-12-23  0:36 ` [Patch v4 12/24] drm/amdkfd: CRIU restore queue doorbell id Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 13/24] drm/amdkfd: CRIU checkpoint and restore queue mqds Rajneesh Bhardwaj
2022-01-10 23:32   ` Felix Kuehling
2021-12-23  0:37 ` [Patch v4 14/24] drm/amdkfd: CRIU checkpoint and restore queue control stack Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 15/24] drm/amdkfd: CRIU checkpoint and restore events Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 16/24] drm/amdkfd: CRIU implement gpu_id remapping Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 17/24] drm/amdkfd: CRIU export BOs as prime dmabuf objects Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 18/24] drm/amdkfd: CRIU checkpoint and restore xnack mode Rajneesh Bhardwaj
2022-01-05 15:22   ` philip yang
2022-01-11  0:10     ` Felix Kuehling [this message]
2022-01-11 15:49       ` philip yang
2021-12-23  0:37 ` [Patch v4 19/24] drm/amdkfd: CRIU allow external mm for svm ranges Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 20/24] drm/amdkfd: use user_gpu_id " Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 21/24] drm/amdkfd: CRIU Discover " Rajneesh Bhardwaj
2022-01-05 14:48   ` philip yang
2022-01-10 23:11   ` philip yang
2021-12-23  0:37 ` [Patch v4 22/24] drm/amdkfd: CRIU Save Shared Virtual Memory ranges Rajneesh Bhardwaj
2021-12-23  0:37 ` [Patch v4 23/24] drm/amdkfd: CRIU prepare for svm resume Rajneesh Bhardwaj
2022-01-05 14:43   ` philip yang
2022-01-10 23:58     ` Felix Kuehling
2022-01-11 15:58       ` philip yang
2021-12-23  0:37 ` [Patch v4 24/24] drm/amdkfd: CRIU resume shared virtual memory ranges Rajneesh Bhardwaj
2022-01-11  0:03   ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=116c8cf4-57c2-f3a1-f4b9-5f0ef4526967@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=airlied@redhat.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=rajneesh.bhardwaj@amd.com \
    --cc=yangp@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.