All of lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: Oak Zeng <Oak.Zeng@amd.com>, amd-gfx@lists.freedesktop.org
Cc: feifei.xu@amd.com, leo.liu@amd.com, hawking.zhang@amd.com
Subject: Re: [PATCH 1/3] drm/amdkfd: Disallow debugfs to hang hws when GPU is resetting
Date: Wed, 14 Jul 2021 12:04:21 -0400	[thread overview]
Message-ID: <705cea8d-fd6a-ba84-60a9-d6b8749131b5@amd.com> (raw)
In-Reply-To: <1626276343-3552-2-git-send-email-Oak.Zeng@amd.com>

Am 2021-07-14 um 11:25 a.m. schrieb Oak Zeng:
> If GPU is during a resetting cycle, writing to GPU can cause
> unpredictable protection fault, see below call trace. Disallow using kfd debugfs
> hang_hws to hang hws if GPU is resetting.
>
> [12808.234114] general protection fault: 0000 [#1] SMP NOPTI
> [12808.234119] CPU: 13 PID: 6334 Comm: tee Tainted: G           OE
> 5.4.0-77-generic #86-Ubuntu
> [12808.234121] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE
> WIFI, BIOS 0211 11/27/2020
> [12808.234220] RIP: 0010:kq_submit_packet+0xd/0x50 [amdgpu]
> [12808.234222] Code: 8b 45 d0 48 c7 00 00 00 00 00 b8 f4 ff ff ff eb df 66 66
> 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 8b 17 48 8b 47 48 <48> 8b 52
> 08 48 89 e5 83 7a 20 08 74 14 8b 77 20 89 30 48 8b 47 10
> [12808.234224] RSP: 0018:ffffb0bf4954bdc0 EFLAGS: 00010216
> [12808.234226] RAX: ffffb0bf4a1a5a00 RBX: ffff99302895c0c8 RCX:
> 0000000000000000
> [12808.234227] RDX: c3156d43d3a04949 RSI: 0000000000000055 RDI:
> ffff99302584c300
> [12808.234228] RBP: ffffb0bf4954bdf8 R08: 0000000000000543 R09:
> ffffb0bf4a1a4230
> [12808.234229] R10: 000000000000000a R11: f000000000000000 R12:
> 0000000000000000
> [12808.234230] R13: ffff99302895c0d8 R14: 00007ffebb3d18f0 R15:
> 0000000000000005
> [12808.234232] FS:  00007f0d822ef580(0000) GS:ffff99307d340000(0000)
> knlGS:0000000000000000
> [12808.234233] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12808.234234] CR2: 00007ffebb3d1908 CR3: 0000001efe1ec000 CR4:
> 0000000000340ee0
> [12808.234235] Call Trace:
> [12808.234324]  ? pm_debugfs_hang_hws+0x71/0xd0 [amdgpu]
> [12808.234408]  kfd_debugfs_hang_hws+0x2e/0x50 [amdgpu]
> [12808.234494]  kfd_debugfs_hang_hws_write+0xb6/0xc0 [amdgpu]
> [12808.234499]  full_proxy_write+0x5c/0x90
> [12808.234502]  __vfs_write+0x1b/0x40
> [12808.234504]  vfs_write+0xb9/0x1a0
> [12808.234506]  ksys_write+0x67/0xe0
> [12808.234508]  __x64_sys_write+0x1a/0x20
> [12808.234511]  do_syscall_64+0x57/0x190
> [12808.234514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 9e4a05e..fc77d03 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -1390,6 +1390,11 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
>  		return -EINVAL;
>  	}
>  
> +	if (dev->dqm->is_resetting) {

Checking dev->dqm->is_resetting without holding the dqm_lock is
incorrect. The problem is not really the fact, that it's resetting, but
that dqm->packets (the packet manager) is not initialized at that time.

A more general solution would be to move the pm_debugfs_hang_hws call
into dqm_debugfs_execute_queues, which does take the dqm_lock, and add a
check for dqm->packets while holding the lock.

Regards,
  Felix


> +		pr_err("HWS is already resetting, please wait for the current reset to finish\n");
> +		return -EBUSY;
> +	}
> +
>  	r = pm_debugfs_hang_hws(&dev->dqm->packets);
>  	if (!r)
>  		r = dqm_debugfs_execute_queues(dev->dqm);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-07-14 16:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-14 15:25 [PATCH 0/3] MISC fixes Oak Zeng
2021-07-14 15:25 ` [PATCH 1/3] drm/amdkfd: Disallow debugfs to hang hws when GPU is resetting Oak Zeng
2021-07-14 16:04   ` Felix Kuehling [this message]
2021-07-14 15:25 ` [PATCH 2/3] drm/amdgpu: Fix a printing message Oak Zeng
2021-07-14 15:28   ` Christian König
2021-07-14 15:48   ` Alex Deucher
2021-07-15  2:51     ` Chen, Jiansong (Simon)
2021-07-15  2:54       ` Deucher, Alexander
2021-07-15  2:56         ` Chen, Jiansong (Simon)
2021-07-14 15:25 ` [PATCH 3/3] drm/amdgpu: Change a few function names Oak Zeng
2021-07-14 15:50   ` Alex Deucher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=705cea8d-fd6a-ba84-60a9-d6b8749131b5@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Oak.Zeng@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=feifei.xu@amd.com \
    --cc=hawking.zhang@amd.com \
    --cc=leo.liu@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.