All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zhao, Jiange" <Jiange.Zhao@amd.com>
To: "Koenig, Christian" <Christian.Koenig@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Pelloux-prayer,
	Pierre-eric" <Pierre-eric.Pelloux-prayer@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>
Subject: RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Date: Wed, 13 May 2020 09:37:39 +0000	[thread overview]
Message-ID: <BY5PR12MB3844F65A645C8721BA6363B2E1BF0@BY5PR12MB3844.namprd12.prod.outlook.com> (raw)
In-Reply-To: <9595e459-500a-40f5-e128-079605f9a284@gmail.com>

[AMD Official Use Only - Internal Distribution Only]

Hi Christian,

Since amdgpu_debugfs_wait_dump() would need 'audodump.dumping.done==0' to actually stop and wait for user mode app to dump.

Since usermode app might open a file , do nothing and close it. I believe a poll() function would be a better indicator that the usermode app actually wants to do a dump.

Also, a reset might happen between open() and poll(). The worst case would be wait_dump() would wait until timeout and usermode poll would always fail.

Jiange

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Wednesday, May 13, 2020 4:20 PM
To: Zhao, Jiange <Jiange.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@amd.com>; Zhao, Jiange <Jiange.Zhao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4

Am 09.05.20 um 11:45 schrieb jianzh@amd.com:
> From: Jiange Zhao <Jiange.Zhao@amd.com>
>
> When GPU got timeout, it would notify an interested part of an 
> opportunity to dump info before actual GPU reset.
>
> A usermode app would open 'autodump' node under debugfs system and 
> poll() for readable/writable. When a GPU reset is due, amdgpu would 
> notify usermode app through wait_queue_head and give it 10 minutes to 
> dump info.
>
> After usermode app has done its work, this 'autodump' node is closed.
> On node closure, amdgpu gets to know the dump is done through the 
> completion that is triggered in release().
>
> There is no write or read callback because necessary info can be 
> obtained through dmesg and umr. Messages back and forth between 
> usermode app and amdgpu are unnecessary.
>
> v2: (1) changed 'registered' to 'app_listening'
>      (2) add a mutex in open() to prevent race condition
>
> v3 (chk): grab the reset lock to avoid race in autodump_open,
>            rename debugfs file to amdgpu_autodump,
>            provide autodump_read as well,
>            style and code cleanups
>
> v4: add 'bool app_listening' to differentiate situations, so that
>      the node can be reopened; also, there is no need to wait for
>      completion when no app is waiting for a dump.
>
> v5: change 'bool app_listening' to 'enum amdgpu_autodump_state'
>      add 'app_state_mutex' for race conditions:
> 	(1)Only 1 user can open this file node
> 	(2)wait_dump() can only take effect after poll() executed.
> 	(3)eliminated the race condition between release() and
> 	   wait_dump()
>
> v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex'
>      removed state checking in amdgpu_debugfs_wait_dump
>      Improve on top of version 3 so that the node can be reopened.
>
> Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h         |  2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 ++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h |  6 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +
>   4 files changed, 87 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 2a806cb55b78..9e8eeddfe7ce 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,8 @@ struct amdgpu_device {
>   	char				product_number[16];
>   	char				product_name[32];
>   	char				serial[16];
> +
> +	struct amdgpu_autodump		autodump;
>   };
>   
>   static inline struct amdgpu_device *amdgpu_ttm_adev(struct 
> ttm_bo_device *bdev) diff --git 
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 1a4894fa3693..261b67ece7fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -27,7 +27,7 @@
>   #include <linux/pci.h>
>   #include <linux/uaccess.h>
>   #include <linux/pm_runtime.h>
> -
> +#include <linux/poll.h>
>   #include <drm/drm_debugfs.h>
>   
>   #include "amdgpu.h"
> @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
>   	return 0;
>   }
>   
> +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if 
> +defined(CONFIG_DEBUG_FS)
> +	unsigned long timeout = 600 * HZ;
> +	int ret;
> +
> +	wake_up_interruptible(&adev->autodump.gpu_hang);
> +
> +	ret = wait_for_completion_interruptible_timeout(&adev->autodump.dumping, timeout);
> +	complete_all(&adev->autodump.dumping);
> +	if (ret == 0) {
> +		pr_err("autodump: timeout, move on to gpu recovery\n");
> +		return -ETIMEDOUT;
> +	}
> +#endif
> +	return 0;
> +}
> +
>   #if defined(CONFIG_DEBUG_FS)
>   
> +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct 
> +file *file) {
> +	struct amdgpu_device *adev = inode->i_private;
> +	int ret;
> +
> +	file->private_data = adev;
> +
> +	mutex_lock(&adev->lock_reset);
> +	if (adev->autodump.dumping.done)
> +		ret = 0;
> +	else
> +		ret = -EBUSY;
> +	mutex_unlock(&adev->lock_reset);
> +
> +	return ret;
> +}
> +
> +static int amdgpu_debugfs_autodump_release(struct inode *inode, 
> +struct file *file) {
> +	struct amdgpu_device *adev = file->private_data;
> +
> +	complete_all(&adev->autodump.dumping);
> +	return 0;
> +}
> +
> +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, 
> +struct poll_table_struct *poll_table) {
> +	struct amdgpu_device *adev = file->private_data;
> +
> +	reinit_completion(&adev->autodump.dumping);

Why do you have the reinit_completion here and not in open callback?

Apart from that looks good to me.

Regards,
Christian.

> +	poll_wait(file, &adev->autodump.gpu_hang, poll_table);
> +
> +	if (adev->in_gpu_reset)
> +		return POLLIN | POLLRDNORM | POLLWRNORM;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations autodump_debug_fops = {
> +	.owner = THIS_MODULE,
> +	.open = amdgpu_debugfs_autodump_open,
> +	.poll = amdgpu_debugfs_autodump_poll,
> +	.release = amdgpu_debugfs_autodump_release, };
> +
> +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) 
> +{
> +	init_completion(&adev->autodump.dumping);
> +	complete_all(&adev->autodump.dumping);
> +	init_waitqueue_head(&adev->autodump.gpu_hang);
> +
> +	debugfs_create_file("amdgpu_autodump", 0600,
> +		adev->ddev->primary->debugfs_root,
> +		adev, &autodump_debug_fops);
> +}
> +
>   /**
>    * amdgpu_debugfs_process_reg_op - Handle MMIO register reads/writes
>    *
> @@ -1434,6 +1508,8 @@ int amdgpu_debugfs_init(struct amdgpu_device 
> *adev)
>   
>   	amdgpu_ras_debugfs_create_all(adev);
>   
> +	amdgpu_debugfs_autodump_init(adev);
> +
>   	return amdgpu_debugfs_add_files(adev, amdgpu_debugfs_list,
>   					ARRAY_SIZE(amdgpu_debugfs_list));
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
> index de12d1101526..2803884d338d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
> @@ -31,6 +31,11 @@ struct amdgpu_debugfs {
>   	unsigned		num_files;
>   };
>   
> +struct amdgpu_autodump {
> +	struct completion		dumping;
> +	struct wait_queue_head		gpu_hang;
> +};
> +
>   int amdgpu_debugfs_regs_init(struct amdgpu_device *adev);
>   int amdgpu_debugfs_init(struct amdgpu_device *adev);
>   void amdgpu_debugfs_fini(struct amdgpu_device *adev); @@ -40,3 +45,4 
> @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
>   int amdgpu_debugfs_fence_init(struct amdgpu_device *adev);
>   int amdgpu_debugfs_firmware_init(struct amdgpu_device *adev);
>   int amdgpu_debugfs_gem_init(struct amdgpu_device *adev);
> +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index cc41e8f5ad14..545beebcf43e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3927,6 +3927,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   	int i, r = 0;
>   	bool need_full_reset  = *need_full_reset_arg;
>   
> +	amdgpu_debugfs_wait_dump(adev);
> +
>   	/* block all schedulers and reset given job's ring */
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2020-05-13  9:37 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-09  9:45 [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 jianzh
2020-05-13  8:16 ` Zhao, Jiange
2020-05-13  8:20   ` Christian König
2020-05-13  8:19 ` Christian König
2020-05-13  9:37   ` Zhao, Jiange [this message]
2020-05-13 10:36     ` Christian König
  -- strict thread matches above, loose matches on Subject: below --
2020-05-14  9:18 jianzh
2020-05-14 10:25 ` Christian König
2020-05-14  5:29 jianzh
2020-05-14  7:01 ` Christian König
2020-05-14  7:14   ` Zhao, Jiange
2020-05-14  8:28     ` Christian König
2020-05-14 15:15       ` Li, Dennis
2020-05-15  2:40         ` Zhao, Jiange
2020-05-15  6:51           ` Christian König
2020-04-29  5:06 jianzh
2020-04-29  5:08 ` Zhao, Jiange
2020-04-29 14:04   ` Pierre-Eric Pelloux-Prayer
2020-04-29 14:08     ` Christian König
2020-05-06  3:45       ` Zhao, Jiange
2020-05-06  8:05         ` Christian König
2020-04-26 10:09 jianzh
2020-04-26 10:40 ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BY5PR12MB3844F65A645C8721BA6363B2E1BF0@BY5PR12MB3844.namprd12.prod.outlook.com \
    --to=jiange.zhao@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=Pierre-eric.Pelloux-prayer@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.