linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] Add fdinfo support to Panfrost
@ 2023-09-19 23:34 Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 1/6] drm/panfrost: Add cycle count GPU register definitions Adrián Larumbe
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel

This patch series adds fdinfo support to the Panfrost DRM driver. It will
display a series of key:value pairs under /proc/pid/fdinfo/fd for render
processes that open the Panfrost DRM file.

The pairs contain basic drm gpu engine and memory region information that
can either be cat by a privileged user or accessed with IGT's gputop
utility.

Changelog:

v1: https://lore.kernel.org/lkml/bb52b872-e41b-3894-285e-b52cfc849782@arm.com/T/

v2: https://lore.kernel.org/lkml/20230901084457.5bc1ad69@collabora.com/T/
 - Changed the way gpu cycles and engine time are calculated, using GPU
 registers and taking into account potential resets.
 - Split render engine values into fragment and vertex/tiler ones.
 - Added more fine-grained calculation of RSS size for BO's.
 - Implemente selection of drm-memory region size units
 - Removed locking of shrinker's mutex in GEM obj status function

v3: https://lore.kernel.org/lkml/20230905184533.959171-1-adrian.larumbe@collabora.com/
 - Changed fdinfo engine names to something more descriptive
 - Mentioned GPU cycle counts aren't an exact measure
 - Handled the case when job->priv might be NULL
 - Handled 32 bit overflow of cycle register
 - Kept fdinfo drm memory stats size unit display within 10k times the
   previous multiplier for more accurate BO size numbers
 - Removed special handling of Prime imported BO RSS
 - Use rss_size only for heap objects
 - Use bo->base.madv instead of specific purgeable flag
 - Fixed kernel test robot warnings

v4: https://lore.kernel.org/lkml/20230912084044.955864-1-adrian.larumbe@collabora.com/
 - Move cycle counter get and put to panfrost_job_hw_submit and
   panfrost_job_handle_{err,done} for more accuracy
 - Make sure cycle counter refs are released in reset path
 - Drop the model param for toggling cycle counting and do
   leave it down to the debugfs file
 - Don't disable cycle counter when togglint debugfs file,
   let refcounting logic handle it instead.
 - Remove fdinfo data nested structure definion and 'names' field
 - When incrementing BO RSS size in GPU MMU page fault IRQ handler, assume
  granuality of 2MiB for every successful mapping.
 - drm-file picks an fdinfo memory object size unit that doesn't lose precision.

v5: https://lore.kernel.org/lkml/20230914223928.2374933-1-adrian.larumbe@collabora.com/
 - Removed explicit initialisation of atomic variable for profiling mode,
   as it's allocated with kzalloc.
 - Pass engine utilisation structure to jobs rather than the file context, to avoid
   future misusage of the latter.
 - Remove double reading of cycle counter register and ktime in job deqeueue function,
   as the scheduler will make sure these values are read over in case of requeuing.
 - Moved putting of cycle counting refcnt into panfrost job dequeue
   function to avoid repetition.

v6:
 - Fix wrong swapped-round engine time and cycle values in fdinfo
   drm print statements.

Adrián Larumbe (6):
  drm/panfrost: Add cycle count GPU register definitions
  drm/panfrost: Add fdinfo support GPU load metrics
  drm/panfrost: Add fdinfo support for memory stats
  drm/drm_file: Add DRM obj's RSS reporting function for fdinfo
  drm/panfrost: Implement generic DRM object RSS reporting function
  drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats

 drivers/gpu/drm/drm_file.c                  | 10 +++-
 drivers/gpu/drm/panfrost/Makefile           |  2 +
 drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 +++++++
 drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
 drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
 drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
 drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
 drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
 drivers/gpu/drm/panfrost/panfrost_drv.c     | 59 ++++++++++++++++++++-
 drivers/gpu/drm/panfrost/panfrost_gem.c     | 29 ++++++++++
 drivers/gpu/drm/panfrost/panfrost_gem.h     |  5 ++
 drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 ++++++++++++++
 drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
 drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
 drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
 drivers/gpu/drm/panfrost/panfrost_mmu.c     |  1 +
 drivers/gpu/drm/panfrost/panfrost_regs.h    |  5 ++
 include/drm/drm_gem.h                       |  9 ++++
 18 files changed, 250 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h


base-commit: f45acf7acf75921c0409d452f0165f51a19a74fd
-- 
2.42.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v6 1/6] drm/panfrost: Add cycle count GPU register definitions
  2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
@ 2023-09-19 23:34 ` Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics Adrián Larumbe
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel, Boris Brezillon

These GPU registers will be used when programming the cycle counter, which
we need for providing accurate fdinfo drm-cycles values to user space.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
index 919f44ac853d..55ec807550b3 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -46,6 +46,8 @@
 #define   GPU_CMD_SOFT_RESET		0x01
 #define   GPU_CMD_PERFCNT_CLEAR		0x03
 #define   GPU_CMD_PERFCNT_SAMPLE	0x04
+#define   GPU_CMD_CYCLE_COUNT_START	0x05
+#define   GPU_CMD_CYCLE_COUNT_STOP	0x06
 #define   GPU_CMD_CLEAN_CACHES		0x07
 #define   GPU_CMD_CLEAN_INV_CACHES	0x08
 #define GPU_STATUS			0x34
@@ -73,6 +75,9 @@
 #define GPU_PRFCNT_TILER_EN		0x74
 #define GPU_PRFCNT_MMU_L2_EN		0x7c
 
+#define GPU_CYCLE_COUNT_LO		0x90
+#define GPU_CYCLE_COUNT_HI		0x94
+
 #define GPU_THREAD_MAX_THREADS		0x0A0	/* (RO) Maximum number of threads per core */
 #define GPU_THREAD_MAX_WORKGROUP_SIZE	0x0A4	/* (RO) Maximum workgroup size */
 #define GPU_THREAD_MAX_BARRIER_SIZE	0x0A8	/* (RO) Maximum threads waiting at a barrier */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics
  2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 1/6] drm/panfrost: Add cycle count GPU register definitions Adrián Larumbe
@ 2023-09-19 23:34 ` Adrián Larumbe
  2023-09-20 15:40   ` Tvrtko Ursulin
  2023-09-19 23:34 ` [PATCH v6 3/6] drm/panfrost: Add fdinfo support for memory stats Adrián Larumbe
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel, Boris Brezillon

The drm-stats fdinfo tags made available to user space are drm-engine,
drm-cycles, drm-max-freq and drm-curfreq, one per job slot.

This deviates from standard practice in other DRM drivers, where a single
set of key:value pairs is provided for the whole render engine. However,
Panfrost has separate queues for fragment and vertex/tiler jobs, so a
decision was made to calculate bus cycles and workload times separately.

Maximum operating frequency is calculated at devfreq initialisation time.
Current frequency is made available to user space because nvtop uses it
when performing engine usage calculations.

It is important to bear in mind that both GPU cycle and kernel time numbers
provided are at best rough estimations, and always reported in excess from
the actual figure because of two reasons:
 - Excess time because of the delay between the end of a job processing,
   the subsequent job IRQ and the actual time of the sample.
 - Time spent in the engine queue waiting for the GPU to pick up the next
   job.

To avoid race conditions during enablement/disabling, a reference counting
mechanism was introduced, and a job flag that tells us whether a given job
increased the refcount. This is necessary, because user space can toggle
cycle counting through a debugfs file, and a given job might have been in
flight by the time cycle counting was disabled.

The main goal of the debugfs cycle counter knob is letting tools like nvtop
or IGT's gputop switch it at any time, to avoid power waste in case no
engine usage measuring is necessary.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/Makefile           |  2 +
 drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 ++++++++
 drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
 drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
 drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
 drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
 drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
 drivers/gpu/drm/panfrost/panfrost_drv.c     | 57 ++++++++++++++++++++-
 drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 +++++++++++++++
 drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
 drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
 drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
 12 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h

diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile
index 7da2b3f02ed9..2c01c1e7523e 100644
--- a/drivers/gpu/drm/panfrost/Makefile
+++ b/drivers/gpu/drm/panfrost/Makefile
@@ -12,4 +12,6 @@ panfrost-y := \
 	panfrost_perfcnt.o \
 	panfrost_dump.o
 
+panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o
+
 obj-$(CONFIG_DRM_PANFROST) += panfrost.o
diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
new file mode 100644
index 000000000000..cc14eccba206
--- /dev/null
+++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/debugfs.h>
+#include <linux/platform_device.h>
+#include <drm/drm_debugfs.h>
+#include <drm/drm_file.h>
+#include <drm/panfrost_drm.h>
+
+#include "panfrost_device.h"
+#include "panfrost_gpu.h"
+#include "panfrost_debugfs.h"
+
+void panfrost_debugfs_init(struct drm_minor *minor)
+{
+	struct drm_device *dev = minor->dev;
+	struct panfrost_device *pfdev = platform_get_drvdata(to_platform_device(dev->dev));
+
+	debugfs_create_atomic_t("profile", 0600, minor->debugfs_root, &pfdev->profile_mode);
+}
diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
new file mode 100644
index 000000000000..db1c158bcf2f
--- /dev/null
+++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2023 Collabora ltd.
+ */
+
+#ifndef PANFROST_DEBUGFS_H
+#define PANFROST_DEBUGFS_H
+
+#ifdef CONFIG_DEBUG_FS
+void panfrost_debugfs_init(struct drm_minor *minor);
+#endif
+
+#endif  /* PANFROST_DEBUGFS_H */
diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
index 58dfb15a8757..28caffc689e2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
+++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
@@ -58,6 +58,7 @@ static int panfrost_devfreq_get_dev_status(struct device *dev,
 	spin_lock_irqsave(&pfdevfreq->lock, irqflags);
 
 	panfrost_devfreq_update_utilization(pfdevfreq);
+	pfdevfreq->current_frequency = status->current_frequency;
 
 	status->total_time = ktime_to_ns(ktime_add(pfdevfreq->busy_time,
 						   pfdevfreq->idle_time));
@@ -117,6 +118,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
 	struct devfreq *devfreq;
 	struct thermal_cooling_device *cooling;
 	struct panfrost_devfreq *pfdevfreq = &pfdev->pfdevfreq;
+	unsigned long freq = ULONG_MAX;
 
 	if (pfdev->comp->num_supplies > 1) {
 		/*
@@ -172,6 +174,12 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
 		return ret;
 	}
 
+	/* Find the fastest defined rate  */
+	opp = dev_pm_opp_find_freq_floor(dev, &freq);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+	pfdevfreq->fast_rate = freq;
+
 	dev_pm_opp_put(opp);
 
 	/*
diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.h b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
index 1514c1f9d91c..48dbe185f206 100644
--- a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
+++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
@@ -19,6 +19,9 @@ struct panfrost_devfreq {
 	struct devfreq_simple_ondemand_data gov_data;
 	bool opp_of_table_added;
 
+	unsigned long current_frequency;
+	unsigned long fast_rate;
+
 	ktime_t busy_time;
 	ktime_t idle_time;
 	ktime_t time_last_update;
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
index fa1a086a862b..28f7046e1b1a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -207,6 +207,8 @@ int panfrost_device_init(struct panfrost_device *pfdev)
 
 	spin_lock_init(&pfdev->as_lock);
 
+	spin_lock_init(&pfdev->cycle_counter.lock);
+
 	err = panfrost_clk_init(pfdev);
 	if (err) {
 		dev_err(pfdev->dev, "clk init failed %d\n", err);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index b0126b9fbadc..1e85656dc2f7 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -107,6 +107,7 @@ struct panfrost_device {
 	struct list_head scheduled_jobs;
 
 	struct panfrost_perfcnt *perfcnt;
+	atomic_t profile_mode;
 
 	struct mutex sched_lock;
 
@@ -121,6 +122,11 @@ struct panfrost_device {
 	struct shrinker shrinker;
 
 	struct panfrost_devfreq pfdevfreq;
+
+	struct {
+		atomic_t use_count;
+		spinlock_t lock;
+	} cycle_counter;
 };
 
 struct panfrost_mmu {
@@ -135,12 +141,19 @@ struct panfrost_mmu {
 	struct list_head list;
 };
 
+struct panfrost_engine_usage {
+	unsigned long long elapsed_ns[NUM_JOB_SLOTS];
+	unsigned long long cycles[NUM_JOB_SLOTS];
+};
+
 struct panfrost_file_priv {
 	struct panfrost_device *pfdev;
 
 	struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
 
 	struct panfrost_mmu *mmu;
+
+	struct panfrost_engine_usage engine_usage;
 };
 
 static inline struct panfrost_device *to_panfrost_device(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index a2ab99698ca8..3c93a11deab1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -20,6 +20,7 @@
 #include "panfrost_job.h"
 #include "panfrost_gpu.h"
 #include "panfrost_perfcnt.h"
+#include "panfrost_debugfs.h"
 
 static bool unstable_ioctls;
 module_param_unsafe(unstable_ioctls, bool, 0600);
@@ -267,6 +268,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
 	job->requirements = args->requirements;
 	job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
 	job->mmu = file_priv->mmu;
+	job->engine_usage = &file_priv->engine_usage;
 
 	slot = panfrost_job_get_slot(job);
 
@@ -523,7 +525,55 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
 	PANFROST_IOCTL(MADVISE,		madvise,	DRM_RENDER_ALLOW),
 };
 
-DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
+
+static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
+				     struct panfrost_file_priv *panfrost_priv,
+				     struct drm_printer *p)
+{
+	int i;
+
+	/*
+	 * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
+	 * accurate, as they only provide a rough estimation of the number of
+	 * GPU cycles and CPU time spent in a given context. This is due to two
+	 * different factors:
+	 * - Firstly, we must consider the time the CPU and then the kernel
+	 *   takes to process the GPU interrupt, which means additional time and
+	 *   GPU cycles will be added in excess to the real figure.
+	 * - Secondly, the pipelining done by the Job Manager (2 job slots per
+	 *   engine) implies there is no way to know exactly how much time each
+	 *   job spent on the GPU.
+	 */
+
+	static const char * const engine_names[] = {
+		"fragment", "vertex-tiler", "compute-only"
+	};
+
+	for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
+		drm_printf(p, "drm-engine-%s:\t%llu ns\n",
+			   engine_names[i], panfrost_priv->engine_usage.elapsed_ns[i]);
+		drm_printf(p, "drm-cycles-%s:\t%llu\n",
+			   engine_names[i], panfrost_priv->engine_usage.cycles[i]);
+		drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
+			   engine_names[i], pfdev->pfdevfreq.fast_rate);
+		drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
+			   engine_names[i], pfdev->pfdevfreq.current_frequency);
+	}
+}
+
+static void panfrost_show_fdinfo(struct drm_printer *p, struct drm_file *file)
+{
+	struct drm_device *dev = file->minor->dev;
+	struct panfrost_device *pfdev = dev->dev_private;
+
+	panfrost_gpu_show_fdinfo(pfdev, file->driver_priv, p);
+}
+
+static const struct file_operations panfrost_drm_driver_fops = {
+	.owner = THIS_MODULE,
+	DRM_GEM_FOPS,
+	.show_fdinfo = drm_show_fdinfo,
+};
 
 /*
  * Panfrost driver version:
@@ -535,6 +585,7 @@ static const struct drm_driver panfrost_drm_driver = {
 	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
 	.open			= panfrost_open,
 	.postclose		= panfrost_postclose,
+	.show_fdinfo		= panfrost_show_fdinfo,
 	.ioctls			= panfrost_drm_driver_ioctls,
 	.num_ioctls		= ARRAY_SIZE(panfrost_drm_driver_ioctls),
 	.fops			= &panfrost_drm_driver_fops,
@@ -546,6 +597,10 @@ static const struct drm_driver panfrost_drm_driver = {
 
 	.gem_create_object	= panfrost_gem_create_object,
 	.gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table,
+
+#ifdef CONFIG_DEBUG_FS
+	.debugfs_init		= panfrost_debugfs_init,
+#endif
 };
 
 static int panfrost_probe(struct platform_device *pdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 2faa344d89ee..f0be7e19b13e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -73,6 +73,13 @@ int panfrost_gpu_soft_reset(struct panfrost_device *pfdev)
 	gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL);
 	gpu_write(pfdev, GPU_INT_MASK, GPU_IRQ_MASK_ALL);
 
+	/*
+	 * All in-flight jobs should have released their cycle
+	 * counter references upon reset, but let us make sure
+	 */
+	if (drm_WARN_ON(pfdev->ddev, atomic_read(&pfdev->cycle_counter.use_count) != 0))
+		atomic_set(&pfdev->cycle_counter.use_count, 0);
+
 	return 0;
 }
 
@@ -321,6 +328,40 @@ static void panfrost_gpu_init_features(struct panfrost_device *pfdev)
 		 pfdev->features.shader_present, pfdev->features.l2_present);
 }
 
+void panfrost_cycle_counter_get(struct panfrost_device *pfdev)
+{
+	if (atomic_inc_not_zero(&pfdev->cycle_counter.use_count))
+		return;
+
+	spin_lock(&pfdev->cycle_counter.lock);
+	if (atomic_inc_return(&pfdev->cycle_counter.use_count) == 1)
+		gpu_write(pfdev, GPU_CMD, GPU_CMD_CYCLE_COUNT_START);
+	spin_unlock(&pfdev->cycle_counter.lock);
+}
+
+void panfrost_cycle_counter_put(struct panfrost_device *pfdev)
+{
+	if (atomic_add_unless(&pfdev->cycle_counter.use_count, -1, 1))
+		return;
+
+	spin_lock(&pfdev->cycle_counter.lock);
+	if (atomic_dec_return(&pfdev->cycle_counter.use_count) == 0)
+		gpu_write(pfdev, GPU_CMD, GPU_CMD_CYCLE_COUNT_STOP);
+	spin_unlock(&pfdev->cycle_counter.lock);
+}
+
+unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev)
+{
+	u32 hi, lo;
+
+	do {
+		hi = gpu_read(pfdev, GPU_CYCLE_COUNT_HI);
+		lo = gpu_read(pfdev, GPU_CYCLE_COUNT_LO);
+	} while (hi != gpu_read(pfdev, GPU_CYCLE_COUNT_HI));
+
+	return ((u64)hi << 32) | lo;
+}
+
 void panfrost_gpu_power_on(struct panfrost_device *pfdev)
 {
 	int ret;
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.h b/drivers/gpu/drm/panfrost/panfrost_gpu.h
index 468c51e7e46d..876fdad9f721 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.h
@@ -16,6 +16,10 @@ int panfrost_gpu_soft_reset(struct panfrost_device *pfdev);
 void panfrost_gpu_power_on(struct panfrost_device *pfdev);
 void panfrost_gpu_power_off(struct panfrost_device *pfdev);
 
+void panfrost_cycle_counter_get(struct panfrost_device *pfdev);
+void panfrost_cycle_counter_put(struct panfrost_device *pfdev);
+unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev);
+
 void panfrost_gpu_amlogic_quirk(struct panfrost_device *pfdev);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 033f5e684707..fb16de2d0420 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -159,6 +159,16 @@ panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
 	struct panfrost_job *job = pfdev->jobs[slot][0];
 
 	WARN_ON(!job);
+	if (job->is_profiled) {
+		if (job->engine_usage) {
+			job->engine_usage->elapsed_ns[slot] +=
+				ktime_to_ns(ktime_sub(ktime_get(), job->start_time));
+			job->engine_usage->cycles[slot] +=
+				panfrost_cycle_counter_read(pfdev) - job->start_cycles;
+		}
+		panfrost_cycle_counter_put(job->pfdev);
+	}
+
 	pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
 	pfdev->jobs[slot][1] = NULL;
 
@@ -233,6 +243,13 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 	subslot = panfrost_enqueue_job(pfdev, js, job);
 	/* Don't queue the job if a reset is in progress */
 	if (!atomic_read(&pfdev->reset.pending)) {
+		if (atomic_read(&pfdev->profile_mode)) {
+			panfrost_cycle_counter_get(pfdev);
+			job->is_profiled = true;
+			job->start_time = ktime_get();
+			job->start_cycles = panfrost_cycle_counter_read(pfdev);
+		}
+
 		job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
 		dev_dbg(pfdev->dev,
 			"JS: Submitting atom %p to js[%d][%d] with head=0x%llx AS %d",
@@ -660,10 +677,14 @@ panfrost_reset(struct panfrost_device *pfdev,
 	 * stuck jobs. Let's make sure the PM counters stay balanced by
 	 * manually calling pm_runtime_put_noidle() and
 	 * panfrost_devfreq_record_idle() for each stuck job.
+	 * Let's also make sure the cycle counting register's refcnt is
+	 * kept balanced to prevent it from running forever
 	 */
 	spin_lock(&pfdev->js->job_lock);
 	for (i = 0; i < NUM_JOB_SLOTS; i++) {
 		for (j = 0; j < ARRAY_SIZE(pfdev->jobs[0]) && pfdev->jobs[i][j]; j++) {
+			if (pfdev->jobs[i][j]->is_profiled)
+				panfrost_cycle_counter_put(pfdev->jobs[i][j]->pfdev);
 			pm_runtime_put_noidle(pfdev->dev);
 			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
 		}
@@ -926,6 +947,9 @@ void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 			}
 
 			job_write(pfdev, JS_COMMAND(i), cmd);
+
+			/* Jobs can outlive their file context */
+			job->engine_usage = NULL;
 		}
 	}
 	spin_unlock(&pfdev->js->job_lock);
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h
index 8becc1ba0eb9..17ff808dba07 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.h
+++ b/drivers/gpu/drm/panfrost/panfrost_job.h
@@ -32,6 +32,11 @@ struct panfrost_job {
 
 	/* Fence to be signaled by drm-sched once its done with the job */
 	struct dma_fence *render_done_fence;
+
+	struct panfrost_engine_usage *engine_usage;
+	bool is_profiled;
+	ktime_t start_time;
+	u64 start_cycles;
 };
 
 int panfrost_job_init(struct panfrost_device *pfdev);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 3/6] drm/panfrost: Add fdinfo support for memory stats
  2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 1/6] drm/panfrost: Add cycle count GPU register definitions Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics Adrián Larumbe
@ 2023-09-19 23:34 ` Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo Adrián Larumbe
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel, Boris Brezillon

A new DRM GEM object function is added so that drm_show_memory_stats can
provide more accurate memory usage numbers.

Ideally, in panfrost_gem_status, the BO's purgeable flag would be checked
after locking the driver's shrinker mutex, but drm_show_memory_stats takes
over the drm file's object handle database spinlock, so there's potential
for a race condition here.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c |  2 ++
 drivers/gpu/drm/panfrost/panfrost_gem.c | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 3c93a11deab1..8cd9331ac4b8 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -567,6 +567,8 @@ static void panfrost_show_fdinfo(struct drm_printer *p, struct drm_file *file)
 	struct panfrost_device *pfdev = dev->dev_private;
 
 	panfrost_gpu_show_fdinfo(pfdev, file->driver_priv, p);
+
+	drm_show_memory_stats(p, file);
 }
 
 static const struct file_operations panfrost_drm_driver_fops = {
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 3c812fbd126f..7d8f83d20539 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -195,6 +195,19 @@ static int panfrost_gem_pin(struct drm_gem_object *obj)
 	return drm_gem_shmem_pin(&bo->base);
 }
 
+static enum drm_gem_object_status panfrost_gem_status(struct drm_gem_object *obj)
+{
+	struct panfrost_gem_object *bo = to_panfrost_bo(obj);
+	enum drm_gem_object_status res = 0;
+
+	res |= (bo->base.madv == PANFROST_MADV_DONTNEED) ?
+		DRM_GEM_OBJECT_PURGEABLE : 0;
+
+	res |= (bo->base.pages) ? DRM_GEM_OBJECT_RESIDENT : 0;
+
+	return res;
+}
+
 static const struct drm_gem_object_funcs panfrost_gem_funcs = {
 	.free = panfrost_gem_free_object,
 	.open = panfrost_gem_open,
@@ -206,6 +219,7 @@ static const struct drm_gem_object_funcs panfrost_gem_funcs = {
 	.vmap = drm_gem_shmem_object_vmap,
 	.vunmap = drm_gem_shmem_object_vunmap,
 	.mmap = drm_gem_shmem_object_mmap,
+	.status = panfrost_gem_status,
 	.vm_ops = &drm_gem_shmem_vm_ops,
 };
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo
  2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
                   ` (2 preceding siblings ...)
  2023-09-19 23:34 ` [PATCH v6 3/6] drm/panfrost: Add fdinfo support for memory stats Adrián Larumbe
@ 2023-09-19 23:34 ` Adrián Larumbe
  2023-09-20 15:53   ` Tvrtko Ursulin
  2023-09-19 23:34 ` [PATCH v6 5/6] drm/panfrost: Implement generic DRM object RSS reporting function Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats Adrián Larumbe
  5 siblings, 1 reply; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel, Boris Brezillon

Some BO's might be mapped onto physical memory chunkwise and on demand,
like Panfrost's tiler heap. In this case, even though the
drm_gem_shmem_object page array might already be allocated, only a very
small fraction of the BO is currently backed by system memory, but
drm_show_memory_stats will then proceed to add its entire virtual size to
the file's total resident size regardless.

This led to very unrealistic RSS sizes being reckoned for Panfrost, where
said tiler heap buffer is initially allocated with a virtual size of 128
MiB, but only a small part of it will eventually be backed by system memory
after successive GPU page faults.

Provide a new DRM object generic function that would allow drivers to
return a more accurate RSS size for their BOs.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/drm_file.c | 5 ++++-
 include/drm/drm_gem.h      | 9 +++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 883d83bc0e3d..762965e3d503 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -944,7 +944,10 @@ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
 		}
 
 		if (s & DRM_GEM_OBJECT_RESIDENT) {
-			status.resident += obj->size;
+			if (obj->funcs && obj->funcs->rss)
+				status.resident += obj->funcs->rss(obj);
+			else
+				status.resident += obj->size;
 		} else {
 			/* If already purged or not yet backed by pages, don't
 			 * count it as purgeable:
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index bc9f6aa2f3fe..16364487fde9 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -208,6 +208,15 @@ struct drm_gem_object_funcs {
 	 */
 	enum drm_gem_object_status (*status)(struct drm_gem_object *obj);
 
+	/**
+	 * @rss:
+	 *
+	 * Return resident size of the object in physical memory.
+	 *
+	 * Called by drm_show_memory_stats().
+	 */
+	size_t (*rss)(struct drm_gem_object *obj);
+
 	/**
 	 * @vm_ops:
 	 *
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 5/6] drm/panfrost: Implement generic DRM object RSS reporting function
  2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
                   ` (3 preceding siblings ...)
  2023-09-19 23:34 ` [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo Adrián Larumbe
@ 2023-09-19 23:34 ` Adrián Larumbe
  2023-09-19 23:34 ` [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats Adrián Larumbe
  5 siblings, 0 replies; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel, Boris Brezillon

BO's RSS is updated every time new pages are allocated on demand and mapped
for the object at GPU page fault's IRQ handler, but only for heap buffers.
The reason this is unnecessary for non-heap buffers is that they are mapped
onto the GPU's VA space and backed by physical memory in their entirety at
BO creation time.

This calculation is unnecessary for imported PRIME objects, since heap
buffers cannot be exported by our driver, and the actual BO RSS size is the
one reported in its attached dmabuf structure.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_gem.c | 15 +++++++++++++++
 drivers/gpu/drm/panfrost/panfrost_gem.h |  5 +++++
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  1 +
 3 files changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 7d8f83d20539..4365434b48db 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -208,6 +208,20 @@ static enum drm_gem_object_status panfrost_gem_status(struct drm_gem_object *obj
 	return res;
 }
 
+static size_t panfrost_gem_rss(struct drm_gem_object *obj)
+{
+	struct panfrost_gem_object *bo = to_panfrost_bo(obj);
+
+	if (bo->is_heap) {
+		return bo->heap_rss_size;
+	} else if (bo->base.pages) {
+		WARN_ON(bo->heap_rss_size);
+		return bo->base.base.size;
+	} else {
+		return 0;
+	}
+}
+
 static const struct drm_gem_object_funcs panfrost_gem_funcs = {
 	.free = panfrost_gem_free_object,
 	.open = panfrost_gem_open,
@@ -220,6 +234,7 @@ static const struct drm_gem_object_funcs panfrost_gem_funcs = {
 	.vunmap = drm_gem_shmem_object_vunmap,
 	.mmap = drm_gem_shmem_object_mmap,
 	.status = panfrost_gem_status,
+	.rss = panfrost_gem_rss,
 	.vm_ops = &drm_gem_shmem_vm_ops,
 };
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h
index ad2877eeeccd..13c0a8149c3a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -36,6 +36,11 @@ struct panfrost_gem_object {
 	 */
 	atomic_t gpu_usecount;
 
+	/*
+	 * Object chunk size currently mapped onto physical memory
+	 */
+	size_t heap_rss_size;
+
 	bool noexec		:1;
 	bool is_heap		:1;
 };
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d54d4e7b2195..7b1490cdaa48 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -522,6 +522,7 @@ static int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as,
 		   IOMMU_WRITE | IOMMU_READ | IOMMU_NOEXEC, sgt);
 
 	bomapping->active = true;
+	bo->heap_rss_size += SZ_2;
 
 	dev_dbg(pfdev->dev, "mapped page fault @ AS%d %llx", as, addr);
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats
  2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
                   ` (4 preceding siblings ...)
  2023-09-19 23:34 ` [PATCH v6 5/6] drm/panfrost: Implement generic DRM object RSS reporting function Adrián Larumbe
@ 2023-09-19 23:34 ` Adrián Larumbe
  2023-09-20 15:32   ` Tvrtko Ursulin
  5 siblings, 1 reply; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-19 23:34 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price
  Cc: adrian.larumbe, dri-devel, linux-kernel, linux-arm-msm,
	freedreno, healych, kernel, Boris Brezillon

The current implementation will try to pick the highest available size
display unit as soon as the BO size exceeds that of the previous
multiplier. That can lead to loss of precision in contexts of low memory
usage.

The new selection criteria try to preserve precision, whilst also
increasing the display unit selection threshold to render more accurate
values.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/drm_file.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 762965e3d503..34cfa128ffe5 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -872,6 +872,8 @@ void drm_send_event(struct drm_device *dev, struct drm_pending_event *e)
 }
 EXPORT_SYMBOL(drm_send_event);
 
+#define UPPER_UNIT_THRESHOLD 100
+
 static void print_size(struct drm_printer *p, const char *stat,
 		       const char *region, u64 sz)
 {
@@ -879,7 +881,8 @@ static void print_size(struct drm_printer *p, const char *stat,
 	unsigned u;
 
 	for (u = 0; u < ARRAY_SIZE(units) - 1; u++) {
-		if (sz < SZ_1K)
+		if ((sz & (SZ_1K - 1)) &&
+		    sz < UPPER_UNIT_THRESHOLD * SZ_1K)
 			break;
 		sz = div_u64(sz, SZ_1K);
 	}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats
  2023-09-19 23:34 ` [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats Adrián Larumbe
@ 2023-09-20 15:32   ` Tvrtko Ursulin
  2023-09-21 10:14     ` Tvrtko Ursulin
  2023-09-22 11:01     ` Adrián Larumbe
  0 siblings, 2 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-20 15:32 UTC (permalink / raw)
  To: Adrián Larumbe, maarten.lankhorst, mripard, tzimmermann,
	airlied, daniel, robdclark, quic_abhinavk, dmitry.baryshkov,
	sean, marijn.suijten, robh, steven.price
  Cc: linux-arm-msm, linux-kernel, dri-devel, healych, Boris Brezillon,
	kernel, freedreno


On 20/09/2023 00:34, Adrián Larumbe wrote:
> The current implementation will try to pick the highest available size
> display unit as soon as the BO size exceeds that of the previous
> multiplier. That can lead to loss of precision in contexts of low memory
> usage.
> 
> The new selection criteria try to preserve precision, whilst also
> increasing the display unit selection threshold to render more accurate
> values.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Steven Price <steven.price@arm.com>
> ---
>   drivers/gpu/drm/drm_file.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index 762965e3d503..34cfa128ffe5 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -872,6 +872,8 @@ void drm_send_event(struct drm_device *dev, struct drm_pending_event *e)
>   }
>   EXPORT_SYMBOL(drm_send_event);
>   
> +#define UPPER_UNIT_THRESHOLD 100
> +
>   static void print_size(struct drm_printer *p, const char *stat,
>   		       const char *region, u64 sz)
>   {
> @@ -879,7 +881,8 @@ static void print_size(struct drm_printer *p, const char *stat,
>   	unsigned u;
>   
>   	for (u = 0; u < ARRAY_SIZE(units) - 1; u++) {
> -		if (sz < SZ_1K)
> +		if ((sz & (SZ_1K - 1)) &&

IS_ALIGNED worth it at all?

> +		    sz < UPPER_UNIT_THRESHOLD * SZ_1K)
>   			break;

Excuse me for a late comment (I was away). I did not get what what is 
special about a ~10% threshold? Sounds to me just going with the lower 
unit, when size is not aligned to the higher one, would be better than 
sometimes precision-sometimes-not.

Regards,

Tvrtko

>   		sz = div_u64(sz, SZ_1K);
>   	}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics
  2023-09-19 23:34 ` [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics Adrián Larumbe
@ 2023-09-20 15:40   ` Tvrtko Ursulin
  2023-09-22 10:57     ` Adrián Larumbe
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-20 15:40 UTC (permalink / raw)
  To: Adrián Larumbe, maarten.lankhorst, mripard, tzimmermann,
	airlied, daniel, robdclark, quic_abhinavk, dmitry.baryshkov,
	sean, marijn.suijten, robh, steven.price
  Cc: linux-arm-msm, linux-kernel, dri-devel, healych, Boris Brezillon,
	kernel, freedreno


On 20/09/2023 00:34, Adrián Larumbe wrote:
> The drm-stats fdinfo tags made available to user space are drm-engine,
> drm-cycles, drm-max-freq and drm-curfreq, one per job slot.
> 
> This deviates from standard practice in other DRM drivers, where a single
> set of key:value pairs is provided for the whole render engine. However,
> Panfrost has separate queues for fragment and vertex/tiler jobs, so a
> decision was made to calculate bus cycles and workload times separately.
> 
> Maximum operating frequency is calculated at devfreq initialisation time.
> Current frequency is made available to user space because nvtop uses it
> when performing engine usage calculations.
> 
> It is important to bear in mind that both GPU cycle and kernel time numbers
> provided are at best rough estimations, and always reported in excess from
> the actual figure because of two reasons:
>   - Excess time because of the delay between the end of a job processing,
>     the subsequent job IRQ and the actual time of the sample.
>   - Time spent in the engine queue waiting for the GPU to pick up the next
>     job.
> 
> To avoid race conditions during enablement/disabling, a reference counting
> mechanism was introduced, and a job flag that tells us whether a given job
> increased the refcount. This is necessary, because user space can toggle
> cycle counting through a debugfs file, and a given job might have been in
> flight by the time cycle counting was disabled.
> 
> The main goal of the debugfs cycle counter knob is letting tools like nvtop
> or IGT's gputop switch it at any time, to avoid power waste in case no
> engine usage measuring is necessary.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Steven Price <steven.price@arm.com>
> ---
>   drivers/gpu/drm/panfrost/Makefile           |  2 +
>   drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 ++++++++
>   drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
>   drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
>   drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
>   drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
>   drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
>   drivers/gpu/drm/panfrost/panfrost_drv.c     | 57 ++++++++++++++++++++-
>   drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 +++++++++++++++
>   drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
>   drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
>   drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
>   12 files changed, 191 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
>   create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h
> 
> diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile
> index 7da2b3f02ed9..2c01c1e7523e 100644
> --- a/drivers/gpu/drm/panfrost/Makefile
> +++ b/drivers/gpu/drm/panfrost/Makefile
> @@ -12,4 +12,6 @@ panfrost-y := \
>   	panfrost_perfcnt.o \
>   	panfrost_dump.o
>   
> +panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o
> +
>   obj-$(CONFIG_DRM_PANFROST) += panfrost.o
> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
> new file mode 100644
> index 000000000000..cc14eccba206
> --- /dev/null
> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <linux/debugfs.h>
> +#include <linux/platform_device.h>
> +#include <drm/drm_debugfs.h>
> +#include <drm/drm_file.h>
> +#include <drm/panfrost_drm.h>
> +
> +#include "panfrost_device.h"
> +#include "panfrost_gpu.h"
> +#include "panfrost_debugfs.h"
> +
> +void panfrost_debugfs_init(struct drm_minor *minor)
> +{
> +	struct drm_device *dev = minor->dev;
> +	struct panfrost_device *pfdev = platform_get_drvdata(to_platform_device(dev->dev));
> +
> +	debugfs_create_atomic_t("profile", 0600, minor->debugfs_root, &pfdev->profile_mode);
> +}
> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
> new file mode 100644
> index 000000000000..db1c158bcf2f
> --- /dev/null
> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2023 Collabora ltd.
> + */
> +
> +#ifndef PANFROST_DEBUGFS_H
> +#define PANFROST_DEBUGFS_H
> +
> +#ifdef CONFIG_DEBUG_FS
> +void panfrost_debugfs_init(struct drm_minor *minor);
> +#endif
> +
> +#endif  /* PANFROST_DEBUGFS_H */
> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
> index 58dfb15a8757..28caffc689e2 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
> @@ -58,6 +58,7 @@ static int panfrost_devfreq_get_dev_status(struct device *dev,
>   	spin_lock_irqsave(&pfdevfreq->lock, irqflags);
>   
>   	panfrost_devfreq_update_utilization(pfdevfreq);
> +	pfdevfreq->current_frequency = status->current_frequency;
>   
>   	status->total_time = ktime_to_ns(ktime_add(pfdevfreq->busy_time,
>   						   pfdevfreq->idle_time));
> @@ -117,6 +118,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
>   	struct devfreq *devfreq;
>   	struct thermal_cooling_device *cooling;
>   	struct panfrost_devfreq *pfdevfreq = &pfdev->pfdevfreq;
> +	unsigned long freq = ULONG_MAX;
>   
>   	if (pfdev->comp->num_supplies > 1) {
>   		/*
> @@ -172,6 +174,12 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
>   		return ret;
>   	}
>   
> +	/* Find the fastest defined rate  */
> +	opp = dev_pm_opp_find_freq_floor(dev, &freq);
> +	if (IS_ERR(opp))
> +		return PTR_ERR(opp);
> +	pfdevfreq->fast_rate = freq;
> +
>   	dev_pm_opp_put(opp);
>   
>   	/*
> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.h b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
> index 1514c1f9d91c..48dbe185f206 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
> @@ -19,6 +19,9 @@ struct panfrost_devfreq {
>   	struct devfreq_simple_ondemand_data gov_data;
>   	bool opp_of_table_added;
>   
> +	unsigned long current_frequency;
> +	unsigned long fast_rate;
> +
>   	ktime_t busy_time;
>   	ktime_t idle_time;
>   	ktime_t time_last_update;
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
> index fa1a086a862b..28f7046e1b1a 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
> @@ -207,6 +207,8 @@ int panfrost_device_init(struct panfrost_device *pfdev)
>   
>   	spin_lock_init(&pfdev->as_lock);
>   
> +	spin_lock_init(&pfdev->cycle_counter.lock);
> +
>   	err = panfrost_clk_init(pfdev);
>   	if (err) {
>   		dev_err(pfdev->dev, "clk init failed %d\n", err);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> index b0126b9fbadc..1e85656dc2f7 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> @@ -107,6 +107,7 @@ struct panfrost_device {
>   	struct list_head scheduled_jobs;
>   
>   	struct panfrost_perfcnt *perfcnt;
> +	atomic_t profile_mode;
>   
>   	struct mutex sched_lock;
>   
> @@ -121,6 +122,11 @@ struct panfrost_device {
>   	struct shrinker shrinker;
>   
>   	struct panfrost_devfreq pfdevfreq;
> +
> +	struct {
> +		atomic_t use_count;
> +		spinlock_t lock;
> +	} cycle_counter;
>   };
>   
>   struct panfrost_mmu {
> @@ -135,12 +141,19 @@ struct panfrost_mmu {
>   	struct list_head list;
>   };
>   
> +struct panfrost_engine_usage {
> +	unsigned long long elapsed_ns[NUM_JOB_SLOTS];
> +	unsigned long long cycles[NUM_JOB_SLOTS];
> +};
> +
>   struct panfrost_file_priv {
>   	struct panfrost_device *pfdev;
>   
>   	struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
>   
>   	struct panfrost_mmu *mmu;
> +
> +	struct panfrost_engine_usage engine_usage;
>   };
>   
>   static inline struct panfrost_device *to_panfrost_device(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index a2ab99698ca8..3c93a11deab1 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -20,6 +20,7 @@
>   #include "panfrost_job.h"
>   #include "panfrost_gpu.h"
>   #include "panfrost_perfcnt.h"
> +#include "panfrost_debugfs.h"
>   
>   static bool unstable_ioctls;
>   module_param_unsafe(unstable_ioctls, bool, 0600);
> @@ -267,6 +268,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>   	job->requirements = args->requirements;
>   	job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
>   	job->mmu = file_priv->mmu;
> +	job->engine_usage = &file_priv->engine_usage;
>   
>   	slot = panfrost_job_get_slot(job);
>   
> @@ -523,7 +525,55 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
>   	PANFROST_IOCTL(MADVISE,		madvise,	DRM_RENDER_ALLOW),
>   };
>   
> -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
> +
> +static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
> +				     struct panfrost_file_priv *panfrost_priv,
> +				     struct drm_printer *p)
> +{
> +	int i;
> +
> +	/*
> +	 * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
> +	 * accurate, as they only provide a rough estimation of the number of
> +	 * GPU cycles and CPU time spent in a given context. This is due to two
> +	 * different factors:
> +	 * - Firstly, we must consider the time the CPU and then the kernel
> +	 *   takes to process the GPU interrupt, which means additional time and
> +	 *   GPU cycles will be added in excess to the real figure.
> +	 * - Secondly, the pipelining done by the Job Manager (2 job slots per
> +	 *   engine) implies there is no way to know exactly how much time each
> +	 *   job spent on the GPU.
> +	 */
> +
> +	static const char * const engine_names[] = {
> +		"fragment", "vertex-tiler", "compute-only"
> +	};
> +
> +	for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {

FWIW you could future proof this a bit by using "i < 
ARRAY_SIZE(engine_names)" and avoid maybe silent out of bounds reads if 
someone updates NUM_JOB_SLOTS and forgets about this loop. Or stick a 
warning of some sort.

> +		drm_printf(p, "drm-engine-%s:\t%llu ns\n",
> +			   engine_names[i], panfrost_priv->engine_usage.elapsed_ns[i]);
> +		drm_printf(p, "drm-cycles-%s:\t%llu\n",
> +			   engine_names[i], panfrost_priv->engine_usage.cycles[i]);
> +		drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
> +			   engine_names[i], pfdev->pfdevfreq.fast_rate);
> +		drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
> +			   engine_names[i], pfdev->pfdevfreq.current_frequency);

I envisaged a link to driver specific docs at the bottom of 
drm-usage-stats.rst so it would be nice if drivers would be adding those 
sections and describing their private keys, engine names etc. ;)

Regards,

Tvrtko

> +	}
> +}
> +
> +static void panfrost_show_fdinfo(struct drm_printer *p, struct drm_file *file)
> +{
> +	struct drm_device *dev = file->minor->dev;
> +	struct panfrost_device *pfdev = dev->dev_private;
> +
> +	panfrost_gpu_show_fdinfo(pfdev, file->driver_priv, p);
> +}
> +
> +static const struct file_operations panfrost_drm_driver_fops = {
> +	.owner = THIS_MODULE,
> +	DRM_GEM_FOPS,
> +	.show_fdinfo = drm_show_fdinfo,
> +};
>   
>   /*
>    * Panfrost driver version:
> @@ -535,6 +585,7 @@ static const struct drm_driver panfrost_drm_driver = {
>   	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
>   	.open			= panfrost_open,
>   	.postclose		= panfrost_postclose,
> +	.show_fdinfo		= panfrost_show_fdinfo,
>   	.ioctls			= panfrost_drm_driver_ioctls,
>   	.num_ioctls		= ARRAY_SIZE(panfrost_drm_driver_ioctls),
>   	.fops			= &panfrost_drm_driver_fops,
> @@ -546,6 +597,10 @@ static const struct drm_driver panfrost_drm_driver = {
>   
>   	.gem_create_object	= panfrost_gem_create_object,
>   	.gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table,
> +
> +#ifdef CONFIG_DEBUG_FS
> +	.debugfs_init		= panfrost_debugfs_init,
> +#endif
>   };
>   
>   static int panfrost_probe(struct platform_device *pdev)
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
> index 2faa344d89ee..f0be7e19b13e 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
> @@ -73,6 +73,13 @@ int panfrost_gpu_soft_reset(struct panfrost_device *pfdev)
>   	gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL);
>   	gpu_write(pfdev, GPU_INT_MASK, GPU_IRQ_MASK_ALL);
>   
> +	/*
> +	 * All in-flight jobs should have released their cycle
> +	 * counter references upon reset, but let us make sure
> +	 */
> +	if (drm_WARN_ON(pfdev->ddev, atomic_read(&pfdev->cycle_counter.use_count) != 0))
> +		atomic_set(&pfdev->cycle_counter.use_count, 0);
> +
>   	return 0;
>   }
>   
> @@ -321,6 +328,40 @@ static void panfrost_gpu_init_features(struct panfrost_device *pfdev)
>   		 pfdev->features.shader_present, pfdev->features.l2_present);
>   }
>   
> +void panfrost_cycle_counter_get(struct panfrost_device *pfdev)
> +{
> +	if (atomic_inc_not_zero(&pfdev->cycle_counter.use_count))
> +		return;
> +
> +	spin_lock(&pfdev->cycle_counter.lock);
> +	if (atomic_inc_return(&pfdev->cycle_counter.use_count) == 1)
> +		gpu_write(pfdev, GPU_CMD, GPU_CMD_CYCLE_COUNT_START);
> +	spin_unlock(&pfdev->cycle_counter.lock);
> +}
> +
> +void panfrost_cycle_counter_put(struct panfrost_device *pfdev)
> +{
> +	if (atomic_add_unless(&pfdev->cycle_counter.use_count, -1, 1))
> +		return;
> +
> +	spin_lock(&pfdev->cycle_counter.lock);
> +	if (atomic_dec_return(&pfdev->cycle_counter.use_count) == 0)
> +		gpu_write(pfdev, GPU_CMD, GPU_CMD_CYCLE_COUNT_STOP);
> +	spin_unlock(&pfdev->cycle_counter.lock);
> +}
> +
> +unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev)
> +{
> +	u32 hi, lo;
> +
> +	do {
> +		hi = gpu_read(pfdev, GPU_CYCLE_COUNT_HI);
> +		lo = gpu_read(pfdev, GPU_CYCLE_COUNT_LO);
> +	} while (hi != gpu_read(pfdev, GPU_CYCLE_COUNT_HI));
> +
> +	return ((u64)hi << 32) | lo;
> +}
> +
>   void panfrost_gpu_power_on(struct panfrost_device *pfdev)
>   {
>   	int ret;
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.h b/drivers/gpu/drm/panfrost/panfrost_gpu.h
> index 468c51e7e46d..876fdad9f721 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.h
> @@ -16,6 +16,10 @@ int panfrost_gpu_soft_reset(struct panfrost_device *pfdev);
>   void panfrost_gpu_power_on(struct panfrost_device *pfdev);
>   void panfrost_gpu_power_off(struct panfrost_device *pfdev);
>   
> +void panfrost_cycle_counter_get(struct panfrost_device *pfdev);
> +void panfrost_cycle_counter_put(struct panfrost_device *pfdev);
> +unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev);
> +
>   void panfrost_gpu_amlogic_quirk(struct panfrost_device *pfdev);
>   
>   #endif
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 033f5e684707..fb16de2d0420 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -159,6 +159,16 @@ panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
>   	struct panfrost_job *job = pfdev->jobs[slot][0];
>   
>   	WARN_ON(!job);
> +	if (job->is_profiled) {
> +		if (job->engine_usage) {
> +			job->engine_usage->elapsed_ns[slot] +=
> +				ktime_to_ns(ktime_sub(ktime_get(), job->start_time));
> +			job->engine_usage->cycles[slot] +=
> +				panfrost_cycle_counter_read(pfdev) - job->start_cycles;
> +		}
> +		panfrost_cycle_counter_put(job->pfdev);
> +	}
> +
>   	pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
>   	pfdev->jobs[slot][1] = NULL;
>   
> @@ -233,6 +243,13 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
>   	subslot = panfrost_enqueue_job(pfdev, js, job);
>   	/* Don't queue the job if a reset is in progress */
>   	if (!atomic_read(&pfdev->reset.pending)) {
> +		if (atomic_read(&pfdev->profile_mode)) {
> +			panfrost_cycle_counter_get(pfdev);
> +			job->is_profiled = true;
> +			job->start_time = ktime_get();
> +			job->start_cycles = panfrost_cycle_counter_read(pfdev);
> +		}
> +
>   		job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
>   		dev_dbg(pfdev->dev,
>   			"JS: Submitting atom %p to js[%d][%d] with head=0x%llx AS %d",
> @@ -660,10 +677,14 @@ panfrost_reset(struct panfrost_device *pfdev,
>   	 * stuck jobs. Let's make sure the PM counters stay balanced by
>   	 * manually calling pm_runtime_put_noidle() and
>   	 * panfrost_devfreq_record_idle() for each stuck job.
> +	 * Let's also make sure the cycle counting register's refcnt is
> +	 * kept balanced to prevent it from running forever
>   	 */
>   	spin_lock(&pfdev->js->job_lock);
>   	for (i = 0; i < NUM_JOB_SLOTS; i++) {
>   		for (j = 0; j < ARRAY_SIZE(pfdev->jobs[0]) && pfdev->jobs[i][j]; j++) {
> +			if (pfdev->jobs[i][j]->is_profiled)
> +				panfrost_cycle_counter_put(pfdev->jobs[i][j]->pfdev);
>   			pm_runtime_put_noidle(pfdev->dev);
>   			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
>   		}
> @@ -926,6 +947,9 @@ void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
>   			}
>   
>   			job_write(pfdev, JS_COMMAND(i), cmd);
> +
> +			/* Jobs can outlive their file context */
> +			job->engine_usage = NULL;
>   		}
>   	}
>   	spin_unlock(&pfdev->js->job_lock);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h
> index 8becc1ba0eb9..17ff808dba07 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.h
> @@ -32,6 +32,11 @@ struct panfrost_job {
>   
>   	/* Fence to be signaled by drm-sched once its done with the job */
>   	struct dma_fence *render_done_fence;
> +
> +	struct panfrost_engine_usage *engine_usage;
> +	bool is_profiled;
> +	ktime_t start_time;
> +	u64 start_cycles;
>   };
>   
>   int panfrost_job_init(struct panfrost_device *pfdev);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo
  2023-09-19 23:34 ` [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo Adrián Larumbe
@ 2023-09-20 15:53   ` Tvrtko Ursulin
  2023-09-22 10:58     ` Adrián Larumbe
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-20 15:53 UTC (permalink / raw)
  To: Adrián Larumbe, maarten.lankhorst, mripard, tzimmermann,
	airlied, daniel, robdclark, quic_abhinavk, dmitry.baryshkov,
	sean, marijn.suijten, robh, steven.price
  Cc: linux-arm-msm, linux-kernel, dri-devel, healych, Boris Brezillon,
	kernel, freedreno


On 20/09/2023 00:34, Adrián Larumbe wrote:
> Some BO's might be mapped onto physical memory chunkwise and on demand,
> like Panfrost's tiler heap. In this case, even though the
> drm_gem_shmem_object page array might already be allocated, only a very
> small fraction of the BO is currently backed by system memory, but
> drm_show_memory_stats will then proceed to add its entire virtual size to
> the file's total resident size regardless.
> 
> This led to very unrealistic RSS sizes being reckoned for Panfrost, where
> said tiler heap buffer is initially allocated with a virtual size of 128
> MiB, but only a small part of it will eventually be backed by system memory
> after successive GPU page faults.
> 
> Provide a new DRM object generic function that would allow drivers to
> return a more accurate RSS size for their BOs.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Steven Price <steven.price@arm.com>
> ---
>   drivers/gpu/drm/drm_file.c | 5 ++++-
>   include/drm/drm_gem.h      | 9 +++++++++
>   2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index 883d83bc0e3d..762965e3d503 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -944,7 +944,10 @@ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
>   		}
>   
>   		if (s & DRM_GEM_OBJECT_RESIDENT) {
> -			status.resident += obj->size;
> +			if (obj->funcs && obj->funcs->rss)
> +				status.resident += obj->funcs->rss(obj);
> +			else
> +				status.resident += obj->size;

Presumably you'd want the same smaller size in both active and 
purgeable? Or you can end up with more in those two than in rss which 
would look odd.

Also, alternative to adding a new callback could be adding multiple 
output parameters to the existing obj->func->status() which maybe ends 
up simpler due fewer callbacks?

Like:

  s = obj->funcs->status(obj, &supported_status, &rss)

And adjust the code flow to pick up the rss if driver signaled it 
supports reporting it.

Regards,

Tvrtko

>   		} else {
>   			/* If already purged or not yet backed by pages, don't
>   			 * count it as purgeable:
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index bc9f6aa2f3fe..16364487fde9 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -208,6 +208,15 @@ struct drm_gem_object_funcs {
>   	 */
>   	enum drm_gem_object_status (*status)(struct drm_gem_object *obj);
>   
> +	/**
> +	 * @rss:
> +	 *
> +	 * Return resident size of the object in physical memory.
> +	 *
> +	 * Called by drm_show_memory_stats().
> +	 */
> +	size_t (*rss)(struct drm_gem_object *obj);
> +
>   	/**
>   	 * @vm_ops:
>   	 *

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats
  2023-09-20 15:32   ` Tvrtko Ursulin
@ 2023-09-21 10:14     ` Tvrtko Ursulin
  2023-09-22 11:03       ` Adrián Larumbe
  2023-09-22 11:01     ` Adrián Larumbe
  1 sibling, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-21 10:14 UTC (permalink / raw)
  To: Adrián Larumbe, maarten.lankhorst, mripard, tzimmermann,
	airlied, daniel, robdclark, quic_abhinavk, dmitry.baryshkov,
	sean, marijn.suijten, robh, steven.price
  Cc: linux-arm-msm, linux-kernel, dri-devel, healych, Boris Brezillon,
	kernel, freedreno


On 20/09/2023 16:32, Tvrtko Ursulin wrote:
> 
> On 20/09/2023 00:34, Adrián Larumbe wrote:
>> The current implementation will try to pick the highest available size
>> display unit as soon as the BO size exceeds that of the previous
>> multiplier. That can lead to loss of precision in contexts of low memory
>> usage.
>>
>> The new selection criteria try to preserve precision, whilst also
>> increasing the display unit selection threshold to render more accurate
>> values.
>>
>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>> Reviewed-by: Steven Price <steven.price@arm.com>
>> ---
>>   drivers/gpu/drm/drm_file.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>> index 762965e3d503..34cfa128ffe5 100644
>> --- a/drivers/gpu/drm/drm_file.c
>> +++ b/drivers/gpu/drm/drm_file.c
>> @@ -872,6 +872,8 @@ void drm_send_event(struct drm_device *dev, struct 
>> drm_pending_event *e)
>>   }
>>   EXPORT_SYMBOL(drm_send_event);
>> +#define UPPER_UNIT_THRESHOLD 100
>> +
>>   static void print_size(struct drm_printer *p, const char *stat,
>>                  const char *region, u64 sz)
>>   {
>> @@ -879,7 +881,8 @@ static void print_size(struct drm_printer *p, 
>> const char *stat,
>>       unsigned u;
>>       for (u = 0; u < ARRAY_SIZE(units) - 1; u++) {
>> -        if (sz < SZ_1K)
>> +        if ((sz & (SZ_1K - 1)) &&
> 
> IS_ALIGNED worth it at all?
> 
>> +            sz < UPPER_UNIT_THRESHOLD * SZ_1K)
>>               break;
> 
> Excuse me for a late comment (I was away). I did not get what what is 
> special about a ~10% threshold? Sounds to me just going with the lower 
> unit, when size is not aligned to the higher one, would be better than 
> sometimes precision-sometimes-not.

FWIW both current and the threshold option make testing the feature very 
annoying.

So I'd really propose we simply use smaller unit when unaligned.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics
  2023-09-20 15:40   ` Tvrtko Ursulin
@ 2023-09-22 10:57     ` Adrián Larumbe
  2023-09-22 13:53       ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-22 10:57 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno

On 20.09.2023 16:40, Tvrtko Ursulin wrote:
>On 20/09/2023 00:34, Adrián Larumbe wrote:
>> The drm-stats fdinfo tags made available to user space are drm-engine,
>> drm-cycles, drm-max-freq and drm-curfreq, one per job slot.
>> 
>> This deviates from standard practice in other DRM drivers, where a single
>> set of key:value pairs is provided for the whole render engine. However,
>> Panfrost has separate queues for fragment and vertex/tiler jobs, so a
>> decision was made to calculate bus cycles and workload times separately.
>> 
>> Maximum operating frequency is calculated at devfreq initialisation time.
>> Current frequency is made available to user space because nvtop uses it
>> when performing engine usage calculations.
>> 
>> It is important to bear in mind that both GPU cycle and kernel time numbers
>> provided are at best rough estimations, and always reported in excess from
>> the actual figure because of two reasons:
>>   - Excess time because of the delay between the end of a job processing,
>>     the subsequent job IRQ and the actual time of the sample.
>>   - Time spent in the engine queue waiting for the GPU to pick up the next
>>     job.
>> 
>> To avoid race conditions during enablement/disabling, a reference counting
>> mechanism was introduced, and a job flag that tells us whether a given job
>> increased the refcount. This is necessary, because user space can toggle
>> cycle counting through a debugfs file, and a given job might have been in
>> flight by the time cycle counting was disabled.
>> 
>> The main goal of the debugfs cycle counter knob is letting tools like nvtop
>> or IGT's gputop switch it at any time, to avoid power waste in case no
>> engine usage measuring is necessary.
>> 
>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>> Reviewed-by: Steven Price <steven.price@arm.com>
>> ---
>>   drivers/gpu/drm/panfrost/Makefile           |  2 +
>>   drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 ++++++++
>>   drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
>>   drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
>>   drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
>>   drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
>>   drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
>>   drivers/gpu/drm/panfrost/panfrost_drv.c     | 57 ++++++++++++++++++++-
>>   drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 +++++++++++++++
>>   drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
>>   drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
>>   drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
>>   12 files changed, 191 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>   create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h
>> 
>> diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile
>> index 7da2b3f02ed9..2c01c1e7523e 100644
>> --- a/drivers/gpu/drm/panfrost/Makefile
>> +++ b/drivers/gpu/drm/panfrost/Makefile
>> @@ -12,4 +12,6 @@ panfrost-y := \
>>   	panfrost_perfcnt.o \
>>   	panfrost_dump.o
>> +panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o
>> +
>>   obj-$(CONFIG_DRM_PANFROST) += panfrost.o
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>> new file mode 100644
>> index 000000000000..cc14eccba206
>> --- /dev/null
>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>> @@ -0,0 +1,20 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright 2023 Collabora ltd. */
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/platform_device.h>
>> +#include <drm/drm_debugfs.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/panfrost_drm.h>
>> +
>> +#include "panfrost_device.h"
>> +#include "panfrost_gpu.h"
>> +#include "panfrost_debugfs.h"
>> +
>> +void panfrost_debugfs_init(struct drm_minor *minor)
>> +{
>> +	struct drm_device *dev = minor->dev;
>> +	struct panfrost_device *pfdev = platform_get_drvdata(to_platform_device(dev->dev));
>> +
>> +	debugfs_create_atomic_t("profile", 0600, minor->debugfs_root, &pfdev->profile_mode);
>> +}
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>> new file mode 100644
>> index 000000000000..db1c158bcf2f
>> --- /dev/null
>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>> @@ -0,0 +1,13 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright 2023 Collabora ltd.
>> + */
>> +
>> +#ifndef PANFROST_DEBUGFS_H
>> +#define PANFROST_DEBUGFS_H
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +void panfrost_debugfs_init(struct drm_minor *minor);
>> +#endif
>> +
>> +#endif  /* PANFROST_DEBUGFS_H */
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>> index 58dfb15a8757..28caffc689e2 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>> @@ -58,6 +58,7 @@ static int panfrost_devfreq_get_dev_status(struct device *dev,
>>   	spin_lock_irqsave(&pfdevfreq->lock, irqflags);
>>   	panfrost_devfreq_update_utilization(pfdevfreq);
>> +	pfdevfreq->current_frequency = status->current_frequency;
>>   	status->total_time = ktime_to_ns(ktime_add(pfdevfreq->busy_time,
>>   						   pfdevfreq->idle_time));
>> @@ -117,6 +118,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
>>   	struct devfreq *devfreq;
>>   	struct thermal_cooling_device *cooling;
>>   	struct panfrost_devfreq *pfdevfreq = &pfdev->pfdevfreq;
>> +	unsigned long freq = ULONG_MAX;
>>   	if (pfdev->comp->num_supplies > 1) {
>>   		/*
>> @@ -172,6 +174,12 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
>>   		return ret;
>>   	}
>> +	/* Find the fastest defined rate  */
>> +	opp = dev_pm_opp_find_freq_floor(dev, &freq);
>> +	if (IS_ERR(opp))
>> +		return PTR_ERR(opp);
>> +	pfdevfreq->fast_rate = freq;
>> +
>>   	dev_pm_opp_put(opp);
>>   	/*
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.h b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>> index 1514c1f9d91c..48dbe185f206 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>> @@ -19,6 +19,9 @@ struct panfrost_devfreq {
>>   	struct devfreq_simple_ondemand_data gov_data;
>>   	bool opp_of_table_added;
>> +	unsigned long current_frequency;
>> +	unsigned long fast_rate;
>> +
>>   	ktime_t busy_time;
>>   	ktime_t idle_time;
>>   	ktime_t time_last_update;
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
>> index fa1a086a862b..28f7046e1b1a 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
>> @@ -207,6 +207,8 @@ int panfrost_device_init(struct panfrost_device *pfdev)
>>   	spin_lock_init(&pfdev->as_lock);
>> +	spin_lock_init(&pfdev->cycle_counter.lock);
>> +
>>   	err = panfrost_clk_init(pfdev);
>>   	if (err) {
>>   		dev_err(pfdev->dev, "clk init failed %d\n", err);
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
>> index b0126b9fbadc..1e85656dc2f7 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
>> @@ -107,6 +107,7 @@ struct panfrost_device {
>>   	struct list_head scheduled_jobs;
>>   	struct panfrost_perfcnt *perfcnt;
>> +	atomic_t profile_mode;
>>   	struct mutex sched_lock;
>> @@ -121,6 +122,11 @@ struct panfrost_device {
>>   	struct shrinker shrinker;
>>   	struct panfrost_devfreq pfdevfreq;
>> +
>> +	struct {
>> +		atomic_t use_count;
>> +		spinlock_t lock;
>> +	} cycle_counter;
>>   };
>>   struct panfrost_mmu {
>> @@ -135,12 +141,19 @@ struct panfrost_mmu {
>>   	struct list_head list;
>>   };
>> +struct panfrost_engine_usage {
>> +	unsigned long long elapsed_ns[NUM_JOB_SLOTS];
>> +	unsigned long long cycles[NUM_JOB_SLOTS];
>> +};
>> +
>>   struct panfrost_file_priv {
>>   	struct panfrost_device *pfdev;
>>   	struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
>>   	struct panfrost_mmu *mmu;
>> +
>> +	struct panfrost_engine_usage engine_usage;
>>   };
>>   static inline struct panfrost_device *to_panfrost_device(struct drm_device *ddev)
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> index a2ab99698ca8..3c93a11deab1 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> @@ -20,6 +20,7 @@
>>   #include "panfrost_job.h"
>>   #include "panfrost_gpu.h"
>>   #include "panfrost_perfcnt.h"
>> +#include "panfrost_debugfs.h"
>>   static bool unstable_ioctls;
>>   module_param_unsafe(unstable_ioctls, bool, 0600);
>> @@ -267,6 +268,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>   	job->requirements = args->requirements;
>>   	job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
>>   	job->mmu = file_priv->mmu;
>> +	job->engine_usage = &file_priv->engine_usage;
>>   	slot = panfrost_job_get_slot(job);
>> @@ -523,7 +525,55 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
>>   	PANFROST_IOCTL(MADVISE,		madvise,	DRM_RENDER_ALLOW),
>>   };
>> -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
>> +
>> +static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
>> +				     struct panfrost_file_priv *panfrost_priv,
>> +				     struct drm_printer *p)
>> +{
>> +	int i;
>> +
>> +	/*
>> +	 * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
>> +	 * accurate, as they only provide a rough estimation of the number of
>> +	 * GPU cycles and CPU time spent in a given context. This is due to two
>> +	 * different factors:
>> +	 * - Firstly, we must consider the time the CPU and then the kernel
>> +	 *   takes to process the GPU interrupt, which means additional time and
>> +	 *   GPU cycles will be added in excess to the real figure.
>> +	 * - Secondly, the pipelining done by the Job Manager (2 job slots per
>> +	 *   engine) implies there is no way to know exactly how much time each
>> +	 *   job spent on the GPU.
>> +	 */
>> +
>> +	static const char * const engine_names[] = {
>> +		"fragment", "vertex-tiler", "compute-only"
>> +	};
>> +
>> +	for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
>
>FWIW you could future proof this a bit by using "i < ARRAY_SIZE(engine_names)"
>and avoid maybe silent out of bounds reads if someone updates NUM_JOB_SLOTS
>and forgets about this loop. Or stick a warning of some sort.
>
NUM_JOB_SLOTS is actually the same as the number of engines in the device. I decided to follow
this loop convention because that's what's being done across the driver when manipulating
the engine queues, so I thought I'd stick to it for the sake of consistency. Bear in mind
the loop doesn't pick up the compute-only engine because it's still not exposed to user space.

So NUM_JOB_SLOTS cannot change, unless a new engine were introduced, and then someone would
have to update this array accordingly.

>> +		drm_printf(p, "drm-engine-%s:\t%llu ns\n",
>> +			   engine_names[i], panfrost_priv->engine_usage.elapsed_ns[i]);
>> +		drm_printf(p, "drm-cycles-%s:\t%llu\n",
>> +			   engine_names[i], panfrost_priv->engine_usage.cycles[i]);
>> +		drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
>> +			   engine_names[i], pfdev->pfdevfreq.fast_rate);
>> +		drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
>> +			   engine_names[i], pfdev->pfdevfreq.current_frequency);
>
>I envisaged a link to driver specific docs at the bottom of
>drm-usage-stats.rst so it would be nice if drivers would be adding those
>sections and describing their private keys, engine names etc. ;)
>
Currently there's no panfrost.rst file under Documentation/gpu. I guess I'll create a new
one and add the engine descriptions and meaning of drm-curfreq key.

>Regards,
>
>Tvrtko
>
>> +	}
>> +}
>> +
>> +static void panfrost_show_fdinfo(struct drm_printer *p, struct drm_file *file)
>> +{
>> +	struct drm_device *dev = file->minor->dev;
>> +	struct panfrost_device *pfdev = dev->dev_private;
>> +
>> +	panfrost_gpu_show_fdinfo(pfdev, file->driver_priv, p);
>> +}
>> +
>> +static const struct file_operations panfrost_drm_driver_fops = {
>> +	.owner = THIS_MODULE,
>> +	DRM_GEM_FOPS,
>> +	.show_fdinfo = drm_show_fdinfo,
>> +};
>>   /*
>>    * Panfrost driver version:
>> @@ -535,6 +585,7 @@ static const struct drm_driver panfrost_drm_driver = {
>>   	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
>>   	.open			= panfrost_open,
>>   	.postclose		= panfrost_postclose,
>> +	.show_fdinfo		= panfrost_show_fdinfo,
>>   	.ioctls			= panfrost_drm_driver_ioctls,
>>   	.num_ioctls		= ARRAY_SIZE(panfrost_drm_driver_ioctls),
>>   	.fops			= &panfrost_drm_driver_fops,
>> @@ -546,6 +597,10 @@ static const struct drm_driver panfrost_drm_driver = {
>>   	.gem_create_object	= panfrost_gem_create_object,
>>   	.gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table,
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +	.debugfs_init		= panfrost_debugfs_init,
>> +#endif
>>   };
>>   static int panfrost_probe(struct platform_device *pdev)
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
>> index 2faa344d89ee..f0be7e19b13e 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
>> @@ -73,6 +73,13 @@ int panfrost_gpu_soft_reset(struct panfrost_device *pfdev)
>>   	gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL);
>>   	gpu_write(pfdev, GPU_INT_MASK, GPU_IRQ_MASK_ALL);
>> +	/*
>> +	 * All in-flight jobs should have released their cycle
>> +	 * counter references upon reset, but let us make sure
>> +	 */
>> +	if (drm_WARN_ON(pfdev->ddev, atomic_read(&pfdev->cycle_counter.use_count) != 0))
>> +		atomic_set(&pfdev->cycle_counter.use_count, 0);
>> +
>>   	return 0;
>>   }
>> @@ -321,6 +328,40 @@ static void panfrost_gpu_init_features(struct panfrost_device *pfdev)
>>   		 pfdev->features.shader_present, pfdev->features.l2_present);
>>   }
>> +void panfrost_cycle_counter_get(struct panfrost_device *pfdev)
>> +{
>> +	if (atomic_inc_not_zero(&pfdev->cycle_counter.use_count))
>> +		return;
>> +
>> +	spin_lock(&pfdev->cycle_counter.lock);
>> +	if (atomic_inc_return(&pfdev->cycle_counter.use_count) == 1)
>> +		gpu_write(pfdev, GPU_CMD, GPU_CMD_CYCLE_COUNT_START);
>> +	spin_unlock(&pfdev->cycle_counter.lock);
>> +}
>> +
>> +void panfrost_cycle_counter_put(struct panfrost_device *pfdev)
>> +{
>> +	if (atomic_add_unless(&pfdev->cycle_counter.use_count, -1, 1))
>> +		return;
>> +
>> +	spin_lock(&pfdev->cycle_counter.lock);
>> +	if (atomic_dec_return(&pfdev->cycle_counter.use_count) == 0)
>> +		gpu_write(pfdev, GPU_CMD, GPU_CMD_CYCLE_COUNT_STOP);
>> +	spin_unlock(&pfdev->cycle_counter.lock);
>> +}
>> +
>> +unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev)
>> +{
>> +	u32 hi, lo;
>> +
>> +	do {
>> +		hi = gpu_read(pfdev, GPU_CYCLE_COUNT_HI);
>> +		lo = gpu_read(pfdev, GPU_CYCLE_COUNT_LO);
>> +	} while (hi != gpu_read(pfdev, GPU_CYCLE_COUNT_HI));
>> +
>> +	return ((u64)hi << 32) | lo;
>> +}
>> +
>>   void panfrost_gpu_power_on(struct panfrost_device *pfdev)
>>   {
>>   	int ret;
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.h b/drivers/gpu/drm/panfrost/panfrost_gpu.h
>> index 468c51e7e46d..876fdad9f721 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.h
>> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.h
>> @@ -16,6 +16,10 @@ int panfrost_gpu_soft_reset(struct panfrost_device *pfdev);
>>   void panfrost_gpu_power_on(struct panfrost_device *pfdev);
>>   void panfrost_gpu_power_off(struct panfrost_device *pfdev);
>> +void panfrost_cycle_counter_get(struct panfrost_device *pfdev);
>> +void panfrost_cycle_counter_put(struct panfrost_device *pfdev);
>> +unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev);
>> +
>>   void panfrost_gpu_amlogic_quirk(struct panfrost_device *pfdev);
>>   #endif
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index 033f5e684707..fb16de2d0420 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -159,6 +159,16 @@ panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
>>   	struct panfrost_job *job = pfdev->jobs[slot][0];
>>   	WARN_ON(!job);
>> +	if (job->is_profiled) {
>> +		if (job->engine_usage) {
>> +			job->engine_usage->elapsed_ns[slot] +=
>> +				ktime_to_ns(ktime_sub(ktime_get(), job->start_time));
>> +			job->engine_usage->cycles[slot] +=
>> +				panfrost_cycle_counter_read(pfdev) - job->start_cycles;
>> +		}
>> +		panfrost_cycle_counter_put(job->pfdev);
>> +	}
>> +
>>   	pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
>>   	pfdev->jobs[slot][1] = NULL;
>> @@ -233,6 +243,13 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
>>   	subslot = panfrost_enqueue_job(pfdev, js, job);
>>   	/* Don't queue the job if a reset is in progress */
>>   	if (!atomic_read(&pfdev->reset.pending)) {
>> +		if (atomic_read(&pfdev->profile_mode)) {
>> +			panfrost_cycle_counter_get(pfdev);
>> +			job->is_profiled = true;
>> +			job->start_time = ktime_get();
>> +			job->start_cycles = panfrost_cycle_counter_read(pfdev);
>> +		}
>> +
>>   		job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
>>   		dev_dbg(pfdev->dev,
>>   			"JS: Submitting atom %p to js[%d][%d] with head=0x%llx AS %d",
>> @@ -660,10 +677,14 @@ panfrost_reset(struct panfrost_device *pfdev,
>>   	 * stuck jobs. Let's make sure the PM counters stay balanced by
>>   	 * manually calling pm_runtime_put_noidle() and
>>   	 * panfrost_devfreq_record_idle() for each stuck job.
>> +	 * Let's also make sure the cycle counting register's refcnt is
>> +	 * kept balanced to prevent it from running forever
>>   	 */
>>   	spin_lock(&pfdev->js->job_lock);
>>   	for (i = 0; i < NUM_JOB_SLOTS; i++) {
>>   		for (j = 0; j < ARRAY_SIZE(pfdev->jobs[0]) && pfdev->jobs[i][j]; j++) {
>> +			if (pfdev->jobs[i][j]->is_profiled)
>> +				panfrost_cycle_counter_put(pfdev->jobs[i][j]->pfdev);
>>   			pm_runtime_put_noidle(pfdev->dev);
>>   			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
>>   		}
>> @@ -926,6 +947,9 @@ void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
>>   			}
>>   			job_write(pfdev, JS_COMMAND(i), cmd);
>> +
>> +			/* Jobs can outlive their file context */
>> +			job->engine_usage = NULL;
>>   		}
>>   	}
>>   	spin_unlock(&pfdev->js->job_lock);
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h
>> index 8becc1ba0eb9..17ff808dba07 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.h
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.h
>> @@ -32,6 +32,11 @@ struct panfrost_job {
>>   	/* Fence to be signaled by drm-sched once its done with the job */
>>   	struct dma_fence *render_done_fence;
>> +
>> +	struct panfrost_engine_usage *engine_usage;
>> +	bool is_profiled;
>> +	ktime_t start_time;
>> +	u64 start_cycles;
>>   };
>>   int panfrost_job_init(struct panfrost_device *pfdev);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo
  2023-09-20 15:53   ` Tvrtko Ursulin
@ 2023-09-22 10:58     ` Adrián Larumbe
  2023-09-27 14:36       ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-22 10:58 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno

On 20.09.2023 16:53, Tvrtko Ursulin wrote:
>
>On 20/09/2023 00:34, Adrián Larumbe wrote:
>> Some BO's might be mapped onto physical memory chunkwise and on demand,
>> like Panfrost's tiler heap. In this case, even though the
>> drm_gem_shmem_object page array might already be allocated, only a very
>> small fraction of the BO is currently backed by system memory, but
>> drm_show_memory_stats will then proceed to add its entire virtual size to
>> the file's total resident size regardless.
>> 
>> This led to very unrealistic RSS sizes being reckoned for Panfrost, where
>> said tiler heap buffer is initially allocated with a virtual size of 128
>> MiB, but only a small part of it will eventually be backed by system memory
>> after successive GPU page faults.
>> 
>> Provide a new DRM object generic function that would allow drivers to
>> return a more accurate RSS size for their BOs.
>> 
>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>> Reviewed-by: Steven Price <steven.price@arm.com>
>> ---
>>   drivers/gpu/drm/drm_file.c | 5 ++++-
>>   include/drm/drm_gem.h      | 9 +++++++++
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>> index 883d83bc0e3d..762965e3d503 100644
>> --- a/drivers/gpu/drm/drm_file.c
>> +++ b/drivers/gpu/drm/drm_file.c
>> @@ -944,7 +944,10 @@ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
>>   		}
>>   		if (s & DRM_GEM_OBJECT_RESIDENT) {
>> -			status.resident += obj->size;
>> +			if (obj->funcs && obj->funcs->rss)
>> +				status.resident += obj->funcs->rss(obj);
>> +			else
>> +				status.resident += obj->size;
>
>Presumably you'd want the same smaller size in both active and purgeable? Or
>you can end up with more in those two than in rss which would look odd.

I didn't think of this. I guess when an object is both resident and purgeable,
then its RSS and purgeable sizes should be the same.

>Also, alternative to adding a new callback could be adding multiple output
>parameters to the existing obj->func->status() which maybe ends up simpler due
>fewer callbacks?
>
>Like:
>
> s = obj->funcs->status(obj, &supported_status, &rss)
>
>And adjust the code flow to pick up the rss if driver signaled it supports
>reporting it.

I personally find having a separate object callback more readable in this case.
There's also the question of what output parameter value would be used as a token
that the relevant BO doesn't have an RSS different from its virtual
size. I guess '0' would be alright, but this is on the assumption that this
could never be a legitimate BO virtual size across all DRM drivers. I guess
most of them round the size up to the nearest page multiple at BO creation
time.

>
>Regards,
>
>Tvrtko
>
>>   		} else {
>>   			/* If already purged or not yet backed by pages, don't
>>   			 * count it as purgeable:
>> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
>> index bc9f6aa2f3fe..16364487fde9 100644
>> --- a/include/drm/drm_gem.h
>> +++ b/include/drm/drm_gem.h
>> @@ -208,6 +208,15 @@ struct drm_gem_object_funcs {
>>   	 */
>>   	enum drm_gem_object_status (*status)(struct drm_gem_object *obj);
>> +	/**
>> +	 * @rss:
>> +	 *
>> +	 * Return resident size of the object in physical memory.
>> +	 *
>> +	 * Called by drm_show_memory_stats().
>> +	 */
>> +	size_t (*rss)(struct drm_gem_object *obj);
>> +
>>   	/**
>>   	 * @vm_ops:
>>   	 *

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats
  2023-09-20 15:32   ` Tvrtko Ursulin
  2023-09-21 10:14     ` Tvrtko Ursulin
@ 2023-09-22 11:01     ` Adrián Larumbe
  1 sibling, 0 replies; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-22 11:01 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno

On 20.09.2023 16:32, Tvrtko Ursulin wrote:
>
>On 20/09/2023 00:34, Adrián Larumbe wrote:
>> The current implementation will try to pick the highest available size
>> display unit as soon as the BO size exceeds that of the previous
>> multiplier. That can lead to loss of precision in contexts of low memory
>> usage.
>> 
>> The new selection criteria try to preserve precision, whilst also
>> increasing the display unit selection threshold to render more accurate
>> values.
>> 
>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>> Reviewed-by: Steven Price <steven.price@arm.com>
>> ---
>>   drivers/gpu/drm/drm_file.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>> index 762965e3d503..34cfa128ffe5 100644
>> --- a/drivers/gpu/drm/drm_file.c
>> +++ b/drivers/gpu/drm/drm_file.c
>> @@ -872,6 +872,8 @@ void drm_send_event(struct drm_device *dev, struct drm_pending_event *e)
>>   }
>>   EXPORT_SYMBOL(drm_send_event);
>> +#define UPPER_UNIT_THRESHOLD 100
>> +
>>   static void print_size(struct drm_printer *p, const char *stat,
>>   		       const char *region, u64 sz)
>>   {
>> @@ -879,7 +881,8 @@ static void print_size(struct drm_printer *p, const char *stat,
>>   	unsigned u;
>>   	for (u = 0; u < ARRAY_SIZE(units) - 1; u++) {
>> -		if (sz < SZ_1K)
>> +		if ((sz & (SZ_1K - 1)) &&
>
>IS_ALIGNED worth it at all?

This could look better, yeah.

>> +		    sz < UPPER_UNIT_THRESHOLD * SZ_1K)
>>   			break;
>
>Excuse me for a late comment (I was away). I did not get what what is special
>about a ~10% threshold? Sounds to me just going with the lower unit, when size
>is not aligned to the higher one, would be better than sometimes
>precision-sometimes-not.

We had a bit of a debate over this in previous revisions of the patch. It all began
when a Panfrost user complained that for relatively small BOs, they were losing
precision in the fdinfo file because the sum of the sizes of all BOs for a drm file
was in the order of MiBs, but not big enough to warrant losing accuracy when
plotting them on nvtop or gputop.

At first I thought of letting drivers pick their own preferred unit, but this would
lead to inconsistency in the units presented in the fdinfo file across different
DRM devices. Rob then suggested imposing a unit multiple threshold, while Boris
made the suggestion of checking for unit size alignment to lessen precision loss.

In the end Rob thought that minding both constraints was a good solution of compromise.

The unit threshold was picked sort of arbitrarily, and suggested by Rob himself. The
point of having it is avoiding huge number representations for BO size tallies that
aren't aligned to the next unit, and also because BO size sums are scaled when
plotting them on a Y axis, so complete accuracy isn't a requirement.

>Regards,
>
>Tvrtko
>
>>   		sz = div_u64(sz, SZ_1K);
>>   	}

Adrian Larumbe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats
  2023-09-21 10:14     ` Tvrtko Ursulin
@ 2023-09-22 11:03       ` Adrián Larumbe
  2023-09-22 14:02         ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Adrián Larumbe @ 2023-09-22 11:03 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno

On 21.09.2023 11:14, Tvrtko Ursulin wrote:
>
>On 20/09/2023 16:32, Tvrtko Ursulin wrote:
>> 
>> On 20/09/2023 00:34, Adrián Larumbe wrote:
>> > The current implementation will try to pick the highest available size
>> > display unit as soon as the BO size exceeds that of the previous
>> > multiplier. That can lead to loss of precision in contexts of low memory
>> > usage.
>> > 
>> > The new selection criteria try to preserve precision, whilst also
>> > increasing the display unit selection threshold to render more accurate
>> > values.
>> > 
>> > Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>> > Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>> > Reviewed-by: Steven Price <steven.price@arm.com>
>> > ---
>> >   drivers/gpu/drm/drm_file.c | 5 ++++-
>> >   1 file changed, 4 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>> > index 762965e3d503..34cfa128ffe5 100644
>> > --- a/drivers/gpu/drm/drm_file.c
>> > +++ b/drivers/gpu/drm/drm_file.c
>> > @@ -872,6 +872,8 @@ void drm_send_event(struct drm_device *dev, struct
>> > drm_pending_event *e)
>> >   }
>> >   EXPORT_SYMBOL(drm_send_event);
>> > +#define UPPER_UNIT_THRESHOLD 100
>> > +
>> >   static void print_size(struct drm_printer *p, const char *stat,
>> >                  const char *region, u64 sz)
>> >   {
>> > @@ -879,7 +881,8 @@ static void print_size(struct drm_printer *p,
>> > const char *stat,
>> >       unsigned u;
>> >       for (u = 0; u < ARRAY_SIZE(units) - 1; u++) {
>> > -        if (sz < SZ_1K)
>> > +        if ((sz & (SZ_1K - 1)) &&
>> 
>> IS_ALIGNED worth it at all?
>> 
>> > +            sz < UPPER_UNIT_THRESHOLD * SZ_1K)
>> >               break;
>> 
>> Excuse me for a late comment (I was away). I did not get what what is
>> special about a ~10% threshold? Sounds to me just going with the lower
>> unit, when size is not aligned to the higher one, would be better than
>> sometimes precision-sometimes-not.
>
>FWIW both current and the threshold option make testing the feature very
>annoying.

How so?

>So I'd really propose we simply use smaller unit when unaligned.

Like I said in the previous reply, for drm files whose overall BO size sum is enormous
but not a multiple of a MiB, this would render huge number representations in KiB.
I don't find this particularly comfortable to read, and then this extra precision
would mean nothing to nvtop or gputop, which would have to scale the size to their
available screen dimensions when plotting them.

>Regards,
>
>Tvrtko

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics
  2023-09-22 10:57     ` Adrián Larumbe
@ 2023-09-22 13:53       ` Tvrtko Ursulin
  2023-09-22 15:23         ` Steven Price
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-22 13:53 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno


On 22/09/2023 11:57, Adrián Larumbe wrote:
> On 20.09.2023 16:40, Tvrtko Ursulin wrote:
>> On 20/09/2023 00:34, Adrián Larumbe wrote:
>>> The drm-stats fdinfo tags made available to user space are drm-engine,
>>> drm-cycles, drm-max-freq and drm-curfreq, one per job slot.
>>>
>>> This deviates from standard practice in other DRM drivers, where a single
>>> set of key:value pairs is provided for the whole render engine. However,
>>> Panfrost has separate queues for fragment and vertex/tiler jobs, so a
>>> decision was made to calculate bus cycles and workload times separately.
>>>
>>> Maximum operating frequency is calculated at devfreq initialisation time.
>>> Current frequency is made available to user space because nvtop uses it
>>> when performing engine usage calculations.
>>>
>>> It is important to bear in mind that both GPU cycle and kernel time numbers
>>> provided are at best rough estimations, and always reported in excess from
>>> the actual figure because of two reasons:
>>>    - Excess time because of the delay between the end of a job processing,
>>>      the subsequent job IRQ and the actual time of the sample.
>>>    - Time spent in the engine queue waiting for the GPU to pick up the next
>>>      job.
>>>
>>> To avoid race conditions during enablement/disabling, a reference counting
>>> mechanism was introduced, and a job flag that tells us whether a given job
>>> increased the refcount. This is necessary, because user space can toggle
>>> cycle counting through a debugfs file, and a given job might have been in
>>> flight by the time cycle counting was disabled.
>>>
>>> The main goal of the debugfs cycle counter knob is letting tools like nvtop
>>> or IGT's gputop switch it at any time, to avoid power waste in case no
>>> engine usage measuring is necessary.
>>>
>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>>> Reviewed-by: Steven Price <steven.price@arm.com>
>>> ---
>>>    drivers/gpu/drm/panfrost/Makefile           |  2 +
>>>    drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 ++++++++
>>>    drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
>>>    drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
>>>    drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
>>>    drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
>>>    drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
>>>    drivers/gpu/drm/panfrost/panfrost_drv.c     | 57 ++++++++++++++++++++-
>>>    drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 +++++++++++++++
>>>    drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
>>>    drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
>>>    drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
>>>    12 files changed, 191 insertions(+), 1 deletion(-)
>>>    create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>    create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>
>>> diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile
>>> index 7da2b3f02ed9..2c01c1e7523e 100644
>>> --- a/drivers/gpu/drm/panfrost/Makefile
>>> +++ b/drivers/gpu/drm/panfrost/Makefile
>>> @@ -12,4 +12,6 @@ panfrost-y := \
>>>    	panfrost_perfcnt.o \
>>>    	panfrost_dump.o
>>> +panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o
>>> +
>>>    obj-$(CONFIG_DRM_PANFROST) += panfrost.o
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>> new file mode 100644
>>> index 000000000000..cc14eccba206
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>> @@ -0,0 +1,20 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/* Copyright 2023 Collabora ltd. */
>>> +
>>> +#include <linux/debugfs.h>
>>> +#include <linux/platform_device.h>
>>> +#include <drm/drm_debugfs.h>
>>> +#include <drm/drm_file.h>
>>> +#include <drm/panfrost_drm.h>
>>> +
>>> +#include "panfrost_device.h"
>>> +#include "panfrost_gpu.h"
>>> +#include "panfrost_debugfs.h"
>>> +
>>> +void panfrost_debugfs_init(struct drm_minor *minor)
>>> +{
>>> +	struct drm_device *dev = minor->dev;
>>> +	struct panfrost_device *pfdev = platform_get_drvdata(to_platform_device(dev->dev));
>>> +
>>> +	debugfs_create_atomic_t("profile", 0600, minor->debugfs_root, &pfdev->profile_mode);
>>> +}
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>> new file mode 100644
>>> index 000000000000..db1c158bcf2f
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>> @@ -0,0 +1,13 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Copyright 2023 Collabora ltd.
>>> + */
>>> +
>>> +#ifndef PANFROST_DEBUGFS_H
>>> +#define PANFROST_DEBUGFS_H
>>> +
>>> +#ifdef CONFIG_DEBUG_FS
>>> +void panfrost_debugfs_init(struct drm_minor *minor);
>>> +#endif
>>> +
>>> +#endif  /* PANFROST_DEBUGFS_H */
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>> index 58dfb15a8757..28caffc689e2 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>> @@ -58,6 +58,7 @@ static int panfrost_devfreq_get_dev_status(struct device *dev,
>>>    	spin_lock_irqsave(&pfdevfreq->lock, irqflags);
>>>    	panfrost_devfreq_update_utilization(pfdevfreq);
>>> +	pfdevfreq->current_frequency = status->current_frequency;
>>>    	status->total_time = ktime_to_ns(ktime_add(pfdevfreq->busy_time,
>>>    						   pfdevfreq->idle_time));
>>> @@ -117,6 +118,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
>>>    	struct devfreq *devfreq;
>>>    	struct thermal_cooling_device *cooling;
>>>    	struct panfrost_devfreq *pfdevfreq = &pfdev->pfdevfreq;
>>> +	unsigned long freq = ULONG_MAX;
>>>    	if (pfdev->comp->num_supplies > 1) {
>>>    		/*
>>> @@ -172,6 +174,12 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
>>>    		return ret;
>>>    	}
>>> +	/* Find the fastest defined rate  */
>>> +	opp = dev_pm_opp_find_freq_floor(dev, &freq);
>>> +	if (IS_ERR(opp))
>>> +		return PTR_ERR(opp);
>>> +	pfdevfreq->fast_rate = freq;
>>> +
>>>    	dev_pm_opp_put(opp);
>>>    	/*
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.h b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>> index 1514c1f9d91c..48dbe185f206 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>> @@ -19,6 +19,9 @@ struct panfrost_devfreq {
>>>    	struct devfreq_simple_ondemand_data gov_data;
>>>    	bool opp_of_table_added;
>>> +	unsigned long current_frequency;
>>> +	unsigned long fast_rate;
>>> +
>>>    	ktime_t busy_time;
>>>    	ktime_t idle_time;
>>>    	ktime_t time_last_update;
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
>>> index fa1a086a862b..28f7046e1b1a 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
>>> @@ -207,6 +207,8 @@ int panfrost_device_init(struct panfrost_device *pfdev)
>>>    	spin_lock_init(&pfdev->as_lock);
>>> +	spin_lock_init(&pfdev->cycle_counter.lock);
>>> +
>>>    	err = panfrost_clk_init(pfdev);
>>>    	if (err) {
>>>    		dev_err(pfdev->dev, "clk init failed %d\n", err);
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
>>> index b0126b9fbadc..1e85656dc2f7 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
>>> @@ -107,6 +107,7 @@ struct panfrost_device {
>>>    	struct list_head scheduled_jobs;
>>>    	struct panfrost_perfcnt *perfcnt;
>>> +	atomic_t profile_mode;
>>>    	struct mutex sched_lock;
>>> @@ -121,6 +122,11 @@ struct panfrost_device {
>>>    	struct shrinker shrinker;
>>>    	struct panfrost_devfreq pfdevfreq;
>>> +
>>> +	struct {
>>> +		atomic_t use_count;
>>> +		spinlock_t lock;
>>> +	} cycle_counter;
>>>    };
>>>    struct panfrost_mmu {
>>> @@ -135,12 +141,19 @@ struct panfrost_mmu {
>>>    	struct list_head list;
>>>    };
>>> +struct panfrost_engine_usage {
>>> +	unsigned long long elapsed_ns[NUM_JOB_SLOTS];
>>> +	unsigned long long cycles[NUM_JOB_SLOTS];
>>> +};
>>> +
>>>    struct panfrost_file_priv {
>>>    	struct panfrost_device *pfdev;
>>>    	struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
>>>    	struct panfrost_mmu *mmu;
>>> +
>>> +	struct panfrost_engine_usage engine_usage;
>>>    };
>>>    static inline struct panfrost_device *to_panfrost_device(struct drm_device *ddev)
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> index a2ab99698ca8..3c93a11deab1 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> @@ -20,6 +20,7 @@
>>>    #include "panfrost_job.h"
>>>    #include "panfrost_gpu.h"
>>>    #include "panfrost_perfcnt.h"
>>> +#include "panfrost_debugfs.h"
>>>    static bool unstable_ioctls;
>>>    module_param_unsafe(unstable_ioctls, bool, 0600);
>>> @@ -267,6 +268,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>>    	job->requirements = args->requirements;
>>>    	job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
>>>    	job->mmu = file_priv->mmu;
>>> +	job->engine_usage = &file_priv->engine_usage;
>>>    	slot = panfrost_job_get_slot(job);
>>> @@ -523,7 +525,55 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
>>>    	PANFROST_IOCTL(MADVISE,		madvise,	DRM_RENDER_ALLOW),
>>>    };
>>> -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
>>> +
>>> +static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
>>> +				     struct panfrost_file_priv *panfrost_priv,
>>> +				     struct drm_printer *p)
>>> +{
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
>>> +	 * accurate, as they only provide a rough estimation of the number of
>>> +	 * GPU cycles and CPU time spent in a given context. This is due to two
>>> +	 * different factors:
>>> +	 * - Firstly, we must consider the time the CPU and then the kernel
>>> +	 *   takes to process the GPU interrupt, which means additional time and
>>> +	 *   GPU cycles will be added in excess to the real figure.
>>> +	 * - Secondly, the pipelining done by the Job Manager (2 job slots per
>>> +	 *   engine) implies there is no way to know exactly how much time each
>>> +	 *   job spent on the GPU.
>>> +	 */
>>> +
>>> +	static const char * const engine_names[] = {
>>> +		"fragment", "vertex-tiler", "compute-only"
>>> +	};
>>> +
>>> +	for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
>>
>> FWIW you could future proof this a bit by using "i < ARRAY_SIZE(engine_names)"
>> and avoid maybe silent out of bounds reads if someone updates NUM_JOB_SLOTS
>> and forgets about this loop. Or stick a warning of some sort.
>>
> NUM_JOB_SLOTS is actually the same as the number of engines in the device. I decided to follow
> this loop convention because that's what's being done across the driver when manipulating
> the engine queues, so I thought I'd stick to it for the sake of consistency. Bear in mind
> the loop doesn't pick up the compute-only engine because it's still not exposed to user space.
> 
> So NUM_JOB_SLOTS cannot change, unless a new engine were introduced, and then someone would
> have to update this array accordingly.

Exactly, and until they would, here we'd have a be silent out of bound 
memory access. Content of which even gets shared with userspace. ;)

>>> +		drm_printf(p, "drm-engine-%s:\t%llu ns\n",
>>> +			   engine_names[i], panfrost_priv->engine_usage.elapsed_ns[i]);
>>> +		drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>> +			   engine_names[i], panfrost_priv->engine_usage.cycles[i]);
>>> +		drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
>>> +			   engine_names[i], pfdev->pfdevfreq.fast_rate);
>>> +		drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
>>> +			   engine_names[i], pfdev->pfdevfreq.current_frequency);
>>
>> I envisaged a link to driver specific docs at the bottom of
>> drm-usage-stats.rst so it would be nice if drivers would be adding those
>> sections and describing their private keys, engine names etc. ;)
>>
> Currently there's no panfrost.rst file under Documentation/gpu. I guess I'll create a new
> one and add the engine descriptions and meaning of drm-curfreq key.

Yeah I have to do the same for i915 in my memory stats series. :)

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats
  2023-09-22 11:03       ` Adrián Larumbe
@ 2023-09-22 14:02         ` Tvrtko Ursulin
  0 siblings, 0 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-22 14:02 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno


On 22/09/2023 12:03, Adrián Larumbe wrote:
> On 21.09.2023 11:14, Tvrtko Ursulin wrote:
>>
>> On 20/09/2023 16:32, Tvrtko Ursulin wrote:
>>>
>>> On 20/09/2023 00:34, Adrián Larumbe wrote:
>>>> The current implementation will try to pick the highest available size
>>>> display unit as soon as the BO size exceeds that of the previous
>>>> multiplier. That can lead to loss of precision in contexts of low memory
>>>> usage.
>>>>
>>>> The new selection criteria try to preserve precision, whilst also
>>>> increasing the display unit selection threshold to render more accurate
>>>> values.
>>>>
>>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>>>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>>>> Reviewed-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>>    drivers/gpu/drm/drm_file.c | 5 ++++-
>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>>>> index 762965e3d503..34cfa128ffe5 100644
>>>> --- a/drivers/gpu/drm/drm_file.c
>>>> +++ b/drivers/gpu/drm/drm_file.c
>>>> @@ -872,6 +872,8 @@ void drm_send_event(struct drm_device *dev, struct
>>>> drm_pending_event *e)
>>>>    }
>>>>    EXPORT_SYMBOL(drm_send_event);
>>>> +#define UPPER_UNIT_THRESHOLD 100
>>>> +
>>>>    static void print_size(struct drm_printer *p, const char *stat,
>>>>                   const char *region, u64 sz)
>>>>    {
>>>> @@ -879,7 +881,8 @@ static void print_size(struct drm_printer *p,
>>>> const char *stat,
>>>>        unsigned u;
>>>>        for (u = 0; u < ARRAY_SIZE(units) - 1; u++) {
>>>> -        if (sz < SZ_1K)
>>>> +        if ((sz & (SZ_1K - 1)) &&
>>>
>>> IS_ALIGNED worth it at all?
>>>
>>>> +            sz < UPPER_UNIT_THRESHOLD * SZ_1K)
>>>>                break;
>>>
>>> Excuse me for a late comment (I was away). I did not get what what is
>>> special about a ~10% threshold? Sounds to me just going with the lower
>>> unit, when size is not aligned to the higher one, would be better than
>>> sometimes precision-sometimes-not.
>>
>> FWIW both current and the threshold option make testing the feature very
>> annoying.
> 
> How so?

I have to build in the knowledge of implementation details of 
print_size() into my IGT in order to use the right size BOs, so test is 
able to verify stats move as expected. It just feels wrong.

>> So I'd really propose we simply use smaller unit when unaligned.
> 
> Like I said in the previous reply, for drm files whose overall BO size sum is enormous
> but not a multiple of a MiB, this would render huge number representations in KiB.
> I don't find this particularly comfortable to read, and then this extra precision
> would mean nothing to nvtop or gputop, which would have to scale the size to their
> available screen dimensions when plotting them.

I don't think numbers in KiB are so huge.

And I don't think people will end up reading them manually a lot anyway, 
since you have to hunt the pid, and fd, etc.. It is much more realistic 
that some tool like gputop will be used.

And I don't think consistency of units across drivers or whatever 
matters. Even better to keep userspace parser on their toes and make 
then follow drm-usage-stats.rst and not any implementations, at some 
point in time.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics
  2023-09-22 13:53       ` Tvrtko Ursulin
@ 2023-09-22 15:23         ` Steven Price
  2023-09-25  8:57           ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Price @ 2023-09-22 15:23 UTC (permalink / raw)
  To: Tvrtko Ursulin, Adrián Larumbe
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, linux-arm-msm, linux-kernel, dri-devel, healych,
	Boris Brezillon, kernel, freedreno

On 22/09/2023 14:53, Tvrtko Ursulin wrote:
> 
> On 22/09/2023 11:57, Adrián Larumbe wrote:
>> On 20.09.2023 16:40, Tvrtko Ursulin wrote:
>>> On 20/09/2023 00:34, Adrián Larumbe wrote:
>>>> The drm-stats fdinfo tags made available to user space are drm-engine,
>>>> drm-cycles, drm-max-freq and drm-curfreq, one per job slot.
>>>>
>>>> This deviates from standard practice in other DRM drivers, where a
>>>> single
>>>> set of key:value pairs is provided for the whole render engine.
>>>> However,
>>>> Panfrost has separate queues for fragment and vertex/tiler jobs, so a
>>>> decision was made to calculate bus cycles and workload times
>>>> separately.
>>>>
>>>> Maximum operating frequency is calculated at devfreq initialisation
>>>> time.
>>>> Current frequency is made available to user space because nvtop uses it
>>>> when performing engine usage calculations.
>>>>
>>>> It is important to bear in mind that both GPU cycle and kernel time
>>>> numbers
>>>> provided are at best rough estimations, and always reported in
>>>> excess from
>>>> the actual figure because of two reasons:
>>>>    - Excess time because of the delay between the end of a job
>>>> processing,
>>>>      the subsequent job IRQ and the actual time of the sample.
>>>>    - Time spent in the engine queue waiting for the GPU to pick up
>>>> the next
>>>>      job.
>>>>
>>>> To avoid race conditions during enablement/disabling, a reference
>>>> counting
>>>> mechanism was introduced, and a job flag that tells us whether a
>>>> given job
>>>> increased the refcount. This is necessary, because user space can
>>>> toggle
>>>> cycle counting through a debugfs file, and a given job might have
>>>> been in
>>>> flight by the time cycle counting was disabled.
>>>>
>>>> The main goal of the debugfs cycle counter knob is letting tools
>>>> like nvtop
>>>> or IGT's gputop switch it at any time, to avoid power waste in case no
>>>> engine usage measuring is necessary.
>>>>
>>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>>>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>>>> Reviewed-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>>    drivers/gpu/drm/panfrost/Makefile           |  2 +
>>>>    drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 ++++++++
>>>>    drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
>>>>    drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
>>>>    drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
>>>>    drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
>>>>    drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
>>>>    drivers/gpu/drm/panfrost/panfrost_drv.c     | 57
>>>> ++++++++++++++++++++-
>>>>    drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 +++++++++++++++
>>>>    drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
>>>>    drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
>>>>    drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
>>>>    12 files changed, 191 insertions(+), 1 deletion(-)
>>>>    create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>>    create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>>
>>>> diff --git a/drivers/gpu/drm/panfrost/Makefile
>>>> b/drivers/gpu/drm/panfrost/Makefile
>>>> index 7da2b3f02ed9..2c01c1e7523e 100644
>>>> --- a/drivers/gpu/drm/panfrost/Makefile
>>>> +++ b/drivers/gpu/drm/panfrost/Makefile
>>>> @@ -12,4 +12,6 @@ panfrost-y := \
>>>>        panfrost_perfcnt.o \
>>>>        panfrost_dump.o
>>>> +panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o
>>>> +
>>>>    obj-$(CONFIG_DRM_PANFROST) += panfrost.o
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>> b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>> new file mode 100644
>>>> index 000000000000..cc14eccba206
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>> @@ -0,0 +1,20 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright 2023 Collabora ltd. */
>>>> +
>>>> +#include <linux/debugfs.h>
>>>> +#include <linux/platform_device.h>
>>>> +#include <drm/drm_debugfs.h>
>>>> +#include <drm/drm_file.h>
>>>> +#include <drm/panfrost_drm.h>
>>>> +
>>>> +#include "panfrost_device.h"
>>>> +#include "panfrost_gpu.h"
>>>> +#include "panfrost_debugfs.h"
>>>> +
>>>> +void panfrost_debugfs_init(struct drm_minor *minor)
>>>> +{
>>>> +    struct drm_device *dev = minor->dev;
>>>> +    struct panfrost_device *pfdev =
>>>> platform_get_drvdata(to_platform_device(dev->dev));
>>>> +
>>>> +    debugfs_create_atomic_t("profile", 0600, minor->debugfs_root,
>>>> &pfdev->profile_mode);
>>>> +}
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>> b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>> new file mode 100644
>>>> index 000000000000..db1c158bcf2f
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>> @@ -0,0 +1,13 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright 2023 Collabora ltd.
>>>> + */
>>>> +
>>>> +#ifndef PANFROST_DEBUGFS_H
>>>> +#define PANFROST_DEBUGFS_H
>>>> +
>>>> +#ifdef CONFIG_DEBUG_FS
>>>> +void panfrost_debugfs_init(struct drm_minor *minor);
>>>> +#endif
>>>> +
>>>> +#endif  /* PANFROST_DEBUGFS_H */
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>> b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>> index 58dfb15a8757..28caffc689e2 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>> @@ -58,6 +58,7 @@ static int panfrost_devfreq_get_dev_status(struct
>>>> device *dev,
>>>>        spin_lock_irqsave(&pfdevfreq->lock, irqflags);
>>>>        panfrost_devfreq_update_utilization(pfdevfreq);
>>>> +    pfdevfreq->current_frequency = status->current_frequency;
>>>>        status->total_time = ktime_to_ns(ktime_add(pfdevfreq->busy_time,
>>>>                               pfdevfreq->idle_time));
>>>> @@ -117,6 +118,7 @@ int panfrost_devfreq_init(struct panfrost_device
>>>> *pfdev)
>>>>        struct devfreq *devfreq;
>>>>        struct thermal_cooling_device *cooling;
>>>>        struct panfrost_devfreq *pfdevfreq = &pfdev->pfdevfreq;
>>>> +    unsigned long freq = ULONG_MAX;
>>>>        if (pfdev->comp->num_supplies > 1) {
>>>>            /*
>>>> @@ -172,6 +174,12 @@ int panfrost_devfreq_init(struct
>>>> panfrost_device *pfdev)
>>>>            return ret;
>>>>        }
>>>> +    /* Find the fastest defined rate  */
>>>> +    opp = dev_pm_opp_find_freq_floor(dev, &freq);
>>>> +    if (IS_ERR(opp))
>>>> +        return PTR_ERR(opp);
>>>> +    pfdevfreq->fast_rate = freq;
>>>> +
>>>>        dev_pm_opp_put(opp);
>>>>        /*
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>> b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>> index 1514c1f9d91c..48dbe185f206 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>> @@ -19,6 +19,9 @@ struct panfrost_devfreq {
>>>>        struct devfreq_simple_ondemand_data gov_data;
>>>>        bool opp_of_table_added;
>>>> +    unsigned long current_frequency;
>>>> +    unsigned long fast_rate;
>>>> +
>>>>        ktime_t busy_time;
>>>>        ktime_t idle_time;
>>>>        ktime_t time_last_update;
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c
>>>> b/drivers/gpu/drm/panfrost/panfrost_device.c
>>>> index fa1a086a862b..28f7046e1b1a 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
>>>> @@ -207,6 +207,8 @@ int panfrost_device_init(struct panfrost_device
>>>> *pfdev)
>>>>        spin_lock_init(&pfdev->as_lock);
>>>> +    spin_lock_init(&pfdev->cycle_counter.lock);
>>>> +
>>>>        err = panfrost_clk_init(pfdev);
>>>>        if (err) {
>>>>            dev_err(pfdev->dev, "clk init failed %d\n", err);
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h
>>>> b/drivers/gpu/drm/panfrost/panfrost_device.h
>>>> index b0126b9fbadc..1e85656dc2f7 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
>>>> @@ -107,6 +107,7 @@ struct panfrost_device {
>>>>        struct list_head scheduled_jobs;
>>>>        struct panfrost_perfcnt *perfcnt;
>>>> +    atomic_t profile_mode;
>>>>        struct mutex sched_lock;
>>>> @@ -121,6 +122,11 @@ struct panfrost_device {
>>>>        struct shrinker shrinker;
>>>>        struct panfrost_devfreq pfdevfreq;
>>>> +
>>>> +    struct {
>>>> +        atomic_t use_count;
>>>> +        spinlock_t lock;
>>>> +    } cycle_counter;
>>>>    };
>>>>    struct panfrost_mmu {
>>>> @@ -135,12 +141,19 @@ struct panfrost_mmu {
>>>>        struct list_head list;
>>>>    };
>>>> +struct panfrost_engine_usage {
>>>> +    unsigned long long elapsed_ns[NUM_JOB_SLOTS];
>>>> +    unsigned long long cycles[NUM_JOB_SLOTS];
>>>> +};
>>>> +
>>>>    struct panfrost_file_priv {
>>>>        struct panfrost_device *pfdev;
>>>>        struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
>>>>        struct panfrost_mmu *mmu;
>>>> +
>>>> +    struct panfrost_engine_usage engine_usage;
>>>>    };
>>>>    static inline struct panfrost_device *to_panfrost_device(struct
>>>> drm_device *ddev)
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> index a2ab99698ca8..3c93a11deab1 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> @@ -20,6 +20,7 @@
>>>>    #include "panfrost_job.h"
>>>>    #include "panfrost_gpu.h"
>>>>    #include "panfrost_perfcnt.h"
>>>> +#include "panfrost_debugfs.h"
>>>>    static bool unstable_ioctls;
>>>>    module_param_unsafe(unstable_ioctls, bool, 0600);
>>>> @@ -267,6 +268,7 @@ static int panfrost_ioctl_submit(struct
>>>> drm_device *dev, void *data,
>>>>        job->requirements = args->requirements;
>>>>        job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
>>>>        job->mmu = file_priv->mmu;
>>>> +    job->engine_usage = &file_priv->engine_usage;
>>>>        slot = panfrost_job_get_slot(job);
>>>> @@ -523,7 +525,55 @@ static const struct drm_ioctl_desc
>>>> panfrost_drm_driver_ioctls[] = {
>>>>        PANFROST_IOCTL(MADVISE,        madvise,    DRM_RENDER_ALLOW),
>>>>    };
>>>> -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
>>>> +
>>>> +static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
>>>> +                     struct panfrost_file_priv *panfrost_priv,
>>>> +                     struct drm_printer *p)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    /*
>>>> +     * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
>>>> +     * accurate, as they only provide a rough estimation of the
>>>> number of
>>>> +     * GPU cycles and CPU time spent in a given context. This is
>>>> due to two
>>>> +     * different factors:
>>>> +     * - Firstly, we must consider the time the CPU and then the
>>>> kernel
>>>> +     *   takes to process the GPU interrupt, which means additional
>>>> time and
>>>> +     *   GPU cycles will be added in excess to the real figure.
>>>> +     * - Secondly, the pipelining done by the Job Manager (2 job
>>>> slots per
>>>> +     *   engine) implies there is no way to know exactly how much
>>>> time each
>>>> +     *   job spent on the GPU.
>>>> +     */
>>>> +
>>>> +    static const char * const engine_names[] = {
>>>> +        "fragment", "vertex-tiler", "compute-only"
>>>> +    };
>>>> +
>>>> +    for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
>>>
>>> FWIW you could future proof this a bit by using "i <
>>> ARRAY_SIZE(engine_names)"
>>> and avoid maybe silent out of bounds reads if someone updates
>>> NUM_JOB_SLOTS
>>> and forgets about this loop. Or stick a warning of some sort.
>>>
>> NUM_JOB_SLOTS is actually the same as the number of engines in the
>> device. I decided to follow
>> this loop convention because that's what's being done across the
>> driver when manipulating
>> the engine queues, so I thought I'd stick to it for the sake of
>> consistency. Bear in mind
>> the loop doesn't pick up the compute-only engine because it's still
>> not exposed to user space.
>>
>> So NUM_JOB_SLOTS cannot change, unless a new engine were introduced,
>> and then someone would
>> have to update this array accordingly.
> 
> Exactly, and until they would, here we'd have a be silent out of bound
> memory access. Content of which even gets shared with userspace. ;)

I think using NUM_JOB_SLOTS here seems sensible (as Adrián points out
it's consistent with the rest of the driver). But a BUILD_BUG_ON
checking the array size is could make sense.

In reality I don't see the number of job slots ever changing - panfrost
is now for the 'old' architecture (panthor being the new driver for
later 'CSF' architecture). And even if there was a new design for
pre-CSF - it would be a very big change to the architecture: we've kept
the 3 slots all the way through even though the 3rd is never used on
most GPUs. But equally I've been wrong before ;)

Steve

>>>> +        drm_printf(p, "drm-engine-%s:\t%llu ns\n",
>>>> +               engine_names[i],
>>>> panfrost_priv->engine_usage.elapsed_ns[i]);
>>>> +        drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>>> +               engine_names[i],
>>>> panfrost_priv->engine_usage.cycles[i]);
>>>> +        drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
>>>> +               engine_names[i], pfdev->pfdevfreq.fast_rate);
>>>> +        drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
>>>> +               engine_names[i], pfdev->pfdevfreq.current_frequency);
>>>
>>> I envisaged a link to driver specific docs at the bottom of
>>> drm-usage-stats.rst so it would be nice if drivers would be adding those
>>> sections and describing their private keys, engine names etc. ;)
>>>
>> Currently there's no panfrost.rst file under Documentation/gpu. I
>> guess I'll create a new
>> one and add the engine descriptions and meaning of drm-curfreq key.
> 
> Yeah I have to do the same for i915 in my memory stats series. :)
> 
> Regards,
> 
> Tvrtko


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics
  2023-09-22 15:23         ` Steven Price
@ 2023-09-25  8:57           ` Tvrtko Ursulin
  0 siblings, 0 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-25  8:57 UTC (permalink / raw)
  To: Steven Price, Adrián Larumbe
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, linux-arm-msm, linux-kernel, dri-devel, healych,
	Boris Brezillon, kernel, freedreno


On 22/09/2023 16:23, Steven Price wrote:
> On 22/09/2023 14:53, Tvrtko Ursulin wrote:
>>
>> On 22/09/2023 11:57, Adrián Larumbe wrote:
>>> On 20.09.2023 16:40, Tvrtko Ursulin wrote:
>>>> On 20/09/2023 00:34, Adrián Larumbe wrote:
>>>>> The drm-stats fdinfo tags made available to user space are drm-engine,
>>>>> drm-cycles, drm-max-freq and drm-curfreq, one per job slot.
>>>>>
>>>>> This deviates from standard practice in other DRM drivers, where a
>>>>> single
>>>>> set of key:value pairs is provided for the whole render engine.
>>>>> However,
>>>>> Panfrost has separate queues for fragment and vertex/tiler jobs, so a
>>>>> decision was made to calculate bus cycles and workload times
>>>>> separately.
>>>>>
>>>>> Maximum operating frequency is calculated at devfreq initialisation
>>>>> time.
>>>>> Current frequency is made available to user space because nvtop uses it
>>>>> when performing engine usage calculations.
>>>>>
>>>>> It is important to bear in mind that both GPU cycle and kernel time
>>>>> numbers
>>>>> provided are at best rough estimations, and always reported in
>>>>> excess from
>>>>> the actual figure because of two reasons:
>>>>>     - Excess time because of the delay between the end of a job
>>>>> processing,
>>>>>       the subsequent job IRQ and the actual time of the sample.
>>>>>     - Time spent in the engine queue waiting for the GPU to pick up
>>>>> the next
>>>>>       job.
>>>>>
>>>>> To avoid race conditions during enablement/disabling, a reference
>>>>> counting
>>>>> mechanism was introduced, and a job flag that tells us whether a
>>>>> given job
>>>>> increased the refcount. This is necessary, because user space can
>>>>> toggle
>>>>> cycle counting through a debugfs file, and a given job might have
>>>>> been in
>>>>> flight by the time cycle counting was disabled.
>>>>>
>>>>> The main goal of the debugfs cycle counter knob is letting tools
>>>>> like nvtop
>>>>> or IGT's gputop switch it at any time, to avoid power waste in case no
>>>>> engine usage measuring is necessary.
>>>>>
>>>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>>>>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>>>>> Reviewed-by: Steven Price <steven.price@arm.com>
>>>>> ---
>>>>>     drivers/gpu/drm/panfrost/Makefile           |  2 +
>>>>>     drivers/gpu/drm/panfrost/panfrost_debugfs.c | 20 ++++++++
>>>>>     drivers/gpu/drm/panfrost/panfrost_debugfs.h | 13 +++++
>>>>>     drivers/gpu/drm/panfrost/panfrost_devfreq.c |  8 +++
>>>>>     drivers/gpu/drm/panfrost/panfrost_devfreq.h |  3 ++
>>>>>     drivers/gpu/drm/panfrost/panfrost_device.c  |  2 +
>>>>>     drivers/gpu/drm/panfrost/panfrost_device.h  | 13 +++++
>>>>>     drivers/gpu/drm/panfrost/panfrost_drv.c     | 57
>>>>> ++++++++++++++++++++-
>>>>>     drivers/gpu/drm/panfrost/panfrost_gpu.c     | 41 +++++++++++++++
>>>>>     drivers/gpu/drm/panfrost/panfrost_gpu.h     |  4 ++
>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c     | 24 +++++++++
>>>>>     drivers/gpu/drm/panfrost/panfrost_job.h     |  5 ++
>>>>>     12 files changed, 191 insertions(+), 1 deletion(-)
>>>>>     create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>>>     create mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>>>
>>>>> diff --git a/drivers/gpu/drm/panfrost/Makefile
>>>>> b/drivers/gpu/drm/panfrost/Makefile
>>>>> index 7da2b3f02ed9..2c01c1e7523e 100644
>>>>> --- a/drivers/gpu/drm/panfrost/Makefile
>>>>> +++ b/drivers/gpu/drm/panfrost/Makefile
>>>>> @@ -12,4 +12,6 @@ panfrost-y := \
>>>>>         panfrost_perfcnt.o \
>>>>>         panfrost_dump.o
>>>>> +panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o
>>>>> +
>>>>>     obj-$(CONFIG_DRM_PANFROST) += panfrost.o
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>>> b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>>> new file mode 100644
>>>>> index 000000000000..cc14eccba206
>>>>> --- /dev/null
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
>>>>> @@ -0,0 +1,20 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +/* Copyright 2023 Collabora ltd. */
>>>>> +
>>>>> +#include <linux/debugfs.h>
>>>>> +#include <linux/platform_device.h>
>>>>> +#include <drm/drm_debugfs.h>
>>>>> +#include <drm/drm_file.h>
>>>>> +#include <drm/panfrost_drm.h>
>>>>> +
>>>>> +#include "panfrost_device.h"
>>>>> +#include "panfrost_gpu.h"
>>>>> +#include "panfrost_debugfs.h"
>>>>> +
>>>>> +void panfrost_debugfs_init(struct drm_minor *minor)
>>>>> +{
>>>>> +    struct drm_device *dev = minor->dev;
>>>>> +    struct panfrost_device *pfdev =
>>>>> platform_get_drvdata(to_platform_device(dev->dev));
>>>>> +
>>>>> +    debugfs_create_atomic_t("profile", 0600, minor->debugfs_root,
>>>>> &pfdev->profile_mode);
>>>>> +}
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>>> b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>>> new file mode 100644
>>>>> index 000000000000..db1c158bcf2f
>>>>> --- /dev/null
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
>>>>> @@ -0,0 +1,13 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +/*
>>>>> + * Copyright 2023 Collabora ltd.
>>>>> + */
>>>>> +
>>>>> +#ifndef PANFROST_DEBUGFS_H
>>>>> +#define PANFROST_DEBUGFS_H
>>>>> +
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +void panfrost_debugfs_init(struct drm_minor *minor);
>>>>> +#endif
>>>>> +
>>>>> +#endif  /* PANFROST_DEBUGFS_H */
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>>> b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>>> index 58dfb15a8757..28caffc689e2 100644
>>>>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
>>>>> @@ -58,6 +58,7 @@ static int panfrost_devfreq_get_dev_status(struct
>>>>> device *dev,
>>>>>         spin_lock_irqsave(&pfdevfreq->lock, irqflags);
>>>>>         panfrost_devfreq_update_utilization(pfdevfreq);
>>>>> +    pfdevfreq->current_frequency = status->current_frequency;
>>>>>         status->total_time = ktime_to_ns(ktime_add(pfdevfreq->busy_time,
>>>>>                                pfdevfreq->idle_time));
>>>>> @@ -117,6 +118,7 @@ int panfrost_devfreq_init(struct panfrost_device
>>>>> *pfdev)
>>>>>         struct devfreq *devfreq;
>>>>>         struct thermal_cooling_device *cooling;
>>>>>         struct panfrost_devfreq *pfdevfreq = &pfdev->pfdevfreq;
>>>>> +    unsigned long freq = ULONG_MAX;
>>>>>         if (pfdev->comp->num_supplies > 1) {
>>>>>             /*
>>>>> @@ -172,6 +174,12 @@ int panfrost_devfreq_init(struct
>>>>> panfrost_device *pfdev)
>>>>>             return ret;
>>>>>         }
>>>>> +    /* Find the fastest defined rate  */
>>>>> +    opp = dev_pm_opp_find_freq_floor(dev, &freq);
>>>>> +    if (IS_ERR(opp))
>>>>> +        return PTR_ERR(opp);
>>>>> +    pfdevfreq->fast_rate = freq;
>>>>> +
>>>>>         dev_pm_opp_put(opp);
>>>>>         /*
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>>> b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>>> index 1514c1f9d91c..48dbe185f206 100644
>>>>> --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.h
>>>>> @@ -19,6 +19,9 @@ struct panfrost_devfreq {
>>>>>         struct devfreq_simple_ondemand_data gov_data;
>>>>>         bool opp_of_table_added;
>>>>> +    unsigned long current_frequency;
>>>>> +    unsigned long fast_rate;
>>>>> +
>>>>>         ktime_t busy_time;
>>>>>         ktime_t idle_time;
>>>>>         ktime_t time_last_update;
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c
>>>>> b/drivers/gpu/drm/panfrost/panfrost_device.c
>>>>> index fa1a086a862b..28f7046e1b1a 100644
>>>>> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
>>>>> @@ -207,6 +207,8 @@ int panfrost_device_init(struct panfrost_device
>>>>> *pfdev)
>>>>>         spin_lock_init(&pfdev->as_lock);
>>>>> +    spin_lock_init(&pfdev->cycle_counter.lock);
>>>>> +
>>>>>         err = panfrost_clk_init(pfdev);
>>>>>         if (err) {
>>>>>             dev_err(pfdev->dev, "clk init failed %d\n", err);
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h
>>>>> b/drivers/gpu/drm/panfrost/panfrost_device.h
>>>>> index b0126b9fbadc..1e85656dc2f7 100644
>>>>> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
>>>>> @@ -107,6 +107,7 @@ struct panfrost_device {
>>>>>         struct list_head scheduled_jobs;
>>>>>         struct panfrost_perfcnt *perfcnt;
>>>>> +    atomic_t profile_mode;
>>>>>         struct mutex sched_lock;
>>>>> @@ -121,6 +122,11 @@ struct panfrost_device {
>>>>>         struct shrinker shrinker;
>>>>>         struct panfrost_devfreq pfdevfreq;
>>>>> +
>>>>> +    struct {
>>>>> +        atomic_t use_count;
>>>>> +        spinlock_t lock;
>>>>> +    } cycle_counter;
>>>>>     };
>>>>>     struct panfrost_mmu {
>>>>> @@ -135,12 +141,19 @@ struct panfrost_mmu {
>>>>>         struct list_head list;
>>>>>     };
>>>>> +struct panfrost_engine_usage {
>>>>> +    unsigned long long elapsed_ns[NUM_JOB_SLOTS];
>>>>> +    unsigned long long cycles[NUM_JOB_SLOTS];
>>>>> +};
>>>>> +
>>>>>     struct panfrost_file_priv {
>>>>>         struct panfrost_device *pfdev;
>>>>>         struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
>>>>>         struct panfrost_mmu *mmu;
>>>>> +
>>>>> +    struct panfrost_engine_usage engine_usage;
>>>>>     };
>>>>>     static inline struct panfrost_device *to_panfrost_device(struct
>>>>> drm_device *ddev)
>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>>> b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>>> index a2ab99698ca8..3c93a11deab1 100644
>>>>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>>> @@ -20,6 +20,7 @@
>>>>>     #include "panfrost_job.h"
>>>>>     #include "panfrost_gpu.h"
>>>>>     #include "panfrost_perfcnt.h"
>>>>> +#include "panfrost_debugfs.h"
>>>>>     static bool unstable_ioctls;
>>>>>     module_param_unsafe(unstable_ioctls, bool, 0600);
>>>>> @@ -267,6 +268,7 @@ static int panfrost_ioctl_submit(struct
>>>>> drm_device *dev, void *data,
>>>>>         job->requirements = args->requirements;
>>>>>         job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
>>>>>         job->mmu = file_priv->mmu;
>>>>> +    job->engine_usage = &file_priv->engine_usage;
>>>>>         slot = panfrost_job_get_slot(job);
>>>>> @@ -523,7 +525,55 @@ static const struct drm_ioctl_desc
>>>>> panfrost_drm_driver_ioctls[] = {
>>>>>         PANFROST_IOCTL(MADVISE,        madvise,    DRM_RENDER_ALLOW),
>>>>>     };
>>>>> -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
>>>>> +
>>>>> +static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
>>>>> +                     struct panfrost_file_priv *panfrost_priv,
>>>>> +                     struct drm_printer *p)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    /*
>>>>> +     * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
>>>>> +     * accurate, as they only provide a rough estimation of the
>>>>> number of
>>>>> +     * GPU cycles and CPU time spent in a given context. This is
>>>>> due to two
>>>>> +     * different factors:
>>>>> +     * - Firstly, we must consider the time the CPU and then the
>>>>> kernel
>>>>> +     *   takes to process the GPU interrupt, which means additional
>>>>> time and
>>>>> +     *   GPU cycles will be added in excess to the real figure.
>>>>> +     * - Secondly, the pipelining done by the Job Manager (2 job
>>>>> slots per
>>>>> +     *   engine) implies there is no way to know exactly how much
>>>>> time each
>>>>> +     *   job spent on the GPU.
>>>>> +     */
>>>>> +
>>>>> +    static const char * const engine_names[] = {
>>>>> +        "fragment", "vertex-tiler", "compute-only"
>>>>> +    };
>>>>> +
>>>>> +    for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
>>>>
>>>> FWIW you could future proof this a bit by using "i <
>>>> ARRAY_SIZE(engine_names)"
>>>> and avoid maybe silent out of bounds reads if someone updates
>>>> NUM_JOB_SLOTS
>>>> and forgets about this loop. Or stick a warning of some sort.
>>>>
>>> NUM_JOB_SLOTS is actually the same as the number of engines in the
>>> device. I decided to follow
>>> this loop convention because that's what's being done across the
>>> driver when manipulating
>>> the engine queues, so I thought I'd stick to it for the sake of
>>> consistency. Bear in mind
>>> the loop doesn't pick up the compute-only engine because it's still
>>> not exposed to user space.
>>>
>>> So NUM_JOB_SLOTS cannot change, unless a new engine were introduced,
>>> and then someone would
>>> have to update this array accordingly.
>>
>> Exactly, and until they would, here we'd have a be silent out of bound
>> memory access. Content of which even gets shared with userspace. ;)
> 
> I think using NUM_JOB_SLOTS here seems sensible (as Adrián points out
> it's consistent with the rest of the driver). But a BUILD_BUG_ON
> checking the array size is could make sense.
> 
> In reality I don't see the number of job slots ever changing - panfrost
> is now for the 'old' architecture (panthor being the new driver for
> later 'CSF' architecture). And even if there was a new design for
> pre-CSF - it would be a very big change to the architecture: we've kept
> the 3 slots all the way through even though the 3rd is never used on
> most GPUs. But equally I've been wrong before ;)

Thanks for this explanation - with that it indeed isn't much need to 
robustify it.

Regards,

Tvrtko

> 
> Steve
> 
>>>>> +        drm_printf(p, "drm-engine-%s:\t%llu ns\n",
>>>>> +               engine_names[i],
>>>>> panfrost_priv->engine_usage.elapsed_ns[i]);
>>>>> +        drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>>>> +               engine_names[i],
>>>>> panfrost_priv->engine_usage.cycles[i]);
>>>>> +        drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
>>>>> +               engine_names[i], pfdev->pfdevfreq.fast_rate);
>>>>> +        drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
>>>>> +               engine_names[i], pfdev->pfdevfreq.current_frequency);
>>>>
>>>> I envisaged a link to driver specific docs at the bottom of
>>>> drm-usage-stats.rst so it would be nice if drivers would be adding those
>>>> sections and describing their private keys, engine names etc. ;)
>>>>
>>> Currently there's no panfrost.rst file under Documentation/gpu. I
>>> guess I'll create a new
>>> one and add the engine descriptions and meaning of drm-curfreq key.
>>
>> Yeah I have to do the same for i915 in my memory stats series. :)
>>
>> Regards,
>>
>> Tvrtko
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo
  2023-09-22 10:58     ` Adrián Larumbe
@ 2023-09-27 14:36       ` Tvrtko Ursulin
  0 siblings, 0 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-09-27 14:36 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: maarten.lankhorst, mripard, tzimmermann, airlied, daniel,
	robdclark, quic_abhinavk, dmitry.baryshkov, sean, marijn.suijten,
	robh, steven.price, linux-arm-msm, linux-kernel, dri-devel,
	healych, Boris Brezillon, kernel, freedreno


On 22/09/2023 11:58, Adrián Larumbe wrote:
> On 20.09.2023 16:53, Tvrtko Ursulin wrote:
>>
>> On 20/09/2023 00:34, Adrián Larumbe wrote:
>>> Some BO's might be mapped onto physical memory chunkwise and on demand,
>>> like Panfrost's tiler heap. In this case, even though the
>>> drm_gem_shmem_object page array might already be allocated, only a very
>>> small fraction of the BO is currently backed by system memory, but
>>> drm_show_memory_stats will then proceed to add its entire virtual size to
>>> the file's total resident size regardless.
>>>
>>> This led to very unrealistic RSS sizes being reckoned for Panfrost, where
>>> said tiler heap buffer is initially allocated with a virtual size of 128
>>> MiB, but only a small part of it will eventually be backed by system memory
>>> after successive GPU page faults.
>>>
>>> Provide a new DRM object generic function that would allow drivers to
>>> return a more accurate RSS size for their BOs.
>>>
>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>>> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
>>> Reviewed-by: Steven Price <steven.price@arm.com>
>>> ---
>>>    drivers/gpu/drm/drm_file.c | 5 ++++-
>>>    include/drm/drm_gem.h      | 9 +++++++++
>>>    2 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>>> index 883d83bc0e3d..762965e3d503 100644
>>> --- a/drivers/gpu/drm/drm_file.c
>>> +++ b/drivers/gpu/drm/drm_file.c
>>> @@ -944,7 +944,10 @@ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
>>>    		}
>>>    		if (s & DRM_GEM_OBJECT_RESIDENT) {
>>> -			status.resident += obj->size;
>>> +			if (obj->funcs && obj->funcs->rss)
>>> +				status.resident += obj->funcs->rss(obj);
>>> +			else
>>> +				status.resident += obj->size;
>>
>> Presumably you'd want the same smaller size in both active and purgeable? Or
>> you can end up with more in those two than in rss which would look odd.
> 
> I didn't think of this. I guess when an object is both resident and purgeable,
> then its RSS and purgeable sizes should be the same.
> 
>> Also, alternative to adding a new callback could be adding multiple output
>> parameters to the existing obj->func->status() which maybe ends up simpler due
>> fewer callbacks?
>>
>> Like:
>>
>> s = obj->funcs->status(obj, &supported_status, &rss)
>>
>> And adjust the code flow to pick up the rss if driver signaled it supports
>> reporting it.
> 
> I personally find having a separate object callback more readable in this case.
> There's also the question of what output parameter value would be used as a token
> that the relevant BO doesn't have an RSS different from its virtual
> size. I guess '0' would be alright, but this is on the assumption that this
> could never be a legitimate BO virtual size across all DRM drivers. I guess
> most of them round the size up to the nearest page multiple at BO creation
> time.

Okay. See how it will look once you need to apply it to resident and 
purgeable. I wonder if "driver knows better" will end up a dominant case 
and we do end up considering reversing the scheme (like ask the driver 
to fill in the meminfo record). TBH I do not remember all the flavours 
both Rob and I proposed at this point.

Regards,

Tvrtko

> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>    		} else {
>>>    			/* If already purged or not yet backed by pages, don't
>>>    			 * count it as purgeable:
>>> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
>>> index bc9f6aa2f3fe..16364487fde9 100644
>>> --- a/include/drm/drm_gem.h
>>> +++ b/include/drm/drm_gem.h
>>> @@ -208,6 +208,15 @@ struct drm_gem_object_funcs {
>>>    	 */
>>>    	enum drm_gem_object_status (*status)(struct drm_gem_object *obj);
>>> +	/**
>>> +	 * @rss:
>>> +	 *
>>> +	 * Return resident size of the object in physical memory.
>>> +	 *
>>> +	 * Called by drm_show_memory_stats().
>>> +	 */
>>> +	size_t (*rss)(struct drm_gem_object *obj);
>>> +
>>>    	/**
>>>    	 * @vm_ops:
>>>    	 *

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-09-27 14:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-19 23:34 [PATCH v6 0/6] Add fdinfo support to Panfrost Adrián Larumbe
2023-09-19 23:34 ` [PATCH v6 1/6] drm/panfrost: Add cycle count GPU register definitions Adrián Larumbe
2023-09-19 23:34 ` [PATCH v6 2/6] drm/panfrost: Add fdinfo support GPU load metrics Adrián Larumbe
2023-09-20 15:40   ` Tvrtko Ursulin
2023-09-22 10:57     ` Adrián Larumbe
2023-09-22 13:53       ` Tvrtko Ursulin
2023-09-22 15:23         ` Steven Price
2023-09-25  8:57           ` Tvrtko Ursulin
2023-09-19 23:34 ` [PATCH v6 3/6] drm/panfrost: Add fdinfo support for memory stats Adrián Larumbe
2023-09-19 23:34 ` [PATCH v6 4/6] drm/drm_file: Add DRM obj's RSS reporting function for fdinfo Adrián Larumbe
2023-09-20 15:53   ` Tvrtko Ursulin
2023-09-22 10:58     ` Adrián Larumbe
2023-09-27 14:36       ` Tvrtko Ursulin
2023-09-19 23:34 ` [PATCH v6 5/6] drm/panfrost: Implement generic DRM object RSS reporting function Adrián Larumbe
2023-09-19 23:34 ` [PATCH v6 6/6] drm/drm-file: Show finer-grained BO sizes in drm_show_memory_stats Adrián Larumbe
2023-09-20 15:32   ` Tvrtko Ursulin
2023-09-21 10:14     ` Tvrtko Ursulin
2023-09-22 11:03       ` Adrián Larumbe
2023-09-22 14:02         ` Tvrtko Ursulin
2023-09-22 11:01     ` Adrián Larumbe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).