linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/17] DRM scheduling cgroup controller
@ 2022-10-19 17:32 Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 01/17] cgroup: Add the DRM " Tvrtko Ursulin
                   ` (17 more replies)
  0 siblings, 18 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

This series contains two independent proposals for a DRM scheduling cgroup
controller: static priority based controller and GPU usage budget based
controller.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. Whereas
back then I was proposing to somehow tie this with process nice, feedback mostly
was that people wanted cgroups. So here it is - in the world of heterogenous
computing pipelines I think it is time to do something about this gap.

Code is not finished but should survive some light experimenting with. I am
sharing it early since the topic has been controversial in the past. I hope to
demonstrate there are gains to be had in real world usage(*), today, and that
the concepts the proposal relies are well enough established and stable.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
   VMs based on the window focused status. It can be demonstrated how GPU
   scheduling control can easily be integrated into that setup.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

To re-iterate, two proposal are completely functionaly independent and can be
evaluated and progressed independently.

Structure of the series is as follows:

    1) Adds a skeleton DRM cgroup controller with no functionality.
  2-6) First proposal - static priority based controller.
    7) i915 adds support for the static priority based controller.
 8-14) Second proposal - GPU usage based controller.
15-17) i915 adds support for the GPU usage based controller.

Both proposals define a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work. Principle of discoverability
is also employed so that the level of supported functionality can be observed
in situation when there are multiple GPUs in use, each with a different set of
scheduling capabilities.

DRM static priority control
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Static priority control exposes a hierarchical control interface for the
scheduling priority support present in many DRM device drivers.

Hierarchical meaning that the child group priorities are relative to their
parent. As an example:

	A=-1000
	   /\
	  /  \
	 /    \
	B=0   C=100

This results in the effective priority of a group B of -1000 and C of -900. In
other words the fact C is configured for elevated priority is relative to its
parent being a low priority and hence is only elevated in the context of its
siblings.

The scope of individual DRM scheduling priority may be per device or per device
driver, or a combination of both, depending on the implementation. The
controller does not ensure any priority ordering across multiple DRM drivers nor
does it impose any further policy and leaves desired configuration to the system
administrator.

Individual DRM drivers are required to transparently convert the cgroup priority
into values suitable for their capabilities.

No guarantees on effectiveness or granularity are provided by the controller,
apart the available range being chosen to be an integer and hence allowing a
generic concept of normal (zero), lower (negative values) and higher (positive
values) priority.

DRM static priority interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  drm.priority_levels
	One of:
	 1) And integer representing the minimum number of discrete priority
	    levels for the whole group.
	    Optionally followed by an asterisk ('*') indicating some DRM clients
	    in the group support more than the minimum number.
	 2) '0'- indicating one or more DRM clients in the group has no support
	    for static priority control.
	 3) 'n/a' - when there are no DRM clients in the configured group.

  drm.priority
	A read-write integer between -10000 and 10000 (inclusive) representing
	an abstract static priority level.

  drm.effective_priority
	Read only integer showing the current effective priority level for the
	group. Effective meaning taking into account the chain of inherited

This approach should work with all DRM drivers which support priority control,
for instance i915, amdgpu or msm. In the future possibly all that are built on
top of the drm/scheduler as well.

I considered a few options for priority control including range based (min/max
limits, or a flavour of) but ended up settling for simplest one. Mainly because
of the limited priority level support with some drivers, and so to enable adding
wide spread support as easily as possible.

All that is required to enable the setup to be useful is for the drivers to
support the concept of a low, normal and high scheduling priority, which can
then be mapped to from the abstract negatitve, default zero and positive cgroup
priorities.

If range based controls are later wanted they can be added in a backward
compatible manner. Until then priority overlap is possible meaning groups need
to be correctly configured by the "administrator" (entity configuring cgroup
usage on a given system).

As mentioned before, this controller can be easily integrated with the current
ChromeOS architecture for managing CPU bandwith of focused versus unfocused
VM windows. As the OS re-configures the respective CPU shares in a respective
cgroup on a window focus status change, DRM cgroup controller could be attached
to the same group and DRM scheduling priority changed at the same time.

I have ran an experiment where I have shown that doing this can enable the
foreground browser window hit 60fps with no dropped frames, when faced with a
Android game runnning in the background, where if DRM priorities were not used
foreground window was only able to maintain around 45fps.

DRM scheduling soft limits
~~~~~~~~~~~~~~~~~~~~~~~~~~

Because of the heterogenous hardware and driver DRM capabilities, soft limits
are implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in the best-effort manner only. There are no guarantees that the soft limits
will be respected.

DRM scheduling soft limits interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  drm.weight
	Standard cgroup weight based control [1, 10000] used to configure the
	relative distributing of GPU time between the sibling groups.

  drm.period_us
	An integer representing the period with which the controller should look
	at the GPU usage by the group and potentially send the over/under budget
	signal.
	Value of zero (defaul) disables the soft limit checking.

  drm.budget_supported
	One of:
	 1) 'yes' - when all DRM clients in the group support the functionality.
	 2) 'no' - when at least one of the DRM clients does not support the
		   functionality.
	 3) 'n/a' - when there are no DRM clients in the group.

The second proposal is a little bit more advanced in concept and also a little
bit less finished. Interesting thing is that it builds upon the per client GPU
utilisation work which landed recently for a few drivers. So my thinking is that
in principle, an intersect of drivers which support both that and some sort of
priority scheduling control, could also in theory support this.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

 * DRM cgroup controller implements the control files and the scanning loop.
 * DRM core is required to track all DRM clients belonging to processes so it
   can answer when asked how much GPU time is a process using.
 * DRM core also provides a call back which the controller will call when a
   certain process is over budget.
 * Individual drivers need to implement two similar hooks, but which work for
   a single DRM client. Over budget callback and GPU utilisation query.

What I have demonstrated in practice is that when wired to i915, in a really
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental delay, so responsiveness will depend
on the scanning period, which may or may not be a problem for a particular use
case.

The unfinished part is the GPU budgeting split which currently does not
propagate unused bandwith to children, neither can share it with siblings. But
this is not due fundamental reasons, just to avoid spending too much time on it
too early.

There are also interesting conversations to be had around mental models for what
is GPU usage as a single number when faced with GPUs which have different
execution engines. To an extent this is similar to the multi-core and cgroup
CPU controller problems, but definitely goes further than that.

I deliberately did not want to include any such complications in the controller
itself and left the individual drivers to handle it. For instance in the i915
over-budget callback it will not do anything unless client's GPU usage is on a
physical engine which is oversubscribed. This enables multiple clients to be
harmlessly over budget, as long as they are not competing for the same GPU
resource.

This much for now, hope some good discussion will follow.

P.S.
A disclaimer of a kind -  I was not familiar with how to implement a cgroup
controller at all when I started this prototype therefore it is quite possible
there are many bugs and misunderstandings on how it should be done.

Tvrtko Ursulin (17):
  cgroup: Add the DRM cgroup controller
  drm: Track clients per owning process
  cgroup/drm: Support cgroup priority control
  drm/cgroup: Allow safe external access to file_priv
  drm: Connect priority updates to drm core
  drm: Only track clients which are providing drm_cgroup_ops
  drm/i915: i915 priority
  drm: Allow for migration of clients
  cgroup/drm: Introduce weight based drm cgroup control
  drm: Add ability to query drm cgroup GPU time
  drm: Add over budget signalling callback
  cgroup/drm: Client exit hook
  cgroup/drm: Ability to periodically scan cgroups for over budget GPU
    usage
  cgroup/drm: Show group budget signaling capability in sysfs
  drm/i915: Migrate client to new owner on context create
  drm/i915: Wire up with drm controller GPU time query
  drm/i915: Implement cgroup controller over budget throttling

 Documentation/admin-guide/cgroup-v2.rst       | 100 +++
 drivers/gpu/drm/Kconfig                       |   1 +
 drivers/gpu/drm/Makefile                      |   1 +
 drivers/gpu/drm/drm_cgroup.c                  | 294 +++++++
 drivers/gpu/drm/drm_file.c                    |  22 +-
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   3 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  45 +-
 drivers/gpu/drm/i915/i915_driver.c            |  12 +
 drivers/gpu/drm/i915/i915_drm_client.c        | 183 ++++-
 drivers/gpu/drm/i915/i915_drm_client.h        |  15 +
 include/drm/drm_clients.h                     |  50 ++
 include/drm/drm_drv.h                         |  73 ++
 include/drm/drm_file.h                        |  15 +
 include/linux/cgroup_drm.h                    |  18 +
 include/linux/cgroup_subsys.h                 |   4 +
 init/Kconfig                                  |   8 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/drm.c                           | 764 ++++++++++++++++++
 18 files changed, 1594 insertions(+), 15 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_cgroup.c
 create mode 100644 include/drm/drm_clients.h
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 01/17] cgroup: Add the DRM cgroup controller
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 02/17] drm: Track clients per owning process Tvrtko Ursulin
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Skeleton controller without any functionality.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 include/linux/cgroup_drm.h    |  9 ++++++
 include/linux/cgroup_subsys.h |  4 +++
 init/Kconfig                  |  7 +++++
 kernel/cgroup/Makefile        |  1 +
 kernel/cgroup/drm.c           | 54 +++++++++++++++++++++++++++++++++++
 5 files changed, 75 insertions(+)
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
new file mode 100644
index 000000000000..bf8abc6b8ebf
--- /dev/null
+++ b/include/linux/cgroup_drm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#ifndef _CGROUP_DRM_H
+#define _CGROUP_DRM_H
+
+#endif	/* _CGROUP_DRM_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..49460494a010 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+SUBSYS(drm)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index 694f7c160c9c..6dd7faca7749 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1087,6 +1087,13 @@ config CGROUP_RDMA
 	  Attaching processes with active RDMA resources to the cgroup
 	  hierarchy is allowed even if can cross the hierarchy's limit.
 
+config CGROUP_DRM
+	bool "DRM controller"
+	help
+	  Provides the DRM subsystem controller.
+
+	  ...
+
 config CGROUP_FREEZER
 	bool "Freezer controller"
 	help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..849bd2917477 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -6,4 +6,5 @@ obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_MISC) += misc.o
+obj-$(CONFIG_CGROUP_DRM) += drm.o
 obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
new file mode 100644
index 000000000000..b88c93661df3
--- /dev/null
+++ b/kernel/cgroup/drm.c
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <linux/slab.h>
+#include <linux/cgroup.h>
+#include <linux/cgroup_drm.h>
+#include <linux/sched.h>
+
+struct drm_cgroup_state {
+	struct cgroup_subsys_state css;
+};
+
+static inline struct drm_cgroup_state *
+css_to_drmcs(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct drm_cgroup_state, css);
+}
+
+static void drmcs_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_to_drmcs(css));
+}
+
+static struct drm_cgroup_state root_drmcs = {
+};
+
+static struct cgroup_subsys_state *
+drmcs_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct drm_cgroup_state *drmcs;
+
+	if (!parent_css)
+		return &root_drmcs.css;
+
+	drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
+	if (!drmcs)
+		return ERR_PTR(-ENOMEM);
+
+	return &drmcs->css;
+}
+
+struct cftype files[] = {
+	{ } /* Zero entry terminates. */
+};
+
+struct cgroup_subsys drm_cgrp_subsys = {
+	.css_alloc	= drmcs_alloc,
+	.css_free	= drmcs_free,
+	.early_init	= false,
+	.legacy_cftypes	= files,
+	.dfl_cftypes	= files,
+};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 02/17] drm: Track clients per owning process
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 01/17] cgroup: Add the DRM " Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-20  6:40   ` Christian König
  2022-10-19 17:32 ` [RFC 03/17] cgroup/drm: Support cgroup priority control Tvrtko Ursulin
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

To enable propagation of settings from the cgroup drm controller to drm we
need to start tracking which processes own which drm clients.

Implement that by tracking the struct pid pointer of the owning process in
a new XArray, pointing to a structure containing a list of associated
struct drm_file pointers.

Clients are added and removed under the filelist mutex and RCU list
operations are used below it to allow for lockless lookup.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/Makefile     |  1 +
 drivers/gpu/drm/drm_cgroup.c | 60 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/drm_file.c   | 18 ++++++++---
 include/drm/drm_clients.h    | 31 +++++++++++++++++++
 include/drm/drm_file.h       |  4 +++
 5 files changed, 110 insertions(+), 4 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_cgroup.c
 create mode 100644 include/drm/drm_clients.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 6e55c47288e4..0719970d17ee 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -59,6 +59,7 @@ drm-$(CONFIG_DRM_LEGACY) += \
 	drm_scatter.o \
 	drm_vm.o
 drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
+drm-$(CONFIG_CGROUP_DRM) += drm_cgroup.o
 drm-$(CONFIG_COMPAT) += drm_ioc32.o
 drm-$(CONFIG_DRM_PANEL) += drm_panel.o
 drm-$(CONFIG_OF) += drm_of.o
diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
new file mode 100644
index 000000000000..a31ff1d593ab
--- /dev/null
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <drm/drm_drv.h>
+#include <drm/drm_clients.h>
+
+static DEFINE_XARRAY(drm_pid_clients);
+
+void drm_clients_close(struct drm_file *file_priv)
+{
+	unsigned long pid = (unsigned long)file_priv->pid;
+	struct drm_device *dev = file_priv->minor->dev;
+	struct drm_pid_clients *clients;
+
+	lockdep_assert_held(&dev->filelist_mutex);
+
+	clients = xa_load(&drm_pid_clients, pid);
+	list_del_rcu(&file_priv->clink);
+	if (atomic_dec_and_test(&clients->num)) {
+		xa_erase(&drm_pid_clients, pid);
+		kfree_rcu(clients, rcu);
+	}
+}
+
+int drm_clients_open(struct drm_file *file_priv)
+{
+	unsigned long pid = (unsigned long)file_priv->pid;
+	struct drm_device *dev = file_priv->minor->dev;
+	struct drm_pid_clients *clients;
+	bool new_client = false;
+
+	lockdep_assert_held(&dev->filelist_mutex);
+
+	clients = xa_load(&drm_pid_clients, pid);
+	if (!clients) {
+		clients = kmalloc(sizeof(*clients), GFP_KERNEL);
+		if (!clients)
+			return -ENOMEM;
+		atomic_set(&clients->num, 0);
+		INIT_LIST_HEAD(&clients->file_list);
+		init_rcu_head(&clients->rcu);
+		new_client = true;
+	}
+	atomic_inc(&clients->num);
+	list_add_tail_rcu(&file_priv->clink, &clients->file_list);
+	if (new_client) {
+		void *xret;
+
+		xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
+		if (xa_err(xret)) {
+			list_del_init(&file_priv->clink);
+			kfree(clients);
+			return PTR_ERR(clients);
+		}
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index a8b4d918e9a3..ce58d5c513db 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -40,6 +40,7 @@
 #include <linux/slab.h>
 
 #include <drm/drm_client.h>
+#include <drm/drm_clients.h>
 #include <drm/drm_drv.h>
 #include <drm/drm_file.h>
 #include <drm/drm_print.h>
@@ -298,6 +299,7 @@ static void drm_close_helper(struct file *filp)
 
 	mutex_lock(&dev->filelist_mutex);
 	list_del(&file_priv->lhead);
+	drm_clients_close(file_priv);
 	mutex_unlock(&dev->filelist_mutex);
 
 	drm_file_free(file_priv);
@@ -349,10 +351,8 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
 
 	if (drm_is_primary_client(priv)) {
 		ret = drm_master_open(priv);
-		if (ret) {
-			drm_file_free(priv);
-			return ret;
-		}
+		if (ret)
+			goto err_free;
 	}
 
 	filp->private_data = priv;
@@ -360,6 +360,9 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
 	priv->filp = filp;
 
 	mutex_lock(&dev->filelist_mutex);
+	ret = drm_clients_open(priv);
+	if (ret)
+		goto err_unlock;
 	list_add(&priv->lhead, &dev->filelist);
 	mutex_unlock(&dev->filelist_mutex);
 
@@ -387,6 +390,13 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
 #endif
 
 	return 0;
+
+err_unlock:
+	mutex_unlock(&dev->filelist_mutex);
+err_free:
+	drm_file_free(priv);
+
+	return ret;
 }
 
 /**
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
new file mode 100644
index 000000000000..4ae553a03d1e
--- /dev/null
+++ b/include/drm/drm_clients.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#ifndef _DRM_CLIENTS_H_
+#define _DRM_CLIENTS_H_
+
+#include <drm/drm_file.h>
+
+struct drm_pid_clients {
+	atomic_t num;
+	struct list_head file_list;
+	struct rcu_head rcu;
+};
+
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+void drm_clients_close(struct drm_file *file_priv);
+int drm_clients_open(struct drm_file *file_priv);
+#else
+static inline void drm_clients_close(struct drm_file *file_priv)
+{
+}
+
+static inline int drm_clients_open(struct drm_file *file_priv)
+{
+	return 0;
+}
+#endif
+
+#endif
diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index d780fd151789..0965eb111f24 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -268,6 +268,10 @@ struct drm_file {
 	/** @minor: &struct drm_minor for this file. */
 	struct drm_minor *minor;
 
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+	struct list_head clink;
+#endif
+
 	/**
 	 * @object_idr:
 	 *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 03/17] cgroup/drm: Support cgroup priority control
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 01/17] cgroup: Add the DRM " Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 02/17] drm: Track clients per owning process Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 04/17] drm/cgroup: Allow safe external access to file_priv Tvrtko Ursulin
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

A lot of the drm drivers support a concept of a scheduling priority. Add
support for controlling it via the drm cgroup controller.

Abstract priority control range of [DRM_CGROUP_PRIORITY_MIN,
DRM_CGROUP_PRIORITY_MAX] is used and each group hierarchy adjusts it's
base level based on a priority of its parent. In terms of an example that
looks like this:

       P=-1000
          /\
         /  \
        /    \
      A=0   B=100

This results in the effective priority of a group A of -1000 and B of
-900. In other words the fact B is configured for elevated priority is
relative to the parent being a low priority and hence is only elevated in
the context of its siblings.

Implementation does not impose any further policy and leaves sensible
configuration to the system administrator.

Individual drm drivers are expected to transparently convert the drm
cgroup priority into values suitable for their capabilities.

No guarantees on effectiveness or granularity are provided by the
controller, apart the available range being chosen to be an integer and
hence allowing a generic concept of normal (zero), lower (negative values)
and higher (positive values).

Every cgroup starts with a default priority of zero.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  58 +++++++++++++
 include/linux/cgroup_drm.h              |   4 +
 kernel/cgroup/drm.c                     | 110 ++++++++++++++++++++++++
 3 files changed, 172 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index dc254a3cb956..0a6d97c83ea4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2398,6 +2398,64 @@ HugeTLB Interface Files
         hugetlb pages of <hugepagesize> in this cgroup.  Only active in
         use hugetlb pages are included.  The per-node values are in bytes.
 
+DRM
+---
+
+The DRM controller allows configuring static hierarchical scheduling priority.
+
+DRM static priority control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Static priority control exposes a hierarchical control interface for the
+scheduling priority support present in many DRM device drivers.
+
+Hierarchical meaning that the child group priorities are relative to their
+parent. As an example:
+
+	A=-1000
+	   /\
+	  /  \
+	 /    \
+	B=0   C=100
+
+This results in the effective priority of a group B of -1000 and C of -900. In
+other words the fact C is configured for elevated priority is relative to its
+parent being a low priority and hence is only elevated in the context of its
+siblings.
+
+The scope of individual DRM scheduling priority may be per device or per device
+driver, or a combination of both, depending on the implementation. The
+controller does not ensure any priority ordering across multiple DRM drivers nor
+does it impose any further policy and leaves desired configuration to the system
+administrator.
+
+Individual DRM drivers are required to transparently convert the cgroup priority
+into values suitable for their capabilities.
+
+No guarantees on effectiveness or granularity are provided by the controller,
+apart the available range being chosen to be an integer and hence allowing a
+generic concept of normal (zero), lower (negative values) and higher (positive
+values) priority.
+
+DRM static priority interface files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+  drm.priority_levels
+	One of:
+	 1) And integer representing the minimum number of discrete priority
+	    levels for the whole group.
+	 2) '0'- indicating one or more DRM clients in the group has no support
+	    for static priority control.
+	 3) 'n/a' - when there are no DRM clients in the configured group.
+
+  drm.priority
+	A read-write integer between -10000 and 10000 (inclusive) representing
+	an abstract static priority level.
+
+  drm.effective_priority
+	Read only integer showing the current effective priority level for the
+	group. Effective meaning taking into account the chain of inherited
+
 Misc
 ----
 
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
index bf8abc6b8ebf..a59792ccb550 100644
--- a/include/linux/cgroup_drm.h
+++ b/include/linux/cgroup_drm.h
@@ -6,4 +6,8 @@
 #ifndef _CGROUP_DRM_H
 #define _CGROUP_DRM_H
 
+#define DRM_CGROUP_PRIORITY_MIN	(-10000)
+#define DRM_CGROUP_PRIORITY_DEF	(0)
+#define DRM_CGROUP_PRIORITY_MAX	(10000)
+
 #endif	/* _CGROUP_DRM_H */
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index b88c93661df3..2350e1f8a48a 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -6,24 +6,117 @@
 #include <linux/slab.h>
 #include <linux/cgroup.h>
 #include <linux/cgroup_drm.h>
+#include <linux/minmax.h>
+#include <linux/mutex.h>
 #include <linux/sched.h>
 
 struct drm_cgroup_state {
 	struct cgroup_subsys_state css;
+
+	int priority;
+	int effective_priority;
 };
 
+static DEFINE_MUTEX(drmcg_mutex);
+
 static inline struct drm_cgroup_state *
 css_to_drmcs(struct cgroup_subsys_state *css)
 {
 	return container_of(css, struct drm_cgroup_state, css);
 }
 
+static int drmcs_show_priority_levels(struct seq_file *sf, void *v)
+{
+	seq_printf(sf, "%u\n", 0);
+
+	return 0;
+}
+
+static s64
+drmcs_read_effective_priority(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	return drmcs->effective_priority;
+}
+
+static s64
+drmcs_read_priority(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	return drmcs->priority;
+}
+
+static void update_priority(struct drm_cgroup_state *drmcs, int priority)
+{
+	struct cgroup_subsys_state *node;
+
+	lockdep_assert_held(&drmcg_mutex);
+
+	if (priority == drmcs->priority)
+		return;
+
+	drmcs->priority = priority;
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(node, &drmcs->css) {
+		struct drm_cgroup_state *dnode = css_to_drmcs(node);
+		int pprio;
+
+		if (!node->parent)
+			pprio = DRM_CGROUP_PRIORITY_DEF;
+		else
+			pprio = css_to_drmcs(node->parent)->effective_priority;
+
+		dnode->effective_priority =
+			clamp(pprio + dnode->priority,
+			      DRM_CGROUP_PRIORITY_MIN,
+			      DRM_CGROUP_PRIORITY_MAX);
+	}
+	rcu_read_unlock();
+}
+
+static int
+drmcs_write_priority(struct cgroup_subsys_state *css, struct cftype *cftype,
+		     s64 priority)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+	int ret;
+
+	if (priority < (s64)DRM_CGROUP_PRIORITY_MIN ||
+	    priority > (s64)DRM_CGROUP_PRIORITY_MAX)
+		return -ERANGE;
+
+	ret = mutex_lock_interruptible(&drmcg_mutex);
+	if (ret)
+		return ret;
+	update_priority(drmcs, (int)priority);
+	mutex_unlock(&drmcg_mutex);
+
+	return 0;
+}
+
+static int drmcs_online(struct cgroup_subsys_state *css)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	mutex_lock(&drmcg_mutex);
+	update_priority(drmcs, DRM_CGROUP_PRIORITY_DEF);
+	mutex_unlock(&drmcg_mutex);
+
+	return 0;
+}
+
 static void drmcs_free(struct cgroup_subsys_state *css)
 {
 	kfree(css_to_drmcs(css));
 }
 
 static struct drm_cgroup_state root_drmcs = {
+	.priority = DRM_CGROUP_PRIORITY_DEF,
+	.effective_priority = DRM_CGROUP_PRIORITY_DEF,
 };
 
 static struct cgroup_subsys_state *
@@ -42,12 +135,29 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
 }
 
 struct cftype files[] = {
+	{
+		.name = "priority_levels",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = drmcs_show_priority_levels,
+	},
+	{
+		.name = "priority",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = drmcs_read_priority,
+		.write_s64 = drmcs_write_priority,
+	},
+	{
+		.name = "effective_priority",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = drmcs_read_effective_priority,
+	},
 	{ } /* Zero entry terminates. */
 };
 
 struct cgroup_subsys drm_cgrp_subsys = {
 	.css_alloc	= drmcs_alloc,
 	.css_free	= drmcs_free,
+	.css_online	= drmcs_online,
 	.early_init	= false,
 	.legacy_cftypes	= files,
 	.dfl_cftypes	= files,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 04/17] drm/cgroup: Allow safe external access to file_priv
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (2 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 03/17] cgroup/drm: Support cgroup priority control Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 05/17] drm: Connect priority updates to drm core Tvrtko Ursulin
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Entry points from the cgroup subsystem into the drm cgroup controller will
need to walk the file_priv structures associated with registered clients
and since those are not RCU protected lets add a hack for now to make this
safe.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/drm_cgroup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index a31ff1d593ab..9e9caeb0aa87 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -21,6 +21,13 @@ void drm_clients_close(struct drm_file *file_priv)
 	if (atomic_dec_and_test(&clients->num)) {
 		xa_erase(&drm_pid_clients, pid);
 		kfree_rcu(clients, rcu);
+
+		/*
+		 * FIXME: file_priv is not RCU protected so we add this hack
+		 * to avoid any races with code which walks clients->file_list
+		 * and accesses file_priv.
+		 */
+		synchronize_rcu();
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 05/17] drm: Connect priority updates to drm core
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (3 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 04/17] drm/cgroup: Allow safe external access to file_priv Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 06/17] drm: Only track clients which are providing drm_cgroup_ops Tvrtko Ursulin
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

On priority updates, drm cgroup controller is made walk all the processes
belonging to the group being updated, and notifies the drm core of them
via a new helper.

DRM core itself stores the current effective drm cgroup priority in
struct drm_file, while individual drivers can also register an optional
hook to be called at the same time, via struct drm_cgroup_ops which can be
provided as part of struct drm_driver used at driver registration time.

DRM cgroup controller on the other hand exports a new helper which the drm
core uses at client registration time in order to query to current drm
cgroup effective priority.

This establishes a two way communication channel between the drm cgroup
controller and the drm core and hence drm core module now has to be built
into the kernel.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  2 +
 drivers/gpu/drm/Kconfig                 |  1 +
 drivers/gpu/drm/drm_cgroup.c            | 56 +++++++++++++++++
 drivers/gpu/drm/drm_file.c              |  4 ++
 include/drm/drm_clients.h               |  3 +
 include/drm/drm_drv.h                   | 47 ++++++++++++++
 include/drm/drm_file.h                  | 10 +++
 include/linux/cgroup_drm.h              |  4 ++
 init/Kconfig                            |  1 +
 kernel/cgroup/drm.c                     | 82 ++++++++++++++++++++++++-
 10 files changed, 209 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0a6d97c83ea4..1f3cca4e2572 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2444,6 +2444,8 @@ DRM static priority interface files
 	One of:
 	 1) And integer representing the minimum number of discrete priority
 	    levels for the whole group.
+	    Optionally followed by an asterisk ('*') indicating some DRM clients
+	    in the group support more than the minimum number.
 	 2) '0'- indicating one or more DRM clients in the group has no support
 	    for static priority control.
 	 3) 'n/a' - when there are no DRM clients in the configured group.
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 34f5a092c99e..8f3c169ced10 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -7,6 +7,7 @@
 #
 menuconfig DRM
 	tristate "Direct Rendering Manager (XFree86 4.1.0 and higher DRI support)"
+	default y if CGROUP_DRM=y
 	depends on (AGP || AGP=n) && !EMULATED_CMPXCHG && HAS_DMA
 	select DRM_NOMODESET
 	select DRM_PANEL_ORIENTATION_QUIRKS
diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index 9e9caeb0aa87..0fbb88f08cef 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -65,3 +65,59 @@ int drm_clients_open(struct drm_file *file_priv)
 
 	return 0;
 }
+
+unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform)
+{
+	unsigned int min_levels = UINT_MAX;
+	struct drm_pid_clients *clients;
+
+	*non_uniform = false;
+
+	rcu_read_lock();
+	clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+	if (clients) {
+		struct drm_file *fpriv;
+
+		list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+			const struct drm_cgroup_ops *cg_ops =
+				fpriv->minor->dev->driver->cg_ops;
+			unsigned int l;
+
+			if (cg_ops && cg_ops->priority_levels)
+				l = cg_ops->priority_levels(fpriv);
+			else
+				l = 0;
+
+			if (min_levels != UINT_MAX && l != min_levels)
+				*non_uniform = true;
+			if (l < min_levels)
+				min_levels = l;
+		}
+	}
+	rcu_read_unlock();
+
+	return min_levels;
+}
+EXPORT_SYMBOL_GPL(drm_pid_priority_levels);
+
+void drm_pid_update_priority(struct pid *pid, int priority)
+{
+	struct drm_pid_clients *clients;
+
+	rcu_read_lock();
+	clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+	if (clients) {
+		struct drm_file *fpriv;
+
+		list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+			const struct drm_cgroup_ops *cg_ops =
+				fpriv->minor->dev->driver->cg_ops;
+
+			fpriv->drm_cgroup_priority = priority;
+			if (cg_ops && cg_ops->update_priority)
+				cg_ops->update_priority(fpriv, priority);
+		}
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(drm_pid_update_priority);
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index ce58d5c513db..38eb6003e74d 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -32,6 +32,7 @@
  */
 
 #include <linux/anon_inodes.h>
+#include <linux/cgroup_drm.h>
 #include <linux/dma-fence.h>
 #include <linux/file.h>
 #include <linux/module.h>
@@ -359,6 +360,9 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
 	filp->f_mode |= FMODE_UNSIGNED_OFFSET;
 	priv->filp = filp;
 
+	priv->drm_cgroup_priority =
+		drmcgroup_lookup_effective_priority(current);
+
 	mutex_lock(&dev->filelist_mutex);
 	ret = drm_clients_open(priv);
 	if (ret)
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index 4ae553a03d1e..10d21138f7af 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -28,4 +28,7 @@ static inline int drm_clients_open(struct drm_file *file_priv)
 }
 #endif
 
+unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform);
+void drm_pid_update_priority(struct pid *pid, int priority);
+
 #endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index f6159acb8856..2371d73e12cf 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -148,6 +148,43 @@ enum drm_driver_feature {
 	DRIVER_KMS_LEGACY_CONTEXT	= BIT(31),
 };
 
+/**
+ * struct drm_cgroup_ops
+ *
+ * This structure contains a number of callbacks that drivers can provide if
+ * they are able to support one or more of the functionalities implemented by
+ * the DRM cgroup controller.
+ */
+struct drm_cgroup_ops {
+	/**
+	 * @priority_levels:
+	 *
+	 * Returns the discrete number of priority levels supported by the DRM
+	 * driver owning this client.
+	 *
+	 * The value is used by the DRM core when informing the DRM cgroup
+	 * controller on the scheduling priority capability of a group of
+	 * clients.
+	 *
+	 * If the callback is not implemented no support for scheduling priority
+	 * is assumed and reported as such.
+	 */
+	unsigned int (*priority_levels) (struct drm_file *);
+
+	/**
+	 * @update_priority:
+	 *
+	 * Optional callback used by the DRM core for informing individual
+	 * drivers of DRM cgroup priority changes.
+	 *
+	 * If not implemented drivers are still able to access the most recent
+	 * priority via the drm_file->drm_cgroup_priority field. Therefore the
+	 * main purpose of the callback is for drivers which are able to adjust
+	 * priorities of already running workloads.
+	 */
+	void (*update_priority) (struct drm_file *, int priority);
+};
+
 /**
  * struct drm_driver - DRM driver structure
  *
@@ -459,6 +496,16 @@ struct drm_driver {
 	 */
 	const struct file_operations *fops;
 
+#ifdef CONFIG_CGROUP_DRM
+	/**
+	 * @cg_ops:
+	 *
+	 * Optional pointer to driver callbacks facilitating integration with
+	 * the DRM cgroup controller.
+	 */
+	const struct drm_cgroup_ops *cg_ops;
+#endif
+
 #ifdef CONFIG_DRM_LEGACY
 	/* Everything below here is for legacy driver, never use! */
 	/* private: */
diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index 0965eb111f24..a4360e28e2db 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -223,6 +223,16 @@ struct drm_file {
 	 */
 	bool is_master;
 
+#ifdef CONFIG_CGROUP_DRM
+	/**
+	 * @drm_cgroup_priority:
+	 *
+	 * Last known DRM cgroup priority is stored here by the DRM code when
+	 * informed of changes by the cgroup controller.
+	 */
+	int drm_cgroup_priority;
+#endif
+
 	/**
 	 * @master:
 	 *
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
index a59792ccb550..66063b4708e8 100644
--- a/include/linux/cgroup_drm.h
+++ b/include/linux/cgroup_drm.h
@@ -6,8 +6,12 @@
 #ifndef _CGROUP_DRM_H
 #define _CGROUP_DRM_H
 
+struct task_struct;
+
 #define DRM_CGROUP_PRIORITY_MIN	(-10000)
 #define DRM_CGROUP_PRIORITY_DEF	(0)
 #define DRM_CGROUP_PRIORITY_MAX	(10000)
 
+int drmcgroup_lookup_effective_priority(struct task_struct *task);
+
 #endif	/* _CGROUP_DRM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 6dd7faca7749..cfc7a1f2634c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1089,6 +1089,7 @@ config CGROUP_RDMA
 
 config CGROUP_DRM
 	bool "DRM controller"
+	select DRM
 	help
 	  Provides the DRM subsystem controller.
 
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 2350e1f8a48a..01954c3a2087 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -10,6 +10,8 @@
 #include <linux/mutex.h>
 #include <linux/sched.h>
 
+#include <drm/drm_clients.h>
+
 struct drm_cgroup_state {
 	struct cgroup_subsys_state css;
 
@@ -25,9 +27,52 @@ css_to_drmcs(struct cgroup_subsys_state *css)
 	return container_of(css, struct drm_cgroup_state, css);
 }
 
+static inline struct drm_cgroup_state *get_task_drmcs(struct task_struct *task)
+{
+	return css_to_drmcs(task_get_css(task, drm_cgrp_id));
+}
+
+int drmcgroup_lookup_effective_priority(struct task_struct *task)
+{
+	struct drm_cgroup_state *drmcs = get_task_drmcs(task);
+	int prio = drmcs->effective_priority;
+
+	css_put(&drmcs->css);
+
+	return prio;
+}
+EXPORT_SYMBOL_GPL(drmcgroup_lookup_effective_priority);
+
 static int drmcs_show_priority_levels(struct seq_file *sf, void *v)
 {
-	seq_printf(sf, "%u\n", 0);
+	struct cgroup *cgrp = seq_css(sf)->cgroup;
+	unsigned int min_levels = UINT_MAX;
+	bool non_uniform = false;
+	struct task_struct *task;
+	struct css_task_iter it;
+
+	css_task_iter_start(&cgrp->self,
+			    CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
+	while ((task = css_task_iter_next(&it))) {
+		unsigned int l;
+		bool nu;
+
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		l = drm_pid_priority_levels(task_pid(task), &nu);
+		if (nu || (min_levels != UINT_MAX && l != min_levels))
+			non_uniform = true;
+		if (l < min_levels)
+			min_levels = l;
+	}
+	css_task_iter_end(&it);
+
+	if (min_levels != UINT_MAX)
+		seq_printf(sf, "%u%s\n", min_levels, non_uniform ? "*" : "");
+	else
+		seq_puts(sf, "n/a\n");
 
 	return 0;
 }
@@ -49,6 +94,24 @@ drmcs_read_priority(struct cgroup_subsys_state *css, struct cftype *cft)
 	return drmcs->priority;
 }
 
+static void update_drm_priority(struct drm_cgroup_state *drmcs)
+{
+	struct cgroup *cgrp = drmcs->css.cgroup;
+	struct task_struct *task;
+	struct css_task_iter it;
+
+	css_task_iter_start(&cgrp->self,
+			    CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
+	while ((task = css_task_iter_next(&it))) {
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+		drm_pid_update_priority(task_pid(task),
+					drmcs->effective_priority);
+	}
+	css_task_iter_end(&it);
+}
+
 static void update_priority(struct drm_cgroup_state *drmcs, int priority)
 {
 	struct cgroup_subsys_state *node;
@@ -74,6 +137,8 @@ static void update_priority(struct drm_cgroup_state *drmcs, int priority)
 			clamp(pprio + dnode->priority,
 			      DRM_CGROUP_PRIORITY_MIN,
 			      DRM_CGROUP_PRIORITY_MAX);
+
+		update_drm_priority(dnode);
 	}
 	rcu_read_unlock();
 }
@@ -114,6 +179,20 @@ static void drmcs_free(struct cgroup_subsys_state *css)
 	kfree(css_to_drmcs(css));
 }
 
+static void drmcs_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task;
+
+	/*
+	 * As processes are assigned to groups we need to notify them of the
+	 * current priority.
+	 */
+	cgroup_taskset_for_each(task, css, tset)
+		drm_pid_update_priority(task_pid(task),
+					css_to_drmcs(css)->effective_priority);
+}
+
 static struct drm_cgroup_state root_drmcs = {
 	.priority = DRM_CGROUP_PRIORITY_DEF,
 	.effective_priority = DRM_CGROUP_PRIORITY_DEF,
@@ -158,6 +237,7 @@ struct cgroup_subsys drm_cgrp_subsys = {
 	.css_alloc	= drmcs_alloc,
 	.css_free	= drmcs_free,
 	.css_online	= drmcs_online,
+	.attach		= drmcs_attach,
 	.early_init	= false,
 	.legacy_cftypes	= files,
 	.dfl_cftypes	= files,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 06/17] drm: Only track clients which are providing drm_cgroup_ops
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (4 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 05/17] drm: Connect priority updates to drm core Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 07/17] drm/i915: i915 priority Tvrtko Ursulin
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

To reduce the number of tracking going on, especially with drivers which
will not support any sort of control from the drm cgroup controller side,
lets express the funcionality as opt-in and use the presence of
drm_cgroup_ops as activation criteria.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/drm_cgroup.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index 0fbb88f08cef..7ed9c7150cae 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -16,6 +16,9 @@ void drm_clients_close(struct drm_file *file_priv)
 
 	lockdep_assert_held(&dev->filelist_mutex);
 
+	if (!dev->driver->cg_ops)
+		return;
+
 	clients = xa_load(&drm_pid_clients, pid);
 	list_del_rcu(&file_priv->clink);
 	if (atomic_dec_and_test(&clients->num)) {
@@ -40,6 +43,9 @@ int drm_clients_open(struct drm_file *file_priv)
 
 	lockdep_assert_held(&dev->filelist_mutex);
 
+	if (!dev->driver->cg_ops)
+		return 0;
+
 	clients = xa_load(&drm_pid_clients, pid);
 	if (!clients) {
 		clients = kmalloc(sizeof(*clients), GFP_KERNEL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 07/17] drm/i915: i915 priority
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (5 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 06/17] drm: Only track clients which are providing drm_cgroup_ops Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 08/17] drm: Allow for migration of clients Tvrtko Ursulin
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Register i915 as supporting the drm cgroup controller priority management
and wire it up at execbuf time.

GEM context configured priority then works as a relative value on top of
the base level obtained from the drm cgroup controller.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 27 ++++++++++++++++++-
 drivers/gpu/drm/i915/i915_driver.c            | 10 +++++++
 drivers/gpu/drm/i915/i915_drm_client.c        | 16 +++++++++++
 drivers/gpu/drm/i915/i915_drm_client.h        |  4 +++
 4 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 1160723c9d2d..391c5b5c80be 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -4,8 +4,10 @@
  * Copyright © 2008,2010 Intel Corporation
  */
 
+#include <linux/cgroup_drm.h>
 #include <linux/dma-resv.h>
 #include <linux/highmem.h>
+#include <linux/minmax.h>
 #include <linux/sync_file.h>
 #include <linux/uaccess.h>
 
@@ -3015,6 +3017,29 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
 			break;
 }
 
+#ifdef CONFIG_CGROUP_DRM
+static void copy_priority(struct i915_sched_attr *attr,
+			  const struct i915_execbuffer *eb)
+{
+	const int scale = DIV_ROUND_CLOSEST(DRM_CGROUP_PRIORITY_MAX,
+					    I915_CONTEXT_MAX_USER_PRIORITY);
+	int prio;
+
+	*attr = eb->gem_context->sched;
+	prio = attr->priority * scale + eb->file->drm_cgroup_priority;
+	prio = DIV_ROUND_UP(prio, scale);
+	attr->priority = clamp(prio,
+			       I915_CONTEXT_MIN_USER_PRIORITY,
+			       I915_CONTEXT_MAX_USER_PRIORITY);
+}
+#else
+static void copy_priority(struct i915_sched_attr *attr,
+			  const struct i915_execbuffer *eb)
+{
+	*attr = eb->gem_context->sched;
+}
+#endif
+
 static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,
 			  int err, bool last_parallel)
 {
@@ -3031,7 +3056,7 @@ static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,
 
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
-		attr = eb->gem_context->sched;
+		copy_priority(&attr, eb);
 	} else {
 		/* Serialise with context_close via the add_to_timeline */
 		i915_request_set_error_once(rq, -ENOENT);
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index ffff49868dc5..7912782b87cc 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1893,6 +1893,12 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
 	DRM_IOCTL_DEF_DRV(I915_GEM_VM_DESTROY, i915_gem_vm_destroy_ioctl, DRM_RENDER_ALLOW),
 };
 
+#ifdef CONFIG_CGROUP_DRM
+static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
+	.priority_levels = i915_drm_priority_levels,
+};
+#endif
+
 /*
  * Interface history:
  *
@@ -1921,6 +1927,10 @@ static const struct drm_driver i915_drm_driver = {
 	.lastclose = i915_driver_lastclose,
 	.postclose = i915_driver_postclose,
 
+#ifdef CONFIG_CGROUP_DRM
+	.cg_ops = &i915_drm_cgroup_ops,
+#endif
+
 	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
 	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
 	.gem_prime_import = i915_gem_prime_import,
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index b09d1d386574..61a3cdaa7b16 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -75,6 +75,22 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients)
 	xa_destroy(&clients->xarray);
 }
 
+#ifdef CONFIG_CGROUP_DRM
+unsigned int i915_drm_priority_levels(struct drm_file *file)
+{
+	struct drm_i915_file_private *fpriv = file->driver_priv;
+	struct i915_drm_client *client = fpriv->client;
+	struct drm_i915_private *i915 = client->clients->i915;
+
+	if (GRAPHICS_VER(i915) < 8)
+		return 0;
+	else if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
+		return 3;
+	else
+		return 2047;
+}
+#endif
+
 #ifdef CONFIG_PROC_FS
 static const char * const uabi_class_names[] = {
 	[I915_ENGINE_CLASS_RENDER] = "render",
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 69496af996d9..bd5925241007 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -15,6 +15,8 @@
 
 #define I915_LAST_UABI_ENGINE_CLASS I915_ENGINE_CLASS_COMPUTE
 
+struct drm_file;
+
 struct drm_i915_private;
 
 struct i915_drm_clients {
@@ -65,4 +67,6 @@ void i915_drm_client_fdinfo(struct seq_file *m, struct file *f);
 
 void i915_drm_clients_fini(struct i915_drm_clients *clients);
 
+unsigned int i915_drm_priority_levels(struct drm_file *file);
+
 #endif /* !__I915_DRM_CLIENT_H__ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 08/17] drm: Allow for migration of clients
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (6 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 07/17] drm/i915: i915 priority Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 09/17] cgroup/drm: Introduce weight based drm cgroup control Tvrtko Ursulin
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Add a helper which allows migrating the tracked client from one process to
another.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/drm_cgroup.c | 111 ++++++++++++++++++++++++++++++-----
 include/drm/drm_clients.h    |   7 +++
 include/drm/drm_file.h       |   1 +
 3 files changed, 103 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index 7ed9c7150cae..59b730ed1334 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -8,9 +8,21 @@
 
 static DEFINE_XARRAY(drm_pid_clients);
 
+static void
+__del_clients(struct drm_pid_clients *clients, struct drm_file *file_priv)
+{
+	list_del_rcu(&file_priv->clink);
+	if (atomic_dec_and_test(&clients->num)) {
+		xa_erase(&drm_pid_clients, (unsigned long)file_priv->cpid);
+		kfree_rcu(clients, rcu);
+	}
+
+	put_pid(file_priv->cpid);
+	file_priv->cpid = NULL;
+}
+
 void drm_clients_close(struct drm_file *file_priv)
 {
-	unsigned long pid = (unsigned long)file_priv->pid;
 	struct drm_device *dev = file_priv->minor->dev;
 	struct drm_pid_clients *clients;
 
@@ -19,19 +31,32 @@ void drm_clients_close(struct drm_file *file_priv)
 	if (!dev->driver->cg_ops)
 		return;
 
-	clients = xa_load(&drm_pid_clients, pid);
-	list_del_rcu(&file_priv->clink);
-	if (atomic_dec_and_test(&clients->num)) {
-		xa_erase(&drm_pid_clients, pid);
-		kfree_rcu(clients, rcu);
+	clients = xa_load(&drm_pid_clients, (unsigned long)file_priv->cpid);
+	if (WARN_ON_ONCE(!clients))
+		return;
 
-		/*
-		 * FIXME: file_priv is not RCU protected so we add this hack
-		 * to avoid any races with code which walks clients->file_list
-		 * and accesses file_priv.
-		 */
-		synchronize_rcu();
+	__del_clients(clients, file_priv);
+
+	/*
+	 * FIXME: file_priv is not RCU protected so we add this hack
+	 * to avoid any races with code which walks clients->file_list
+	 * and accesses file_priv.
+	 */
+	synchronize_rcu();
+}
+
+static struct drm_pid_clients *__alloc_clients(void)
+{
+	struct drm_pid_clients *clients;
+
+	clients = kmalloc(sizeof(*clients), GFP_KERNEL);
+	if (clients) {
+		atomic_set(&clients->num, 0);
+		INIT_LIST_HEAD(&clients->file_list);
+		init_rcu_head(&clients->rcu);
 	}
+
+	return clients;
 }
 
 int drm_clients_open(struct drm_file *file_priv)
@@ -48,12 +73,9 @@ int drm_clients_open(struct drm_file *file_priv)
 
 	clients = xa_load(&drm_pid_clients, pid);
 	if (!clients) {
-		clients = kmalloc(sizeof(*clients), GFP_KERNEL);
+		clients = __alloc_clients();
 		if (!clients)
 			return -ENOMEM;
-		atomic_set(&clients->num, 0);
-		INIT_LIST_HEAD(&clients->file_list);
-		init_rcu_head(&clients->rcu);
 		new_client = true;
 	}
 	atomic_inc(&clients->num);
@@ -69,9 +91,66 @@ int drm_clients_open(struct drm_file *file_priv)
 		}
 	}
 
+	file_priv->cpid = get_pid(file_priv->pid);
+
 	return 0;
 }
 
+void drm_clients_migrate(struct drm_file *file_priv)
+{
+	struct drm_device *dev = file_priv->minor->dev;
+	struct drm_pid_clients *existing_clients;
+	struct drm_pid_clients *clients, *spare;
+	struct pid *pid = task_pid(current);
+
+	if (!dev->driver->cg_ops)
+		return;
+
+	// TODO: only do this if drmcs level property allows it?
+
+	spare = __alloc_clients();
+	if (WARN_ON(!spare))
+		return;
+
+	mutex_lock(&dev->filelist_mutex);
+	rcu_read_lock();
+
+	existing_clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+	clients = xa_load(&drm_pid_clients, (unsigned long)file_priv->cpid);
+
+	if (WARN_ON_ONCE(!clients))
+		goto out_unlock;
+	else if (clients == existing_clients)
+		goto out_unlock;
+
+	__del_clients(clients, file_priv);
+	smp_mb(); /* hmmm? del_rcu followed by add_rcu? */
+
+	if (!existing_clients) {
+		void *xret;
+
+		xret = xa_store(&drm_pid_clients, (unsigned long)pid, spare,
+				GFP_KERNEL);
+		if (WARN_ON(xa_err(xret)))
+			goto out_unlock;
+		clients = spare;
+		spare = NULL;
+	} else {
+		clients = existing_clients;
+	}
+
+	atomic_inc(&clients->num);
+	list_add_tail_rcu(&file_priv->clink, &clients->file_list);
+	file_priv->cpid = get_pid(pid);
+
+out_unlock:
+	rcu_read_unlock();
+	mutex_unlock(&dev->filelist_mutex);
+
+	kfree(spare);
+}
+EXPORT_SYMBOL_GPL(drm_clients_migrate);
+
 unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform)
 {
 	unsigned int min_levels = UINT_MAX;
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index 10d21138f7af..3a0b1cdb338f 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -17,6 +17,8 @@ struct drm_pid_clients {
 #if IS_ENABLED(CONFIG_CGROUP_DRM)
 void drm_clients_close(struct drm_file *file_priv);
 int drm_clients_open(struct drm_file *file_priv);
+
+void drm_clients_migrate(struct drm_file *file_priv);
 #else
 static inline void drm_clients_close(struct drm_file *file_priv)
 {
@@ -26,6 +28,11 @@ static inline int drm_clients_open(struct drm_file *file_priv)
 {
 	return 0;
 }
+
+static inline void drm_clients_migrate(struct drm_file *file_priv)
+{
+
+}
 #endif
 
 unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform);
diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index a4360e28e2db..2c1e356d3b73 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -280,6 +280,7 @@ struct drm_file {
 
 #if IS_ENABLED(CONFIG_CGROUP_DRM)
 	struct list_head clink;
+	struct pid *cpid;
 #endif
 
 	/**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 09/17] cgroup/drm: Introduce weight based drm cgroup control
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (7 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 08/17] drm: Allow for migration of clients Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 10/17] drm: Add ability to query drm cgroup GPU time Tvrtko Ursulin
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Similar to CPU scheduling, implement a concept of weight in the drm cgroup
controller.

Uses the same range and default as the CPU controller - CGROUP_WEIGHT_MIN,
CGROUP_WEIGHT_DFL and CGROUP_WEIGHT_MAX.

Later each cgroup is assigned a time budget proportionaly based on the
relative weights of it's siblings. This time budget is in turn split by
the group's children and so on.

Children of the root cgroup will be exempt from split budgets and
therefore compete for the GPU time independently and without weight based
control.

This will be used to implement a soft, or best effort signal from drm
cgroup to drm core notifying about groups which are over their allotted
budget.

No guarantees that the limit can be enforced are provided or implied.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 kernel/cgroup/drm.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 01954c3a2087..4b6f88d8236e 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -17,6 +17,7 @@ struct drm_cgroup_state {
 
 	int priority;
 	int effective_priority;
+	unsigned int weight;
 };
 
 static DEFINE_MUTEX(drmcg_mutex);
@@ -163,6 +164,33 @@ drmcs_write_priority(struct cgroup_subsys_state *css, struct cftype *cftype,
 	return 0;
 }
 
+static u64
+drmcs_read_weight(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	return drmcs->weight;
+}
+
+static int
+drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
+		   u64 weight)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+	int ret;
+
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	ret = mutex_lock_interruptible(&drmcg_mutex);
+	if (ret)
+		return ret;
+	drmcs->weight = weight;
+	mutex_unlock(&drmcg_mutex);
+
+	return 0;
+}
+
 static int drmcs_online(struct cgroup_subsys_state *css)
 {
 	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
@@ -210,6 +238,8 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
 	if (!drmcs)
 		return ERR_PTR(-ENOMEM);
 
+	drmcs->weight = CGROUP_WEIGHT_DFL;
+
 	return &drmcs->css;
 }
 
@@ -230,6 +260,12 @@ struct cftype files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_s64 = drmcs_read_effective_priority,
 	},
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = drmcs_read_weight,
+		.write_u64 = drmcs_write_weight,
+	},
 	{ } /* Zero entry terminates. */
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 10/17] drm: Add ability to query drm cgroup GPU time
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (8 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 09/17] cgroup/drm: Introduce weight based drm cgroup control Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 11/17] drm: Add over budget signalling callback Tvrtko Ursulin
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Add a driver callback and core helper which allow querying the time spent
on GPUs for processes belonging to a group.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/drm_cgroup.c | 24 ++++++++++++++++++++++++
 include/drm/drm_clients.h    |  1 +
 include/drm/drm_drv.h        |  9 +++++++++
 3 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index 59b730ed1334..e0cadb5e5659 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -206,3 +206,27 @@ void drm_pid_update_priority(struct pid *pid, int priority)
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(drm_pid_update_priority);
+
+u64 drm_pid_get_active_time_us(struct pid *pid)
+{
+	struct drm_pid_clients *clients;
+	u64 total = 0;
+
+	rcu_read_lock();
+	clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+	if (clients) {
+		struct drm_file *fpriv;
+
+		list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+			const struct drm_cgroup_ops *cg_ops =
+				fpriv->minor->dev->driver->cg_ops;
+
+			if (cg_ops && cg_ops->active_time_us)
+				total += cg_ops->active_time_us(fpriv);
+		}
+	}
+	rcu_read_unlock();
+
+	return total;
+}
+EXPORT_SYMBOL_GPL(drm_pid_get_active_time_us);
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index 3a0b1cdb338f..f25e09ed5feb 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -37,5 +37,6 @@ static inline void drm_clients_migrate(struct drm_file *file_priv)
 
 unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform);
 void drm_pid_update_priority(struct pid *pid, int priority);
+u64 drm_pid_get_active_time_us(struct pid *pid);
 
 #endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 2371d73e12cf..0f1802df01fe 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -183,6 +183,15 @@ struct drm_cgroup_ops {
 	 * priorities of already running workloads.
 	 */
 	void (*update_priority) (struct drm_file *, int priority);
+
+	/**
+	 * @active_time_us:
+	 *
+	 * Optional callback for reporting the GPU time consumed by this client.
+	 *
+	 * Used by the DRM core when queried by the DRM cgroup controller.
+	 */
+	u64 (*active_time_us) (struct drm_file *);
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 11/17] drm: Add over budget signalling callback
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (9 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 10/17] drm: Add ability to query drm cgroup GPU time Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 12/17] cgroup/drm: Client exit hook Tvrtko Ursulin
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Add a new callback via which the drm cgroup controller is notifying the
drm core that a certain process is above its allotted GPU time.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/drm_cgroup.c | 21 +++++++++++++++++++++
 include/drm/drm_clients.h    |  1 +
 include/drm/drm_drv.h        |  8 ++++++++
 3 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index e0cadb5e5659..e36bc4333924 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -230,3 +230,24 @@ u64 drm_pid_get_active_time_us(struct pid *pid)
 	return total;
 }
 EXPORT_SYMBOL_GPL(drm_pid_get_active_time_us);
+
+void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget)
+{
+	struct drm_pid_clients *clients;
+
+	rcu_read_lock();
+	clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+	if (clients) {
+		struct drm_file *fpriv;
+
+		list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+			const struct drm_cgroup_ops *cg_ops =
+				fpriv->minor->dev->driver->cg_ops;
+
+			if (cg_ops && cg_ops->signal_budget)
+				cg_ops->signal_budget(fpriv, usage, budget);
+		}
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(drm_pid_signal_budget);
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index f25e09ed5feb..7ad09fd0a404 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -38,5 +38,6 @@ static inline void drm_clients_migrate(struct drm_file *file_priv)
 unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform);
 void drm_pid_update_priority(struct pid *pid, int priority);
 u64 drm_pid_get_active_time_us(struct pid *pid);
+void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget);
 
 #endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 0f1802df01fe..07dec956ebfb 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -192,6 +192,14 @@ struct drm_cgroup_ops {
 	 * Used by the DRM core when queried by the DRM cgroup controller.
 	 */
 	u64 (*active_time_us) (struct drm_file *);
+
+	/**
+	 * @signal_budget:
+	 *
+	 * Optional callback used by the DRM core to forward over/under GPU time
+	 * messages sent by the DRM cgroup controller.
+	 */
+	void (*signal_budget) (struct drm_file *, u64 used, u64 budget);
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 12/17] cgroup/drm: Client exit hook
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (10 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 11/17] drm: Add over budget signalling callback Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage Tvrtko Ursulin
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

We need the ability for DRM core to inform the cgroup controller when a
client has closed a DRM file descriptor. This will allow us not needing
to keep state relating to GPU time usage by tasks sets in the cgroup
controller itself.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/drm_cgroup.c | 8 ++++++++
 include/linux/cgroup_drm.h   | 1 +
 kernel/cgroup/drm.c          | 8 ++++++++
 3 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index e36bc4333924..ff99d1f4f1d4 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -5,6 +5,7 @@
 
 #include <drm/drm_drv.h>
 #include <drm/drm_clients.h>
+#include <linux/cgroup_drm.h>
 
 static DEFINE_XARRAY(drm_pid_clients);
 
@@ -25,6 +26,7 @@ void drm_clients_close(struct drm_file *file_priv)
 {
 	struct drm_device *dev = file_priv->minor->dev;
 	struct drm_pid_clients *clients;
+	struct task_struct *task;
 
 	lockdep_assert_held(&dev->filelist_mutex);
 
@@ -35,6 +37,12 @@ void drm_clients_close(struct drm_file *file_priv)
 	if (WARN_ON_ONCE(!clients))
 		return;
 
+	task = get_pid_task(file_priv->cpid, PIDTYPE_PID);
+	if (task) {
+		drmcgroup_client_exited(task);
+		put_task_struct(task);
+	}
+
 	__del_clients(clients, file_priv);
 
 	/*
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
index 66063b4708e8..c84516d3e50a 100644
--- a/include/linux/cgroup_drm.h
+++ b/include/linux/cgroup_drm.h
@@ -13,5 +13,6 @@ struct task_struct;
 #define DRM_CGROUP_PRIORITY_MAX	(10000)
 
 int drmcgroup_lookup_effective_priority(struct task_struct *task);
+void drmcgroup_client_exited(struct task_struct *task);
 
 #endif	/* _CGROUP_DRM_H */
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 4b6f88d8236e..48f1eaaa1c07 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -221,6 +221,14 @@ static void drmcs_attach(struct cgroup_taskset *tset)
 					css_to_drmcs(css)->effective_priority);
 }
 
+void drmcgroup_client_exited(struct task_struct *task)
+{
+	struct drm_cgroup_state *drmcs = get_task_drmcs(task);
+
+	css_put(&drmcs->css);
+}
+EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
+
 static struct drm_cgroup_state root_drmcs = {
 	.priority = DRM_CGROUP_PRIORITY_DEF,
 	.effective_priority = DRM_CGROUP_PRIORITY_DEF,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (11 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 12/17] cgroup/drm: Client exit hook Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-21 22:52   ` T.J. Mercier
  2022-10-19 17:32 ` [RFC 14/17] cgroup/drm: Show group budget signaling capability in sysfs Tvrtko Ursulin
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Add a scanning worker, which if enabled, periodically queries the cgroup
for GPU usage and if over budget (as configured by it's relative weight
share) notifies the drm core about the fact.

This is off by default and can be enabled by configuring a scanning
period using the drm.period_us cgroup control file.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  35 +-
 kernel/cgroup/drm.c                     | 426 +++++++++++++++++++++++-
 2 files changed, 459 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1f3cca4e2572..318f463a1316 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2401,7 +2401,8 @@ HugeTLB Interface Files
 DRM
 ---
 
-The DRM controller allows configuring static hierarchical scheduling priority.
+The DRM controller allows configuring static hierarchical scheduling priority
+and scheduling soft limits.
 
 DRM static priority control
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2458,6 +2459,38 @@ DRM static priority interface files
 	Read only integer showing the current effective priority level for the
 	group. Effective meaning taking into account the chain of inherited
 
+DRM scheduling soft limits
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because of the heterogenous hardware and driver DRM capabilities, soft limits
+are implemented as a loose co-operative (bi-directional) interface between the
+controller and DRM core.
+
+The controller configures the GPU time allowed per group and periodically scans
+the belonging tasks to detect the over budget condition, at which point it
+invokes a callback notifying the DRM core of the condition.
+
+DRM core provides an API to query per process GPU utilization and 2nd API to
+receive notification from the cgroup controller when the group enters or exits
+the over budget condition.
+
+Individual DRM drivers which implement the interface are expected to act on this
+in the best-effort manner only. There are no guarantees that the soft limits
+will be respected.
+
+DRM scheduling soft limits interface files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+  drm.weight
+	Standard cgroup weight based control [1, 10000] used to configure the
+	relative distributing of GPU time between the sibling groups.
+
+  drm.period_us
+	An integer representing the period with which the controller should look
+	at the GPU usage by the group and potentially send the over/under budget
+	signal.
+	Value of zero (defaul) disables the soft limit checking.
+
 Misc
 ----
 
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 48f1eaaa1c07..af50ead1564a 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -18,6 +18,29 @@ struct drm_cgroup_state {
 	int priority;
 	int effective_priority;
 	unsigned int weight;
+	unsigned int period_us;
+
+	bool scanning_suspended;
+	unsigned int suspended_period_us;
+
+	struct delayed_work scan_work;
+
+	/*
+	 * Below fields are owned and updated by the scan worker. Either the
+	 * worker accesses them, or worker needs to be suspended and synced
+	 * before they can be touched from the outside.
+	 */
+	bool scanned;
+
+	ktime_t prev_timestamp;
+
+	u64 sum_children_weights;
+	u64 children_active_us;
+	u64 per_s_budget_ns;
+	u64 prev_active_us;
+	u64 active_us;
+
+	bool over_budget;
 };
 
 static DEFINE_MUTEX(drmcg_mutex);
@@ -33,6 +56,31 @@ static inline struct drm_cgroup_state *get_task_drmcs(struct task_struct *task)
 	return css_to_drmcs(task_get_css(task, drm_cgrp_id));
 }
 
+static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
+{
+	struct cgroup *cgrp = drmcs->css.cgroup;
+	struct task_struct *task;
+	struct css_task_iter it;
+	u64 total = 0;
+
+	css_task_iter_start(&cgrp->self,
+			    CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
+			    &it);
+	while ((task = css_task_iter_next(&it))) {
+		u64 time;
+
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		time = drm_pid_get_active_time_us(task_pid(task));
+		total += time;
+	}
+	css_task_iter_end(&it);
+
+	return total;
+}
+
 int drmcgroup_lookup_effective_priority(struct task_struct *task)
 {
 	struct drm_cgroup_state *drmcs = get_task_drmcs(task);
@@ -202,9 +250,301 @@ static int drmcs_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void
+signal_drm_budget(struct drm_cgroup_state *drmcs, u64 usage, u64 budget)
+{
+	struct cgroup *cgrp = drmcs->css.cgroup;
+	struct task_struct *task;
+	struct css_task_iter it;
+
+	css_task_iter_start(&cgrp->self,
+			    CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
+			    &it);
+	while ((task = css_task_iter_next(&it))) {
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		drm_pid_signal_budget(task_pid(task), usage, budget);
+	}
+	css_task_iter_end(&it);
+}
+
+static bool __start_scanning(struct drm_cgroup_state *root)
+{
+	struct cgroup_subsys_state *node;
+	bool ok = true;
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(node, &root->css) {
+		struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+		unsigned long active;
+
+		if (!css_tryget_online(node)) {
+			ok = false;
+			continue;
+		}
+
+		drmcs->scanned = false;
+		drmcs->sum_children_weights = 0;
+		drmcs->children_active_us = 0;
+		if (node == &root->css)
+			drmcs->per_s_budget_ns = NSEC_PER_SEC;
+		else
+			drmcs->per_s_budget_ns = 0;
+
+		active = drmcs_get_active_time_us(drmcs);
+		if (active >= drmcs->prev_active_us)
+			drmcs->active_us = active - drmcs->prev_active_us;
+		else
+			drmcs->active_us = 0;
+		drmcs->prev_active_us = active;
+
+		css_put(node);
+	}
+	rcu_read_unlock();
+
+	return ok;
+}
+
+static void scan_worker(struct work_struct *work)
+{
+	struct drm_cgroup_state *root =
+		container_of(work, typeof(*root), scan_work.work);
+	struct cgroup_subsys_state *node;
+	unsigned int period_us;
+	ktime_t now;
+
+	rcu_read_lock();
+
+	if (WARN_ON_ONCE(!css_tryget_online(&root->css)))
+		return;
+
+	/*
+	 * 1st pass - reset accumulated values and update group GPU activity.
+	 */
+	if (!__start_scanning(root))
+		goto out_retry; /*
+				 * Always come back later if scanner races with
+				 * core cgroup management. (Repeated pattern.)
+				 */
+
+	now = ktime_get();
+	period_us = ktime_to_us(ktime_sub(now, root->prev_timestamp));
+	root->prev_timestamp = now;
+
+	/*
+	 * 2nd pass - calculate accumulated GPU activity and relative weights
+	 * for each parent's children.
+	 */
+	css_for_each_descendant_pre(node, &root->css) {
+		struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+		if (!css_tryget_online(node))
+			goto out_retry;
+
+		if (!drmcs->scanned) {
+			struct cgroup_subsys_state *css;
+
+			css_for_each_child(css, &drmcs->css) {
+				struct drm_cgroup_state *sibling =
+							css_to_drmcs(css);
+
+				if (!css_tryget_online(css)) {
+					css_put(node);
+					goto out_retry;
+				}
+
+				drmcs->children_active_us += sibling->active_us;
+				drmcs->sum_children_weights += sibling->weight;
+
+				css_put(css);
+			}
+
+			drmcs->scanned = true;
+		}
+
+		css_put(node);
+	}
+
+	/*
+	 * 3rd pass - calculate relative budgets for each group based on
+	 * relative weights and parent's budget.
+	 *
+	 * FIXME: This is for now incomplete in more than one way. There is
+	 * no downward propagation of unused budgets, and even no utilisation of
+	 * the unused budgets at all.
+	 */
+	css_for_each_descendant_pre(node, &root->css) {
+		struct drm_cgroup_state *drmcs, *pdrmcs;
+		bool over, was_over;
+		u64 budget;
+
+		if (!css_tryget_online(node))
+			goto out_retry;
+		if (node->cgroup->level == 1) {
+			css_put(node);
+			continue;
+		}
+		if (!css_tryget_online(node->parent)) {
+			css_put(node);
+			goto out_retry;
+		}
+
+		drmcs = css_to_drmcs(node);
+		pdrmcs = css_to_drmcs(node->parent);
+
+		drmcs->per_s_budget_ns  =
+			DIV_ROUND_UP_ULL(pdrmcs->per_s_budget_ns *
+					 drmcs->weight,
+					 pdrmcs->sum_children_weights);
+		budget = DIV_ROUND_UP_ULL(drmcs->per_s_budget_ns * period_us,
+					  NSEC_PER_SEC);
+		over = drmcs->active_us > budget;
+		was_over = drmcs->over_budget;
+		drmcs->over_budget = over;
+		if (over || (!over && was_over))
+			signal_drm_budget(drmcs, drmcs->active_us, budget);
+
+		css_put(node);
+		css_put(node->parent);
+	}
+
+out_retry:
+	rcu_read_unlock();
+
+	period_us = READ_ONCE(root->period_us);
+	if (period_us)
+		schedule_delayed_work(&root->scan_work,
+				      usecs_to_jiffies(period_us));
+
+	css_put(&root->css);
+}
+
+static void start_scanning(struct drm_cgroup_state *drmcs, u64 period_us)
+{
+	drmcs->period_us = (unsigned int)period_us;
+	WARN_ON_ONCE(!__start_scanning(drmcs));
+	drmcs->prev_timestamp = ktime_get();
+	mod_delayed_work(system_wq, &drmcs->scan_work,
+			 usecs_to_jiffies(period_us));
+}
+
+static void stop_scanning(struct drm_cgroup_state *drmcs)
+{
+	drmcs->period_us = 0;
+	cancel_delayed_work_sync(&drmcs->scan_work);
+	if (drmcs->over_budget) {
+		/*
+		 * Signal under budget when scanning goes off so drivers
+		 * correctly update their state.
+		 */
+		signal_drm_budget(drmcs, 0, drmcs->per_s_budget_ns);
+		drmcs->over_budget = false;
+	}
+}
+
+static struct drm_cgroup_state *drmcs_scanner(struct drm_cgroup_state *drmcs)
+{
+	while (drmcs->css.cgroup->level > 1)
+		drmcs = css_to_drmcs(drmcs->css.parent);
+
+	return drmcs;
+}
+
+static void start_suspend_scanning(struct drm_cgroup_state *drmcs)
+{
+	drmcs = drmcs_scanner(drmcs);
+
+	if (drmcs->scanning_suspended)
+		return;
+
+	drmcs->scanning_suspended = true;
+	drmcs->suspended_period_us = drmcs->period_us;
+	drmcs->period_us = 0;
+}
+
+static void finish_suspend_scanning(struct drm_cgroup_state *drmcs)
+{
+	drmcs = drmcs_scanner(drmcs);
+
+	if (drmcs->suspended_period_us)
+		cancel_delayed_work_sync(&drmcs->scan_work);
+}
+
+static void resume_scanning(struct drm_cgroup_state *drmcs)
+{
+	drmcs = drmcs_scanner(drmcs);
+
+	if (!drmcs->scanning_suspended)
+		return;
+
+	drmcs->scanning_suspended = false;
+	if (drmcs->suspended_period_us) {
+		start_scanning(drmcs, drmcs->suspended_period_us);
+		drmcs->suspended_period_us = 0;
+	}
+}
+
 static void drmcs_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_to_drmcs(css));
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	stop_scanning(drmcs);
+
+	kfree(drmcs);
+}
+
+static int drmcs_can_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *new_css;
+	struct task_struct *task;
+	int ret;
+
+	/*
+	 * As processes are getting moved between groups we need to ensure
+	 * both that the old group does not see a sudden downward jump in the
+	 * GPU utilisation, and that the new group does not see a sudden jump
+	 * up with all the GPU time clients belonging to the migrated process
+	 * have accumulated.
+	 *
+	 * To achieve that we suspend the scanner until the migration is
+	 * completed where the resume at the end ensures both groups start
+	 * observing GPU utilisation from a reset state.
+	 */
+
+	ret = mutex_lock_interruptible(&drmcg_mutex);
+	if (ret)
+		return ret;
+
+	cgroup_taskset_for_each(task, new_css, tset) {
+		start_suspend_scanning(css_to_drmcs(task_css(task,
+							     drm_cgrp_id)));
+		start_suspend_scanning(css_to_drmcs(new_css));
+	}
+
+	mutex_unlock(&drmcg_mutex);
+
+	cgroup_taskset_for_each(task, new_css, tset) {
+		finish_suspend_scanning(css_to_drmcs(task_css(task,
+							      drm_cgrp_id)));
+		finish_suspend_scanning(css_to_drmcs(new_css));
+	}
+
+	return 0;
+}
+
+static void tset_resume_scanning(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *new_css;
+	struct task_struct *task;
+
+	mutex_lock(&drmcg_mutex);
+	cgroup_taskset_for_each(task, new_css, tset) {
+		resume_scanning(css_to_drmcs(task_css(task, drm_cgrp_id)));
+		resume_scanning(css_to_drmcs(new_css));
+	}
+	mutex_unlock(&drmcg_mutex);
 }
 
 static void drmcs_attach(struct cgroup_taskset *tset)
@@ -219,12 +559,86 @@ static void drmcs_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_for_each(task, css, tset)
 		drm_pid_update_priority(task_pid(task),
 					css_to_drmcs(css)->effective_priority);
+
+	tset_resume_scanning(tset);
+}
+
+static void drmcs_cancel_attach(struct cgroup_taskset *tset)
+{
+	tset_resume_scanning(tset);
+}
+
+static u64
+drmcs_read_period_us(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	return drmcs->period_us;
+}
+
+static int
+drmcs_write_period_us(struct cgroup_subsys_state *css, struct cftype *cftype,
+		      u64 period_us)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+	int ret;
+
+	if (WARN_ON_ONCE(!css->parent))
+		return -EINVAL;
+	if (css->cgroup->level != 1)
+		return -EINVAL;
+	if ((period_us && period_us < 500000) || period_us > USEC_PER_SEC * 60)
+		return -EINVAL;
+
+	ret = mutex_lock_interruptible(&drmcg_mutex);
+	if (ret)
+		return ret;
+
+	if (!drmcs->scanning_suspended) {
+		if (period_us)
+			start_scanning(drmcs, period_us);
+		else
+			stop_scanning(drmcs);
+	} else {
+		/*
+		 * If scanning is temporarily suspended just update the period
+		 * which will apply once resumed, or simply skip resuming in
+		 * case of disabling.
+		 */
+		drmcs->suspended_period_us = period_us;
+		if (!period_us)
+			drmcs->scanning_suspended = false;
+	}
+
+	mutex_unlock(&drmcg_mutex);
+
+	return 0;
 }
 
 void drmcgroup_client_exited(struct task_struct *task)
 {
 	struct drm_cgroup_state *drmcs = get_task_drmcs(task);
 
+	/*
+	 * Since we are not tracking accumulated GPU time for each cgroup,
+	 * avoid jumps in group observed GPU usage by re-setting the scanner
+	 * at a point when GPU usage can suddenly jump down.
+	 *
+	 * Downside is clients can influence the effectiveness of the over-
+	 * budget scanning by continuosly closing DRM file descriptors but for
+	 * now we do not worry about it.
+	 */
+
+	mutex_lock(&drmcg_mutex);
+	start_suspend_scanning(drmcs);
+	mutex_unlock(&drmcg_mutex);
+
+	finish_suspend_scanning(drmcs);
+
+	mutex_lock(&drmcg_mutex);
+	resume_scanning(drmcs);
+	mutex_unlock(&drmcg_mutex);
+
 	css_put(&drmcs->css);
 }
 EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
@@ -232,6 +646,7 @@ EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
 static struct drm_cgroup_state root_drmcs = {
 	.priority = DRM_CGROUP_PRIORITY_DEF,
 	.effective_priority = DRM_CGROUP_PRIORITY_DEF,
+	.weight = CGROUP_WEIGHT_DFL,
 };
 
 static struct cgroup_subsys_state *
@@ -247,6 +662,7 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
 		return ERR_PTR(-ENOMEM);
 
 	drmcs->weight = CGROUP_WEIGHT_DFL;
+	INIT_DELAYED_WORK(&drmcs->scan_work, scan_worker);
 
 	return &drmcs->css;
 }
@@ -274,6 +690,12 @@ struct cftype files[] = {
 		.read_u64 = drmcs_read_weight,
 		.write_u64 = drmcs_write_weight,
 	},
+	{
+		.name = "period_us",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = drmcs_read_period_us,
+		.write_u64 = drmcs_write_period_us,
+	},
 	{ } /* Zero entry terminates. */
 };
 
@@ -281,7 +703,9 @@ struct cgroup_subsys drm_cgrp_subsys = {
 	.css_alloc	= drmcs_alloc,
 	.css_free	= drmcs_free,
 	.css_online	= drmcs_online,
+	.can_attach	= drmcs_can_attach,
 	.attach		= drmcs_attach,
+	.cancel_attach	= drmcs_cancel_attach,
 	.early_init	= false,
 	.legacy_cftypes	= files,
 	.dfl_cftypes	= files,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 14/17] cgroup/drm: Show group budget signaling capability in sysfs
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (12 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 15/17] drm/i915: Migrate client to new owner on context create Tvrtko Ursulin
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Show overall status of a task group - whether all DRM clients in a group
support over budget signaling, or some do not, or if there are no DRM
clients in the group to start with.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  7 ++++
 drivers/gpu/drm/drm_cgroup.c            | 33 ++++++++++++++++
 include/drm/drm_clients.h               |  7 ++++
 include/drm/drm_drv.h                   | 11 +++++-
 kernel/cgroup/drm.c                     | 52 +++++++++++++++++++++++++
 5 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 318f463a1316..6ee94ee109f0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2491,6 +2491,13 @@ DRM scheduling soft limits interface files
 	signal.
 	Value of zero (defaul) disables the soft limit checking.
 
+  drm.budget_supported
+	One of:
+	 1) 'yes' - when all DRM clients in the group support the functionality.
+	 2) 'no' - when at least one of the DRM clients does not support the
+		   functionality.
+	 3) 'n/a' - when there are no DRM clients in the group.
+
 Misc
 ----
 
diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index ff99d1f4f1d4..d2d8b2cb4ab3 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -259,3 +259,36 @@ void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget)
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(drm_pid_signal_budget);
+
+enum drm_cg_supported drm_pid_signal_budget_supported(struct pid *pid)
+{
+	enum drm_cg_supported supported = DRM_CG_NOT_APPLICABLE;
+	struct drm_pid_clients *clients;
+
+	rcu_read_lock();
+	clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+	if (clients) {
+		struct drm_file *fpriv;
+
+		list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+			const struct drm_cgroup_ops *cg_ops =
+				fpriv->minor->dev->driver->cg_ops;
+
+			if (!cg_ops ||
+			    !cg_ops->active_time_us ||
+			    !cg_ops->signal_budget ||
+			    cg_ops->signal_budget(fpriv, 0, 0) < 0) {
+				supported = DRM_CG_NOT_SUPPORTED;
+				break;
+			}
+
+			if (supported == DRM_CG_NOT_APPLICABLE)
+				supported = DRM_CG_SUPPORTED;
+
+		}
+	}
+	rcu_read_unlock();
+
+	return supported;
+}
+EXPORT_SYMBOL_GPL(drm_pid_signal_budget_supported);
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index 7ad09fd0a404..5d14ae26ece6 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -14,6 +14,12 @@ struct drm_pid_clients {
 	struct rcu_head rcu;
 };
 
+enum drm_cg_supported {
+	DRM_CG_NOT_APPLICABLE = -1,
+	DRM_CG_NOT_SUPPORTED = 0,
+	DRM_CG_SUPPORTED
+};
+
 #if IS_ENABLED(CONFIG_CGROUP_DRM)
 void drm_clients_close(struct drm_file *file_priv);
 int drm_clients_open(struct drm_file *file_priv);
@@ -39,5 +45,6 @@ unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform);
 void drm_pid_update_priority(struct pid *pid, int priority);
 u64 drm_pid_get_active_time_us(struct pid *pid);
 void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget);
+enum drm_cg_supported drm_pid_signal_budget_supported(struct pid *pid);
 
 #endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 07dec956ebfb..7a1a20d1b8de 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -198,8 +198,17 @@ struct drm_cgroup_ops {
 	 *
 	 * Optional callback used by the DRM core to forward over/under GPU time
 	 * messages sent by the DRM cgroup controller.
+	 *
+	 * Zero used with zero budget is a special budgeting support status
+	 * query which needs to return either zero or -EINVAL if client does not
+	 * support budget control.
+	 *
+	 * Returns:
+	 * 	* 1 when client has been throttled.
+	 * 	* 0 when no action has been taken.
+	 * 	* -EINVAL when not supported by the client.
 	 */
-	void (*signal_budget) (struct drm_file *, u64 used, u64 budget);
+	int (*signal_budget) (struct drm_file *, u64 used, u64 budget);
 };
 
 /**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index af50ead1564a..dd7db70c2831 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -239,6 +239,53 @@ drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
 	return 0;
 }
 
+static int drmcs_show_budget_supported(struct seq_file *sf, void *v)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(seq_css(sf));
+	enum drm_cg_supported overall = DRM_CG_NOT_APPLICABLE;
+	struct cgroup *cgrp = drmcs->css.cgroup;
+	struct task_struct *task;
+	struct css_task_iter it;
+
+	css_task_iter_start(&cgrp->self,
+			    CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
+			    &it);
+	while ((task = css_task_iter_next(&it))) {
+		enum drm_cg_supported supported;
+
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		supported = drm_pid_signal_budget_supported(task_pid(task));
+		if (supported == DRM_CG_SUPPORTED &&
+		    overall == DRM_CG_NOT_APPLICABLE) {
+			overall = DRM_CG_SUPPORTED;
+		} else if (supported == DRM_CG_NOT_SUPPORTED) {
+			overall = DRM_CG_NOT_SUPPORTED;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	switch (overall) {
+	case DRM_CG_NOT_APPLICABLE:
+		seq_puts(sf, "n/a\n");
+		break;
+	case DRM_CG_NOT_SUPPORTED:
+		seq_puts(sf, "no\n");
+		break;
+	case DRM_CG_SUPPORTED:
+		seq_puts(sf, "yes\n");
+		break;
+	default:
+		seq_printf(sf, "%u\n", overall);
+		break;
+	}
+
+	return 0;
+}
+
 static int drmcs_online(struct cgroup_subsys_state *css)
 {
 	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
@@ -690,6 +737,11 @@ struct cftype files[] = {
 		.read_u64 = drmcs_read_weight,
 		.write_u64 = drmcs_write_weight,
 	},
+	{
+		.name = "budget_supported",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = drmcs_show_budget_supported,
+	},
 	{
 		.name = "period_us",
 		.flags = CFTYPE_NOT_ON_ROOT,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 15/17] drm/i915: Migrate client to new owner on context create
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (13 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 14/17] cgroup/drm: Show group budget signaling capability in sysfs Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 16/17] drm/i915: Wire up with drm controller GPU time query Tvrtko Ursulin
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Some usage models pass a drm file descriptor from a creating process to
the client which will actually use it.

Use the core drm helper on GEM context create to account for this and
ensure client's resource usage is tracked in the correct cgroup.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 1e29b1e6d186..5f6af306e147 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -69,6 +69,7 @@
 #include <linux/nospec.h>
 
 #include <drm/drm_cache.h>
+#include <drm/drm_clients.h>
 #include <drm/drm_syncobj.h>
 
 #include "gt/gen6_ppgtt.h"
@@ -2300,6 +2301,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 	args->ctx_id = id;
 	drm_dbg(&i915->drm, "HW context %d created\n", args->ctx_id);
 
+	drm_clients_migrate(file);
+
 	return 0;
 
 err_pc:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 16/17] drm/i915: Wire up with drm controller GPU time query
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (14 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 15/17] drm/i915: Migrate client to new owner on context create Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 17:32 ` [RFC 17/17] drm/i915: Implement cgroup controller over budget throttling Tvrtko Ursulin
  2022-10-19 18:45 ` [RFC 00/17] DRM scheduling cgroup controller Tejun Heo
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Implement the drm_cgroup_ops->active_time_us callback.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_driver.c     |   1 +
 drivers/gpu/drm/i915/i915_drm_client.c | 106 +++++++++++++++++++------
 drivers/gpu/drm/i915/i915_drm_client.h |   2 +
 3 files changed, 83 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 7912782b87cc..b949fd715202 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1896,6 +1896,7 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
 #ifdef CONFIG_CGROUP_DRM
 static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
 	.priority_levels = i915_drm_priority_levels,
+	.active_time_us = i915_drm_cgroup_get_active_time_us,
 };
 #endif
 
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 61a3cdaa7b16..8527fe80d449 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -75,23 +75,7 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients)
 	xa_destroy(&clients->xarray);
 }
 
-#ifdef CONFIG_CGROUP_DRM
-unsigned int i915_drm_priority_levels(struct drm_file *file)
-{
-	struct drm_i915_file_private *fpriv = file->driver_priv;
-	struct i915_drm_client *client = fpriv->client;
-	struct drm_i915_private *i915 = client->clients->i915;
-
-	if (GRAPHICS_VER(i915) < 8)
-		return 0;
-	else if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
-		return 3;
-	else
-		return 2047;
-}
-#endif
-
-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
 static const char * const uabi_class_names[] = {
 	[I915_ENGINE_CLASS_RENDER] = "render",
 	[I915_ENGINE_CLASS_COPY] = "copy",
@@ -116,22 +100,92 @@ static u64 busy_add(struct i915_gem_context *ctx, unsigned int class)
 	return total;
 }
 
-static void
-show_client_class(struct seq_file *m,
-		  struct i915_drm_client *client,
-		  unsigned int class)
+static u64 get_class_active_ns(struct i915_drm_client *client,
+			       unsigned int class,
+			       unsigned int *capacity)
 {
-	const struct list_head *list = &client->ctx_list;
-	u64 total = atomic64_read(&client->past_runtime[class]);
-	const unsigned int capacity =
-		client->clients->i915->engine_uabi_class_count[class];
 	struct i915_gem_context *ctx;
+	u64 total;
+
+	*capacity =
+	    client->clients->i915->engine_uabi_class_count[class];
+	if (!*capacity)
+		return 0;
+
+	total = atomic64_read(&client->past_runtime[class]);
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(ctx, list, client_link)
+	list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
 		total += busy_add(ctx, class);
 	rcu_read_unlock();
 
+	return total;
+}
+#endif
+
+#ifdef CONFIG_CGROUP_DRM
+unsigned int i915_drm_priority_levels(struct drm_file *file)
+{
+	struct drm_i915_file_private *fpriv = file->driver_priv;
+	struct i915_drm_client *client = fpriv->client;
+	struct drm_i915_private *i915 = client->clients->i915;
+
+	if (GRAPHICS_VER(i915) < 8)
+		return 0;
+	else if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
+		return 3;
+	else
+		return 2047;
+}
+
+static bool supports_stats(struct drm_i915_private *i915)
+{
+	if (GRAPHICS_VER(i915) < 8)
+		return false;
+
+	/* temporary... */
+	if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
+		return false;
+
+	return true;
+}
+
+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file)
+{
+	struct drm_i915_file_private *fpriv = file->driver_priv;
+	struct i915_drm_client *client = fpriv->client;
+	unsigned int i;
+	u64 busy = 0;
+
+	if (!supports_stats(client->clients->i915))
+		return 0;
+
+	for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+		unsigned int capacity;
+		u64 b;
+
+		b = get_class_active_ns(client, i, &capacity);
+		if (capacity) {
+			b = DIV_ROUND_UP_ULL(b, capacity * 1000);
+			busy += b;
+		}
+	}
+
+	return busy;
+}
+#endif
+
+#ifdef CONFIG_PROC_FS
+static void
+show_client_class(struct seq_file *m,
+		  struct i915_drm_client *client,
+		  unsigned int class)
+{
+	unsigned int capacity;
+	u64 total;
+
+	total = get_class_active_ns(client, class, &capacity);
+
 	if (capacity)
 		seq_printf(m, "drm-engine-%s:\t%llu ns\n",
 			   uabi_class_names[class], total);
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index bd5925241007..99b8ae01c183 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -69,4 +69,6 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients);
 
 unsigned int i915_drm_priority_levels(struct drm_file *file);
 
+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
+
 #endif /* !__I915_DRM_CLIENT_H__ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 17/17] drm/i915: Implement cgroup controller over budget throttling
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (15 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 16/17] drm/i915: Wire up with drm controller GPU time query Tvrtko Ursulin
@ 2022-10-19 17:32 ` Tvrtko Ursulin
  2022-10-19 18:45 ` [RFC 00/17] DRM scheduling cgroup controller Tejun Heo
  17 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-19 17:32 UTC (permalink / raw)
  To: Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

When notified by the drm core we are over our allotted time budget, i915
instance will check if any of the GPU engines it is reponsible for is
fully saturated. If it is, and the client in question is using that
engine, it will throttle it.

For now throttling is done simplistically by lowering the scheduling
priority while client is throttled.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 22 ++++-
 drivers/gpu/drm/i915/i915_driver.c            |  1 +
 drivers/gpu/drm/i915/i915_drm_client.c        | 93 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_drm_client.h        |  9 ++
 4 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 391c5b5c80be..efcbd827f6a0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -8,6 +8,7 @@
 #include <linux/dma-resv.h>
 #include <linux/highmem.h>
 #include <linux/minmax.h>
+#include <linux/prandom.h>
 #include <linux/sync_file.h>
 #include <linux/uaccess.h>
 
@@ -3018,15 +3019,32 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
 }
 
 #ifdef CONFIG_CGROUP_DRM
+static unsigned int
+__get_class(struct drm_i915_file_private *fpriv, const struct i915_request *rq)
+{
+	unsigned int class;
+
+	class = rq->context->engine->uabi_class;
+
+	if (WARN_ON_ONCE(class >= ARRAY_SIZE(fpriv->client->throttle)))
+		class = 0;
+
+	return class;
+}
+
 static void copy_priority(struct i915_sched_attr *attr,
-			  const struct i915_execbuffer *eb)
+			  const struct i915_execbuffer *eb,
+			  const struct i915_request *rq)
 {
+	struct drm_i915_file_private *file_priv = eb->file->driver_priv;
 	const int scale = DIV_ROUND_CLOSEST(DRM_CGROUP_PRIORITY_MAX,
 					    I915_CONTEXT_MAX_USER_PRIORITY);
 	int prio;
 
 	*attr = eb->gem_context->sched;
 	prio = attr->priority * scale + eb->file->drm_cgroup_priority;
+	if (file_priv->client->throttle[__get_class(file_priv, rq)])
+		prio -= 1 + prandom_u32_max(-DRM_CGROUP_PRIORITY_MIN / 2);
 	prio = DIV_ROUND_UP(prio, scale);
 	attr->priority = clamp(prio,
 			       I915_CONTEXT_MIN_USER_PRIORITY,
@@ -3056,7 +3074,7 @@ static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,
 
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
-		copy_priority(&attr, eb);
+		copy_priority(&attr, eb, rq);
 	} else {
 		/* Serialise with context_close via the add_to_timeline */
 		i915_request_set_error_once(rq, -ENOENT);
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index b949fd715202..abac9bb5bf27 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1897,6 +1897,7 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
 static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
 	.priority_levels = i915_drm_priority_levels,
 	.active_time_us = i915_drm_cgroup_get_active_time_us,
+	.signal_budget = i915_drm_cgroup_signal_budget,
 };
 #endif
 
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 8527fe80d449..ce497055cc3f 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -4,6 +4,7 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/ktime.h>
 #include <linux/slab.h>
 #include <linux/types.h>
 
@@ -173,6 +174,98 @@ u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file)
 
 	return busy;
 }
+
+int i915_drm_cgroup_signal_budget(struct drm_file *file, u64 usage, u64 budget)
+{
+	struct drm_i915_file_private *fpriv = file->driver_priv;
+	u64 class_usage[I915_LAST_UABI_ENGINE_CLASS + 1];
+	u64 class_last[I915_LAST_UABI_ENGINE_CLASS + 1];
+	struct drm_i915_private *i915 = fpriv->dev_priv;
+	struct i915_drm_client *client = fpriv->client;
+	struct intel_engine_cs *engine;
+	bool over = usage > budget;
+	unsigned int i;
+	ktime_t unused;
+	int ret = 0;
+	u64 t;
+
+	if (!supports_stats(i915))
+		return -EINVAL;
+
+	if (usage == 0 && budget == 0)
+		return 0;
+
+printk("i915_drm_cgroup_signal_budget client-id=%u over=%u (%llu/%llu) <%u>\n",
+       client->id, over, usage, budget, client->over_budget);
+
+	if (over) {
+		client->over_budget++;
+		if (!client->over_budget)
+			client->over_budget = 2;
+	} else {
+		client->over_budget = 0;
+		memset(client->class_last, 0, sizeof(client->class_last));
+		memset(client->throttle, 0, sizeof(client->throttle));
+		return 0;
+	}
+
+	memset(class_usage, 0, sizeof(class_usage));
+	for_each_uabi_engine(engine, i915)
+		class_usage[engine->uabi_class] +=
+			ktime_to_ns(intel_engine_get_busy_time(engine, &unused));
+
+	memcpy(class_last, client->class_last, sizeof(class_last));
+	memcpy(client->class_last, class_usage, sizeof(class_last));
+
+	for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++)
+		class_usage[i] -= class_last[i];
+
+	t = client->last;
+	client->last = ktime_get_raw_ns();
+	t = client->last - t;
+
+	if (client->over_budget == 1)
+		return 0;
+
+	for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+		u64 client_class_usage[I915_LAST_UABI_ENGINE_CLASS + 1];
+		unsigned int capacity;
+
+		if (!i915->engine_uabi_class_count[i])
+			continue;
+
+		t = DIV_ROUND_UP_ULL(t, 1000);
+		class_usage[i] = DIV_ROUND_CLOSEST_ULL(class_usage[i], 1000);
+		usage = DIV_ROUND_CLOSEST_ULL(class_usage[i] * 100ULL,
+					      t *
+					      i915->engine_uabi_class_count[i]);
+		if (usage <= 95) {
+			/* class not oversubsribed */
+			if (client->throttle[i]) {
+				client->throttle[i] = false;
+printk("  UN-throttling class%u (phys=%lld%%)\n",
+       i, usage);
+			}
+			continue;
+		}
+
+		client_class_usage[i] =
+			get_class_active_ns(client, i, &capacity);
+
+		if (client_class_usage[i] && !client->throttle[i]) {
+			ret |= 1;
+			client->throttle[i] = true;
+			/*
+			 * QQQ maybe apply "strength" of throttling based on
+			 * usage/budget?
+			 */
+printk("  THROTTLING class%u (phys=%lld%% client=%lluus)\n",
+       i, usage, client_class_usage[i] / 1000);
+		}
+	}
+
+	return ret;
+}
 #endif
 
 #ifdef CONFIG_PROC_FS
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 99b8ae01c183..b05afe01e68e 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -40,6 +40,13 @@ struct i915_drm_client {
 	 * @past_runtime: Accumulation of pphwsp runtimes from closed contexts.
 	 */
 	atomic64_t past_runtime[I915_LAST_UABI_ENGINE_CLASS + 1];
+
+#ifdef CONFIG_CGROUP_DRM
+	bool throttle[I915_LAST_UABI_ENGINE_CLASS + 1];
+	unsigned int over_budget;
+	u64 last;
+	u64 class_last[I915_LAST_UABI_ENGINE_CLASS + 1];
+#endif
 };
 
 void i915_drm_clients_init(struct i915_drm_clients *clients,
@@ -70,5 +77,7 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients);
 unsigned int i915_drm_priority_levels(struct drm_file *file);
 
 u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
+int i915_drm_cgroup_signal_budget(struct drm_file *file,
+				  u64 usage, u64 budget);
 
 #endif /* !__I915_DRM_CLIENT_H__ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC 00/17] DRM scheduling cgroup controller
  2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
                   ` (16 preceding siblings ...)
  2022-10-19 17:32 ` [RFC 17/17] drm/i915: Implement cgroup controller over budget throttling Tvrtko Ursulin
@ 2022-10-19 18:45 ` Tejun Heo
  2022-10-27 14:32   ` Tvrtko Ursulin
  17 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2022-10-19 18:45 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel-gfx, cgroups, linux-kernel, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

Hello,

On Wed, Oct 19, 2022 at 06:32:37PM +0100, Tvrtko Ursulin wrote:
...
> DRM static priority interface files
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   drm.priority_levels
> 	One of:
> 	 1) And integer representing the minimum number of discrete priority
> 	    levels for the whole group.
> 	    Optionally followed by an asterisk ('*') indicating some DRM clients
> 	    in the group support more than the minimum number.
> 	 2) '0'- indicating one or more DRM clients in the group has no support
> 	    for static priority control.
> 	 3) 'n/a' - when there are no DRM clients in the configured group.
> 
>   drm.priority
> 	A read-write integer between -10000 and 10000 (inclusive) representing
> 	an abstract static priority level.
> 
>   drm.effective_priority
> 	Read only integer showing the current effective priority level for the
> 	group. Effective meaning taking into account the chain of inherited

From interface POV, this is a lot worse than the second proposal and I'd
really like to avoid this. Even if we go with mapping user priority
configuration to per-driver priorities, I'd much prefer if the interface
presented to user is weight based and let each driver try to match the
resulting hierarchical weight (ie. the absolute proportion a given cgroup
should have at the point in time) as best as they can rather than exposing
opaque priority numbers to userspace whose meaning isn't defined at all.

> DRM scheduling soft limits interface files
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   drm.weight
> 	Standard cgroup weight based control [1, 10000] used to configure the
> 	relative distributing of GPU time between the sibling groups.

Please take a look at io.weight. This can follow the same convention to
express both global and per-device weights.

>   drm.period_us
> 	An integer representing the period with which the controller should look
> 	at the GPU usage by the group and potentially send the over/under budget
> 	signal.
> 	Value of zero (defaul) disables the soft limit checking.

Can we not do period_us or at least make it a per-driver tuning parameter
exposed as module param? Weight, users can easily understand and configure.
period_us is a lot more an implementation detail. If we want to express the
trade-off between latency and bandwidth at the interface, we prolly should
encode the latency requirement in a more canonical way but let's leave that
for the future.

>   drm.budget_supported
> 	One of:
> 	 1) 'yes' - when all DRM clients in the group support the functionality.
> 	 2) 'no' - when at least one of the DRM clients does not support the
> 		   functionality.
> 	 3) 'n/a' - when there are no DRM clients in the group.

Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
with and I'm not sure 'no' meaning at least one device not supporting is
intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
drop this.

Another basic interface question. Is everyone happy with the drm prefix or
should it be something like gpu? Also, in the future, if there's a consensus
around how to control gpu memory, what prefix would that take?

> The second proposal is a little bit more advanced in concept and also a little
> bit less finished. Interesting thing is that it builds upon the per client GPU
> utilisation work which landed recently for a few drivers. So my thinking is that
> in principle, an intersect of drivers which support both that and some sort of
> priority scheduling control, could also in theory support this.
> 
> Another really interesting angle for this controller is that it mimics the same
> control menthod used by the CPU scheduler. That is the proportional/weight based
> GPU time budgeting. Which makes it easy to configure and does not need a new
> mental model.
> 
> However, as the introduction mentions, GPUs are much more heterogenous and
> therefore the controller uses very "soft" wording as to what it promises. The
> general statement is that it can define budgets, notify clients when they are
> over them, and let individual drivers implement best effort handling of those
> conditions.
> 
> Delegation of duties in the implementation goes likes this:
> 
>  * DRM cgroup controller implements the control files and the scanning loop.
>  * DRM core is required to track all DRM clients belonging to processes so it
>    can answer when asked how much GPU time is a process using.
>  * DRM core also provides a call back which the controller will call when a
>    certain process is over budget.
>  * Individual drivers need to implement two similar hooks, but which work for
>    a single DRM client. Over budget callback and GPU utilisation query.
> 
> What I have demonstrated in practice is that when wired to i915, in a really
> primitive way where the over-budget condition simply lowers the scheduling
> priority, the concept can be almost equally effective as the static priority
> control. I say almost because the design where budget control depends on the
> periodic usage scanning has a fundamental delay, so responsiveness will depend
> on the scanning period, which may or may not be a problem for a particular use
> case.
> 
> The unfinished part is the GPU budgeting split which currently does not
> propagate unused bandwith to children, neither can share it with siblings. But
> this is not due fundamental reasons, just to avoid spending too much time on it
> too early.

Rather than doing it hierarchically on the spot, it's usually a lot cheaper
and easier to calculate the flattened hierarchical weight per leaf cgroup
and divide the bandwidth according to the eventual portions. For an example,
please take a look at block/blk-iocost.c.

I don't know much about the drm driver side, so can't comment much on it but
I do really like the idea of having the core implementation determining who
should get how much and then letting each driver enforce the target. That
seems a lot more robust and generic than trying to somehow coax and expose
per-driver priority implementations directly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 02/17] drm: Track clients per owning process
  2022-10-19 17:32 ` [RFC 02/17] drm: Track clients per owning process Tvrtko Ursulin
@ 2022-10-20  6:40   ` Christian König
  2022-10-20  7:34     ` Tvrtko Ursulin
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2022-10-20  6:40 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Brian Welty, Tvrtko Ursulin

Am 19.10.22 um 19:32 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> To enable propagation of settings from the cgroup drm controller to drm we
> need to start tracking which processes own which drm clients.
>
> Implement that by tracking the struct pid pointer of the owning process in
> a new XArray, pointing to a structure containing a list of associated
> struct drm_file pointers.
>
> Clients are added and removed under the filelist mutex and RCU list
> operations are used below it to allow for lockless lookup.

That won't work easily like this. The problem is that file_priv->pid is 
usually not accurate these days:

 From the debugfs clients file:

       systemd-logind   773   0   y    y     0          0
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
              firefox  2945 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
               chrome 35940 128   n    n  1000          0
               chrome 35940   0   n    y  1000          1
               chrome 35940   0   n    y  1000          2
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0
                 Xorg  1639 128   n    n  1000          0

This is with glxgears and a bunch other OpenGL applications running.

The problem is that for most applications the X/Wayland server is now 
opening the render node. The only exceptions in this case are apps using 
DRI2 (VA-API?).

I always wanted to fix this and actually track who is using the file 
descriptor instead of who opened it, but never had the time to do this.

I think you need to fix this problem first. And BTW: and unsigned long 
doesn't work as PID either with containers.

Regards,
Christian.

>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/Makefile     |  1 +
>   drivers/gpu/drm/drm_cgroup.c | 60 ++++++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/drm_file.c   | 18 ++++++++---
>   include/drm/drm_clients.h    | 31 +++++++++++++++++++
>   include/drm/drm_file.h       |  4 +++
>   5 files changed, 110 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/gpu/drm/drm_cgroup.c
>   create mode 100644 include/drm/drm_clients.h
>
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index 6e55c47288e4..0719970d17ee 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -59,6 +59,7 @@ drm-$(CONFIG_DRM_LEGACY) += \
>   	drm_scatter.o \
>   	drm_vm.o
>   drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
> +drm-$(CONFIG_CGROUP_DRM) += drm_cgroup.o
>   drm-$(CONFIG_COMPAT) += drm_ioc32.o
>   drm-$(CONFIG_DRM_PANEL) += drm_panel.o
>   drm-$(CONFIG_OF) += drm_of.o
> diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
> new file mode 100644
> index 000000000000..a31ff1d593ab
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_cgroup.c
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +#include <drm/drm_drv.h>
> +#include <drm/drm_clients.h>
> +
> +static DEFINE_XARRAY(drm_pid_clients);
> +
> +void drm_clients_close(struct drm_file *file_priv)
> +{
> +	unsigned long pid = (unsigned long)file_priv->pid;
> +	struct drm_device *dev = file_priv->minor->dev;
> +	struct drm_pid_clients *clients;
> +
> +	lockdep_assert_held(&dev->filelist_mutex);
> +
> +	clients = xa_load(&drm_pid_clients, pid);
> +	list_del_rcu(&file_priv->clink);
> +	if (atomic_dec_and_test(&clients->num)) {
> +		xa_erase(&drm_pid_clients, pid);
> +		kfree_rcu(clients, rcu);
> +	}
> +}
> +
> +int drm_clients_open(struct drm_file *file_priv)
> +{
> +	unsigned long pid = (unsigned long)file_priv->pid;
> +	struct drm_device *dev = file_priv->minor->dev;
> +	struct drm_pid_clients *clients;
> +	bool new_client = false;
> +
> +	lockdep_assert_held(&dev->filelist_mutex);
> +
> +	clients = xa_load(&drm_pid_clients, pid);
> +	if (!clients) {
> +		clients = kmalloc(sizeof(*clients), GFP_KERNEL);
> +		if (!clients)
> +			return -ENOMEM;
> +		atomic_set(&clients->num, 0);
> +		INIT_LIST_HEAD(&clients->file_list);
> +		init_rcu_head(&clients->rcu);
> +		new_client = true;
> +	}
> +	atomic_inc(&clients->num);
> +	list_add_tail_rcu(&file_priv->clink, &clients->file_list);
> +	if (new_client) {
> +		void *xret;
> +
> +		xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
> +		if (xa_err(xret)) {
> +			list_del_init(&file_priv->clink);
> +			kfree(clients);
> +			return PTR_ERR(clients);
> +		}
> +	}
> +
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index a8b4d918e9a3..ce58d5c513db 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -40,6 +40,7 @@
>   #include <linux/slab.h>
>   
>   #include <drm/drm_client.h>
> +#include <drm/drm_clients.h>
>   #include <drm/drm_drv.h>
>   #include <drm/drm_file.h>
>   #include <drm/drm_print.h>
> @@ -298,6 +299,7 @@ static void drm_close_helper(struct file *filp)
>   
>   	mutex_lock(&dev->filelist_mutex);
>   	list_del(&file_priv->lhead);
> +	drm_clients_close(file_priv);
>   	mutex_unlock(&dev->filelist_mutex);
>   
>   	drm_file_free(file_priv);
> @@ -349,10 +351,8 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
>   
>   	if (drm_is_primary_client(priv)) {
>   		ret = drm_master_open(priv);
> -		if (ret) {
> -			drm_file_free(priv);
> -			return ret;
> -		}
> +		if (ret)
> +			goto err_free;
>   	}
>   
>   	filp->private_data = priv;
> @@ -360,6 +360,9 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
>   	priv->filp = filp;
>   
>   	mutex_lock(&dev->filelist_mutex);
> +	ret = drm_clients_open(priv);
> +	if (ret)
> +		goto err_unlock;
>   	list_add(&priv->lhead, &dev->filelist);
>   	mutex_unlock(&dev->filelist_mutex);
>   
> @@ -387,6 +390,13 @@ static int drm_open_helper(struct file *filp, struct drm_minor *minor)
>   #endif
>   
>   	return 0;
> +
> +err_unlock:
> +	mutex_unlock(&dev->filelist_mutex);
> +err_free:
> +	drm_file_free(priv);
> +
> +	return ret;
>   }
>   
>   /**
> diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
> new file mode 100644
> index 000000000000..4ae553a03d1e
> --- /dev/null
> +++ b/include/drm/drm_clients.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +#ifndef _DRM_CLIENTS_H_
> +#define _DRM_CLIENTS_H_
> +
> +#include <drm/drm_file.h>
> +
> +struct drm_pid_clients {
> +	atomic_t num;
> +	struct list_head file_list;
> +	struct rcu_head rcu;
> +};
> +
> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
> +void drm_clients_close(struct drm_file *file_priv);
> +int drm_clients_open(struct drm_file *file_priv);
> +#else
> +static inline void drm_clients_close(struct drm_file *file_priv)
> +{
> +}
> +
> +static inline int drm_clients_open(struct drm_file *file_priv)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#endif
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index d780fd151789..0965eb111f24 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -268,6 +268,10 @@ struct drm_file {
>   	/** @minor: &struct drm_minor for this file. */
>   	struct drm_minor *minor;
>   
> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
> +	struct list_head clink;
> +#endif
> +
>   	/**
>   	 * @object_idr:
>   	 *


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 02/17] drm: Track clients per owning process
  2022-10-20  6:40   ` Christian König
@ 2022-10-20  7:34     ` Tvrtko Ursulin
  2022-10-20 11:33       ` Christian König
  0 siblings, 1 reply; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-20  7:34 UTC (permalink / raw)
  To: Christian König, Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Brian Welty, Tvrtko Ursulin


On 20/10/2022 07:40, Christian König wrote:
> Am 19.10.22 um 19:32 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> To enable propagation of settings from the cgroup drm controller to 
>> drm we
>> need to start tracking which processes own which drm clients.
>>
>> Implement that by tracking the struct pid pointer of the owning 
>> process in
>> a new XArray, pointing to a structure containing a list of associated
>> struct drm_file pointers.
>>
>> Clients are added and removed under the filelist mutex and RCU list
>> operations are used below it to allow for lockless lookup.
> 
> That won't work easily like this. The problem is that file_priv->pid is 
> usually not accurate these days:
> 
>  From the debugfs clients file:
> 
>        systemd-logind   773   0   y    y     0          0
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>               firefox  2945 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>                chrome 35940 128   n    n  1000          0
>                chrome 35940   0   n    y  1000          1
>                chrome 35940   0   n    y  1000          2
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
>                  Xorg  1639 128   n    n  1000          0
> 
> This is with glxgears and a bunch other OpenGL applications running.
> 
> The problem is that for most applications the X/Wayland server is now 
> opening the render node. The only exceptions in this case are apps using 
> DRI2 (VA-API?).
> 
> I always wanted to fix this and actually track who is using the file 
> descriptor instead of who opened it, but never had the time to do this.

There's a patch later in the series which allows client records to be 
migrated to a new PID, and then i915 patch to do that when fd is used 
for context create. That approach I think worked well enough in the 
past. So maybe it could be done in the DRM core at some suitable entry 
point.

> I think you need to fix this problem first. And BTW: and unsigned long 
> doesn't work as PID either with containers.

This I am not familiar with so would like to hear more if you could 
point me in the right direction at least.

My assumption was that struct pid *, which is what I store in unsigned 
long, would be unique in a system where there is a single kernel 
running, so as long as lifetimes are correct (released from tracking 
here when fd is closed, which is implicit on process exit) would work. 
You are suggesting that is not so?

Regards,

Tvrtko

> 
> Regards,
> Christian.
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>   drivers/gpu/drm/Makefile     |  1 +
>>   drivers/gpu/drm/drm_cgroup.c | 60 ++++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/drm_file.c   | 18 ++++++++---
>>   include/drm/drm_clients.h    | 31 +++++++++++++++++++
>>   include/drm/drm_file.h       |  4 +++
>>   5 files changed, 110 insertions(+), 4 deletions(-)
>>   create mode 100644 drivers/gpu/drm/drm_cgroup.c
>>   create mode 100644 include/drm/drm_clients.h
>>
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index 6e55c47288e4..0719970d17ee 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -59,6 +59,7 @@ drm-$(CONFIG_DRM_LEGACY) += \
>>       drm_scatter.o \
>>       drm_vm.o
>>   drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
>> +drm-$(CONFIG_CGROUP_DRM) += drm_cgroup.o
>>   drm-$(CONFIG_COMPAT) += drm_ioc32.o
>>   drm-$(CONFIG_DRM_PANEL) += drm_panel.o
>>   drm-$(CONFIG_OF) += drm_of.o
>> diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
>> new file mode 100644
>> index 000000000000..a31ff1d593ab
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_cgroup.c
>> @@ -0,0 +1,60 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_clients.h>
>> +
>> +static DEFINE_XARRAY(drm_pid_clients);
>> +
>> +void drm_clients_close(struct drm_file *file_priv)
>> +{
>> +    unsigned long pid = (unsigned long)file_priv->pid;
>> +    struct drm_device *dev = file_priv->minor->dev;
>> +    struct drm_pid_clients *clients;
>> +
>> +    lockdep_assert_held(&dev->filelist_mutex);
>> +
>> +    clients = xa_load(&drm_pid_clients, pid);
>> +    list_del_rcu(&file_priv->clink);
>> +    if (atomic_dec_and_test(&clients->num)) {
>> +        xa_erase(&drm_pid_clients, pid);
>> +        kfree_rcu(clients, rcu);
>> +    }
>> +}
>> +
>> +int drm_clients_open(struct drm_file *file_priv)
>> +{
>> +    unsigned long pid = (unsigned long)file_priv->pid;
>> +    struct drm_device *dev = file_priv->minor->dev;
>> +    struct drm_pid_clients *clients;
>> +    bool new_client = false;
>> +
>> +    lockdep_assert_held(&dev->filelist_mutex);
>> +
>> +    clients = xa_load(&drm_pid_clients, pid);
>> +    if (!clients) {
>> +        clients = kmalloc(sizeof(*clients), GFP_KERNEL);
>> +        if (!clients)
>> +            return -ENOMEM;
>> +        atomic_set(&clients->num, 0);
>> +        INIT_LIST_HEAD(&clients->file_list);
>> +        init_rcu_head(&clients->rcu);
>> +        new_client = true;
>> +    }
>> +    atomic_inc(&clients->num);
>> +    list_add_tail_rcu(&file_priv->clink, &clients->file_list);
>> +    if (new_client) {
>> +        void *xret;
>> +
>> +        xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
>> +        if (xa_err(xret)) {
>> +            list_del_init(&file_priv->clink);
>> +            kfree(clients);
>> +            return PTR_ERR(clients);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>> index a8b4d918e9a3..ce58d5c513db 100644
>> --- a/drivers/gpu/drm/drm_file.c
>> +++ b/drivers/gpu/drm/drm_file.c
>> @@ -40,6 +40,7 @@
>>   #include <linux/slab.h>
>>   #include <drm/drm_client.h>
>> +#include <drm/drm_clients.h>
>>   #include <drm/drm_drv.h>
>>   #include <drm/drm_file.h>
>>   #include <drm/drm_print.h>
>> @@ -298,6 +299,7 @@ static void drm_close_helper(struct file *filp)
>>       mutex_lock(&dev->filelist_mutex);
>>       list_del(&file_priv->lhead);
>> +    drm_clients_close(file_priv);
>>       mutex_unlock(&dev->filelist_mutex);
>>       drm_file_free(file_priv);
>> @@ -349,10 +351,8 @@ static int drm_open_helper(struct file *filp, 
>> struct drm_minor *minor)
>>       if (drm_is_primary_client(priv)) {
>>           ret = drm_master_open(priv);
>> -        if (ret) {
>> -            drm_file_free(priv);
>> -            return ret;
>> -        }
>> +        if (ret)
>> +            goto err_free;
>>       }
>>       filp->private_data = priv;
>> @@ -360,6 +360,9 @@ static int drm_open_helper(struct file *filp, 
>> struct drm_minor *minor)
>>       priv->filp = filp;
>>       mutex_lock(&dev->filelist_mutex);
>> +    ret = drm_clients_open(priv);
>> +    if (ret)
>> +        goto err_unlock;
>>       list_add(&priv->lhead, &dev->filelist);
>>       mutex_unlock(&dev->filelist_mutex);
>> @@ -387,6 +390,13 @@ static int drm_open_helper(struct file *filp, 
>> struct drm_minor *minor)
>>   #endif
>>       return 0;
>> +
>> +err_unlock:
>> +    mutex_unlock(&dev->filelist_mutex);
>> +err_free:
>> +    drm_file_free(priv);
>> +
>> +    return ret;
>>   }
>>   /**
>> diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
>> new file mode 100644
>> index 000000000000..4ae553a03d1e
>> --- /dev/null
>> +++ b/include/drm/drm_clients.h
>> @@ -0,0 +1,31 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +#ifndef _DRM_CLIENTS_H_
>> +#define _DRM_CLIENTS_H_
>> +
>> +#include <drm/drm_file.h>
>> +
>> +struct drm_pid_clients {
>> +    atomic_t num;
>> +    struct list_head file_list;
>> +    struct rcu_head rcu;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
>> +void drm_clients_close(struct drm_file *file_priv);
>> +int drm_clients_open(struct drm_file *file_priv);
>> +#else
>> +static inline void drm_clients_close(struct drm_file *file_priv)
>> +{
>> +}
>> +
>> +static inline int drm_clients_open(struct drm_file *file_priv)
>> +{
>> +    return 0;
>> +}
>> +#endif
>> +
>> +#endif
>> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
>> index d780fd151789..0965eb111f24 100644
>> --- a/include/drm/drm_file.h
>> +++ b/include/drm/drm_file.h
>> @@ -268,6 +268,10 @@ struct drm_file {
>>       /** @minor: &struct drm_minor for this file. */
>>       struct drm_minor *minor;
>> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
>> +    struct list_head clink;
>> +#endif
>> +
>>       /**
>>        * @object_idr:
>>        *
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 02/17] drm: Track clients per owning process
  2022-10-20  7:34     ` Tvrtko Ursulin
@ 2022-10-20 11:33       ` Christian König
  2022-10-27 14:35         ` Tvrtko Ursulin
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2022-10-20 11:33 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Brian Welty, Tvrtko Ursulin

Am 20.10.22 um 09:34 schrieb Tvrtko Ursulin:
>
> On 20/10/2022 07:40, Christian König wrote:
>> Am 19.10.22 um 19:32 schrieb Tvrtko Ursulin:
>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>
>>> To enable propagation of settings from the cgroup drm controller to 
>>> drm we
>>> need to start tracking which processes own which drm clients.
>>>
>>> Implement that by tracking the struct pid pointer of the owning 
>>> process in
>>> a new XArray, pointing to a structure containing a list of associated
>>> struct drm_file pointers.
>>>
>>> Clients are added and removed under the filelist mutex and RCU list
>>> operations are used below it to allow for lockless lookup.
>>
>> That won't work easily like this. The problem is that file_priv->pid 
>> is usually not accurate these days:
>>
>>  From the debugfs clients file:
>>
>>        systemd-logind   773   0   y    y     0          0
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>               firefox  2945 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>                chrome 35940 128   n    n  1000          0
>>                chrome 35940   0   n    y  1000          1
>>                chrome 35940   0   n    y  1000          2
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>                  Xorg  1639 128   n    n  1000          0
>>
>> This is with glxgears and a bunch other OpenGL applications running.
>>
>> The problem is that for most applications the X/Wayland server is now 
>> opening the render node. The only exceptions in this case are apps 
>> using DRI2 (VA-API?).
>>
>> I always wanted to fix this and actually track who is using the file 
>> descriptor instead of who opened it, but never had the time to do this.
>
> There's a patch later in the series which allows client records to be 
> migrated to a new PID, and then i915 patch to do that when fd is used 
> for context create. That approach I think worked well enough in the 
> past. So maybe it could be done in the DRM core at some suitable entry 
> point.

Yeah, that makes some sense. I think you should wire that inside 
drm_ioctl(), as far as I know more or less all uses of a file descriptor 
would go through that function.

And maybe make that a stand alone patch, cause that can go upstream as a 
bug fix independently if you ask me.

>> I think you need to fix this problem first. And BTW: and unsigned 
>> long doesn't work as PID either with containers.
>
> This I am not familiar with so would like to hear more if you could 
> point me in the right direction at least.

Uff, I'm the wrong person to ask stuff like that. I just can say from 
experience because I've ran into that trap as well.

>
> My assumption was that struct pid *, which is what I store in unsigned 
> long, would be unique in a system where there is a single kernel 
> running, so as long as lifetimes are correct (released from tracking 
> here when fd is closed, which is implicit on process exit) would work. 
> You are suggesting that is not so?

I think you should have the pointer to struct pid directly here since 
that is a reference counted structure IIRC. But don't ask me what the 
semantics is how to get or put a reference.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>   drivers/gpu/drm/Makefile     |  1 +
>>>   drivers/gpu/drm/drm_cgroup.c | 60 
>>> ++++++++++++++++++++++++++++++++++++
>>>   drivers/gpu/drm/drm_file.c   | 18 ++++++++---
>>>   include/drm/drm_clients.h    | 31 +++++++++++++++++++
>>>   include/drm/drm_file.h       |  4 +++
>>>   5 files changed, 110 insertions(+), 4 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/drm_cgroup.c
>>>   create mode 100644 include/drm/drm_clients.h
>>>
>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>> index 6e55c47288e4..0719970d17ee 100644
>>> --- a/drivers/gpu/drm/Makefile
>>> +++ b/drivers/gpu/drm/Makefile
>>> @@ -59,6 +59,7 @@ drm-$(CONFIG_DRM_LEGACY) += \
>>>       drm_scatter.o \
>>>       drm_vm.o
>>>   drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
>>> +drm-$(CONFIG_CGROUP_DRM) += drm_cgroup.o
>>>   drm-$(CONFIG_COMPAT) += drm_ioc32.o
>>>   drm-$(CONFIG_DRM_PANEL) += drm_panel.o
>>>   drm-$(CONFIG_OF) += drm_of.o
>>> diff --git a/drivers/gpu/drm/drm_cgroup.c 
>>> b/drivers/gpu/drm/drm_cgroup.c
>>> new file mode 100644
>>> index 000000000000..a31ff1d593ab
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/drm_cgroup.c
>>> @@ -0,0 +1,60 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +
>>> +#include <drm/drm_drv.h>
>>> +#include <drm/drm_clients.h>
>>> +
>>> +static DEFINE_XARRAY(drm_pid_clients);
>>> +
>>> +void drm_clients_close(struct drm_file *file_priv)
>>> +{
>>> +    unsigned long pid = (unsigned long)file_priv->pid;
>>> +    struct drm_device *dev = file_priv->minor->dev;
>>> +    struct drm_pid_clients *clients;
>>> +
>>> +    lockdep_assert_held(&dev->filelist_mutex);
>>> +
>>> +    clients = xa_load(&drm_pid_clients, pid);
>>> +    list_del_rcu(&file_priv->clink);
>>> +    if (atomic_dec_and_test(&clients->num)) {
>>> +        xa_erase(&drm_pid_clients, pid);
>>> +        kfree_rcu(clients, rcu);
>>> +    }
>>> +}
>>> +
>>> +int drm_clients_open(struct drm_file *file_priv)
>>> +{
>>> +    unsigned long pid = (unsigned long)file_priv->pid;
>>> +    struct drm_device *dev = file_priv->minor->dev;
>>> +    struct drm_pid_clients *clients;
>>> +    bool new_client = false;
>>> +
>>> +    lockdep_assert_held(&dev->filelist_mutex);
>>> +
>>> +    clients = xa_load(&drm_pid_clients, pid);
>>> +    if (!clients) {
>>> +        clients = kmalloc(sizeof(*clients), GFP_KERNEL);
>>> +        if (!clients)
>>> +            return -ENOMEM;
>>> +        atomic_set(&clients->num, 0);
>>> +        INIT_LIST_HEAD(&clients->file_list);
>>> +        init_rcu_head(&clients->rcu);
>>> +        new_client = true;
>>> +    }
>>> +    atomic_inc(&clients->num);
>>> +    list_add_tail_rcu(&file_priv->clink, &clients->file_list);
>>> +    if (new_client) {
>>> +        void *xret;
>>> +
>>> +        xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
>>> +        if (xa_err(xret)) {
>>> +            list_del_init(&file_priv->clink);
>>> +            kfree(clients);
>>> +            return PTR_ERR(clients);
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>>> index a8b4d918e9a3..ce58d5c513db 100644
>>> --- a/drivers/gpu/drm/drm_file.c
>>> +++ b/drivers/gpu/drm/drm_file.c
>>> @@ -40,6 +40,7 @@
>>>   #include <linux/slab.h>
>>>   #include <drm/drm_client.h>
>>> +#include <drm/drm_clients.h>
>>>   #include <drm/drm_drv.h>
>>>   #include <drm/drm_file.h>
>>>   #include <drm/drm_print.h>
>>> @@ -298,6 +299,7 @@ static void drm_close_helper(struct file *filp)
>>>       mutex_lock(&dev->filelist_mutex);
>>>       list_del(&file_priv->lhead);
>>> +    drm_clients_close(file_priv);
>>>       mutex_unlock(&dev->filelist_mutex);
>>>       drm_file_free(file_priv);
>>> @@ -349,10 +351,8 @@ static int drm_open_helper(struct file *filp, 
>>> struct drm_minor *minor)
>>>       if (drm_is_primary_client(priv)) {
>>>           ret = drm_master_open(priv);
>>> -        if (ret) {
>>> -            drm_file_free(priv);
>>> -            return ret;
>>> -        }
>>> +        if (ret)
>>> +            goto err_free;
>>>       }
>>>       filp->private_data = priv;
>>> @@ -360,6 +360,9 @@ static int drm_open_helper(struct file *filp, 
>>> struct drm_minor *minor)
>>>       priv->filp = filp;
>>>       mutex_lock(&dev->filelist_mutex);
>>> +    ret = drm_clients_open(priv);
>>> +    if (ret)
>>> +        goto err_unlock;
>>>       list_add(&priv->lhead, &dev->filelist);
>>>       mutex_unlock(&dev->filelist_mutex);
>>> @@ -387,6 +390,13 @@ static int drm_open_helper(struct file *filp, 
>>> struct drm_minor *minor)
>>>   #endif
>>>       return 0;
>>> +
>>> +err_unlock:
>>> +    mutex_unlock(&dev->filelist_mutex);
>>> +err_free:
>>> +    drm_file_free(priv);
>>> +
>>> +    return ret;
>>>   }
>>>   /**
>>> diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
>>> new file mode 100644
>>> index 000000000000..4ae553a03d1e
>>> --- /dev/null
>>> +++ b/include/drm/drm_clients.h
>>> @@ -0,0 +1,31 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +
>>> +#ifndef _DRM_CLIENTS_H_
>>> +#define _DRM_CLIENTS_H_
>>> +
>>> +#include <drm/drm_file.h>
>>> +
>>> +struct drm_pid_clients {
>>> +    atomic_t num;
>>> +    struct list_head file_list;
>>> +    struct rcu_head rcu;
>>> +};
>>> +
>>> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
>>> +void drm_clients_close(struct drm_file *file_priv);
>>> +int drm_clients_open(struct drm_file *file_priv);
>>> +#else
>>> +static inline void drm_clients_close(struct drm_file *file_priv)
>>> +{
>>> +}
>>> +
>>> +static inline int drm_clients_open(struct drm_file *file_priv)
>>> +{
>>> +    return 0;
>>> +}
>>> +#endif
>>> +
>>> +#endif
>>> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
>>> index d780fd151789..0965eb111f24 100644
>>> --- a/include/drm/drm_file.h
>>> +++ b/include/drm/drm_file.h
>>> @@ -268,6 +268,10 @@ struct drm_file {
>>>       /** @minor: &struct drm_minor for this file. */
>>>       struct drm_minor *minor;
>>> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
>>> +    struct list_head clink;
>>> +#endif
>>> +
>>>       /**
>>>        * @object_idr:
>>>        *
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage
  2022-10-19 17:32 ` [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage Tvrtko Ursulin
@ 2022-10-21 22:52   ` T.J. Mercier
  2022-10-27 14:45     ` Tvrtko Ursulin
  0 siblings, 1 reply; 28+ messages in thread
From: T.J. Mercier @ 2022-10-21 22:52 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel-gfx, cgroups, linux-kernel, Tejun Heo, Johannes Weiner,
	Zefan Li, Dave Airlie, Daniel Vetter, Rob Clark,
	Stéphane Marchesin, Kenny.Ho, Christian König,
	Brian Welty, Tvrtko Ursulin

On Wed, Oct 19, 2022 at 10:34 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Add a scanning worker, which if enabled, periodically queries the cgroup
> for GPU usage and if over budget (as configured by it's relative weight
> share) notifies the drm core about the fact.
>
> This is off by default and can be enabled by configuring a scanning
> period using the drm.period_us cgroup control file.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  35 +-
>  kernel/cgroup/drm.c                     | 426 +++++++++++++++++++++++-
>  2 files changed, 459 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 1f3cca4e2572..318f463a1316 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2401,7 +2401,8 @@ HugeTLB Interface Files
>  DRM
>  ---
>
> -The DRM controller allows configuring static hierarchical scheduling priority.
> +The DRM controller allows configuring static hierarchical scheduling priority
> +and scheduling soft limits.
>
>  DRM static priority control
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -2458,6 +2459,38 @@ DRM static priority interface files
>         Read only integer showing the current effective priority level for the
>         group. Effective meaning taking into account the chain of inherited
>
> +DRM scheduling soft limits
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Because of the heterogenous hardware and driver DRM capabilities, soft limits
> +are implemented as a loose co-operative (bi-directional) interface between the
> +controller and DRM core.
> +
> +The controller configures the GPU time allowed per group and periodically scans
> +the belonging tasks to detect the over budget condition, at which point it
> +invokes a callback notifying the DRM core of the condition.
> +
> +DRM core provides an API to query per process GPU utilization and 2nd API to
> +receive notification from the cgroup controller when the group enters or exits
> +the over budget condition.
> +
> +Individual DRM drivers which implement the interface are expected to act on this
> +in the best-effort manner only. There are no guarantees that the soft limits
> +will be respected.
> +
> +DRM scheduling soft limits interface files
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +  drm.weight
> +       Standard cgroup weight based control [1, 10000] used to configure the
> +       relative distributing of GPU time between the sibling groups.
> +
> +  drm.period_us
> +       An integer representing the period with which the controller should look
> +       at the GPU usage by the group and potentially send the over/under budget
> +       signal.
> +       Value of zero (defaul) disables the soft limit checking.
> +
>  Misc
>  ----
>
> diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
> index 48f1eaaa1c07..af50ead1564a 100644
> --- a/kernel/cgroup/drm.c
> +++ b/kernel/cgroup/drm.c
> @@ -18,6 +18,29 @@ struct drm_cgroup_state {
>         int priority;
>         int effective_priority;
>         unsigned int weight;
> +       unsigned int period_us;
> +
> +       bool scanning_suspended;
> +       unsigned int suspended_period_us;
> +
> +       struct delayed_work scan_work;
> +
> +       /*
> +        * Below fields are owned and updated by the scan worker. Either the
> +        * worker accesses them, or worker needs to be suspended and synced
> +        * before they can be touched from the outside.
> +        */
> +       bool scanned;
> +
> +       ktime_t prev_timestamp;
> +
> +       u64 sum_children_weights;
> +       u64 children_active_us;
> +       u64 per_s_budget_ns;
> +       u64 prev_active_us;
> +       u64 active_us;
> +
> +       bool over_budget;
>  };
>
>  static DEFINE_MUTEX(drmcg_mutex);
> @@ -33,6 +56,31 @@ static inline struct drm_cgroup_state *get_task_drmcs(struct task_struct *task)
>         return css_to_drmcs(task_get_css(task, drm_cgrp_id));
>  }
>
> +static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
> +{
> +       struct cgroup *cgrp = drmcs->css.cgroup;
> +       struct task_struct *task;
> +       struct css_task_iter it;
> +       u64 total = 0;
> +
> +       css_task_iter_start(&cgrp->self,
> +                           CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
> +                           &it);
> +       while ((task = css_task_iter_next(&it))) {
> +               u64 time;
> +
> +               /* Ignore kernel threads here. */
> +               if (task->flags & PF_KTHREAD)
> +                       continue;
> +
> +               time = drm_pid_get_active_time_us(task_pid(task));
> +               total += time;
> +       }
> +       css_task_iter_end(&it);
> +
> +       return total;
> +}
> +
>  int drmcgroup_lookup_effective_priority(struct task_struct *task)
>  {
>         struct drm_cgroup_state *drmcs = get_task_drmcs(task);
> @@ -202,9 +250,301 @@ static int drmcs_online(struct cgroup_subsys_state *css)
>         return 0;
>  }
>
> +static void
> +signal_drm_budget(struct drm_cgroup_state *drmcs, u64 usage, u64 budget)
> +{
> +       struct cgroup *cgrp = drmcs->css.cgroup;
> +       struct task_struct *task;
> +       struct css_task_iter it;
> +
> +       css_task_iter_start(&cgrp->self,
> +                           CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
> +                           &it);
> +       while ((task = css_task_iter_next(&it))) {
> +               /* Ignore kernel threads here. */
> +               if (task->flags & PF_KTHREAD)
> +                       continue;
> +
> +               drm_pid_signal_budget(task_pid(task), usage, budget);
> +       }
> +       css_task_iter_end(&it);
> +}
> +
> +static bool __start_scanning(struct drm_cgroup_state *root)
> +{
> +       struct cgroup_subsys_state *node;
> +       bool ok = true;
> +
> +       rcu_read_lock();
> +       css_for_each_descendant_pre(node, &root->css) {
> +               struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +               unsigned long active;
> +
> +               if (!css_tryget_online(node)) {
> +                       ok = false;
> +                       continue;
> +               }
> +
> +               drmcs->scanned = false;
> +               drmcs->sum_children_weights = 0;
> +               drmcs->children_active_us = 0;
> +               if (node == &root->css)
> +                       drmcs->per_s_budget_ns = NSEC_PER_SEC;
> +               else
> +                       drmcs->per_s_budget_ns = 0;
> +
> +               active = drmcs_get_active_time_us(drmcs);
> +               if (active >= drmcs->prev_active_us)
> +                       drmcs->active_us = active - drmcs->prev_active_us;
> +               else
> +                       drmcs->active_us = 0;
> +               drmcs->prev_active_us = active;
> +
> +               css_put(node);
> +       }
> +       rcu_read_unlock();
> +
> +       return ok;
> +}
> +
> +static void scan_worker(struct work_struct *work)
> +{
> +       struct drm_cgroup_state *root =
> +               container_of(work, typeof(*root), scan_work.work);
> +       struct cgroup_subsys_state *node;
> +       unsigned int period_us;
> +       ktime_t now;
> +
> +       rcu_read_lock();

Hi Tvrtko, I think this lock needs to come after the return for the
online check just below here to avoid missing the rcu_read_unlock at
out_retry. Although it doesn't look like this should ever run in the
first place if the DRM controller is disabled.

> +
> +       if (WARN_ON_ONCE(!css_tryget_online(&root->css)))
> +               return;
> +
> +       /*
> +        * 1st pass - reset accumulated values and update group GPU activity.
> +        */
> +       if (!__start_scanning(root))
> +               goto out_retry; /*
> +                                * Always come back later if scanner races with
> +                                * core cgroup management. (Repeated pattern.)
> +                                */
> +
> +       now = ktime_get();
> +       period_us = ktime_to_us(ktime_sub(now, root->prev_timestamp));
> +       root->prev_timestamp = now;
> +
> +       /*
> +        * 2nd pass - calculate accumulated GPU activity and relative weights
> +        * for each parent's children.
> +        */
> +       css_for_each_descendant_pre(node, &root->css) {
> +               struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +
> +               if (!css_tryget_online(node))
> +                       goto out_retry;
> +
> +               if (!drmcs->scanned) {
> +                       struct cgroup_subsys_state *css;
> +
> +                       css_for_each_child(css, &drmcs->css) {
> +                               struct drm_cgroup_state *sibling =
> +                                                       css_to_drmcs(css);
> +
> +                               if (!css_tryget_online(css)) {
> +                                       css_put(node);
> +                                       goto out_retry;
> +                               }
> +
> +                               drmcs->children_active_us += sibling->active_us;
> +                               drmcs->sum_children_weights += sibling->weight;
> +
> +                               css_put(css);
> +                       }
> +
> +                       drmcs->scanned = true;
> +               }
> +
> +               css_put(node);
> +       }
> +
> +       /*
> +        * 3rd pass - calculate relative budgets for each group based on
> +        * relative weights and parent's budget.
> +        *
> +        * FIXME: This is for now incomplete in more than one way. There is
> +        * no downward propagation of unused budgets, and even no utilisation of
> +        * the unused budgets at all.
> +        */
> +       css_for_each_descendant_pre(node, &root->css) {
> +               struct drm_cgroup_state *drmcs, *pdrmcs;
> +               bool over, was_over;
> +               u64 budget;
> +
> +               if (!css_tryget_online(node))
> +                       goto out_retry;
> +               if (node->cgroup->level == 1) {
> +                       css_put(node);
> +                       continue;
> +               }
> +               if (!css_tryget_online(node->parent)) {
> +                       css_put(node);
> +                       goto out_retry;
> +               }
> +
> +               drmcs = css_to_drmcs(node);
> +               pdrmcs = css_to_drmcs(node->parent);
> +
> +               drmcs->per_s_budget_ns  =
> +                       DIV_ROUND_UP_ULL(pdrmcs->per_s_budget_ns *
> +                                        drmcs->weight,
> +                                        pdrmcs->sum_children_weights);
> +               budget = DIV_ROUND_UP_ULL(drmcs->per_s_budget_ns * period_us,
> +                                         NSEC_PER_SEC);
> +               over = drmcs->active_us > budget;
> +               was_over = drmcs->over_budget;
> +               drmcs->over_budget = over;
> +               if (over || (!over && was_over))
> +                       signal_drm_budget(drmcs, drmcs->active_us, budget);
> +
> +               css_put(node);
> +               css_put(node->parent);
> +       }
> +
> +out_retry:
> +       rcu_read_unlock();
> +
> +       period_us = READ_ONCE(root->period_us);
> +       if (period_us)
> +               schedule_delayed_work(&root->scan_work,
> +                                     usecs_to_jiffies(period_us));
> +
> +       css_put(&root->css);
> +}
> +
> +static void start_scanning(struct drm_cgroup_state *drmcs, u64 period_us)
> +{
> +       drmcs->period_us = (unsigned int)period_us;
> +       WARN_ON_ONCE(!__start_scanning(drmcs));
> +       drmcs->prev_timestamp = ktime_get();
> +       mod_delayed_work(system_wq, &drmcs->scan_work,
> +                        usecs_to_jiffies(period_us));
> +}
> +
> +static void stop_scanning(struct drm_cgroup_state *drmcs)
> +{
> +       drmcs->period_us = 0;
> +       cancel_delayed_work_sync(&drmcs->scan_work);
> +       if (drmcs->over_budget) {
> +               /*
> +                * Signal under budget when scanning goes off so drivers
> +                * correctly update their state.
> +                */
> +               signal_drm_budget(drmcs, 0, drmcs->per_s_budget_ns);
> +               drmcs->over_budget = false;
> +       }
> +}
> +
> +static struct drm_cgroup_state *drmcs_scanner(struct drm_cgroup_state *drmcs)
> +{
> +       while (drmcs->css.cgroup->level > 1)
> +               drmcs = css_to_drmcs(drmcs->css.parent);
> +
> +       return drmcs;
> +}
> +
> +static void start_suspend_scanning(struct drm_cgroup_state *drmcs)
> +{
> +       drmcs = drmcs_scanner(drmcs);
> +
> +       if (drmcs->scanning_suspended)
> +               return;
> +
> +       drmcs->scanning_suspended = true;
> +       drmcs->suspended_period_us = drmcs->period_us;
> +       drmcs->period_us = 0;
> +}
> +
> +static void finish_suspend_scanning(struct drm_cgroup_state *drmcs)
> +{
> +       drmcs = drmcs_scanner(drmcs);
> +
> +       if (drmcs->suspended_period_us)
> +               cancel_delayed_work_sync(&drmcs->scan_work);
> +}
> +
> +static void resume_scanning(struct drm_cgroup_state *drmcs)
> +{
> +       drmcs = drmcs_scanner(drmcs);
> +
> +       if (!drmcs->scanning_suspended)
> +               return;
> +
> +       drmcs->scanning_suspended = false;
> +       if (drmcs->suspended_period_us) {
> +               start_scanning(drmcs, drmcs->suspended_period_us);
> +               drmcs->suspended_period_us = 0;
> +       }
> +}
> +
>  static void drmcs_free(struct cgroup_subsys_state *css)
>  {
> -       kfree(css_to_drmcs(css));
> +       struct drm_cgroup_state *drmcs = css_to_drmcs(css);
> +
> +       stop_scanning(drmcs);
> +
> +       kfree(drmcs);
> +}
> +
> +static int drmcs_can_attach(struct cgroup_taskset *tset)
> +{
> +       struct cgroup_subsys_state *new_css;
> +       struct task_struct *task;
> +       int ret;
> +
> +       /*
> +        * As processes are getting moved between groups we need to ensure
> +        * both that the old group does not see a sudden downward jump in the
> +        * GPU utilisation, and that the new group does not see a sudden jump
> +        * up with all the GPU time clients belonging to the migrated process
> +        * have accumulated.
> +        *
> +        * To achieve that we suspend the scanner until the migration is
> +        * completed where the resume at the end ensures both groups start
> +        * observing GPU utilisation from a reset state.
> +        */
> +
> +       ret = mutex_lock_interruptible(&drmcg_mutex);
> +       if (ret)
> +               return ret;
> +
> +       cgroup_taskset_for_each(task, new_css, tset) {
> +               start_suspend_scanning(css_to_drmcs(task_css(task,
> +                                                            drm_cgrp_id)));
> +               start_suspend_scanning(css_to_drmcs(new_css));
> +       }
> +
> +       mutex_unlock(&drmcg_mutex);
> +
> +       cgroup_taskset_for_each(task, new_css, tset) {
> +               finish_suspend_scanning(css_to_drmcs(task_css(task,
> +                                                             drm_cgrp_id)));
> +               finish_suspend_scanning(css_to_drmcs(new_css));
> +       }
> +
> +       return 0;
> +}
> +
> +static void tset_resume_scanning(struct cgroup_taskset *tset)
> +{
> +       struct cgroup_subsys_state *new_css;
> +       struct task_struct *task;
> +
> +       mutex_lock(&drmcg_mutex);
> +       cgroup_taskset_for_each(task, new_css, tset) {
> +               resume_scanning(css_to_drmcs(task_css(task, drm_cgrp_id)));
> +               resume_scanning(css_to_drmcs(new_css));
> +       }
> +       mutex_unlock(&drmcg_mutex);
>  }
>
>  static void drmcs_attach(struct cgroup_taskset *tset)
> @@ -219,12 +559,86 @@ static void drmcs_attach(struct cgroup_taskset *tset)
>         cgroup_taskset_for_each(task, css, tset)
>                 drm_pid_update_priority(task_pid(task),
>                                         css_to_drmcs(css)->effective_priority);
> +
> +       tset_resume_scanning(tset);
> +}
> +
> +static void drmcs_cancel_attach(struct cgroup_taskset *tset)
> +{
> +       tset_resume_scanning(tset);
> +}
> +
> +static u64
> +drmcs_read_period_us(struct cgroup_subsys_state *css, struct cftype *cft)
> +{
> +       struct drm_cgroup_state *drmcs = css_to_drmcs(css);
> +
> +       return drmcs->period_us;
> +}
> +
> +static int
> +drmcs_write_period_us(struct cgroup_subsys_state *css, struct cftype *cftype,
> +                     u64 period_us)
> +{
> +       struct drm_cgroup_state *drmcs = css_to_drmcs(css);
> +       int ret;
> +
> +       if (WARN_ON_ONCE(!css->parent))
> +               return -EINVAL;
> +       if (css->cgroup->level != 1)
> +               return -EINVAL;
> +       if ((period_us && period_us < 500000) || period_us > USEC_PER_SEC * 60)
> +               return -EINVAL;
> +
> +       ret = mutex_lock_interruptible(&drmcg_mutex);
> +       if (ret)
> +               return ret;
> +
> +       if (!drmcs->scanning_suspended) {
> +               if (period_us)
> +                       start_scanning(drmcs, period_us);
> +               else
> +                       stop_scanning(drmcs);
> +       } else {
> +               /*
> +                * If scanning is temporarily suspended just update the period
> +                * which will apply once resumed, or simply skip resuming in
> +                * case of disabling.
> +                */
> +               drmcs->suspended_period_us = period_us;
> +               if (!period_us)
> +                       drmcs->scanning_suspended = false;
> +       }
> +
> +       mutex_unlock(&drmcg_mutex);
> +
> +       return 0;
>  }
>
>  void drmcgroup_client_exited(struct task_struct *task)
>  {
>         struct drm_cgroup_state *drmcs = get_task_drmcs(task);
>
> +       /*
> +        * Since we are not tracking accumulated GPU time for each cgroup,
> +        * avoid jumps in group observed GPU usage by re-setting the scanner
> +        * at a point when GPU usage can suddenly jump down.
> +        *
> +        * Downside is clients can influence the effectiveness of the over-
> +        * budget scanning by continuosly closing DRM file descriptors but for

"continuously"

And I think also if a user has permission to create and migrate
processes between cgroups even just under the same parent, since
css_tryget_online failure would cause an early exit during scanning?

> +        * now we do not worry about it.
> +        */
> +
> +       mutex_lock(&drmcg_mutex);
> +       start_suspend_scanning(drmcs);
> +       mutex_unlock(&drmcg_mutex);
> +
> +       finish_suspend_scanning(drmcs);
> +
> +       mutex_lock(&drmcg_mutex);
> +       resume_scanning(drmcs);
> +       mutex_unlock(&drmcg_mutex);
> +
>         css_put(&drmcs->css);
>  }
>  EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
> @@ -232,6 +646,7 @@ EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
>  static struct drm_cgroup_state root_drmcs = {
>         .priority = DRM_CGROUP_PRIORITY_DEF,
>         .effective_priority = DRM_CGROUP_PRIORITY_DEF,
> +       .weight = CGROUP_WEIGHT_DFL,
>  };
>
>  static struct cgroup_subsys_state *
> @@ -247,6 +662,7 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
>                 return ERR_PTR(-ENOMEM);
>
>         drmcs->weight = CGROUP_WEIGHT_DFL;
> +       INIT_DELAYED_WORK(&drmcs->scan_work, scan_worker);
>
>         return &drmcs->css;
>  }
> @@ -274,6 +690,12 @@ struct cftype files[] = {
>                 .read_u64 = drmcs_read_weight,
>                 .write_u64 = drmcs_write_weight,
>         },
> +       {
> +               .name = "period_us",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_u64 = drmcs_read_period_us,
> +               .write_u64 = drmcs_write_period_us,
> +       },
>         { } /* Zero entry terminates. */
>  };
>
> @@ -281,7 +703,9 @@ struct cgroup_subsys drm_cgrp_subsys = {
>         .css_alloc      = drmcs_alloc,
>         .css_free       = drmcs_free,
>         .css_online     = drmcs_online,
> +       .can_attach     = drmcs_can_attach,
>         .attach         = drmcs_attach,
> +       .cancel_attach  = drmcs_cancel_attach,
>         .early_init     = false,
>         .legacy_cftypes = files,
>         .dfl_cftypes    = files,
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 00/17] DRM scheduling cgroup controller
  2022-10-19 18:45 ` [RFC 00/17] DRM scheduling cgroup controller Tejun Heo
@ 2022-10-27 14:32   ` Tvrtko Ursulin
  2022-10-31 20:20     ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-27 14:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Intel-gfx, cgroups, linux-kernel, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin


Hi Tejun,

On 19/10/2022 19:45, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 19, 2022 at 06:32:37PM +0100, Tvrtko Ursulin wrote:
> ...
>> DRM static priority interface files
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>    drm.priority_levels
>> 	One of:
>> 	 1) And integer representing the minimum number of discrete priority
>> 	    levels for the whole group.
>> 	    Optionally followed by an asterisk ('*') indicating some DRM clients
>> 	    in the group support more than the minimum number.
>> 	 2) '0'- indicating one or more DRM clients in the group has no support
>> 	    for static priority control.
>> 	 3) 'n/a' - when there are no DRM clients in the configured group.
>>
>>    drm.priority
>> 	A read-write integer between -10000 and 10000 (inclusive) representing
>> 	an abstract static priority level.
>>
>>    drm.effective_priority
>> 	Read only integer showing the current effective priority level for the
>> 	group. Effective meaning taking into account the chain of inherited
> 
>>From interface POV, this is a lot worse than the second proposal and I'd
> really like to avoid this. Even if we go with mapping user priority
> configuration to per-driver priorities, I'd much prefer if the interface
> presented to user is weight based and let each driver try to match the
> resulting hierarchical weight (ie. the absolute proportion a given cgroup
> should have at the point in time) as best as they can rather than exposing
> opaque priority numbers to userspace whose meaning isn't defined at all.

I actually somewhat agree here and this proposal was mainly motivated by 
my desire to come up with *something* which would allow similar 
_external_ control as it exists in the CPU and IO world (nice/ionice).

Because currently priority of GPU workloads cannot be controlled from 
the outside at all. And especially if we consider pipelines composed of 
stages where part of the processing is done on the CPU and part on the 
GPU (or AI/VPU accelerator), then I think it would be really useful to 
be able to do so.

To this effect I proposed connecting CPU nice with GPU priority, same as 
it is connected for IO priority (so only if it hasn't been explicitly 
changed away from the defaults), but at that point I was getting 
feedback nudging me into direction of cgroups.

Looking at what's available in cgroups world now, I have spotted the 
blkio.prio.class control. For my current use case (lower GPU scheduling 
of background/unfocused windows) that would also work. Even if starting 
with just two possible values - 'no-change' and 'idle' (to follow the IO 
controller naming).

How would you view that as a proposal? It would be a bit tougher "sell" 
to the DRM community, perhaps, given that many drivers do have 
scheduling priority, but the concept of scheduling class is not really 
there. Nevertheless a concept of normal-vs-background could be plausible 
in my mind. It could be easily implemented by using the priority 
controls available in many drivers.

>> DRM scheduling soft limits interface files
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>    drm.weight
>> 	Standard cgroup weight based control [1, 10000] used to configure the
>> 	relative distributing of GPU time between the sibling groups.
> 
> Please take a look at io.weight. This can follow the same convention to
> express both global and per-device weights.
> 
>>    drm.period_us
>> 	An integer representing the period with which the controller should look
>> 	at the GPU usage by the group and potentially send the over/under budget
>> 	signal.
>> 	Value of zero (defaul) disables the soft limit checking.
> 
> Can we not do period_us or at least make it a per-driver tuning parameter
> exposed as module param? Weight, users can easily understand and configure.
> period_us is a lot more an implementation detail. If we want to express the
> trade-off between latency and bandwidth at the interface, we prolly should
> encode the latency requirement in a more canonical way but let's leave that
> for the future.

Yes agreed - for the moment (while RFC) it is handy for me to have it 
controllable per group. But I agree a kernel global (modparam) should be 
sufficient in the final solution.

> 
>>    drm.budget_supported
>> 	One of:
>> 	 1) 'yes' - when all DRM clients in the group support the functionality.
>> 	 2) 'no' - when at least one of the DRM clients does not support the
>> 		   functionality.
>> 	 3) 'n/a' - when there are no DRM clients in the group.
> 
> Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
> with and I'm not sure 'no' meaning at least one device not supporting is
> intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
> drop this.

The idea actually is for this to be per-cgroup and potentially change 
dynamically. It implements the concept of "observability", how I have, 
perhaps clumsily, named it.

This is because we can have a mix of DRM file descriptors in a cgroup, 
not all of which support the proposed functionality. So I wanted to have 
something by which the administrator can observe the status of the group.

For instance seeing some clients do not support the feature could be 
signal that things have been misconfigured, or that appeal needs to be 
made for driver X to start supporting the feature. Seeing a "no" there 
in other words is a signal that budgeting may not really work as 
expected and needs to be investigated.

> Another basic interface question. Is everyone happy with the drm prefix or
> should it be something like gpu? Also, in the future, if there's a consensus
> around how to control gpu memory, what prefix would that take?

Given the current effort to bring in AI/VPU/ML accelerator devices under 
the DRM umbrella I think drm prefix is okay. At least some of those 
devices (the ones I looked at) will accept the same concepts of 
scheduling and memory allocation as well.

And for memory specifically, I don't see why the drm prefix would not work.

But yeah, I echo the calls for wider feedback. Given how drm cgroup 
controller has been on a wish list for ages I have to say I expected a 
bit more interest. :)

>> The second proposal is a little bit more advanced in concept and also a little
>> bit less finished. Interesting thing is that it builds upon the per client GPU
>> utilisation work which landed recently for a few drivers. So my thinking is that
>> in principle, an intersect of drivers which support both that and some sort of
>> priority scheduling control, could also in theory support this.
>>
>> Another really interesting angle for this controller is that it mimics the same
>> control menthod used by the CPU scheduler. That is the proportional/weight based
>> GPU time budgeting. Which makes it easy to configure and does not need a new
>> mental model.
>>
>> However, as the introduction mentions, GPUs are much more heterogenous and
>> therefore the controller uses very "soft" wording as to what it promises. The
>> general statement is that it can define budgets, notify clients when they are
>> over them, and let individual drivers implement best effort handling of those
>> conditions.
>>
>> Delegation of duties in the implementation goes likes this:
>>
>>   * DRM cgroup controller implements the control files and the scanning loop.
>>   * DRM core is required to track all DRM clients belonging to processes so it
>>     can answer when asked how much GPU time is a process using.
>>   * DRM core also provides a call back which the controller will call when a
>>     certain process is over budget.
>>   * Individual drivers need to implement two similar hooks, but which work for
>>     a single DRM client. Over budget callback and GPU utilisation query.
>>
>> What I have demonstrated in practice is that when wired to i915, in a really
>> primitive way where the over-budget condition simply lowers the scheduling
>> priority, the concept can be almost equally effective as the static priority
>> control. I say almost because the design where budget control depends on the
>> periodic usage scanning has a fundamental delay, so responsiveness will depend
>> on the scanning period, which may or may not be a problem for a particular use
>> case.
>>
>> The unfinished part is the GPU budgeting split which currently does not
>> propagate unused bandwith to children, neither can share it with siblings. But
>> this is not due fundamental reasons, just to avoid spending too much time on it
>> too early.
> 
> Rather than doing it hierarchically on the spot, it's usually a lot cheaper
> and easier to calculate the flattened hierarchical weight per leaf cgroup
> and divide the bandwidth according to the eventual portions. For an example,
> please take a look at block/blk-iocost.c.

This seems exactly what I had in mind (but haven't implemented it yet). 
So in this RFC I have budget splitting per group where each tree level 
adds up to "100%" and the thing which I have not implemented is 
"borrowing" or yielding (how blk-iocost.c calls it, or donating) unused 
budgets to siblings.

I am very happy to hear my idea is the right one and someone already 
implemented it. Thanks for this pointer!

> I don't know much about the drm driver side, so can't comment much on it but
> I do really like the idea of having the core implementation determining who
> should get how much and then letting each driver enforce the target. That
> seems a lot more robust and generic than trying to somehow coax and expose
> per-driver priority implementations directly.

Thanks, and thanks for having a detailed read and providing great 
feedback so far!

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 02/17] drm: Track clients per owning process
  2022-10-20 11:33       ` Christian König
@ 2022-10-27 14:35         ` Tvrtko Ursulin
  0 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-27 14:35 UTC (permalink / raw)
  To: Christian König, Intel-gfx
  Cc: cgroups, linux-kernel, Tejun Heo, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Brian Welty, Tvrtko Ursulin


On 20/10/2022 12:33, Christian König wrote:
> Am 20.10.22 um 09:34 schrieb Tvrtko Ursulin:
>>
>> On 20/10/2022 07:40, Christian König wrote:
>>> Am 19.10.22 um 19:32 schrieb Tvrtko Ursulin:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> To enable propagation of settings from the cgroup drm controller to 
>>>> drm we
>>>> need to start tracking which processes own which drm clients.
>>>>
>>>> Implement that by tracking the struct pid pointer of the owning 
>>>> process in
>>>> a new XArray, pointing to a structure containing a list of associated
>>>> struct drm_file pointers.
>>>>
>>>> Clients are added and removed under the filelist mutex and RCU list
>>>> operations are used below it to allow for lockless lookup.
>>>
>>> That won't work easily like this. The problem is that file_priv->pid 
>>> is usually not accurate these days:
>>>
>>>  From the debugfs clients file:
>>>
>>>        systemd-logind   773   0   y    y     0          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>               firefox  2945 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                chrome 35940 128   n    n  1000          0
>>>                chrome 35940   0   n    y  1000          1
>>>                chrome 35940   0   n    y  1000          2
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>                  Xorg  1639 128   n    n  1000          0
>>>
>>> This is with glxgears and a bunch other OpenGL applications running.
>>>
>>> The problem is that for most applications the X/Wayland server is now 
>>> opening the render node. The only exceptions in this case are apps 
>>> using DRI2 (VA-API?).
>>>
>>> I always wanted to fix this and actually track who is using the file 
>>> descriptor instead of who opened it, but never had the time to do this.
>>
>> There's a patch later in the series which allows client records to be 
>> migrated to a new PID, and then i915 patch to do that when fd is used 
>> for context create. That approach I think worked well enough in the 
>> past. So maybe it could be done in the DRM core at some suitable entry 
>> point.
> 
> Yeah, that makes some sense. I think you should wire that inside 
> drm_ioctl(), as far as I know more or less all uses of a file descriptor 
> would go through that function.
> 
> And maybe make that a stand alone patch, cause that can go upstream as a 
> bug fix independently if you ask me.

I've put it on my todo list to try and come up with something standalone 
for this problem. Will see if I manage to send it separately or perhaps 
will start the next cgroup controller RFC with those patches.

>>> I think you need to fix this problem first. And BTW: and unsigned 
>>> long doesn't work as PID either with containers.
>>
>> This I am not familiar with so would like to hear more if you could 
>> point me in the right direction at least.
> 
> Uff, I'm the wrong person to ask stuff like that. I just can say from 
> experience because I've ran into that trap as well.
> 
>>
>> My assumption was that struct pid *, which is what I store in unsigned 
>> long, would be unique in a system where there is a single kernel 
>> running, so as long as lifetimes are correct (released from tracking 
>> here when fd is closed, which is implicit on process exit) would work. 
>> You are suggesting that is not so?
> 
> I think you should have the pointer to struct pid directly here since 
> that is a reference counted structure IIRC. But don't ask me what the 
> semantics is how to get or put a reference.

Yeah I think I have all that. I track struct pid, with a reference, in 
drm client, and release it when file descriptor is closed (indirectly 
via the DRM close hook). All I need, I think, is for that mapping to 
answer me "which drm_file objects" are in use by this struct pid 
pointer. I don't see a problem with lifetimes or scope yet.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage
  2022-10-21 22:52   ` T.J. Mercier
@ 2022-10-27 14:45     ` Tvrtko Ursulin
  0 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-10-27 14:45 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Intel-gfx, cgroups, linux-kernel, Tejun Heo, Johannes Weiner,
	Zefan Li, Dave Airlie, Daniel Vetter, Rob Clark,
	Stéphane Marchesin, Kenny.Ho, Christian König,
	Brian Welty, Tvrtko Ursulin


On 21/10/2022 23:52, T.J. Mercier wrote:
> On Wed, Oct 19, 2022 at 10:34 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Add a scanning worker, which if enabled, periodically queries the cgroup
>> for GPU usage and if over budget (as configured by it's relative weight
>> share) notifies the drm core about the fact.
>>
>> This is off by default and can be enabled by configuring a scanning
>> period using the drm.period_us cgroup control file.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>   Documentation/admin-guide/cgroup-v2.rst |  35 +-
>>   kernel/cgroup/drm.c                     | 426 +++++++++++++++++++++++-
>>   2 files changed, 459 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 1f3cca4e2572..318f463a1316 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -2401,7 +2401,8 @@ HugeTLB Interface Files
>>   DRM
>>   ---
>>
>> -The DRM controller allows configuring static hierarchical scheduling priority.
>> +The DRM controller allows configuring static hierarchical scheduling priority
>> +and scheduling soft limits.
>>
>>   DRM static priority control
>>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> @@ -2458,6 +2459,38 @@ DRM static priority interface files
>>          Read only integer showing the current effective priority level for the
>>          group. Effective meaning taking into account the chain of inherited
>>
>> +DRM scheduling soft limits
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Because of the heterogenous hardware and driver DRM capabilities, soft limits
>> +are implemented as a loose co-operative (bi-directional) interface between the
>> +controller and DRM core.
>> +
>> +The controller configures the GPU time allowed per group and periodically scans
>> +the belonging tasks to detect the over budget condition, at which point it
>> +invokes a callback notifying the DRM core of the condition.
>> +
>> +DRM core provides an API to query per process GPU utilization and 2nd API to
>> +receive notification from the cgroup controller when the group enters or exits
>> +the over budget condition.
>> +
>> +Individual DRM drivers which implement the interface are expected to act on this
>> +in the best-effort manner only. There are no guarantees that the soft limits
>> +will be respected.
>> +
>> +DRM scheduling soft limits interface files
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +  drm.weight
>> +       Standard cgroup weight based control [1, 10000] used to configure the
>> +       relative distributing of GPU time between the sibling groups.
>> +
>> +  drm.period_us
>> +       An integer representing the period with which the controller should look
>> +       at the GPU usage by the group and potentially send the over/under budget
>> +       signal.
>> +       Value of zero (defaul) disables the soft limit checking.
>> +
>>   Misc
>>   ----
>>
>> diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
>> index 48f1eaaa1c07..af50ead1564a 100644
>> --- a/kernel/cgroup/drm.c
>> +++ b/kernel/cgroup/drm.c
>> @@ -18,6 +18,29 @@ struct drm_cgroup_state {
>>          int priority;
>>          int effective_priority;
>>          unsigned int weight;
>> +       unsigned int period_us;
>> +
>> +       bool scanning_suspended;
>> +       unsigned int suspended_period_us;
>> +
>> +       struct delayed_work scan_work;
>> +
>> +       /*
>> +        * Below fields are owned and updated by the scan worker. Either the
>> +        * worker accesses them, or worker needs to be suspended and synced
>> +        * before they can be touched from the outside.
>> +        */
>> +       bool scanned;
>> +
>> +       ktime_t prev_timestamp;
>> +
>> +       u64 sum_children_weights;
>> +       u64 children_active_us;
>> +       u64 per_s_budget_ns;
>> +       u64 prev_active_us;
>> +       u64 active_us;
>> +
>> +       bool over_budget;
>>   };
>>
>>   static DEFINE_MUTEX(drmcg_mutex);
>> @@ -33,6 +56,31 @@ static inline struct drm_cgroup_state *get_task_drmcs(struct task_struct *task)
>>          return css_to_drmcs(task_get_css(task, drm_cgrp_id));
>>   }
>>
>> +static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
>> +{
>> +       struct cgroup *cgrp = drmcs->css.cgroup;
>> +       struct task_struct *task;
>> +       struct css_task_iter it;
>> +       u64 total = 0;
>> +
>> +       css_task_iter_start(&cgrp->self,
>> +                           CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
>> +                           &it);
>> +       while ((task = css_task_iter_next(&it))) {
>> +               u64 time;
>> +
>> +               /* Ignore kernel threads here. */
>> +               if (task->flags & PF_KTHREAD)
>> +                       continue;
>> +
>> +               time = drm_pid_get_active_time_us(task_pid(task));
>> +               total += time;
>> +       }
>> +       css_task_iter_end(&it);
>> +
>> +       return total;
>> +}
>> +
>>   int drmcgroup_lookup_effective_priority(struct task_struct *task)
>>   {
>>          struct drm_cgroup_state *drmcs = get_task_drmcs(task);
>> @@ -202,9 +250,301 @@ static int drmcs_online(struct cgroup_subsys_state *css)
>>          return 0;
>>   }
>>
>> +static void
>> +signal_drm_budget(struct drm_cgroup_state *drmcs, u64 usage, u64 budget)
>> +{
>> +       struct cgroup *cgrp = drmcs->css.cgroup;
>> +       struct task_struct *task;
>> +       struct css_task_iter it;
>> +
>> +       css_task_iter_start(&cgrp->self,
>> +                           CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
>> +                           &it);
>> +       while ((task = css_task_iter_next(&it))) {
>> +               /* Ignore kernel threads here. */
>> +               if (task->flags & PF_KTHREAD)
>> +                       continue;
>> +
>> +               drm_pid_signal_budget(task_pid(task), usage, budget);
>> +       }
>> +       css_task_iter_end(&it);
>> +}
>> +
>> +static bool __start_scanning(struct drm_cgroup_state *root)
>> +{
>> +       struct cgroup_subsys_state *node;
>> +       bool ok = true;
>> +
>> +       rcu_read_lock();
>> +       css_for_each_descendant_pre(node, &root->css) {
>> +               struct drm_cgroup_state *drmcs = css_to_drmcs(node);
>> +               unsigned long active;
>> +
>> +               if (!css_tryget_online(node)) {
>> +                       ok = false;
>> +                       continue;
>> +               }
>> +
>> +               drmcs->scanned = false;
>> +               drmcs->sum_children_weights = 0;
>> +               drmcs->children_active_us = 0;
>> +               if (node == &root->css)
>> +                       drmcs->per_s_budget_ns = NSEC_PER_SEC;
>> +               else
>> +                       drmcs->per_s_budget_ns = 0;
>> +
>> +               active = drmcs_get_active_time_us(drmcs);
>> +               if (active >= drmcs->prev_active_us)
>> +                       drmcs->active_us = active - drmcs->prev_active_us;
>> +               else
>> +                       drmcs->active_us = 0;
>> +               drmcs->prev_active_us = active;
>> +
>> +               css_put(node);
>> +       }
>> +       rcu_read_unlock();
>> +
>> +       return ok;
>> +}
>> +
>> +static void scan_worker(struct work_struct *work)
>> +{
>> +       struct drm_cgroup_state *root =
>> +               container_of(work, typeof(*root), scan_work.work);
>> +       struct cgroup_subsys_state *node;
>> +       unsigned int period_us;
>> +       ktime_t now;
>> +
>> +       rcu_read_lock();
> 
> Hi Tvrtko, I think this lock needs to come after the return for the
> online check just below here to avoid missing the rcu_read_unlock at
> out_retry. Although it doesn't look like this should ever run in the
> first place if the DRM controller is disabled.

Yep - thank you. I've fixed it up locally.

Another TODO note for me is to see if I can hook into the cgroup offline 
hook to avoid all this checks by perhaps making sure scan worker is not 
running as transitions are happening. Will see if that works.

>> +
>> +       if (WARN_ON_ONCE(!css_tryget_online(&root->css)))
>> +               return;
>> +
>> +       /*
>> +        * 1st pass - reset accumulated values and update group GPU activity.
>> +        */
>> +       if (!__start_scanning(root))
>> +               goto out_retry; /*
>> +                                * Always come back later if scanner races with
>> +                                * core cgroup management. (Repeated pattern.)
>> +                                */
>> +
>> +       now = ktime_get();
>> +       period_us = ktime_to_us(ktime_sub(now, root->prev_timestamp));
>> +       root->prev_timestamp = now;
>> +
>> +       /*
>> +        * 2nd pass - calculate accumulated GPU activity and relative weights
>> +        * for each parent's children.
>> +        */
>> +       css_for_each_descendant_pre(node, &root->css) {
>> +               struct drm_cgroup_state *drmcs = css_to_drmcs(node);
>> +
>> +               if (!css_tryget_online(node))
>> +                       goto out_retry;
>> +
>> +               if (!drmcs->scanned) {
>> +                       struct cgroup_subsys_state *css;
>> +
>> +                       css_for_each_child(css, &drmcs->css) {
>> +                               struct drm_cgroup_state *sibling =
>> +                                                       css_to_drmcs(css);
>> +
>> +                               if (!css_tryget_online(css)) {
>> +                                       css_put(node);
>> +                                       goto out_retry;
>> +                               }
>> +
>> +                               drmcs->children_active_us += sibling->active_us;
>> +                               drmcs->sum_children_weights += sibling->weight;
>> +
>> +                               css_put(css);
>> +                       }
>> +
>> +                       drmcs->scanned = true;
>> +               }
>> +
>> +               css_put(node);
>> +       }
>> +
>> +       /*
>> +        * 3rd pass - calculate relative budgets for each group based on
>> +        * relative weights and parent's budget.
>> +        *
>> +        * FIXME: This is for now incomplete in more than one way. There is
>> +        * no downward propagation of unused budgets, and even no utilisation of
>> +        * the unused budgets at all.
>> +        */
>> +       css_for_each_descendant_pre(node, &root->css) {
>> +               struct drm_cgroup_state *drmcs, *pdrmcs;
>> +               bool over, was_over;
>> +               u64 budget;
>> +
>> +               if (!css_tryget_online(node))
>> +                       goto out_retry;
>> +               if (node->cgroup->level == 1) {
>> +                       css_put(node);
>> +                       continue;
>> +               }
>> +               if (!css_tryget_online(node->parent)) {
>> +                       css_put(node);
>> +                       goto out_retry;
>> +               }
>> +
>> +               drmcs = css_to_drmcs(node);
>> +               pdrmcs = css_to_drmcs(node->parent);
>> +
>> +               drmcs->per_s_budget_ns  =
>> +                       DIV_ROUND_UP_ULL(pdrmcs->per_s_budget_ns *
>> +                                        drmcs->weight,
>> +                                        pdrmcs->sum_children_weights);
>> +               budget = DIV_ROUND_UP_ULL(drmcs->per_s_budget_ns * period_us,
>> +                                         NSEC_PER_SEC);
>> +               over = drmcs->active_us > budget;
>> +               was_over = drmcs->over_budget;
>> +               drmcs->over_budget = over;
>> +               if (over || (!over && was_over))
>> +                       signal_drm_budget(drmcs, drmcs->active_us, budget);
>> +
>> +               css_put(node);
>> +               css_put(node->parent);
>> +       }
>> +
>> +out_retry:
>> +       rcu_read_unlock();
>> +
>> +       period_us = READ_ONCE(root->period_us);
>> +       if (period_us)
>> +               schedule_delayed_work(&root->scan_work,
>> +                                     usecs_to_jiffies(period_us));
>> +
>> +       css_put(&root->css);
>> +}
>> +
>> +static void start_scanning(struct drm_cgroup_state *drmcs, u64 period_us)
>> +{
>> +       drmcs->period_us = (unsigned int)period_us;
>> +       WARN_ON_ONCE(!__start_scanning(drmcs));
>> +       drmcs->prev_timestamp = ktime_get();
>> +       mod_delayed_work(system_wq, &drmcs->scan_work,
>> +                        usecs_to_jiffies(period_us));
>> +}
>> +
>> +static void stop_scanning(struct drm_cgroup_state *drmcs)
>> +{
>> +       drmcs->period_us = 0;
>> +       cancel_delayed_work_sync(&drmcs->scan_work);
>> +       if (drmcs->over_budget) {
>> +               /*
>> +                * Signal under budget when scanning goes off so drivers
>> +                * correctly update their state.
>> +                */
>> +               signal_drm_budget(drmcs, 0, drmcs->per_s_budget_ns);
>> +               drmcs->over_budget = false;
>> +       }
>> +}
>> +
>> +static struct drm_cgroup_state *drmcs_scanner(struct drm_cgroup_state *drmcs)
>> +{
>> +       while (drmcs->css.cgroup->level > 1)
>> +               drmcs = css_to_drmcs(drmcs->css.parent);
>> +
>> +       return drmcs;
>> +}
>> +
>> +static void start_suspend_scanning(struct drm_cgroup_state *drmcs)
>> +{
>> +       drmcs = drmcs_scanner(drmcs);
>> +
>> +       if (drmcs->scanning_suspended)
>> +               return;
>> +
>> +       drmcs->scanning_suspended = true;
>> +       drmcs->suspended_period_us = drmcs->period_us;
>> +       drmcs->period_us = 0;
>> +}
>> +
>> +static void finish_suspend_scanning(struct drm_cgroup_state *drmcs)
>> +{
>> +       drmcs = drmcs_scanner(drmcs);
>> +
>> +       if (drmcs->suspended_period_us)
>> +               cancel_delayed_work_sync(&drmcs->scan_work);
>> +}
>> +
>> +static void resume_scanning(struct drm_cgroup_state *drmcs)
>> +{
>> +       drmcs = drmcs_scanner(drmcs);
>> +
>> +       if (!drmcs->scanning_suspended)
>> +               return;
>> +
>> +       drmcs->scanning_suspended = false;
>> +       if (drmcs->suspended_period_us) {
>> +               start_scanning(drmcs, drmcs->suspended_period_us);
>> +               drmcs->suspended_period_us = 0;
>> +       }
>> +}
>> +
>>   static void drmcs_free(struct cgroup_subsys_state *css)
>>   {
>> -       kfree(css_to_drmcs(css));
>> +       struct drm_cgroup_state *drmcs = css_to_drmcs(css);
>> +
>> +       stop_scanning(drmcs);
>> +
>> +       kfree(drmcs);
>> +}
>> +
>> +static int drmcs_can_attach(struct cgroup_taskset *tset)
>> +{
>> +       struct cgroup_subsys_state *new_css;
>> +       struct task_struct *task;
>> +       int ret;
>> +
>> +       /*
>> +        * As processes are getting moved between groups we need to ensure
>> +        * both that the old group does not see a sudden downward jump in the
>> +        * GPU utilisation, and that the new group does not see a sudden jump
>> +        * up with all the GPU time clients belonging to the migrated process
>> +        * have accumulated.
>> +        *
>> +        * To achieve that we suspend the scanner until the migration is
>> +        * completed where the resume at the end ensures both groups start
>> +        * observing GPU utilisation from a reset state.
>> +        */
>> +
>> +       ret = mutex_lock_interruptible(&drmcg_mutex);
>> +       if (ret)
>> +               return ret;
>> +
>> +       cgroup_taskset_for_each(task, new_css, tset) {
>> +               start_suspend_scanning(css_to_drmcs(task_css(task,
>> +                                                            drm_cgrp_id)));
>> +               start_suspend_scanning(css_to_drmcs(new_css));
>> +       }
>> +
>> +       mutex_unlock(&drmcg_mutex);
>> +
>> +       cgroup_taskset_for_each(task, new_css, tset) {
>> +               finish_suspend_scanning(css_to_drmcs(task_css(task,
>> +                                                             drm_cgrp_id)));
>> +               finish_suspend_scanning(css_to_drmcs(new_css));
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static void tset_resume_scanning(struct cgroup_taskset *tset)
>> +{
>> +       struct cgroup_subsys_state *new_css;
>> +       struct task_struct *task;
>> +
>> +       mutex_lock(&drmcg_mutex);
>> +       cgroup_taskset_for_each(task, new_css, tset) {
>> +               resume_scanning(css_to_drmcs(task_css(task, drm_cgrp_id)));
>> +               resume_scanning(css_to_drmcs(new_css));
>> +       }
>> +       mutex_unlock(&drmcg_mutex);
>>   }
>>
>>   static void drmcs_attach(struct cgroup_taskset *tset)
>> @@ -219,12 +559,86 @@ static void drmcs_attach(struct cgroup_taskset *tset)
>>          cgroup_taskset_for_each(task, css, tset)
>>                  drm_pid_update_priority(task_pid(task),
>>                                          css_to_drmcs(css)->effective_priority);
>> +
>> +       tset_resume_scanning(tset);
>> +}
>> +
>> +static void drmcs_cancel_attach(struct cgroup_taskset *tset)
>> +{
>> +       tset_resume_scanning(tset);
>> +}
>> +
>> +static u64
>> +drmcs_read_period_us(struct cgroup_subsys_state *css, struct cftype *cft)
>> +{
>> +       struct drm_cgroup_state *drmcs = css_to_drmcs(css);
>> +
>> +       return drmcs->period_us;
>> +}
>> +
>> +static int
>> +drmcs_write_period_us(struct cgroup_subsys_state *css, struct cftype *cftype,
>> +                     u64 period_us)
>> +{
>> +       struct drm_cgroup_state *drmcs = css_to_drmcs(css);
>> +       int ret;
>> +
>> +       if (WARN_ON_ONCE(!css->parent))
>> +               return -EINVAL;
>> +       if (css->cgroup->level != 1)
>> +               return -EINVAL;
>> +       if ((period_us && period_us < 500000) || period_us > USEC_PER_SEC * 60)
>> +               return -EINVAL;
>> +
>> +       ret = mutex_lock_interruptible(&drmcg_mutex);
>> +       if (ret)
>> +               return ret;
>> +
>> +       if (!drmcs->scanning_suspended) {
>> +               if (period_us)
>> +                       start_scanning(drmcs, period_us);
>> +               else
>> +                       stop_scanning(drmcs);
>> +       } else {
>> +               /*
>> +                * If scanning is temporarily suspended just update the period
>> +                * which will apply once resumed, or simply skip resuming in
>> +                * case of disabling.
>> +                */
>> +               drmcs->suspended_period_us = period_us;
>> +               if (!period_us)
>> +                       drmcs->scanning_suspended = false;
>> +       }
>> +
>> +       mutex_unlock(&drmcg_mutex);
>> +
>> +       return 0;
>>   }
>>
>>   void drmcgroup_client_exited(struct task_struct *task)
>>   {
>>          struct drm_cgroup_state *drmcs = get_task_drmcs(task);
>>
>> +       /*
>> +        * Since we are not tracking accumulated GPU time for each cgroup,
>> +        * avoid jumps in group observed GPU usage by re-setting the scanner
>> +        * at a point when GPU usage can suddenly jump down.
>> +        *
>> +        * Downside is clients can influence the effectiveness of the over-
>> +        * budget scanning by continuosly closing DRM file descriptors but for
> 
> "continuously"

Thanks, fixed locally.

> 
> And I think also if a user has permission to create and migrate
> processes between cgroups even just under the same parent, since
> css_tryget_online failure would cause an early exit during scanning?

I will need to double check if migration causes online/offline events. I 
didn't think it does. I know it causes "attach" "family" of events which 
I hooked into to ensure no sudden jumps in usage during migration. See 
comment in drmcs_can_attach(). But that is a pretty rare event I think.

Unless you are thinking from the ChromeOS angle and browser 
foreground/backgroun tabs cgroups? Those are not doing GPU rendering so 
I did not envisage hooking up the DRM controller there. To start with I 
only planned to hook it up with the children of the vms cgroup. The 
model in that branch is not to migrate processes but to change cgroup 
CPU share based on ::OnWindowActivated hooks.

Regards,

Tvrtko

>> +        * now we do not worry about it.
>> +        */
>> +
>> +       mutex_lock(&drmcg_mutex);
>> +       start_suspend_scanning(drmcs);
>> +       mutex_unlock(&drmcg_mutex);
>> +
>> +       finish_suspend_scanning(drmcs);
>> +
>> +       mutex_lock(&drmcg_mutex);
>> +       resume_scanning(drmcs);
>> +       mutex_unlock(&drmcg_mutex);
>> +
>>          css_put(&drmcs->css);
>>   }
>>   EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
>> @@ -232,6 +646,7 @@ EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
>>   static struct drm_cgroup_state root_drmcs = {
>>          .priority = DRM_CGROUP_PRIORITY_DEF,
>>          .effective_priority = DRM_CGROUP_PRIORITY_DEF,
>> +       .weight = CGROUP_WEIGHT_DFL,
>>   };
>>
>>   static struct cgroup_subsys_state *
>> @@ -247,6 +662,7 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
>>                  return ERR_PTR(-ENOMEM);
>>
>>          drmcs->weight = CGROUP_WEIGHT_DFL;
>> +       INIT_DELAYED_WORK(&drmcs->scan_work, scan_worker);
>>
>>          return &drmcs->css;
>>   }
>> @@ -274,6 +690,12 @@ struct cftype files[] = {
>>                  .read_u64 = drmcs_read_weight,
>>                  .write_u64 = drmcs_write_weight,
>>          },
>> +       {
>> +               .name = "period_us",
>> +               .flags = CFTYPE_NOT_ON_ROOT,
>> +               .read_u64 = drmcs_read_period_us,
>> +               .write_u64 = drmcs_write_period_us,
>> +       },
>>          { } /* Zero entry terminates. */
>>   };
>>
>> @@ -281,7 +703,9 @@ struct cgroup_subsys drm_cgrp_subsys = {
>>          .css_alloc      = drmcs_alloc,
>>          .css_free       = drmcs_free,
>>          .css_online     = drmcs_online,
>> +       .can_attach     = drmcs_can_attach,
>>          .attach         = drmcs_attach,
>> +       .cancel_attach  = drmcs_cancel_attach,
>>          .early_init     = false,
>>          .legacy_cftypes = files,
>>          .dfl_cftypes    = files,
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 00/17] DRM scheduling cgroup controller
  2022-10-27 14:32   ` Tvrtko Ursulin
@ 2022-10-31 20:20     ` Tejun Heo
  2022-11-09 16:59       ` Tvrtko Ursulin
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2022-10-31 20:20 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel-gfx, cgroups, linux-kernel, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin

Hello,

On Thu, Oct 27, 2022 at 03:32:00PM +0100, Tvrtko Ursulin wrote:
> Looking at what's available in cgroups world now, I have spotted the
> blkio.prio.class control. For my current use case (lower GPU scheduling of
> background/unfocused windows) that would also work. Even if starting with
> just two possible values - 'no-change' and 'idle' (to follow the IO
> controller naming).

I wouldn't follow that example. That's only meaningful within the context of
bfq and it probabaly shouldn't have been merged in the first place.

> How would you view that as a proposal? It would be a bit tougher "sell" to
> the DRM community, perhaps, given that many drivers do have scheduling
> priority, but the concept of scheduling class is not really there.
> Nevertheless a concept of normal-vs-background could be plausible in my
> mind. It could be easily implemented by using the priority controls
> available in many drivers.

I don't feel great about that.

* The semantics aren't clearly defined. While not immediately obvious in the
  interface, the task nice levels have definite mappings to weight values
  and thus clear meanings. I don't think it's a good idea to leave the
  interface semantics open to interpretation.

* Maybe GPUs are better but my experience with optional hardware features in
  the storage world has been that vendors diverge wildly and unexpectedly to
  the point many features are mostly useless. There are fewer GPU vendors
  and more software effort behind each, so maybe the situation is better but
  I think it'd be helpul to keep some skepticism.

* Even when per-vendor or per-driver features are consistent enough to be
  useful, they often aren't thought through enough to be truly useful. e.g.
  nvme has priority features but they aren't really that useful because they
  can't do much without congestion control on the issuer side and once you
  have congestion control on the issuer side which is usually a lot more
  complex (e.g. dealing with work-conserving hierarchical weight
  distributions, priority inversions and so on), you can achieve most of
  what you need in terms of resource control from the issuer side anyway.

So, I'd much prefer to have a fuller solution on the kernel side which
integrates per-vendor/driver features where they make sense.

> > >    drm.budget_supported
> > > 	One of:
> > > 	 1) 'yes' - when all DRM clients in the group support the functionality.
> > > 	 2) 'no' - when at least one of the DRM clients does not support the
> > > 		   functionality.
> > > 	 3) 'n/a' - when there are no DRM clients in the group.
> > 
> > Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
> > with and I'm not sure 'no' meaning at least one device not supporting is
> > intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
> > drop this.
> 
> The idea actually is for this to be per-cgroup and potentially change
> dynamically. It implements the concept of "observability", how I have,
> perhaps clumsily, named it.
> 
> This is because we can have a mix of DRM file descriptors in a cgroup, not
> all of which support the proposed functionality. So I wanted to have
> something by which the administrator can observe the status of the group.
> 
> For instance seeing some clients do not support the feature could be signal
> that things have been misconfigured, or that appeal needs to be made for
> driver X to start supporting the feature. Seeing a "no" there in other words
> is a signal that budgeting may not really work as expected and needs to be
> investigated.

I still don't see how this is per-cgroup given that it's indicating whether
the driver supports some feature. Also, the eventual goal would be
supporting the same control mechanisms across most (if not all) GPUs, right?

> > Rather than doing it hierarchically on the spot, it's usually a lot cheaper
> > and easier to calculate the flattened hierarchical weight per leaf cgroup
> > and divide the bandwidth according to the eventual portions. For an example,
> > please take a look at block/blk-iocost.c.
> 
> This seems exactly what I had in mind (but haven't implemented it yet). So
> in this RFC I have budget splitting per group where each tree level adds up
> to "100%" and the thing which I have not implemented is "borrowing" or
> yielding (how blk-iocost.c calls it, or donating) unused budgets to
> siblings.
> 
> I am very happy to hear my idea is the right one and someone already
> implemented it. Thanks for this pointer!

The budget donation thing in iocost is necessary only because it wants to
make the hot path local to the cgroup because io control has to support very
high decision rate. For time-slicing GPU, it's likely that following the
current hierarchical weight on the spot is enough.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 00/17] DRM scheduling cgroup controller
  2022-10-31 20:20     ` Tejun Heo
@ 2022-11-09 16:59       ` Tvrtko Ursulin
  0 siblings, 0 replies; 28+ messages in thread
From: Tvrtko Ursulin @ 2022-11-09 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Intel-gfx, cgroups, linux-kernel, Johannes Weiner, Zefan Li,
	Dave Airlie, Daniel Vetter, Rob Clark, Stéphane Marchesin,
	T . J . Mercier, Kenny.Ho, Christian König, Brian Welty,
	Tvrtko Ursulin


On 31/10/2022 20:20, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 27, 2022 at 03:32:00PM +0100, Tvrtko Ursulin wrote:
>> Looking at what's available in cgroups world now, I have spotted the
>> blkio.prio.class control. For my current use case (lower GPU scheduling of
>> background/unfocused windows) that would also work. Even if starting with
>> just two possible values - 'no-change' and 'idle' (to follow the IO
>> controller naming).
> 
> I wouldn't follow that example. That's only meaningful within the context of
> bfq and it probabaly shouldn't have been merged in the first place.
> 
>> How would you view that as a proposal? It would be a bit tougher "sell" to
>> the DRM community, perhaps, given that many drivers do have scheduling
>> priority, but the concept of scheduling class is not really there.
>> Nevertheless a concept of normal-vs-background could be plausible in my
>> mind. It could be easily implemented by using the priority controls
>> available in many drivers.
> 
> I don't feel great about that.
> 
> * The semantics aren't clearly defined. While not immediately obvious in the
>    interface, the task nice levels have definite mappings to weight values
>    and thus clear meanings. I don't think it's a good idea to leave the
>    interface semantics open to interpretation.

Agreed it is not clearly defined, but it was the same with nice levels. 
As in it changed a lot over the years how exactly they behave (with a 
few scheduler rewrites) and they were constantly at least somewhat 
useful as a mean of external control.

Nevertheless you will notice I have dropped everything priority based 
from the v2 of the RFC to simplify the conversation going forward.

> * Maybe GPUs are better but my experience with optional hardware features in
>    the storage world has been that vendors diverge wildly and unexpectedly to
>    the point many features are mostly useless. There are fewer GPU vendors
>    and more software effort behind each, so maybe the situation is better but
>    I think it'd be helpul to keep some skepticism.
> 
> * Even when per-vendor or per-driver features are consistent enough to be
>    useful, they often aren't thought through enough to be truly useful. e.g.
>    nvme has priority features but they aren't really that useful because they
>    can't do much without congestion control on the issuer side and once you
>    have congestion control on the issuer side which is usually a lot more
>    complex (e.g. dealing with work-conserving hierarchical weight
>    distributions, priority inversions and so on), you can achieve most of
>    what you need in terms of resource control from the issuer side anyway.

GPUs will not be fully uniform either, especially in the terms of how 
well the controls work, which is why I am spelling out how this is only 
attempting to do "soft limits", everywhere in the documentation, cover 
letter and patch commit message.

But at least concept of GPU time feels to me like a very universal one 
so should be something which we can base the control on.

> So, I'd much prefer to have a fuller solution on the kernel side which
> integrates per-vendor/driver features where they make sense.
> 
>>>>     drm.budget_supported
>>>> 	One of:
>>>> 	 1) 'yes' - when all DRM clients in the group support the functionality.
>>>> 	 2) 'no' - when at least one of the DRM clients does not support the
>>>> 		   functionality.
>>>> 	 3) 'n/a' - when there are no DRM clients in the group.
>>>
>>> Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
>>> with and I'm not sure 'no' meaning at least one device not supporting is
>>> intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
>>> drop this.
>>
>> The idea actually is for this to be per-cgroup and potentially change
>> dynamically. It implements the concept of "observability", how I have,
>> perhaps clumsily, named it.
>>
>> This is because we can have a mix of DRM file descriptors in a cgroup, not
>> all of which support the proposed functionality. So I wanted to have
>> something by which the administrator can observe the status of the group.
>>
>> For instance seeing some clients do not support the feature could be signal
>> that things have been misconfigured, or that appeal needs to be made for
>> driver X to start supporting the feature. Seeing a "no" there in other words
>> is a signal that budgeting may not really work as expected and needs to be
>> investigated.
> 
> I still don't see how this is per-cgroup given that it's indicating whether
> the driver supports some feature. Also, the eventual goal would be
> supporting the same control mechanisms across most (if not all) GPUs, right?

I have dropped this from v2 as well, to focus the feedback on most 
important points.

But in general the problem it wanted to address is that a single cgroup 
can contain multiple processes, each with zero to N open DRM file 
descriptors to any random GPU which happens to be installed in the 
system. And it can all change dynamically. It may be different vendors 
or different hardware generations, where some do not support the 
required functionality to support the cgroup controller.

So I wanted to give the sysadmin some visibility if at any given time 
the configuration applied to a cgroup has a chance to work fully, or 
only partially.

For instance with i915 we can have two supported devices in a laptop - 
integrated and discrete. Integrated can support the controller well, 
while for discrete is work in progress, maybe comes in the next kernel 
release. And then we can have this:

1)

cgexec -g drm:gfx/clients glxgears

This runs fully on integrated and budgeting works as expected.

2) DRI_PRIME=1 cgexec -g drm:gfx/clients glxgears

This one runs on the discrete GPU where budgeting does not work yet. 
While at the same time there can be another client in the same cgroup 
for which budgeting works. Or in other words:

cgexec -g drm:gfx/clients transcode_me_a_video
DRI_PRIME=1 cgexec -g drm:gfx/clients run_a_game

In this case game is not taking part in budgeting, even though it is a 
same driver. Maybe it will in a few kernel releases but not yet. Or in 
case of super old GPUs maybe support never gets added.

Perhaps I am over complicating things and what it would enough is to log 
the capability per device during driver probe:

i915 0000:00:02.0: DRM cgroup support: weights
i915 0000:01:00.0: DRM cgroup support: none

?

>>> Rather than doing it hierarchically on the spot, it's usually a lot cheaper
>>> and easier to calculate the flattened hierarchical weight per leaf cgroup
>>> and divide the bandwidth according to the eventual portions. For an example,
>>> please take a look at block/blk-iocost.c.
>>
>> This seems exactly what I had in mind (but haven't implemented it yet). So
>> in this RFC I have budget splitting per group where each tree level adds up
>> to "100%" and the thing which I have not implemented is "borrowing" or
>> yielding (how blk-iocost.c calls it, or donating) unused budgets to
>> siblings.
>>
>> I am very happy to hear my idea is the right one and someone already
>> implemented it. Thanks for this pointer!
> 
> The budget donation thing in iocost is necessary only because it wants to
> make the hot path local to the cgroup because io control has to support very
> high decision rate. For time-slicing GPU, it's likely that following the
> current hierarchical weight on the spot is enough.

I think I completed this part in v2. At least some quick smoke testing 
showed me that budgets now correctly propagate through the tree.

Not guaranteeing no bugs just yet and there are certainly still things 
to polish up in v2.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-11-09 17:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19 17:32 [RFC 00/17] DRM scheduling cgroup controller Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 01/17] cgroup: Add the DRM " Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 02/17] drm: Track clients per owning process Tvrtko Ursulin
2022-10-20  6:40   ` Christian König
2022-10-20  7:34     ` Tvrtko Ursulin
2022-10-20 11:33       ` Christian König
2022-10-27 14:35         ` Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 03/17] cgroup/drm: Support cgroup priority control Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 04/17] drm/cgroup: Allow safe external access to file_priv Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 05/17] drm: Connect priority updates to drm core Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 06/17] drm: Only track clients which are providing drm_cgroup_ops Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 07/17] drm/i915: i915 priority Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 08/17] drm: Allow for migration of clients Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 09/17] cgroup/drm: Introduce weight based drm cgroup control Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 10/17] drm: Add ability to query drm cgroup GPU time Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 11/17] drm: Add over budget signalling callback Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 12/17] cgroup/drm: Client exit hook Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 13/17] cgroup/drm: Ability to periodically scan cgroups for over budget GPU usage Tvrtko Ursulin
2022-10-21 22:52   ` T.J. Mercier
2022-10-27 14:45     ` Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 14/17] cgroup/drm: Show group budget signaling capability in sysfs Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 15/17] drm/i915: Migrate client to new owner on context create Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 16/17] drm/i915: Wire up with drm controller GPU time query Tvrtko Ursulin
2022-10-19 17:32 ` [RFC 17/17] drm/i915: Implement cgroup controller over budget throttling Tvrtko Ursulin
2022-10-19 18:45 ` [RFC 00/17] DRM scheduling cgroup controller Tejun Heo
2022-10-27 14:32   ` Tvrtko Ursulin
2022-10-31 20:20     ` Tejun Heo
2022-11-09 16:59       ` Tvrtko Ursulin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).