[RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4]  Add support for DRM cgroup memory accounting.
@ 2023-05-03  8:34 Maarten Lankhorst
  2023-05-03  8:34 ` [RFC PATCH 1/4] cgroup: Add the DRM cgroup controller Maarten Lankhorst
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03  8:34 UTC (permalink / raw)
  To: dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	amd-gfx, Zefan Li, Johannes Weiner, Tejun Heo

RFC as I'm looking for comments.

For long running compute, it can be beneficial to partition the GPU memory
between cgroups, so each cgroup can use its maximum amount of memory without
interfering with other scheduled jobs. Done properly, this can alleviate the
need for eviction, which might result in a job being terminated if the GPU
doesn't support mid-thread preemption or recoverable page faults.

This is done by adding a bunch of knobs to cgroup:
drm.capacity: Shows maximum capacity of each resource region.
drm.max: Display or limit max amount of memory.
drm.current: Current amount of memory in use.

TTM has not been made cgroup aware yet, so instead of evicting from
the current cgroup to stay within the cgroup limits, it simply returns
the error -ENOSPC to userspace.

I've used Tvrtko's cgroup controller series as a base, but it implemented
scheduling weight, not memory accounting, so I only ended up keeping the
base patch.

Xe is not upstream yet, so the driver specific patch will only apply on
https://gitlab.freedesktop.org/drm/xe/kernel

Maarten Lankhorst (3):
  drm/cgroup: Add memory accounting to DRM cgroup
  drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.
  drm/xe: Add support for the drm cgroup

Tvrtko Ursulin (1):
  cgroup: Add the DRM cgroup controller

 Documentation/admin-guide/cgroup-v2.rst    |  46 ++
 Documentation/gpu/drm-compute.rst          |  54 ++
 drivers/gpu/drm/ttm/ttm_bo.c               |   4 +-
 drivers/gpu/drm/xe/xe_device.c             |   4 +
 drivers/gpu/drm/xe/xe_device_types.h       |   4 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       |  21 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |   5 +
 include/linux/cgroup_drm.h                 |  90 ++++
 include/linux/cgroup_subsys.h              |   4 +
 init/Kconfig                               |   7 +
 kernel/cgroup/Makefile                     |   1 +
 kernel/cgroup/drm.c                        | 557 +++++++++++++++++++++
 12 files changed, 794 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/gpu/drm-compute.rst
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/4] cgroup: Add the DRM cgroup controller
  2023-05-03  8:34 [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Maarten Lankhorst
@ 2023-05-03  8:34 ` Maarten Lankhorst
  2023-05-03  8:34 ` [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup Maarten Lankhorst
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03  8:34 UTC (permalink / raw)
  To: dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	amd-gfx, Zefan Li, Johannes Weiner, Tejun Heo

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Skeleton controller without any functionality.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 include/linux/cgroup_drm.h    |  9 ++++++
 include/linux/cgroup_subsys.h |  4 +++
 init/Kconfig                  |  7 ++++
 kernel/cgroup/Makefile        |  1 +
 kernel/cgroup/drm.c           | 60 +++++++++++++++++++++++++++++++++++
 5 files changed, 81 insertions(+)
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
new file mode 100644
index 000000000000..8ef66a47619f
--- /dev/null
+++ b/include/linux/cgroup_drm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _CGROUP_DRM_H
+#define _CGROUP_DRM_H
+
+#endif	/* _CGROUP_DRM_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..49460494a010 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+SUBSYS(drm)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index 1fb5f313d18f..1679229143c0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1093,6 +1093,13 @@ config CGROUP_RDMA
 	  Attaching processes with active RDMA resources to the cgroup
 	  hierarchy is allowed even if can cross the hierarchy's limit.
 
+config CGROUP_DRM
+	bool "DRM controller"
+	help
+	  Provides the DRM subsystem controller.
+
+	  ...
+
 config CGROUP_FREEZER
 	bool "Freezer controller"
 	help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..849bd2917477 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -6,4 +6,5 @@ obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_MISC) += misc.o
+obj-$(CONFIG_CGROUP_DRM) += drm.o
 obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
new file mode 100644
index 000000000000..02c8eaa633d3
--- /dev/null
+++ b/kernel/cgroup/drm.c
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_drm.h>
+#include <linux/slab.h>
+
+struct drm_cgroup_state {
+	struct cgroup_subsys_state css;
+};
+
+struct drm_root_cgroup_state {
+	struct drm_cgroup_state drmcs;
+};
+
+static struct drm_root_cgroup_state root_drmcs;
+
+static inline struct drm_cgroup_state *
+css_to_drmcs(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct drm_cgroup_state, css);
+}
+
+static void drmcs_free(struct cgroup_subsys_state *css)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+	if (drmcs != &root_drmcs.drmcs)
+		kfree(drmcs);
+}
+
+static struct cgroup_subsys_state *
+drmcs_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct drm_cgroup_state *drmcs;
+
+	if (!parent_css) {
+		drmcs = &root_drmcs.drmcs;
+	} else {
+		drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
+		if (!drmcs)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	return &drmcs->css;
+}
+
+struct cftype files[] = {
+	{ } /* Zero entry terminates. */
+};
+
+struct cgroup_subsys drm_cgrp_subsys = {
+	.css_alloc	= drmcs_alloc,
+	.css_free	= drmcs_free,
+	.early_init	= false,
+	.legacy_cftypes	= files,
+	.dfl_cftypes	= files,
+};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup
  2023-05-03  8:34 [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Maarten Lankhorst
  2023-05-03  8:34 ` [RFC PATCH 1/4] cgroup: Add the DRM cgroup controller Maarten Lankhorst
@ 2023-05-03  8:34 ` Maarten Lankhorst
  2023-05-03 15:31   ` [Intel-gfx] " Tvrtko Ursulin
  2023-05-03  8:34 ` [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC Maarten Lankhorst
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03  8:34 UTC (permalink / raw)
  To: dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	amd-gfx, Zefan Li, Johannes Weiner, Tejun Heo

Based roughly on the rdma and misc cgroup controllers, with a lot of
the accounting code borrowed from rdma.

The interface is simple:
- populate drmcgroup_device->regions[..] name and size for each active
  region.
- Call drm(m)cg_register_device()
- Use drmcg_try_charge to check if you can allocate a chunk of memory,
  use drmcg_uncharge when freeing it. This may return an error code,
  or -EAGAIN when the cgroup limit is reached.

The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
to restart infinitely.

This API allows you to limit stuff with cgroups.
You can see the supported cards in /sys/fs/cgroup/drm.capacity
You need to echo +drm to cgroup.subtree_control, and then you can
partition memory.

In each cgroup subdir:
drm.max shows the current limits of the cgroup.
drm.current the current amount of allocated memory used by this cgroup.
drm.events shows the amount of time max memory was reached.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  46 ++
 Documentation/gpu/drm-compute.rst       |  54 +++
 include/linux/cgroup_drm.h              |  81 ++++
 kernel/cgroup/drm.c                     | 539 +++++++++++++++++++++++-
 4 files changed, 699 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/gpu/drm-compute.rst

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..b858d99cb2ef 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2374,6 +2374,52 @@ RDMA Interface Files
 	  mlx4_0 hca_handle=1 hca_object=20
 	  ocrdma1 hca_handle=1 hca_object=23
 
+DRM
+----
+
+The "drm" controller regulates the distribution and accounting of
+DRM resources.
+
+DRM Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+  drm.max
+	A readwrite nested-keyed file that exists for all the cgroups
+	except root that describes current configured resource limit
+	for a DRM device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	The following nested keys are defined.
+
+	  ==========	=======================================================
+	  region.* 	Maximum amount of bytes that allocatable in this region
+	  ==========	=======================================================
+
+	An example for xe follows::
+
+	  0000:03:00.0 region.vram0=1073741824 region.stolen=max
+
+  drm.capacity
+	A read-only file that describes maximum region capacity.
+	It only exists on the root cgroup. Not all memory can be
+	allocated by cgroups, as the kernel reserves some for
+	internal use.
+
+	An example for xe follows::
+
+	  0000:03:00.0 region.vram0=8514437120 region.stolen=67108864
+
+  drm.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup except root.
+
+	An example for xe follows::
+
+	  0000:03:00.0 region.vram0=12550144 region.stolen=8650752
+
 HugeTLB
 -------
 
diff --git a/Documentation/gpu/drm-compute.rst b/Documentation/gpu/drm-compute.rst
new file mode 100644
index 000000000000..116270976ef7
--- /dev/null
+++ b/Documentation/gpu/drm-compute.rst
@@ -0,0 +1,54 @@
+==================================
+Long running workloads and compute
+==================================
+
+Long running workloads (compute) are workloads that will not complete in 10
+seconds. (The time let the user wait before he reaches for the power button).
+This means that other techniques need to be used to manage those workloads,
+that cannot use fences.
+
+Some hardware may schedule compute jobs, and have no way to pre-empt them, or
+have their memory swapped out from them. Or they simply want their workload
+not to be preempted or swapped out at all.
+
+This means that it differs from what is described in driver-api/dma-buf.rst.
+
+As with normal compute jobs, dma-fence may not be used at all. In this case,
+not even to force preemption. The driver with is simply forced to unmap a BO
+from the long compute job's address space on unbind immediately, not even
+waiting for the workload to complete. Effectively this terminates the workload
+when there is no hardware support to recover.
+
+Since this is undesirable, there need to be mitigations to prevent a workload
+from being terminated. There are several possible approach, all with their
+advantages and drawbacks.
+
+The first approach you will likely try is to pin all buffers used by compute.
+This guarantees that the job will run uninterrupted, but also allows a very
+denial of service attack by pinning as much memory as possible, hogging the
+all GPU memory, and possibly a huge chunk of CPU memory.
+
+A second approach that will work slightly better on its own is adding an option
+not to evict when creating a new job (any kind). If all of userspace opts in
+to this flag, it would prevent cooperating userspace from forced terminating
+older compute jobs to start a new one.
+
+If job preemption and recoverable pagefaults are not available, those are the
+only approaches possible. So even with those, you want a separate way of
+controlling resources. The standard kernel way of doing so is cgroups.
+
+This creates a third option, using cgroups to prevent eviction. Both GPU and
+driver-allocated CPU memory would be accounted to the correct cgroup, and
+eviction would be made cgroup aware. This allows the GPU to be partitioned
+into cgroups, that will allow jobs to run next to each other without
+interference.
+
+The interface to the cgroup would be similar to the current CPU memory
+interface, with similar semantics for min/low/high/max, if eviction can
+be made cgroup aware. For now only max is implemented.
+
+What should be noted is that each memory region (tiled memory for example)
+should have its own accounting, using $card key0 = value0 key1 = value1.
+
+The key is set to the regionid set by the driver, for example "tile0".
+For the value of $card, we use drmGetUnique().
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
index 8ef66a47619f..4f17b1c85f47 100644
--- a/include/linux/cgroup_drm.h
+++ b/include/linux/cgroup_drm.h
@@ -6,4 +6,85 @@
 #ifndef _CGROUP_DRM_H
 #define _CGROUP_DRM_H
 
+#include <linux/types.h>
+
+#include <drm/drm_managed.h>
+
+struct drm_device;
+struct drm_file;
+
+struct drmcgroup_state;
+
+/*
+ * Use 8 as max, because of N^2 lookup when setting things, can be bumped if needed
+ * Identical to TTM_NUM_MEM_TYPES to allow simplifying that code.
+ */
+#define DRMCG_MAX_REGIONS 8
+
+struct drmcgroup_device {
+	struct list_head list;
+	struct list_head pools;
+
+	struct {
+		u64 size;
+		const char *name;
+	} regions[DRMCG_MAX_REGIONS];
+
+	/* Name describing the card, set by drmcg_register_device */
+	const char *name;
+
+};
+
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+int drmcg_register_device(struct drm_device *dev,
+			   struct drmcgroup_device *drm_cg);
+void drmcg_unregister_device(struct drmcgroup_device *cgdev);
+int drmcg_try_charge(struct drmcgroup_state **drmcg,
+		     struct drmcgroup_device *cgdev,
+		     u32 index, u64 size);
+void drmcg_uncharge(struct drmcgroup_state *drmcg,
+		    struct drmcgroup_device *cgdev,
+		    u32 index, u64 size);
+#else
+static inline int
+drmcg_register_device(struct drm_device *dev,
+		      struct drm_cgroup *drm_cg)
+{
+	return 0;
+}
+
+static inline void drmcg_unregister_device(struct drmcgroup_device *cgdev)
+{
+}
+
+static inline int drmcg_try_charge(struct drmcgroup_state **drmcg,
+				   struct drmcgroup_device *cgdev,
+				   u32 index, u64 size)
+{
+	*drmcg = NULL;
+	return 0;
+}
+
+static inline void drmcg_uncharge(struct drmcgroup_state *drmcg,
+				  struct drmcgroup_device *cgdev,
+				  u32 index, u64 size)
+{ }
+#endif
+
+static inline void drmmcg_unregister_device(struct drm_device *dev, void *arg)
+{
+	drmcg_unregister_device(arg);
+}
+
+/*
+ * This needs to be done as inline, because cgroup lives in the core
+ * kernel and it cannot call drm calls directly
+ */
+static inline int drmmcg_register_device(struct drm_device *dev,
+					 struct drmcgroup_device *cgdev)
+{
+	return drmcg_register_device(dev, cgdev) ?:
+		drmm_add_action_or_reset(dev, drmmcg_unregister_device, cgdev);
+}
+
 #endif	/* _CGROUP_DRM_H */
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 02c8eaa633d3..a93d9344fd36 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -1,60 +1,557 @@
-/* SPDX-License-Identifier: MIT */
+// SPDX-License-Identifier: GPL-2.0
 /*
- * Copyright © 2023 Intel Corporation
+ * Copyright 2023 Intel
+ * Partially based on the rdma and misc controllers, which bear the following copyrights:
+ *
+ * Copyright 2020 Google LLC
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
  */
 
 #include <linux/cgroup.h>
 #include <linux/cgroup_drm.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/parser.h>
 #include <linux/slab.h>
 
-struct drm_cgroup_state {
+#include <drm/drm_device.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_file.h>
+#include <drm/drm_managed.h>
+
+struct drmcgroup_state {
 	struct cgroup_subsys_state css;
+
+	struct list_head pools;
 };
 
-struct drm_root_cgroup_state {
-	struct drm_cgroup_state drmcs;
+struct drmcgroup_pool_state {
+	struct drmcgroup_device *device;
+	struct drmcgroup_resource {
+		s64 max, used;
+	} resources[DRMCG_MAX_REGIONS];
+
+	s64 usage_sum;
+
+	struct list_head	cg_node;
+	struct list_head	dev_node;
 };
 
-static struct drm_root_cgroup_state root_drmcs;
+static DEFINE_MUTEX(drmcg_mutex);
+static LIST_HEAD(drmcg_devices);
 
-static inline struct drm_cgroup_state *
+static inline struct drmcgroup_state *
 css_to_drmcs(struct cgroup_subsys_state *css)
 {
-	return container_of(css, struct drm_cgroup_state, css);
+	return container_of(css, struct drmcgroup_state, css);
+}
+
+static inline struct drmcgroup_state *get_current_drmcg(void)
+{
+	return css_to_drmcs(task_get_css(current, drm_cgrp_id));
+}
+
+static struct drmcgroup_state *parent_drmcg(struct drmcgroup_state *cg)
+{
+	return css_to_drmcs(cg->css.parent);
+}
+
+static void free_cg_pool_locked(struct drmcgroup_pool_state *pool)
+{
+	lockdep_assert_held(&drmcg_mutex);
+
+	list_del(&pool->cg_node);
+	list_del(&pool->dev_node);
+	kfree(pool);
+}
+
+static void
+set_resource_max(struct drmcgroup_pool_state *pool, int i, u64 new_max)
+{
+	pool->resources[i].max = new_max;
+}
+
+static void set_all_resource_max_limit(struct drmcgroup_pool_state *rpool)
+{
+	int i;
+
+	for (i = 0; i < DRMCG_MAX_REGIONS; i++)
+		set_resource_max(rpool, i, S64_MAX);
+}
+
+static void drmcs_offline(struct cgroup_subsys_state *css)
+{
+	struct drmcgroup_state *drmcs = css_to_drmcs(css);
+	struct drmcgroup_pool_state *pool, *next;
+
+	mutex_lock(&drmcg_mutex);
+	list_for_each_entry_safe(pool, next, &drmcs->pools, cg_node) {
+		if (!pool->usage_sum) {
+			free_cg_pool_locked(pool);
+		} else {
+			/* Reset all regions, last uncharge will remove pool */
+			set_all_resource_max_limit(pool);
+		}
+	}
+	mutex_unlock(&drmcg_mutex);
 }
 
 static void drmcs_free(struct cgroup_subsys_state *css)
 {
-	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+	struct drmcgroup_state *drmcs = css_to_drmcs(css);
 
-	if (drmcs != &root_drmcs.drmcs)
-		kfree(drmcs);
+	kfree(drmcs);
 }
 
 static struct cgroup_subsys_state *
 drmcs_alloc(struct cgroup_subsys_state *parent_css)
 {
-	struct drm_cgroup_state *drmcs;
+	struct drmcgroup_state *drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
+	if (!drmcs)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&drmcs->pools);
+	return &drmcs->css;
+}
+
+static struct drmcgroup_pool_state *
+find_cg_pool_locked(struct drmcgroup_state *drmcs, struct drmcgroup_device *dev)
+{
+	struct drmcgroup_pool_state *pool;
+
+	list_for_each_entry(pool, &drmcs->pools, cg_node)
+		if (pool->device == dev)
+			return pool;
+
+	return NULL;
+}
+
+static struct drmcgroup_pool_state *
+get_cg_pool_locked(struct drmcgroup_state *drmcs, struct drmcgroup_device *dev)
+{
+	struct drmcgroup_pool_state *pool;
+
+	pool = find_cg_pool_locked(drmcs, dev);
+	if (pool)
+		return pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	pool->device = dev;
+	set_all_resource_max_limit(pool);
 
-	if (!parent_css) {
-		drmcs = &root_drmcs.drmcs;
-	} else {
-		drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
-		if (!drmcs)
-			return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&pool->cg_node);
+	INIT_LIST_HEAD(&pool->dev_node);
+	list_add_tail(&pool->cg_node, &drmcs->pools);
+	list_add_tail(&pool->dev_node, &dev->pools);
+	return pool;
+}
+
+void drmcg_unregister_device(struct drmcgroup_device *cgdev)
+{
+	struct drmcgroup_pool_state *pool, *next;
+
+	mutex_lock(&drmcg_mutex);
+	list_del(&cgdev->list);
+
+	list_for_each_entry_safe(pool, next, &cgdev->pools, dev_node)
+		free_cg_pool_locked(pool);
+	mutex_unlock(&drmcg_mutex);
+	kfree(cgdev->name);
+}
+
+EXPORT_SYMBOL_GPL(drmcg_unregister_device);
+
+int drmcg_register_device(struct drm_device *dev,
+			  struct drmcgroup_device *cgdev)
+{
+	char *name = kstrdup(dev->unique, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&cgdev->pools);
+	mutex_lock(&drmcg_mutex);
+	cgdev->name = name;
+	list_add_tail(&cgdev->list, &drmcg_devices);
+	mutex_unlock(&drmcg_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(drmcg_register_device);
+
+static int drmcg_max_show(struct seq_file *sf, void *v)
+{
+	struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
+	struct drmcgroup_pool_state *pool;
+
+	mutex_lock(&drmcg_mutex);
+	list_for_each_entry(pool, &drmcs->pools, cg_node) {
+		struct drmcgroup_device *dev = pool->device;
+		int i;
+
+		seq_puts(sf, dev->name);
+
+		for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
+			if (!dev->regions[i].name)
+				continue;
+
+			if (pool->resources[i].max < S64_MAX)
+				seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
+					   pool->resources[i].max);
+			else
+				seq_printf(sf, " region.%s=max", dev->regions[i].name);
+		}
+
+		seq_putc(sf, '\n');
 	}
+	mutex_unlock(&drmcg_mutex);
 
-	return &drmcs->css;
+	return 0;
+}
+
+static struct drmcgroup_device *drmcg_get_device_locked(const char *name)
+{
+	struct drmcgroup_device *dev;
+
+	lockdep_assert_held(&drmcg_mutex);
+
+	list_for_each_entry(dev, &drmcg_devices, list)
+		if (!strcmp(name, dev->name))
+			return dev;
+
+	return NULL;
+}
+
+static void try_to_free_cg_pool_locked(struct drmcgroup_pool_state *pool)
+{
+	struct drmcgroup_device *dev = pool->device;
+	u32 i;
+
+	/* Memory charged to this pool */
+	if (pool->usage_sum)
+		return;
+
+	for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
+		if (!dev->regions[i].name)
+			continue;
+
+		/* Is a specific limit set? */
+		if (pool->resources[i].max < S64_MAX)
+			return;
+	}
+
+	/*
+	 * No user of the pool and all entries are set to defaults;
+	 * safe to delete this pool.
+	 */
+	free_cg_pool_locked(pool);
+}
+
+
+static void
+uncharge_cg_locked(struct drmcgroup_state *drmcs,
+		   struct drmcgroup_device *cgdev,
+		   u32 index, u64 size)
+{
+	struct drmcgroup_pool_state *pool;
+
+	pool = find_cg_pool_locked(drmcs, cgdev);
+
+	if (unlikely(!pool)) {
+		pr_warn("Invalid device %p or drm cgroup %p\n", cgdev, drmcs);
+		return;
+	}
+
+	pool->resources[index].used -= size;
+
+	/*
+	 * A negative count (or overflow) is invalid,
+	 * it indicates a bug in the rdma controller.
+	 */
+	WARN_ON_ONCE(pool->resources[index].used < 0);
+	pool->usage_sum--;
+	try_to_free_cg_pool_locked(pool);
+}
+
+static void drmcg_uncharge_hierarchy(struct drmcgroup_state *drmcs,
+				     struct drmcgroup_device *cgdev,
+				     struct drmcgroup_state *stop_cg,
+				     u32 index, u64 size)
+{
+	struct drmcgroup_state *p;
+
+	mutex_lock(&drmcg_mutex);
+
+	for (p = drmcs; p != stop_cg; p = parent_drmcg(p))
+		uncharge_cg_locked(p, cgdev, index, size);
+
+	mutex_unlock(&drmcg_mutex);
+
+	css_put(&drmcs->css);
+}
+
+void drmcg_uncharge(struct drmcgroup_state *drmcs,
+		    struct drmcgroup_device *cgdev,
+		    u32 index,
+		    u64 size)
+{
+	if (index >= DRMCG_MAX_REGIONS)
+		return;
+
+	drmcg_uncharge_hierarchy(drmcs, cgdev, NULL, index, size);
+}
+EXPORT_SYMBOL_GPL(drmcg_uncharge);
+
+int drmcg_try_charge(struct drmcgroup_state **drmcs,
+		     struct drmcgroup_device *cgdev,
+		     u32 index,
+		     u64 size)
+{
+	struct drmcgroup_state *cg, *p;
+	struct drmcgroup_pool_state *pool;
+	u64 new;
+	int ret = 0;
+
+	if (index >= DRMCG_MAX_REGIONS)
+		return -EINVAL;
+
+	/*
+	 * hold on to css, as cgroup can be removed but resource
+	 * accounting happens on css.
+	 */
+	cg = get_current_drmcg();
+
+	mutex_lock(&drmcg_mutex);
+	for (p = cg; p; p = parent_drmcg(p)) {
+		pool = get_cg_pool_locked(p, cgdev);
+		if (IS_ERR(pool)) {
+			ret = PTR_ERR(pool);
+			goto err;
+		} else {
+			new = pool->resources[index].used + size;
+			if (new > pool->resources[index].max || new > S64_MAX) {
+				ret = -EAGAIN;
+				goto err;
+			} else {
+				pool->resources[index].used = new;
+				pool->usage_sum++;
+			}
+		}
+	}
+	mutex_unlock(&drmcg_mutex);
+
+	*drmcs = cg;
+	return 0;
+
+err:
+	mutex_unlock(&drmcg_mutex);
+	drmcg_uncharge_hierarchy(cg, cgdev, p, index, size);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(drmcg_try_charge);
+
+static s64 parse_resource(char *c, char **retname)
+{
+	substring_t argstr;
+	char *name, *value = c;
+	size_t len;
+	int ret;
+	u64 retval;
+
+	name = strsep(&value, "=");
+	if (!name || !value)
+		return -EINVAL;
+
+	/* Only support region setting for now */
+	if (strncmp(name, "region.", 7))
+		return -EINVAL;
+	else
+		name += 7;
+
+	*retname = name;
+	len = strlen(value);
+
+	argstr.from = value;
+	argstr.to = value + len;
+
+	ret = match_u64(&argstr, &retval);
+	if (ret >= 0) {
+		if (retval > S64_MAX)
+			return -EINVAL;
+		return retval;
+	}
+	if (!strncmp(value, "max", len))
+		return S64_MAX;
+
+	/* Not u64 or max, error */
+	return -EINVAL;
+}
+
+static int drmcg_parse_limits(char *options,
+			      u64 *limits, char **enables)
+{
+	char *c;
+	int num_limits = 0;
+
+	/* parse resource options */
+	while ((c = strsep(&options, " ")) != NULL) {
+		s64 limit;
+
+		if (num_limits >= DRMCG_MAX_REGIONS)
+			return -EINVAL;
+
+		limit = parse_resource(c, &enables[num_limits]);
+		if (limit < 0)
+			return limit;
+
+		limits[num_limits++] = limit;
+	}
+	return num_limits;
+}
+
+static ssize_t drmcg_max_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct drmcgroup_state *drmcs = css_to_drmcs(of_css(of));
+	struct drmcgroup_device *dev;
+	struct drmcgroup_pool_state *pool;
+	char *options = strstrip(buf);
+	char *dev_name = strsep(&options, " ");
+	u64 limits[DRMCG_MAX_REGIONS];
+	u64 new_limits[DRMCG_MAX_REGIONS];
+	char *regions[DRMCG_MAX_REGIONS];
+	int num_limits, i;
+	unsigned long set_mask = 0;
+	int err = 0;
+
+	if (!dev_name)
+		return -EINVAL;
+
+	num_limits = drmcg_parse_limits(options, limits, regions);
+	if (num_limits < 0)
+		return num_limits;
+	if (!num_limits)
+		return -EINVAL;
+
+	/*
+	 * Everything is parsed into key=value pairs now, take lock and attempt to update
+	 * For good measure, set -EINVAL when a key is set twice.
+	 */
+	mutex_lock(&drmcg_mutex);
+
+	dev = drmcg_get_device_locked(dev_name);
+	if (!dev) {
+		err = -ENODEV;
+		goto err;
+	}
+
+	pool = get_cg_pool_locked(drmcs, dev);
+	if (IS_ERR(pool)) {
+		err = PTR_ERR(pool);
+		goto err;
+	}
+
+	/* Lookup region names and set new_limits to the index */
+	for (i = 0; i < num_limits; i++) {
+		int j;
+
+		for (j = 0; j < DRMCG_MAX_REGIONS; j++)
+			if (dev->regions[j].name &&
+			    !strcmp(regions[i], dev->regions[j].name))
+				break;
+
+		if (j == DRMCG_MAX_REGIONS ||
+		    set_mask & BIT(j)) {
+			err = -EINVAL;
+			goto err_put;
+		}
+
+		set_mask |= BIT(j);
+		new_limits[j] = limits[i];
+	}
+
+	/* And commit */
+	for_each_set_bit(i, &set_mask, DRMCG_MAX_REGIONS)
+		set_resource_max(pool, i, new_limits[i]);
+
+err_put:
+	try_to_free_cg_pool_locked(pool);
+err:
+	mutex_unlock(&drmcg_mutex);
+
+	return err ?: nbytes;
+}
+
+static int drmcg_current_show(struct seq_file *sf, void *v)
+{
+	struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
+	struct drmcgroup_device *dev;
+
+	mutex_lock(&drmcg_mutex);
+	list_for_each_entry(dev, &drmcg_devices, list) {
+		struct drmcgroup_pool_state *pool = find_cg_pool_locked(drmcs, dev);
+		int i;
+
+		seq_puts(sf, dev->name);
+
+		for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
+			if (!dev->regions[i].name)
+				continue;
+
+			seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
+				   pool ? pool->resources[i].used : 0ULL);
+		}
+
+		seq_putc(sf, '\n');
+	}
+	mutex_unlock(&drmcg_mutex);
+
+	return 0;
+}
+
+static int drmcg_capacity_show(struct seq_file *sf, void *v)
+{
+	struct drmcgroup_device *dev;
+	int i;
+
+	list_for_each_entry(dev, &drmcg_devices, list) {
+		seq_puts(sf, dev->name);
+		for (i = 0; i < DRMCG_MAX_REGIONS; i++)
+			if (dev->regions[i].name)
+				seq_printf(sf, " region.%s=%lld",
+					   dev->regions[i].name,
+					   dev->regions[i].size);
+		seq_putc(sf, '\n');
+	}
+	return 0;
 }
 
-struct cftype files[] = {
+static struct cftype files[] = {
+	{
+		.name = "max",
+		.write = drmcg_max_write,
+		.seq_show = drmcg_max_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.seq_show = drmcg_current_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "capacity",
+		.seq_show = drmcg_capacity_show,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+	},
 	{ } /* Zero entry terminates. */
 };
 
 struct cgroup_subsys drm_cgrp_subsys = {
 	.css_alloc	= drmcs_alloc,
 	.css_free	= drmcs_free,
-	.early_init	= false,
+	.css_offline	= drmcs_offline,
 	.legacy_cftypes	= files,
 	.dfl_cftypes	= files,
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.
  2023-05-03  8:34 [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Maarten Lankhorst
  2023-05-03  8:34 ` [RFC PATCH 1/4] cgroup: Add the DRM cgroup controller Maarten Lankhorst
  2023-05-03  8:34 ` [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup Maarten Lankhorst
@ 2023-05-03  8:34 ` Maarten Lankhorst
  2023-05-03  9:11   ` [Intel-xe] " Thomas Hellström
  2023-05-03  8:35 ` [RFC PATCH 4/4] drm/xe: Add support for the drm cgroup Maarten Lankhorst
  2023-05-05 19:50 ` [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Tejun Heo
  4 siblings, 1 reply; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03  8:34 UTC (permalink / raw)
  To: dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	amd-gfx, Zefan Li, Johannes Weiner, Tejun Heo

This allows the drm cgroup controller to return no space is available..

XXX: This is a hopeless simplification that changes behavior, and
returns -ENOSPC even if we could evict ourselves from the current
cgroup.

Ideally, the eviction code becomes cgroup aware, and will force eviction
from the current cgroup or its parents.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bd5dae4d1624..e057d5d8f09a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -731,6 +731,8 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
 		ret = ttm_resource_alloc(bo, place, mem);
 		if (likely(!ret))
 			break;
+		if (ret == -EAGAIN)
+			return -ENOSPC;
 		if (unlikely(ret != -ENOSPC))
 			return ret;
 		ret = ttm_mem_evict_first(bdev, man, place, ctx,
@@ -783,7 +785,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
 
 		type_found = true;
 		ret = ttm_resource_alloc(bo, place, mem);
-		if (ret == -ENOSPC)
+		if (ret == -ENOSPC || ret == -EAGAIN)
 			continue;
 		if (unlikely(ret))
 			goto error;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/4] drm/xe: Add support for the drm cgroup
  2023-05-03  8:34 [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Maarten Lankhorst
                   ` (2 preceding siblings ...)
  2023-05-03  8:34 ` [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC Maarten Lankhorst
@ 2023-05-03  8:35 ` Maarten Lankhorst
  2023-05-05 19:50 ` [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Tejun Heo
  4 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03  8:35 UTC (permalink / raw)
  To: dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	amd-gfx, Zefan Li, Johannes Weiner, Tejun Heo

Add some code to implement basic support for the vram0, vram1 and stolen
memory regions.

I fear the try_charge code should probably be done inside TTM. This
code should interact with the shrinker, but for a simple RFC it's good
enough.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_device.c             |  4 ++++
 drivers/gpu/drm/xe/xe_device_types.h       |  4 ++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 21 +++++++++++++++++++--
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  5 +++++
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 45d6e5ff47fd..f0a5af15a662 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -291,6 +291,10 @@ int xe_device_probe(struct xe_device *xe)
 	/* Allocate and map stolen after potential VRAM resize */
 	xe_ttm_stolen_mgr_init(xe);
 
+	err = drmmcg_register_device(&xe->drm, &xe->cg);
+	if (err)
+		goto err_irq_shutdown;
+
 	/*
 	 * Now that GT is initialized (TTM in particular),
 	 * we can try to init display, and inherit the initial fb.
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 1cb404e48aaa..04b85060cbec 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -12,6 +12,8 @@
 #include <drm/drm_file.h>
 #include <drm/ttm/ttm_device.h>
 
+#include <linux/cgroup_drm.h>
+
 #include "xe_gt_types.h"
 #include "xe_platform_types.h"
 #include "xe_step_types.h"
@@ -55,6 +57,8 @@ struct xe_device {
 	/** @drm: drm device */
 	struct drm_device drm;
 
+	struct drmcgroup_device cg;
+
 	/** @info: device info */
 	struct intel_device_info {
 		/** @graphics_name: graphics IP name */
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 73836b9b7fed..263cd4ef7b6d 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -50,6 +50,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 			       struct ttm_resource **res)
 {
 	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
+	struct xe_device *xe = ttm_to_xe_device(tbo->bdev);
 	struct xe_ttm_vram_mgr_resource *vres;
 	struct drm_buddy *mm = &mgr->mm;
 	u64 size, remaining_size, min_page_size;
@@ -116,9 +117,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 
 	mutex_lock(&mgr->lock);
 	if (lpfn <= mgr->visible_size >> PAGE_SHIFT && size > mgr->visible_avail) {
-		mutex_unlock(&mgr->lock);
 		err = -ENOSPC;
-		goto error_fini;
+		goto error_unlock;
 	}
 
 	if (place->fpfn + (size >> PAGE_SHIFT) != place->lpfn &&
@@ -129,6 +129,10 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 		lpfn = max_t(unsigned long, place->fpfn + (size >> PAGE_SHIFT), lpfn);
 	}
 
+	err = drmcg_try_charge(&vres->cg, &xe->cg, mgr->mem_type, vres->base.size);
+	if (err)
+		goto error_unlock;
+
 	remaining_size = size;
 	do {
 		/*
@@ -197,6 +201,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 
 error_free_blocks:
 	drm_buddy_free_list(mm, &vres->blocks);
+	drmcg_uncharge(vres->cg, &xe->cg, mgr->mem_type, vres->base.size);
+error_unlock:
 	mutex_unlock(&mgr->lock);
 error_fini:
 	ttm_resource_fini(man, &vres->base);
@@ -211,6 +217,7 @@ static void xe_ttm_vram_mgr_del(struct ttm_resource_manager *man,
 	struct xe_ttm_vram_mgr_resource *vres =
 		to_xe_ttm_vram_mgr_resource(res);
 	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
+	struct xe_device *xe = ttm_to_xe_device(man->bdev);
 	struct drm_buddy *mm = &mgr->mm;
 
 	mutex_lock(&mgr->lock);
@@ -218,6 +225,7 @@ static void xe_ttm_vram_mgr_del(struct ttm_resource_manager *man,
 	mgr->visible_avail += vres->used_visible_size;
 	mutex_unlock(&mgr->lock);
 
+	drmcg_uncharge(vres->cg, &xe->cg, mgr->mem_type, vres->base.size);
 	ttm_resource_fini(man, res);
 
 	kfree(vres);
@@ -337,6 +345,15 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	struct ttm_resource_manager *man = &mgr->manager;
 	int err;
 
+	xe->cg.regions[mem_type].size = size;
+
+	if (mem_type == XE_PL_STOLEN) {
+		xe->cg.regions[mem_type].name = "stolen";
+	} else {
+		xe->cg.regions[mem_type].name =
+			mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
+	}
+
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	mutex_init(&mgr->lock);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 3d9417ff7434..232585d7ae69 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -9,6 +9,8 @@
 #include <drm/drm_buddy.h>
 #include <drm/ttm/ttm_device.h>
 
+struct drmcgroup_state;
+
 struct xe_gt;
 
 /**
@@ -47,6 +49,9 @@ struct xe_ttm_vram_mgr_resource {
 	u64 used_visible_size;
 	/** @flags: flags associated with the resource */
 	unsigned long flags;
+
+	/** @cg: cgroup this resource is charged to */
+	struct drmcgroup_state *cg;
 };
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Intel-xe] [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.
  2023-05-03  8:34 ` [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC Maarten Lankhorst
@ 2023-05-03  9:11   ` Thomas Hellström
  2023-05-03  9:36     ` Maarten Lankhorst
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Hellström @ 2023-05-03  9:11 UTC (permalink / raw)
  To: Maarten Lankhorst, dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Zefan Li, intel-gfx, linux-kernel, amd-gfx,
	Thomas Zimmermann, Johannes Weiner, Tejun Heo

Hi, Maarten

On 5/3/23 10:34, Maarten Lankhorst wrote:
> This allows the drm cgroup controller to return no space is available..
>
> XXX: This is a hopeless simplification that changes behavior, and
> returns -ENOSPC even if we could evict ourselves from the current
> cgroup.
>
> Ideally, the eviction code becomes cgroup aware, and will force eviction
> from the current cgroup or its parents.
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Thinking of the shrinker analogy, do non-cgroup aware shrinkers just 
shrink blindly or do they reject shrinking like this patch when a cgroup 
limit is reached?

/Thomas


> ---
>   drivers/gpu/drm/ttm/ttm_bo.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index bd5dae4d1624..e057d5d8f09a 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -731,6 +731,8 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
>   		ret = ttm_resource_alloc(bo, place, mem);
>   		if (likely(!ret))
>   			break;
> +		if (ret == -EAGAIN)
> +			return -ENOSPC;
>   		if (unlikely(ret != -ENOSPC))
>   			return ret;
>   		ret = ttm_mem_evict_first(bdev, man, place, ctx,
> @@ -783,7 +785,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   
>   		type_found = true;
>   		ret = ttm_resource_alloc(bo, place, mem);
> -		if (ret == -ENOSPC)
> +		if (ret == -ENOSPC || ret == -EAGAIN)
>   			continue;
>   		if (unlikely(ret))
>   			goto error;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-xe] [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.
  2023-05-03  9:11   ` [Intel-xe] " Thomas Hellström
@ 2023-05-03  9:36     ` Maarten Lankhorst
  0 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03  9:36 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel, cgroups, intel-xe
  Cc: Tvrtko Ursulin, Zefan Li, intel-gfx, linux-kernel, amd-gfx,
	Thomas Zimmermann, Johannes Weiner, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 1881 bytes --]

On 2023-05-03 11:11, Thomas Hellström wrote:
> Hi, Maarten
>
> On 5/3/23 10:34, Maarten Lankhorst wrote:
>> This allows the drm cgroup controller to return no space is available..
>>
>> XXX: This is a hopeless simplification that changes behavior, and
>> returns -ENOSPC even if we could evict ourselves from the current
>> cgroup.
>>
>> Ideally, the eviction code becomes cgroup aware, and will force eviction
>> from the current cgroup or its parents.
>>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>
> Thinking of the shrinker analogy, do non-cgroup aware shrinkers just 
> shrink blindly or do they reject shrinking like this patch when a 
> cgroup limit is reached?

When I made the cgroup controller return -ENOSPC I just hit an infinite 
loop since it sees enough memory is free and tries to allocate memory 
again. Hence the -EAGAIN handling here. It returns -ENOSPC, without the 
infinite looping.

I think there should be 2 code paths:

- OOM, generic case: Handle like we do now. No need for special cgroup 
handling needed right now.

Might change if we implement cgroup memory semantics. See the memory 
section of Documentation/admin-guide/cgroup-v2.rst

It could be useful regardless.

- OOM, cgroup limit reached: Check for each BO if it's valuable to evict to unblock the relevant limit.

              / cg1.0
root - cg1 --  cg1.1
    \         \ cg1.2
     \  cg2

If we hit the cg limit in cg1.0 for only cg1.0, it doesn't make sense to evict from any other cgroup.
If we hit the limit in cg1.0 for the entirety of cg1, it makes sense to evict from any of the cg1 nodes, but not from cg2.

This should be relatively straightforward to implement. We identify which cgroup hit a limit, and then let the shrinker
run only on that cgroup and its childs.

This could be simplified to the OOM generic case, for root/NULL cg.

~Maarten

[-- Attachment #2: Type: text/html, Size: 2700 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup
  2023-05-03  8:34 ` [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup Maarten Lankhorst
@ 2023-05-03 15:31   ` Tvrtko Ursulin
  2023-05-03 15:33     ` Maarten Lankhorst
  2023-05-05 14:21     ` Maarten Lankhorst
  0 siblings, 2 replies; 16+ messages in thread
From: Tvrtko Ursulin @ 2023-05-03 15:31 UTC (permalink / raw)
  To: Maarten Lankhorst, dri-devel, cgroups, intel-xe
  Cc: Zefan Li, intel-gfx, linux-kernel, amd-gfx, Thomas Zimmermann,
	Johannes Weiner, Tejun Heo


On 03/05/2023 09:34, Maarten Lankhorst wrote:
> Based roughly on the rdma and misc cgroup controllers, with a lot of
> the accounting code borrowed from rdma.
> 
> The interface is simple:
> - populate drmcgroup_device->regions[..] name and size for each active
>    region.
> - Call drm(m)cg_register_device()
> - Use drmcg_try_charge to check if you can allocate a chunk of memory,
>    use drmcg_uncharge when freeing it. This may return an error code,
>    or -EAGAIN when the cgroup limit is reached.
> 
> The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
> logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
> to restart infinitely.
> 
> This API allows you to limit stuff with cgroups.
> You can see the supported cards in /sys/fs/cgroup/drm.capacity
> You need to echo +drm to cgroup.subtree_control, and then you can
> partition memory.
> 
> In each cgroup subdir:
> drm.max shows the current limits of the cgroup.
> drm.current the current amount of allocated memory used by this cgroup.
> drm.events shows the amount of time max memory was reached.

Events is not in the patch?

> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> ---
>   Documentation/admin-guide/cgroup-v2.rst |  46 ++
>   Documentation/gpu/drm-compute.rst       |  54 +++
>   include/linux/cgroup_drm.h              |  81 ++++
>   kernel/cgroup/drm.c                     | 539 +++++++++++++++++++++++-
>   4 files changed, 699 insertions(+), 21 deletions(-)
>   create mode 100644 Documentation/gpu/drm-compute.rst
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index f67c0829350b..b858d99cb2ef 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2374,6 +2374,52 @@ RDMA Interface Files
>   	  mlx4_0 hca_handle=1 hca_object=20
>   	  ocrdma1 hca_handle=1 hca_object=23
>   
> +DRM
> +----
> +
> +The "drm" controller regulates the distribution and accounting of
> +DRM resources.
> +
> +DRM Interface Files
> +~~~~~~~~~~~~~~~~~~~~
> +
> +  drm.max
> +	A readwrite nested-keyed file that exists for all the cgroups
> +	except root that describes current configured resource limit
> +	for a DRM device.
> +
> +	Lines are keyed by device name and are not ordered.
> +	Each line contains space separated resource name and its configured
> +	limit that can be distributed.
> +
> +	The following nested keys are defined.
> +
> +	  ==========	=======================================================
> +	  region.* 	Maximum amount of bytes that allocatable in this region
> +	  ==========	=======================================================
> +
> +	An example for xe follows::
> +
> +	  0000:03:00.0 region.vram0=1073741824 region.stolen=max
> +
> +  drm.capacity
> +	A read-only file that describes maximum region capacity.
> +	It only exists on the root cgroup. Not all memory can be
> +	allocated by cgroups, as the kernel reserves some for
> +	internal use.
> +
> +	An example for xe follows::
> +
> +	  0000:03:00.0 region.vram0=8514437120 region.stolen=67108864
> +
> +  drm.current
> +	A read-only file that describes current resource usage.
> +	It exists for all the cgroup except root.
> +
> +	An example for xe follows::
> +
> +	  0000:03:00.0 region.vram0=12550144 region.stolen=8650752
> +
>   HugeTLB
>   -------
>   
> diff --git a/Documentation/gpu/drm-compute.rst b/Documentation/gpu/drm-compute.rst
> new file mode 100644
> index 000000000000..116270976ef7
> --- /dev/null
> +++ b/Documentation/gpu/drm-compute.rst
> @@ -0,0 +1,54 @@
> +==================================
> +Long running workloads and compute
> +==================================
> +
> +Long running workloads (compute) are workloads that will not complete in 10
> +seconds. (The time let the user wait before he reaches for the power button).
> +This means that other techniques need to be used to manage those workloads,
> +that cannot use fences.
> +
> +Some hardware may schedule compute jobs, and have no way to pre-empt them, or
> +have their memory swapped out from them. Or they simply want their workload
> +not to be preempted or swapped out at all.
> +
> +This means that it differs from what is described in driver-api/dma-buf.rst.
> +
> +As with normal compute jobs, dma-fence may not be used at all. In this case,
> +not even to force preemption. The driver with is simply forced to unmap a BO
> +from the long compute job's address space on unbind immediately, not even
> +waiting for the workload to complete. Effectively this terminates the workload
> +when there is no hardware support to recover.
> +
> +Since this is undesirable, there need to be mitigations to prevent a workload
> +from being terminated. There are several possible approach, all with their
> +advantages and drawbacks.
> +
> +The first approach you will likely try is to pin all buffers used by compute.
> +This guarantees that the job will run uninterrupted, but also allows a very
> +denial of service attack by pinning as much memory as possible, hogging the
> +all GPU memory, and possibly a huge chunk of CPU memory.
> +
> +A second approach that will work slightly better on its own is adding an option
> +not to evict when creating a new job (any kind). If all of userspace opts in
> +to this flag, it would prevent cooperating userspace from forced terminating
> +older compute jobs to start a new one.
> +
> +If job preemption and recoverable pagefaults are not available, those are the
> +only approaches possible. So even with those, you want a separate way of
> +controlling resources. The standard kernel way of doing so is cgroups.
> +
> +This creates a third option, using cgroups to prevent eviction. Both GPU and
> +driver-allocated CPU memory would be accounted to the correct cgroup, and
> +eviction would be made cgroup aware. This allows the GPU to be partitioned
> +into cgroups, that will allow jobs to run next to each other without
> +interference.

The 3rd approach is only valid if used strictly with device local 
memory, right? Because as soon as system memory backed buffers are used 
this approach cannot guarantee no eviction can be triggered.

> +
> +The interface to the cgroup would be similar to the current CPU memory
> +interface, with similar semantics for min/low/high/max, if eviction can
> +be made cgroup aware. For now only max is implemented.
> +
> +What should be noted is that each memory region (tiled memory for example)
> +should have its own accounting, using $card key0 = value0 key1 = value1.
> +
> +The key is set to the regionid set by the driver, for example "tile0".
> +For the value of $card, we use drmGetUnique().
> diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
> index 8ef66a47619f..4f17b1c85f47 100644
> --- a/include/linux/cgroup_drm.h
> +++ b/include/linux/cgroup_drm.h
> @@ -6,4 +6,85 @@
>   #ifndef _CGROUP_DRM_H
>   #define _CGROUP_DRM_H
>   
> +#include <linux/types.h>
> +
> +#include <drm/drm_managed.h>
> +
> +struct drm_device;
> +struct drm_file;
> +
> +struct drmcgroup_state;
> +
> +/*
> + * Use 8 as max, because of N^2 lookup when setting things, can be bumped if needed
> + * Identical to TTM_NUM_MEM_TYPES to allow simplifying that code.
> + */
> +#define DRMCG_MAX_REGIONS 8
> +
> +struct drmcgroup_device {
> +	struct list_head list;
> +	struct list_head pools;
> +
> +	struct {
> +		u64 size;
> +		const char *name;
> +	} regions[DRMCG_MAX_REGIONS];
> +
> +	/* Name describing the card, set by drmcg_register_device */
> +	const char *name;
> +
> +};
> +
> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
> +int drmcg_register_device(struct drm_device *dev,
> +			   struct drmcgroup_device *drm_cg);
> +void drmcg_unregister_device(struct drmcgroup_device *cgdev);
> +int drmcg_try_charge(struct drmcgroup_state **drmcg,
> +		     struct drmcgroup_device *cgdev,
> +		     u32 index, u64 size);
> +void drmcg_uncharge(struct drmcgroup_state *drmcg,
> +		    struct drmcgroup_device *cgdev,
> +		    u32 index, u64 size);
> +#else
> +static inline int
> +drmcg_register_device(struct drm_device *dev,
> +		      struct drm_cgroup *drm_cg)
> +{
> +	return 0;
> +}
> +
> +static inline void drmcg_unregister_device(struct drmcgroup_device *cgdev)
> +{
> +}
> +
> +static inline int drmcg_try_charge(struct drmcgroup_state **drmcg,
> +				   struct drmcgroup_device *cgdev,
> +				   u32 index, u64 size)
> +{
> +	*drmcg = NULL;
> +	return 0;
> +}
> +
> +static inline void drmcg_uncharge(struct drmcgroup_state *drmcg,
> +				  struct drmcgroup_device *cgdev,
> +				  u32 index, u64 size)
> +{ }
> +#endif
> +
> +static inline void drmmcg_unregister_device(struct drm_device *dev, void *arg)
> +{
> +	drmcg_unregister_device(arg);
> +}
> +
> +/*
> + * This needs to be done as inline, because cgroup lives in the core
> + * kernel and it cannot call drm calls directly
> + */
> +static inline int drmmcg_register_device(struct drm_device *dev,
> +					 struct drmcgroup_device *cgdev)
> +{
> +	return drmcg_register_device(dev, cgdev) ?:
> +		drmm_add_action_or_reset(dev, drmmcg_unregister_device, cgdev);
> +}
> +
>   #endif	/* _CGROUP_DRM_H */
> diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
> index 02c8eaa633d3..a93d9344fd36 100644
> --- a/kernel/cgroup/drm.c
> +++ b/kernel/cgroup/drm.c
> @@ -1,60 +1,557 @@
> -/* SPDX-License-Identifier: MIT */
> +// SPDX-License-Identifier: GPL-2.0
>   /*
> - * Copyright © 2023 Intel Corporation
> + * Copyright 2023 Intel
> + * Partially based on the rdma and misc controllers, which bear the following copyrights:
> + *
> + * Copyright 2020 Google LLC
> + * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
>    */
>   
>   #include <linux/cgroup.h>
>   #include <linux/cgroup_drm.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/parser.h>
>   #include <linux/slab.h>
>   
> -struct drm_cgroup_state {

As a side note, it'd be easier to read the diff if you left the name as 
is, and some other details too, like the static root group (I need to 
remind myself if/why I needed it, but does it harm you?) and my missed 
static keywords and needless static struct initialization. I will fix 
that up in my patch localy. Aynway, that way it would maybe be less 
churn from one patch to the other in the series.

> +#include <drm/drm_device.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_managed.h>
> +
> +struct drmcgroup_state {
>   	struct cgroup_subsys_state css;
> +
> +	struct list_head pools;
>   };
>   
> -struct drm_root_cgroup_state {
> -	struct drm_cgroup_state drmcs;
> +struct drmcgroup_pool_state {
> +	struct drmcgroup_device *device;
> +	struct drmcgroup_resource {
> +		s64 max, used;
> +	} resources[DRMCG_MAX_REGIONS];
> +
> +	s64 usage_sum;
> +
> +	struct list_head	cg_node;

cg always makes me think cgroup and not css so it is a bit confusing.

Why are two lists needed?

> +	struct list_head	dev_node;
>   };
>   
> -static struct drm_root_cgroup_state root_drmcs;
> +static DEFINE_MUTEX(drmcg_mutex);
> +static LIST_HEAD(drmcg_devices);
>   
> -static inline struct drm_cgroup_state *
> +static inline struct drmcgroup_state *
>   css_to_drmcs(struct cgroup_subsys_state *css)
>   {
> -	return container_of(css, struct drm_cgroup_state, css);
> +	return container_of(css, struct drmcgroup_state, css);
> +}
> +
> +static inline struct drmcgroup_state *get_current_drmcg(void)
> +{
> +	return css_to_drmcs(task_get_css(current, drm_cgrp_id));
> +}
> +
> +static struct drmcgroup_state *parent_drmcg(struct drmcgroup_state *cg)
> +{
> +	return css_to_drmcs(cg->css.parent);
> +}
> +
> +static void free_cg_pool_locked(struct drmcgroup_pool_state *pool)
> +{
> +	lockdep_assert_held(&drmcg_mutex);
> +
> +	list_del(&pool->cg_node);
> +	list_del(&pool->dev_node);
> +	kfree(pool);
> +}
> +
> +static void
> +set_resource_max(struct drmcgroup_pool_state *pool, int i, u64 new_max)
> +{
> +	pool->resources[i].max = new_max;
> +}
> +
> +static void set_all_resource_max_limit(struct drmcgroup_pool_state *rpool)
> +{
> +	int i;
> +
> +	for (i = 0; i < DRMCG_MAX_REGIONS; i++)
> +		set_resource_max(rpool, i, S64_MAX);
> +}
> +
> +static void drmcs_offline(struct cgroup_subsys_state *css)
> +{
> +	struct drmcgroup_state *drmcs = css_to_drmcs(css);
> +	struct drmcgroup_pool_state *pool, *next;
> +
> +	mutex_lock(&drmcg_mutex);
> +	list_for_each_entry_safe(pool, next, &drmcs->pools, cg_node) {
> +		if (!pool->usage_sum) {
> +			free_cg_pool_locked(pool);
> +		} else {
> +			/* Reset all regions, last uncharge will remove pool */
> +			set_all_resource_max_limit(pool);
> +		}
> +	}
> +	mutex_unlock(&drmcg_mutex);
>   }
>   
>   static void drmcs_free(struct cgroup_subsys_state *css)
>   {
> -	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
> +	struct drmcgroup_state *drmcs = css_to_drmcs(css);
>   
> -	if (drmcs != &root_drmcs.drmcs)
> -		kfree(drmcs);
> +	kfree(drmcs);
>   }
>   
>   static struct cgroup_subsys_state *
>   drmcs_alloc(struct cgroup_subsys_state *parent_css)
>   {
> -	struct drm_cgroup_state *drmcs;
> +	struct drmcgroup_state *drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
> +	if (!drmcs)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&drmcs->pools);
> +	return &drmcs->css;
> +}
> +
> +static struct drmcgroup_pool_state *
> +find_cg_pool_locked(struct drmcgroup_state *drmcs, struct drmcgroup_device *dev)
> +{
> +	struct drmcgroup_pool_state *pool;
> +
> +	list_for_each_entry(pool, &drmcs->pools, cg_node)
> +		if (pool->device == dev)
> +			return pool;
> +
> +	return NULL;
> +}
> +
> +static struct drmcgroup_pool_state *
> +get_cg_pool_locked(struct drmcgroup_state *drmcs, struct drmcgroup_device *dev)
> +{
> +	struct drmcgroup_pool_state *pool;
> +
> +	pool = find_cg_pool_locked(drmcs, dev);
> +	if (pool)
> +		return pool;
> +
> +	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> +	if (!pool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	pool->device = dev;
> +	set_all_resource_max_limit(pool);
>   
> -	if (!parent_css) {
> -		drmcs = &root_drmcs.drmcs;
> -	} else {
> -		drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
> -		if (!drmcs)
> -			return ERR_PTR(-ENOMEM);
> +	INIT_LIST_HEAD(&pool->cg_node);
> +	INIT_LIST_HEAD(&pool->dev_node);
> +	list_add_tail(&pool->cg_node, &drmcs->pools);
> +	list_add_tail(&pool->dev_node, &dev->pools);
> +	return pool;
> +}
> +
> +void drmcg_unregister_device(struct drmcgroup_device *cgdev)
> +{
> +	struct drmcgroup_pool_state *pool, *next;
> +
> +	mutex_lock(&drmcg_mutex);
> +	list_del(&cgdev->list);
> +
> +	list_for_each_entry_safe(pool, next, &cgdev->pools, dev_node)
> +		free_cg_pool_locked(pool);
> +	mutex_unlock(&drmcg_mutex);
> +	kfree(cgdev->name);
> +}
> +
> +EXPORT_SYMBOL_GPL(drmcg_unregister_device);
> +
> +int drmcg_register_device(struct drm_device *dev,
> +			  struct drmcgroup_device *cgdev)
> +{
> +	char *name = kstrdup(dev->unique, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&cgdev->pools);
> +	mutex_lock(&drmcg_mutex);
> +	cgdev->name = name;
> +	list_add_tail(&cgdev->list, &drmcg_devices);
> +	mutex_unlock(&drmcg_mutex);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(drmcg_register_device);
> +
> +static int drmcg_max_show(struct seq_file *sf, void *v)
> +{
> +	struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
> +	struct drmcgroup_pool_state *pool;
> +
> +	mutex_lock(&drmcg_mutex);
> +	list_for_each_entry(pool, &drmcs->pools, cg_node) {
> +		struct drmcgroup_device *dev = pool->device;
> +		int i;
> +
> +		seq_puts(sf, dev->name);
> +
> +		for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
> +			if (!dev->regions[i].name)
> +				continue;
> +
> +			if (pool->resources[i].max < S64_MAX)
> +				seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
> +					   pool->resources[i].max);
> +			else
> +				seq_printf(sf, " region.%s=max", dev->regions[i].name);
> +		}
> +
> +		seq_putc(sf, '\n');
>   	}
> +	mutex_unlock(&drmcg_mutex);
>   
> -	return &drmcs->css;
> +	return 0;
> +}
> +
> +static struct drmcgroup_device *drmcg_get_device_locked(const char *name)
> +{
> +	struct drmcgroup_device *dev;
> +
> +	lockdep_assert_held(&drmcg_mutex);
> +
> +	list_for_each_entry(dev, &drmcg_devices, list)
> +		if (!strcmp(name, dev->name))
> +			return dev;
> +
> +	return NULL;
> +}
> +
> +static void try_to_free_cg_pool_locked(struct drmcgroup_pool_state *pool)
> +{
> +	struct drmcgroup_device *dev = pool->device;
> +	u32 i;
> +
> +	/* Memory charged to this pool */
> +	if (pool->usage_sum)
> +		return;
> +
> +	for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
> +		if (!dev->regions[i].name)
> +			continue;
> +
> +		/* Is a specific limit set? */
> +		if (pool->resources[i].max < S64_MAX)
> +			return;
> +	}
> +
> +	/*
> +	 * No user of the pool and all entries are set to defaults;
> +	 * safe to delete this pool.
> +	 */
> +	free_cg_pool_locked(pool);
> +}
> +
> +
> +static void
> +uncharge_cg_locked(struct drmcgroup_state *drmcs,
> +		   struct drmcgroup_device *cgdev,
> +		   u32 index, u64 size)
> +{
> +	struct drmcgroup_pool_state *pool;
> +
> +	pool = find_cg_pool_locked(drmcs, cgdev);
> +
> +	if (unlikely(!pool)) {
> +		pr_warn("Invalid device %p or drm cgroup %p\n", cgdev, drmcs);
> +		return;
> +	}
> +
> +	pool->resources[index].used -= size;
> +
> +	/*
> +	 * A negative count (or overflow) is invalid,
> +	 * it indicates a bug in the rdma controller.
> +	 */
> +	WARN_ON_ONCE(pool->resources[index].used < 0);
> +	pool->usage_sum--;
> +	try_to_free_cg_pool_locked(pool);
> +}
> +
> +static void drmcg_uncharge_hierarchy(struct drmcgroup_state *drmcs,
> +				     struct drmcgroup_device *cgdev,
> +				     struct drmcgroup_state *stop_cg,
> +				     u32 index, u64 size)
> +{
> +	struct drmcgroup_state *p;
> +
> +	mutex_lock(&drmcg_mutex);
> +
> +	for (p = drmcs; p != stop_cg; p = parent_drmcg(p))
> +		uncharge_cg_locked(p, cgdev, index, size);
> +
> +	mutex_unlock(&drmcg_mutex);
> +
> +	css_put(&drmcs->css);
> +}
> +
> +void drmcg_uncharge(struct drmcgroup_state *drmcs,
> +		    struct drmcgroup_device *cgdev,
> +		    u32 index,
> +		    u64 size)
> +{
> +	if (index >= DRMCG_MAX_REGIONS)
> +		return;
> +
> +	drmcg_uncharge_hierarchy(drmcs, cgdev, NULL, index, size);
> +}
> +EXPORT_SYMBOL_GPL(drmcg_uncharge);
> +
> +int drmcg_try_charge(struct drmcgroup_state **drmcs,
> +		     struct drmcgroup_device *cgdev,
> +		     u32 index,
> +		     u64 size)
> +{
> +	struct drmcgroup_state *cg, *p;
> +	struct drmcgroup_pool_state *pool;
> +	u64 new;
> +	int ret = 0;
> +
> +	if (index >= DRMCG_MAX_REGIONS)
> +		return -EINVAL;
> +
> +	/*
> +	 * hold on to css, as cgroup can be removed but resource
> +	 * accounting happens on css.
> +	 */
> +	cg = get_current_drmcg();

1)

I am not familiar with the Xe flows - charging is at the point of actual 
backing store allocation?

What about buffer sharing?

Also, given how the css is permanently stored in the caller - you 
deliberately decided not to deal with task migrations? I am not sure 
that will work. Or maybe just omitted for RFC v1?

2)

Buffer objects which Xe can migrate between memory regions will be 
correctly charge/uncharged as they are moved?

Regards,

Tvrtko

> +
> +	mutex_lock(&drmcg_mutex);
> +	for (p = cg; p; p = parent_drmcg(p)) {
> +		pool = get_cg_pool_locked(p, cgdev);
> +		if (IS_ERR(pool)) {
> +			ret = PTR_ERR(pool);
> +			goto err;
> +		} else {
> +			new = pool->resources[index].used + size;
> +			if (new > pool->resources[index].max || new > S64_MAX) {
> +				ret = -EAGAIN;
> +				goto err;
> +			} else {
> +				pool->resources[index].used = new;
> +				pool->usage_sum++;
> +			}
> +		}
> +	}
> +	mutex_unlock(&drmcg_mutex);
> +
> +	*drmcs = cg;
> +	return 0;
> +
> +err:
> +	mutex_unlock(&drmcg_mutex);
> +	drmcg_uncharge_hierarchy(cg, cgdev, p, index, size);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(drmcg_try_charge);
> +
> +static s64 parse_resource(char *c, char **retname)
> +{
> +	substring_t argstr;
> +	char *name, *value = c;
> +	size_t len;
> +	int ret;
> +	u64 retval;
> +
> +	name = strsep(&value, "=");
> +	if (!name || !value)
> +		return -EINVAL;
> +
> +	/* Only support region setting for now */
> +	if (strncmp(name, "region.", 7))
> +		return -EINVAL;
> +	else
> +		name += 7;
> +
> +	*retname = name;
> +	len = strlen(value);
> +
> +	argstr.from = value;
> +	argstr.to = value + len;
> +
> +	ret = match_u64(&argstr, &retval);
> +	if (ret >= 0) {
> +		if (retval > S64_MAX)
> +			return -EINVAL;
> +		return retval;
> +	}
> +	if (!strncmp(value, "max", len))
> +		return S64_MAX;
> +
> +	/* Not u64 or max, error */
> +	return -EINVAL;
> +}
> +
> +static int drmcg_parse_limits(char *options,
> +			      u64 *limits, char **enables)
> +{
> +	char *c;
> +	int num_limits = 0;
> +
> +	/* parse resource options */
> +	while ((c = strsep(&options, " ")) != NULL) {
> +		s64 limit;
> +
> +		if (num_limits >= DRMCG_MAX_REGIONS)
> +			return -EINVAL;
> +
> +		limit = parse_resource(c, &enables[num_limits]);
> +		if (limit < 0)
> +			return limit;
> +
> +		limits[num_limits++] = limit;
> +	}
> +	return num_limits;
> +}
> +
> +static ssize_t drmcg_max_write(struct kernfs_open_file *of,
> +			       char *buf, size_t nbytes, loff_t off)
> +{
> +	struct drmcgroup_state *drmcs = css_to_drmcs(of_css(of));
> +	struct drmcgroup_device *dev;
> +	struct drmcgroup_pool_state *pool;
> +	char *options = strstrip(buf);
> +	char *dev_name = strsep(&options, " ");
> +	u64 limits[DRMCG_MAX_REGIONS];
> +	u64 new_limits[DRMCG_MAX_REGIONS];
> +	char *regions[DRMCG_MAX_REGIONS];
> +	int num_limits, i;
> +	unsigned long set_mask = 0;
> +	int err = 0;
> +
> +	if (!dev_name)
> +		return -EINVAL;
> +
> +	num_limits = drmcg_parse_limits(options, limits, regions);
> +	if (num_limits < 0)
> +		return num_limits;
> +	if (!num_limits)
> +		return -EINVAL;
> +
> +	/*
> +	 * Everything is parsed into key=value pairs now, take lock and attempt to update
> +	 * For good measure, set -EINVAL when a key is set twice.
> +	 */
> +	mutex_lock(&drmcg_mutex);
> +
> +	dev = drmcg_get_device_locked(dev_name);
> +	if (!dev) {
> +		err = -ENODEV;
> +		goto err;
> +	}
> +
> +	pool = get_cg_pool_locked(drmcs, dev);
> +	if (IS_ERR(pool)) {
> +		err = PTR_ERR(pool);
> +		goto err;
> +	}
> +
> +	/* Lookup region names and set new_limits to the index */
> +	for (i = 0; i < num_limits; i++) {
> +		int j;
> +
> +		for (j = 0; j < DRMCG_MAX_REGIONS; j++)
> +			if (dev->regions[j].name &&
> +			    !strcmp(regions[i], dev->regions[j].name))
> +				break;
> +
> +		if (j == DRMCG_MAX_REGIONS ||
> +		    set_mask & BIT(j)) {
> +			err = -EINVAL;
> +			goto err_put;
> +		}
> +
> +		set_mask |= BIT(j);
> +		new_limits[j] = limits[i];
> +	}
> +
> +	/* And commit */
> +	for_each_set_bit(i, &set_mask, DRMCG_MAX_REGIONS)
> +		set_resource_max(pool, i, new_limits[i]);
> +
> +err_put:
> +	try_to_free_cg_pool_locked(pool);
> +err:
> +	mutex_unlock(&drmcg_mutex);
> +
> +	return err ?: nbytes;
> +}
> +
> +static int drmcg_current_show(struct seq_file *sf, void *v)
> +{
> +	struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
> +	struct drmcgroup_device *dev;
> +
> +	mutex_lock(&drmcg_mutex);
> +	list_for_each_entry(dev, &drmcg_devices, list) {
> +		struct drmcgroup_pool_state *pool = find_cg_pool_locked(drmcs, dev);
> +		int i;
> +
> +		seq_puts(sf, dev->name);
> +
> +		for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
> +			if (!dev->regions[i].name)
> +				continue;
> +
> +			seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
> +				   pool ? pool->resources[i].used : 0ULL);
> +		}
> +
> +		seq_putc(sf, '\n');
> +	}
> +	mutex_unlock(&drmcg_mutex);
> +
> +	return 0;
> +}
> +
> +static int drmcg_capacity_show(struct seq_file *sf, void *v)
> +{
> +	struct drmcgroup_device *dev;
> +	int i;
> +
> +	list_for_each_entry(dev, &drmcg_devices, list) {
> +		seq_puts(sf, dev->name);
> +		for (i = 0; i < DRMCG_MAX_REGIONS; i++)
> +			if (dev->regions[i].name)
> +				seq_printf(sf, " region.%s=%lld",
> +					   dev->regions[i].name,
> +					   dev->regions[i].size);
> +		seq_putc(sf, '\n');
> +	}
> +	return 0;
>   }
>   
> -struct cftype files[] = {
> +static struct cftype files[] = {
> +	{
> +		.name = "max",
> +		.write = drmcg_max_write,
> +		.seq_show = drmcg_max_show,
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +	},
> +	{
> +		.name = "current",
> +		.seq_show = drmcg_current_show,
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +	},
> +	{
> +		.name = "capacity",
> +		.seq_show = drmcg_capacity_show,
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +	},
>   	{ } /* Zero entry terminates. */
>   };
>   
>   struct cgroup_subsys drm_cgrp_subsys = {
>   	.css_alloc	= drmcs_alloc,
>   	.css_free	= drmcs_free,
> -	.early_init	= false,
> +	.css_offline	= drmcs_offline,
>   	.legacy_cftypes	= files,
>   	.dfl_cftypes	= files,
>   };

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup
  2023-05-03 15:31   ` [Intel-gfx] " Tvrtko Ursulin
@ 2023-05-03 15:33     ` Maarten Lankhorst
  2023-05-05 14:21     ` Maarten Lankhorst
  1 sibling, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-03 15:33 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel, cgroups, intel-xe
  Cc: Zefan Li, intel-gfx, linux-kernel, amd-gfx, Thomas Zimmermann,
	Johannes Weiner, Tejun Heo


On 2023-05-03 17:31, Tvrtko Ursulin wrote:
>
> On 03/05/2023 09:34, Maarten Lankhorst wrote:
>> Based roughly on the rdma and misc cgroup controllers, with a lot of
>> the accounting code borrowed from rdma.
>>
>> The interface is simple:
>> - populate drmcgroup_device->regions[..] name and size for each active
>>    region.
>> - Call drm(m)cg_register_device()
>> - Use drmcg_try_charge to check if you can allocate a chunk of memory,
>>    use drmcg_uncharge when freeing it. This may return an error code,
>>    or -EAGAIN when the cgroup limit is reached.
>>
>> The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
>> logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
>> to restart infinitely.
>>
>> This API allows you to limit stuff with cgroups.
>> You can see the supported cards in /sys/fs/cgroup/drm.capacity
>> You need to echo +drm to cgroup.subtree_control, and then you can
>> partition memory.
>>
>> In each cgroup subdir:
>> drm.max shows the current limits of the cgroup.
>> drm.current the current amount of allocated memory used by this cgroup.
>> drm.events shows the amount of time max memory was reached.
>
> Events is not in the patch?

Oops, correct.

I removed it since it added more complexity, and didn't seem granular 
enough to be useful.

I removed it from the documentation, but not the commit message it seems. :)

>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> ---
>>   Documentation/admin-guide/cgroup-v2.rst |  46 ++
>>   Documentation/gpu/drm-compute.rst       |  54 +++
>>   include/linux/cgroup_drm.h              |  81 ++++
>>   kernel/cgroup/drm.c                     | 539 +++++++++++++++++++++++-
>>   4 files changed, 699 insertions(+), 21 deletions(-)
>>   create mode 100644 Documentation/gpu/drm-compute.rst
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
>> b/Documentation/admin-guide/cgroup-v2.rst
>> index f67c0829350b..b858d99cb2ef 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -2374,6 +2374,52 @@ RDMA Interface Files
>>         mlx4_0 hca_handle=1 hca_object=20
>>         ocrdma1 hca_handle=1 hca_object=23
>>   +DRM
>> +----
>> +
>> +The "drm" controller regulates the distribution and accounting of
>> +DRM resources.
>> +
>> +DRM Interface Files
>> +~~~~~~~~~~~~~~~~~~~~
>> +
>> +  drm.max
>> +    A readwrite nested-keyed file that exists for all the cgroups
>> +    except root that describes current configured resource limit
>> +    for a DRM device.
>> +
>> +    Lines are keyed by device name and are not ordered.
>> +    Each line contains space separated resource name and its configured
>> +    limit that can be distributed.
>> +
>> +    The following nested keys are defined.
>> +
>> +      ========== 
>> =======================================================
>> +      region.*     Maximum amount of bytes that allocatable in this 
>> region
>> +      ========== 
>> =======================================================
>> +
>> +    An example for xe follows::
>> +
>> +      0000:03:00.0 region.vram0=1073741824 region.stolen=max
>> +
>> +  drm.capacity
>> +    A read-only file that describes maximum region capacity.
>> +    It only exists on the root cgroup. Not all memory can be
>> +    allocated by cgroups, as the kernel reserves some for
>> +    internal use.
>> +
>> +    An example for xe follows::
>> +
>> +      0000:03:00.0 region.vram0=8514437120 region.stolen=67108864
>> +
>> +  drm.current
>> +    A read-only file that describes current resource usage.
>> +    It exists for all the cgroup except root.
>> +
>> +    An example for xe follows::
>> +
>> +      0000:03:00.0 region.vram0=12550144 region.stolen=8650752
>> +
>>   HugeTLB
>>   -------
>>   diff --git a/Documentation/gpu/drm-compute.rst 
>> b/Documentation/gpu/drm-compute.rst
>> new file mode 100644
>> index 000000000000..116270976ef7
>> --- /dev/null
>> +++ b/Documentation/gpu/drm-compute.rst
>> @@ -0,0 +1,54 @@
>> +==================================
>> +Long running workloads and compute
>> +==================================
>> +
>> +Long running workloads (compute) are workloads that will not 
>> complete in 10
>> +seconds. (The time let the user wait before he reaches for the power 
>> button).
>> +This means that other techniques need to be used to manage those 
>> workloads,
>> +that cannot use fences.
>> +
>> +Some hardware may schedule compute jobs, and have no way to pre-empt 
>> them, or
>> +have their memory swapped out from them. Or they simply want their 
>> workload
>> +not to be preempted or swapped out at all.
>> +
>> +This means that it differs from what is described in 
>> driver-api/dma-buf.rst.
>> +
>> +As with normal compute jobs, dma-fence may not be used at all. In 
>> this case,
>> +not even to force preemption. The driver with is simply forced to 
>> unmap a BO
>> +from the long compute job's address space on unbind immediately, not 
>> even
>> +waiting for the workload to complete. Effectively this terminates 
>> the workload
>> +when there is no hardware support to recover.
>> +
>> +Since this is undesirable, there need to be mitigations to prevent a 
>> workload
>> +from being terminated. There are several possible approach, all with 
>> their
>> +advantages and drawbacks.
>> +
>> +The first approach you will likely try is to pin all buffers used by 
>> compute.
>> +This guarantees that the job will run uninterrupted, but also allows 
>> a very
>> +denial of service attack by pinning as much memory as possible, 
>> hogging the
>> +all GPU memory, and possibly a huge chunk of CPU memory.
>> +
>> +A second approach that will work slightly better on its own is 
>> adding an option
>> +not to evict when creating a new job (any kind). If all of userspace 
>> opts in
>> +to this flag, it would prevent cooperating userspace from forced 
>> terminating
>> +older compute jobs to start a new one.
>> +
>> +If job preemption and recoverable pagefaults are not available, 
>> those are the
>> +only approaches possible. So even with those, you want a separate 
>> way of
>> +controlling resources. The standard kernel way of doing so is cgroups.
>> +
>> +This creates a third option, using cgroups to prevent eviction. Both 
>> GPU and
>> +driver-allocated CPU memory would be accounted to the correct 
>> cgroup, and
>> +eviction would be made cgroup aware. This allows the GPU to be 
>> partitioned
>> +into cgroups, that will allow jobs to run next to each other without
>> +interference.
>
> The 3rd approach is only valid if used strictly with device local 
> memory, right? Because as soon as system memory backed buffers are 
> used this approach cannot guarantee no eviction can be triggered.
>
>> +
>> +The interface to the cgroup would be similar to the current CPU memory
>> +interface, with similar semantics for min/low/high/max, if eviction can
>> +be made cgroup aware. For now only max is implemented.
>> +
>> +What should be noted is that each memory region (tiled memory for 
>> example)
>> +should have its own accounting, using $card key0 = value0 key1 = 
>> value1.
>> +
>> +The key is set to the regionid set by the driver, for example "tile0".
>> +For the value of $card, we use drmGetUnique().
>> diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
>> index 8ef66a47619f..4f17b1c85f47 100644
>> --- a/include/linux/cgroup_drm.h
>> +++ b/include/linux/cgroup_drm.h
>> @@ -6,4 +6,85 @@
>>   #ifndef _CGROUP_DRM_H
>>   #define _CGROUP_DRM_H
>>   +#include <linux/types.h>
>> +
>> +#include <drm/drm_managed.h>
>> +
>> +struct drm_device;
>> +struct drm_file;
>> +
>> +struct drmcgroup_state;
>> +
>> +/*
>> + * Use 8 as max, because of N^2 lookup when setting things, can be 
>> bumped if needed
>> + * Identical to TTM_NUM_MEM_TYPES to allow simplifying that code.
>> + */
>> +#define DRMCG_MAX_REGIONS 8
>> +
>> +struct drmcgroup_device {
>> +    struct list_head list;
>> +    struct list_head pools;
>> +
>> +    struct {
>> +        u64 size;
>> +        const char *name;
>> +    } regions[DRMCG_MAX_REGIONS];
>> +
>> +    /* Name describing the card, set by drmcg_register_device */
>> +    const char *name;
>> +
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
>> +int drmcg_register_device(struct drm_device *dev,
>> +               struct drmcgroup_device *drm_cg);
>> +void drmcg_unregister_device(struct drmcgroup_device *cgdev);
>> +int drmcg_try_charge(struct drmcgroup_state **drmcg,
>> +             struct drmcgroup_device *cgdev,
>> +             u32 index, u64 size);
>> +void drmcg_uncharge(struct drmcgroup_state *drmcg,
>> +            struct drmcgroup_device *cgdev,
>> +            u32 index, u64 size);
>> +#else
>> +static inline int
>> +drmcg_register_device(struct drm_device *dev,
>> +              struct drm_cgroup *drm_cg)
>> +{
>> +    return 0;
>> +}
>> +
>> +static inline void drmcg_unregister_device(struct drmcgroup_device 
>> *cgdev)
>> +{
>> +}
>> +
>> +static inline int drmcg_try_charge(struct drmcgroup_state **drmcg,
>> +                   struct drmcgroup_device *cgdev,
>> +                   u32 index, u64 size)
>> +{
>> +    *drmcg = NULL;
>> +    return 0;
>> +}
>> +
>> +static inline void drmcg_uncharge(struct drmcgroup_state *drmcg,
>> +                  struct drmcgroup_device *cgdev,
>> +                  u32 index, u64 size)
>> +{ }
>> +#endif
>> +
>> +static inline void drmmcg_unregister_device(struct drm_device *dev, 
>> void *arg)
>> +{
>> +    drmcg_unregister_device(arg);
>> +}
>> +
>> +/*
>> + * This needs to be done as inline, because cgroup lives in the core
>> + * kernel and it cannot call drm calls directly
>> + */
>> +static inline int drmmcg_register_device(struct drm_device *dev,
>> +                     struct drmcgroup_device *cgdev)
>> +{
>> +    return drmcg_register_device(dev, cgdev) ?:
>> +        drmm_add_action_or_reset(dev, drmmcg_unregister_device, cgdev);
>> +}
>> +
>>   #endif    /* _CGROUP_DRM_H */
>> diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
>> index 02c8eaa633d3..a93d9344fd36 100644
>> --- a/kernel/cgroup/drm.c
>> +++ b/kernel/cgroup/drm.c
>> @@ -1,60 +1,557 @@
>> -/* SPDX-License-Identifier: MIT */
>> +// SPDX-License-Identifier: GPL-2.0
>>   /*
>> - * Copyright © 2023 Intel Corporation
>> + * Copyright 2023 Intel
>> + * Partially based on the rdma and misc controllers, which bear the 
>> following copyrights:
>> + *
>> + * Copyright 2020 Google LLC
>> + * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
>>    */
>>     #include <linux/cgroup.h>
>>   #include <linux/cgroup_drm.h>
>> +#include <linux/list.h>
>> +#include <linux/mutex.h>
>> +#include <linux/parser.h>
>>   #include <linux/slab.h>
>>   -struct drm_cgroup_state {
>
> As a side note, it'd be easier to read the diff if you left the name 
> as is, and some other details too, like the static root group (I need 
> to remind myself if/why I needed it, but does it harm you?) and my 
> missed static keywords and needless static struct initialization. I 
> will fix that up in my patch localy. Aynway, that way it would maybe 
> be less churn from one patch to the other in the series.
>
>> +#include <drm/drm_device.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/drm_managed.h>
>> +
>> +struct drmcgroup_state {
>>       struct cgroup_subsys_state css;
>> +
>> +    struct list_head pools;
>>   };
>>   -struct drm_root_cgroup_state {
>> -    struct drm_cgroup_state drmcs;
>> +struct drmcgroup_pool_state {
>> +    struct drmcgroup_device *device;
>> +    struct drmcgroup_resource {
>> +        s64 max, used;
>> +    } resources[DRMCG_MAX_REGIONS];
>> +
>> +    s64 usage_sum;
>> +
>> +    struct list_head    cg_node;
>
> cg always makes me think cgroup and not css so it is a bit confusing.
>
> Why are two lists needed?
>
>> +    struct list_head    dev_node;
>>   };
>>   -static struct drm_root_cgroup_state root_drmcs;
>> +static DEFINE_MUTEX(drmcg_mutex);
>> +static LIST_HEAD(drmcg_devices);
>>   -static inline struct drm_cgroup_state *
>> +static inline struct drmcgroup_state *
>>   css_to_drmcs(struct cgroup_subsys_state *css)
>>   {
>> -    return container_of(css, struct drm_cgroup_state, css);
>> +    return container_of(css, struct drmcgroup_state, css);
>> +}
>> +
>> +static inline struct drmcgroup_state *get_current_drmcg(void)
>> +{
>> +    return css_to_drmcs(task_get_css(current, drm_cgrp_id));
>> +}
>> +
>> +static struct drmcgroup_state *parent_drmcg(struct drmcgroup_state *cg)
>> +{
>> +    return css_to_drmcs(cg->css.parent);
>> +}
>> +
>> +static void free_cg_pool_locked(struct drmcgroup_pool_state *pool)
>> +{
>> +    lockdep_assert_held(&drmcg_mutex);
>> +
>> +    list_del(&pool->cg_node);
>> +    list_del(&pool->dev_node);
>> +    kfree(pool);
>> +}
>> +
>> +static void
>> +set_resource_max(struct drmcgroup_pool_state *pool, int i, u64 new_max)
>> +{
>> +    pool->resources[i].max = new_max;
>> +}
>> +
>> +static void set_all_resource_max_limit(struct drmcgroup_pool_state 
>> *rpool)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < DRMCG_MAX_REGIONS; i++)
>> +        set_resource_max(rpool, i, S64_MAX);
>> +}
>> +
>> +static void drmcs_offline(struct cgroup_subsys_state *css)
>> +{
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(css);
>> +    struct drmcgroup_pool_state *pool, *next;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_for_each_entry_safe(pool, next, &drmcs->pools, cg_node) {
>> +        if (!pool->usage_sum) {
>> +            free_cg_pool_locked(pool);
>> +        } else {
>> +            /* Reset all regions, last uncharge will remove pool */
>> +            set_all_resource_max_limit(pool);
>> +        }
>> +    }
>> +    mutex_unlock(&drmcg_mutex);
>>   }
>>     static void drmcs_free(struct cgroup_subsys_state *css)
>>   {
>> -    struct drm_cgroup_state *drmcs = css_to_drmcs(css);
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(css);
>>   -    if (drmcs != &root_drmcs.drmcs)
>> -        kfree(drmcs);
>> +    kfree(drmcs);
>>   }
>>     static struct cgroup_subsys_state *
>>   drmcs_alloc(struct cgroup_subsys_state *parent_css)
>>   {
>> -    struct drm_cgroup_state *drmcs;
>> +    struct drmcgroup_state *drmcs = kzalloc(sizeof(*drmcs), 
>> GFP_KERNEL);
>> +    if (!drmcs)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    INIT_LIST_HEAD(&drmcs->pools);
>> +    return &drmcs->css;
>> +}
>> +
>> +static struct drmcgroup_pool_state *
>> +find_cg_pool_locked(struct drmcgroup_state *drmcs, struct 
>> drmcgroup_device *dev)
>> +{
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    list_for_each_entry(pool, &drmcs->pools, cg_node)
>> +        if (pool->device == dev)
>> +            return pool;
>> +
>> +    return NULL;
>> +}
>> +
>> +static struct drmcgroup_pool_state *
>> +get_cg_pool_locked(struct drmcgroup_state *drmcs, struct 
>> drmcgroup_device *dev)
>> +{
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    pool = find_cg_pool_locked(drmcs, dev);
>> +    if (pool)
>> +        return pool;
>> +
>> +    pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>> +    if (!pool)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    pool->device = dev;
>> +    set_all_resource_max_limit(pool);
>>   -    if (!parent_css) {
>> -        drmcs = &root_drmcs.drmcs;
>> -    } else {
>> -        drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
>> -        if (!drmcs)
>> -            return ERR_PTR(-ENOMEM);
>> +    INIT_LIST_HEAD(&pool->cg_node);
>> +    INIT_LIST_HEAD(&pool->dev_node);
>> +    list_add_tail(&pool->cg_node, &drmcs->pools);
>> +    list_add_tail(&pool->dev_node, &dev->pools);
>> +    return pool;
>> +}
>> +
>> +void drmcg_unregister_device(struct drmcgroup_device *cgdev)
>> +{
>> +    struct drmcgroup_pool_state *pool, *next;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_del(&cgdev->list);
>> +
>> +    list_for_each_entry_safe(pool, next, &cgdev->pools, dev_node)
>> +        free_cg_pool_locked(pool);
>> +    mutex_unlock(&drmcg_mutex);
>> +    kfree(cgdev->name);
>> +}
>> +
>> +EXPORT_SYMBOL_GPL(drmcg_unregister_device);
>> +
>> +int drmcg_register_device(struct drm_device *dev,
>> +              struct drmcgroup_device *cgdev)
>> +{
>> +    char *name = kstrdup(dev->unique, GFP_KERNEL);
>> +    if (!name)
>> +        return -ENOMEM;
>> +
>> +    INIT_LIST_HEAD(&cgdev->pools);
>> +    mutex_lock(&drmcg_mutex);
>> +    cgdev->name = name;
>> +    list_add_tail(&cgdev->list, &drmcg_devices);
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(drmcg_register_device);
>> +
>> +static int drmcg_max_show(struct seq_file *sf, void *v)
>> +{
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_for_each_entry(pool, &drmcs->pools, cg_node) {
>> +        struct drmcgroup_device *dev = pool->device;
>> +        int i;
>> +
>> +        seq_puts(sf, dev->name);
>> +
>> +        for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
>> +            if (!dev->regions[i].name)
>> +                continue;
>> +
>> +            if (pool->resources[i].max < S64_MAX)
>> +                seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
>> +                       pool->resources[i].max);
>> +            else
>> +                seq_printf(sf, " region.%s=max", dev->regions[i].name);
>> +        }
>> +
>> +        seq_putc(sf, '\n');
>>       }
>> +    mutex_unlock(&drmcg_mutex);
>>   -    return &drmcs->css;
>> +    return 0;
>> +}
>> +
>> +static struct drmcgroup_device *drmcg_get_device_locked(const char 
>> *name)
>> +{
>> +    struct drmcgroup_device *dev;
>> +
>> +    lockdep_assert_held(&drmcg_mutex);
>> +
>> +    list_for_each_entry(dev, &drmcg_devices, list)
>> +        if (!strcmp(name, dev->name))
>> +            return dev;
>> +
>> +    return NULL;
>> +}
>> +
>> +static void try_to_free_cg_pool_locked(struct drmcgroup_pool_state 
>> *pool)
>> +{
>> +    struct drmcgroup_device *dev = pool->device;
>> +    u32 i;
>> +
>> +    /* Memory charged to this pool */
>> +    if (pool->usage_sum)
>> +        return;
>> +
>> +    for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
>> +        if (!dev->regions[i].name)
>> +            continue;
>> +
>> +        /* Is a specific limit set? */
>> +        if (pool->resources[i].max < S64_MAX)
>> +            return;
>> +    }
>> +
>> +    /*
>> +     * No user of the pool and all entries are set to defaults;
>> +     * safe to delete this pool.
>> +     */
>> +    free_cg_pool_locked(pool);
>> +}
>> +
>> +
>> +static void
>> +uncharge_cg_locked(struct drmcgroup_state *drmcs,
>> +           struct drmcgroup_device *cgdev,
>> +           u32 index, u64 size)
>> +{
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    pool = find_cg_pool_locked(drmcs, cgdev);
>> +
>> +    if (unlikely(!pool)) {
>> +        pr_warn("Invalid device %p or drm cgroup %p\n", cgdev, drmcs);
>> +        return;
>> +    }
>> +
>> +    pool->resources[index].used -= size;
>> +
>> +    /*
>> +     * A negative count (or overflow) is invalid,
>> +     * it indicates a bug in the rdma controller.
>> +     */
>> +    WARN_ON_ONCE(pool->resources[index].used < 0);
>> +    pool->usage_sum--;
>> +    try_to_free_cg_pool_locked(pool);
>> +}
>> +
>> +static void drmcg_uncharge_hierarchy(struct drmcgroup_state *drmcs,
>> +                     struct drmcgroup_device *cgdev,
>> +                     struct drmcgroup_state *stop_cg,
>> +                     u32 index, u64 size)
>> +{
>> +    struct drmcgroup_state *p;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +
>> +    for (p = drmcs; p != stop_cg; p = parent_drmcg(p))
>> +        uncharge_cg_locked(p, cgdev, index, size);
>> +
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    css_put(&drmcs->css);
>> +}
>> +
>> +void drmcg_uncharge(struct drmcgroup_state *drmcs,
>> +            struct drmcgroup_device *cgdev,
>> +            u32 index,
>> +            u64 size)
>> +{
>> +    if (index >= DRMCG_MAX_REGIONS)
>> +        return;
>> +
>> +    drmcg_uncharge_hierarchy(drmcs, cgdev, NULL, index, size);
>> +}
>> +EXPORT_SYMBOL_GPL(drmcg_uncharge);
>> +
>> +int drmcg_try_charge(struct drmcgroup_state **drmcs,
>> +             struct drmcgroup_device *cgdev,
>> +             u32 index,
>> +             u64 size)
>> +{
>> +    struct drmcgroup_state *cg, *p;
>> +    struct drmcgroup_pool_state *pool;
>> +    u64 new;
>> +    int ret = 0;
>> +
>> +    if (index >= DRMCG_MAX_REGIONS)
>> +        return -EINVAL;
>> +
>> +    /*
>> +     * hold on to css, as cgroup can be removed but resource
>> +     * accounting happens on css.
>> +     */
>> +    cg = get_current_drmcg();
>
> 1)
>
> I am not familiar with the Xe flows - charging is at the point of 
> actual backing store allocation?
>
> What about buffer sharing?
>
> Also, given how the css is permanently stored in the caller - you 
> deliberately decided not to deal with task migrations? I am not sure 
> that will work. Or maybe just omitted for RFC v1?
>
> 2)
>
> Buffer objects which Xe can migrate between memory regions will be 
> correctly charge/uncharged as they are moved?
>
> Regards,
>
> Tvrtko
>
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    for (p = cg; p; p = parent_drmcg(p)) {
>> +        pool = get_cg_pool_locked(p, cgdev);
>> +        if (IS_ERR(pool)) {
>> +            ret = PTR_ERR(pool);
>> +            goto err;
>> +        } else {
>> +            new = pool->resources[index].used + size;
>> +            if (new > pool->resources[index].max || new > S64_MAX) {
>> +                ret = -EAGAIN;
>> +                goto err;
>> +            } else {
>> +                pool->resources[index].used = new;
>> +                pool->usage_sum++;
>> +            }
>> +        }
>> +    }
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    *drmcs = cg;
>> +    return 0;
>> +
>> +err:
>> +    mutex_unlock(&drmcg_mutex);
>> +    drmcg_uncharge_hierarchy(cg, cgdev, p, index, size);
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(drmcg_try_charge);
>> +
>> +static s64 parse_resource(char *c, char **retname)
>> +{
>> +    substring_t argstr;
>> +    char *name, *value = c;
>> +    size_t len;
>> +    int ret;
>> +    u64 retval;
>> +
>> +    name = strsep(&value, "=");
>> +    if (!name || !value)
>> +        return -EINVAL;
>> +
>> +    /* Only support region setting for now */
>> +    if (strncmp(name, "region.", 7))
>> +        return -EINVAL;
>> +    else
>> +        name += 7;
>> +
>> +    *retname = name;
>> +    len = strlen(value);
>> +
>> +    argstr.from = value;
>> +    argstr.to = value + len;
>> +
>> +    ret = match_u64(&argstr, &retval);
>> +    if (ret >= 0) {
>> +        if (retval > S64_MAX)
>> +            return -EINVAL;
>> +        return retval;
>> +    }
>> +    if (!strncmp(value, "max", len))
>> +        return S64_MAX;
>> +
>> +    /* Not u64 or max, error */
>> +    return -EINVAL;
>> +}
>> +
>> +static int drmcg_parse_limits(char *options,
>> +                  u64 *limits, char **enables)
>> +{
>> +    char *c;
>> +    int num_limits = 0;
>> +
>> +    /* parse resource options */
>> +    while ((c = strsep(&options, " ")) != NULL) {
>> +        s64 limit;
>> +
>> +        if (num_limits >= DRMCG_MAX_REGIONS)
>> +            return -EINVAL;
>> +
>> +        limit = parse_resource(c, &enables[num_limits]);
>> +        if (limit < 0)
>> +            return limit;
>> +
>> +        limits[num_limits++] = limit;
>> +    }
>> +    return num_limits;
>> +}
>> +
>> +static ssize_t drmcg_max_write(struct kernfs_open_file *of,
>> +                   char *buf, size_t nbytes, loff_t off)
>> +{
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(of_css(of));
>> +    struct drmcgroup_device *dev;
>> +    struct drmcgroup_pool_state *pool;
>> +    char *options = strstrip(buf);
>> +    char *dev_name = strsep(&options, " ");
>> +    u64 limits[DRMCG_MAX_REGIONS];
>> +    u64 new_limits[DRMCG_MAX_REGIONS];
>> +    char *regions[DRMCG_MAX_REGIONS];
>> +    int num_limits, i;
>> +    unsigned long set_mask = 0;
>> +    int err = 0;
>> +
>> +    if (!dev_name)
>> +        return -EINVAL;
>> +
>> +    num_limits = drmcg_parse_limits(options, limits, regions);
>> +    if (num_limits < 0)
>> +        return num_limits;
>> +    if (!num_limits)
>> +        return -EINVAL;
>> +
>> +    /*
>> +     * Everything is parsed into key=value pairs now, take lock and 
>> attempt to update
>> +     * For good measure, set -EINVAL when a key is set twice.
>> +     */
>> +    mutex_lock(&drmcg_mutex);
>> +
>> +    dev = drmcg_get_device_locked(dev_name);
>> +    if (!dev) {
>> +        err = -ENODEV;
>> +        goto err;
>> +    }
>> +
>> +    pool = get_cg_pool_locked(drmcs, dev);
>> +    if (IS_ERR(pool)) {
>> +        err = PTR_ERR(pool);
>> +        goto err;
>> +    }
>> +
>> +    /* Lookup region names and set new_limits to the index */
>> +    for (i = 0; i < num_limits; i++) {
>> +        int j;
>> +
>> +        for (j = 0; j < DRMCG_MAX_REGIONS; j++)
>> +            if (dev->regions[j].name &&
>> +                !strcmp(regions[i], dev->regions[j].name))
>> +                break;
>> +
>> +        if (j == DRMCG_MAX_REGIONS ||
>> +            set_mask & BIT(j)) {
>> +            err = -EINVAL;
>> +            goto err_put;
>> +        }
>> +
>> +        set_mask |= BIT(j);
>> +        new_limits[j] = limits[i];
>> +    }
>> +
>> +    /* And commit */
>> +    for_each_set_bit(i, &set_mask, DRMCG_MAX_REGIONS)
>> +        set_resource_max(pool, i, new_limits[i]);
>> +
>> +err_put:
>> +    try_to_free_cg_pool_locked(pool);
>> +err:
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    return err ?: nbytes;
>> +}
>> +
>> +static int drmcg_current_show(struct seq_file *sf, void *v)
>> +{
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
>> +    struct drmcgroup_device *dev;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_for_each_entry(dev, &drmcg_devices, list) {
>> +        struct drmcgroup_pool_state *pool = 
>> find_cg_pool_locked(drmcs, dev);
>> +        int i;
>> +
>> +        seq_puts(sf, dev->name);
>> +
>> +        for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
>> +            if (!dev->regions[i].name)
>> +                continue;
>> +
>> +            seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
>> +                   pool ? pool->resources[i].used : 0ULL);
>> +        }
>> +
>> +        seq_putc(sf, '\n');
>> +    }
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    return 0;
>> +}
>> +
>> +static int drmcg_capacity_show(struct seq_file *sf, void *v)
>> +{
>> +    struct drmcgroup_device *dev;
>> +    int i;
>> +
>> +    list_for_each_entry(dev, &drmcg_devices, list) {
>> +        seq_puts(sf, dev->name);
>> +        for (i = 0; i < DRMCG_MAX_REGIONS; i++)
>> +            if (dev->regions[i].name)
>> +                seq_printf(sf, " region.%s=%lld",
>> +                       dev->regions[i].name,
>> +                       dev->regions[i].size);
>> +        seq_putc(sf, '\n');
>> +    }
>> +    return 0;
>>   }
>>   -struct cftype files[] = {
>> +static struct cftype files[] = {
>> +    {
>> +        .name = "max",
>> +        .write = drmcg_max_write,
>> +        .seq_show = drmcg_max_show,
>> +        .flags = CFTYPE_NOT_ON_ROOT,
>> +    },
>> +    {
>> +        .name = "current",
>> +        .seq_show = drmcg_current_show,
>> +        .flags = CFTYPE_NOT_ON_ROOT,
>> +    },
>> +    {
>> +        .name = "capacity",
>> +        .seq_show = drmcg_capacity_show,
>> +        .flags = CFTYPE_ONLY_ON_ROOT,
>> +    },
>>       { } /* Zero entry terminates. */
>>   };
>>     struct cgroup_subsys drm_cgrp_subsys = {
>>       .css_alloc    = drmcs_alloc,
>>       .css_free    = drmcs_free,
>> -    .early_init    = false,
>> +    .css_offline    = drmcs_offline,
>>       .legacy_cftypes    = files,
>>       .dfl_cftypes    = files,
>>   };

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup
  2023-05-03 15:31   ` [Intel-gfx] " Tvrtko Ursulin
  2023-05-03 15:33     ` Maarten Lankhorst
@ 2023-05-05 14:21     ` Maarten Lankhorst
  1 sibling, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-05 14:21 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel, cgroups, intel-xe
  Cc: Zefan Li, intel-gfx, linux-kernel, amd-gfx, Thomas Zimmermann,
	Johannes Weiner, Tejun Heo

I just now noticed the other comments. Wiill address them.

On 2023-05-03 17:31, Tvrtko Ursulin wrote:
>
> On 03/05/2023 09:34, Maarten Lankhorst wrote:
>> Based roughly on the rdma and misc cgroup controllers, with a lot of
>> the accounting code borrowed from rdma.
>>
>> The interface is simple:
>> - populate drmcgroup_device->regions[..] name and size for each active
>>    region.
>> - Call drm(m)cg_register_device()
>> - Use drmcg_try_charge to check if you can allocate a chunk of memory,
>>    use drmcg_uncharge when freeing it. This may return an error code,
>>    or -EAGAIN when the cgroup limit is reached.
>>
>> The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
>> logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
>> to restart infinitely.
>>
>> This API allows you to limit stuff with cgroups.
>> You can see the supported cards in /sys/fs/cgroup/drm.capacity
>> You need to echo +drm to cgroup.subtree_control, and then you can
>> partition memory.
>>
>> In each cgroup subdir:
>> drm.max shows the current limits of the cgroup.
>> drm.current the current amount of allocated memory used by this cgroup.
>> drm.events shows the amount of time max memory was reached.
>
> Events is not in the patch?
>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> ---
>>   Documentation/admin-guide/cgroup-v2.rst |  46 ++
>>   Documentation/gpu/drm-compute.rst       |  54 +++
>>   include/linux/cgroup_drm.h              |  81 ++++
>>   kernel/cgroup/drm.c                     | 539 +++++++++++++++++++++++-
>>   4 files changed, 699 insertions(+), 21 deletions(-)
>>   create mode 100644 Documentation/gpu/drm-compute.rst
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
>> b/Documentation/admin-guide/cgroup-v2.rst
>> index f67c0829350b..b858d99cb2ef 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -2374,6 +2374,52 @@ RDMA Interface Files
>>         mlx4_0 hca_handle=1 hca_object=20
>>         ocrdma1 hca_handle=1 hca_object=23
>>   +DRM
>> +----
>> +
>> +The "drm" controller regulates the distribution and accounting of
>> +DRM resources.
>> +
>> +DRM Interface Files
>> +~~~~~~~~~~~~~~~~~~~~
>> +
>> +  drm.max
>> +    A readwrite nested-keyed file that exists for all the cgroups
>> +    except root that describes current configured resource limit
>> +    for a DRM device.
>> +
>> +    Lines are keyed by device name and are not ordered.
>> +    Each line contains space separated resource name and its configured
>> +    limit that can be distributed.
>> +
>> +    The following nested keys are defined.
>> +
>> +      ========== 
>> =======================================================
>> +      region.*     Maximum amount of bytes that allocatable in this 
>> region
>> +      ========== 
>> =======================================================
>> +
>> +    An example for xe follows::
>> +
>> +      0000:03:00.0 region.vram0=1073741824 region.stolen=max
>> +
>> +  drm.capacity
>> +    A read-only file that describes maximum region capacity.
>> +    It only exists on the root cgroup. Not all memory can be
>> +    allocated by cgroups, as the kernel reserves some for
>> +    internal use.
>> +
>> +    An example for xe follows::
>> +
>> +      0000:03:00.0 region.vram0=8514437120 region.stolen=67108864
>> +
>> +  drm.current
>> +    A read-only file that describes current resource usage.
>> +    It exists for all the cgroup except root.
>> +
>> +    An example for xe follows::
>> +
>> +      0000:03:00.0 region.vram0=12550144 region.stolen=8650752
>> +
>>   HugeTLB
>>   -------
>>   diff --git a/Documentation/gpu/drm-compute.rst 
>> b/Documentation/gpu/drm-compute.rst
>> new file mode 100644
>> index 000000000000..116270976ef7
>> --- /dev/null
>> +++ b/Documentation/gpu/drm-compute.rst
>> @@ -0,0 +1,54 @@
>> +==================================
>> +Long running workloads and compute
>> +==================================
>> +
>> +Long running workloads (compute) are workloads that will not 
>> complete in 10
>> +seconds. (The time let the user wait before he reaches for the power 
>> button).
>> +This means that other techniques need to be used to manage those 
>> workloads,
>> +that cannot use fences.
>> +
>> +Some hardware may schedule compute jobs, and have no way to pre-empt 
>> them, or
>> +have their memory swapped out from them. Or they simply want their 
>> workload
>> +not to be preempted or swapped out at all.
>> +
>> +This means that it differs from what is described in 
>> driver-api/dma-buf.rst.
>> +
>> +As with normal compute jobs, dma-fence may not be used at all. In 
>> this case,
>> +not even to force preemption. The driver with is simply forced to 
>> unmap a BO
>> +from the long compute job's address space on unbind immediately, not 
>> even
>> +waiting for the workload to complete. Effectively this terminates 
>> the workload
>> +when there is no hardware support to recover.
>> +
>> +Since this is undesirable, there need to be mitigations to prevent a 
>> workload
>> +from being terminated. There are several possible approach, all with 
>> their
>> +advantages and drawbacks.
>> +
>> +The first approach you will likely try is to pin all buffers used by 
>> compute.
>> +This guarantees that the job will run uninterrupted, but also allows 
>> a very
>> +denial of service attack by pinning as much memory as possible, 
>> hogging the
>> +all GPU memory, and possibly a huge chunk of CPU memory.
>> +
>> +A second approach that will work slightly better on its own is 
>> adding an option
>> +not to evict when creating a new job (any kind). If all of userspace 
>> opts in
>> +to this flag, it would prevent cooperating userspace from forced 
>> terminating
>> +older compute jobs to start a new one.
>> +
>> +If job preemption and recoverable pagefaults are not available, 
>> those are the
>> +only approaches possible. So even with those, you want a separate 
>> way of
>> +controlling resources. The standard kernel way of doing so is cgroups.
>> +
>> +This creates a third option, using cgroups to prevent eviction. Both 
>> GPU and
>> +driver-allocated CPU memory would be accounted to the correct 
>> cgroup, and
>> +eviction would be made cgroup aware. This allows the GPU to be 
>> partitioned
>> +into cgroups, that will allow jobs to run next to each other without
>> +interference.
>
> The 3rd approach is only valid if used strictly with device local 
> memory, right? Because as soon as system memory backed buffers are 
> used this approach cannot guarantee no eviction can be triggered.
>
>> +
>> +The interface to the cgroup would be similar to the current CPU memory
>> +interface, with similar semantics for min/low/high/max, if eviction can
>> +be made cgroup aware. For now only max is implemented.
>> +
>> +What should be noted is that each memory region (tiled memory for 
>> example)
>> +should have its own accounting, using $card key0 = value0 key1 = 
>> value1.
>> +
>> +The key is set to the regionid set by the driver, for example "tile0".
>> +For the value of $card, we use drmGetUnique().
>> diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
>> index 8ef66a47619f..4f17b1c85f47 100644
>> --- a/include/linux/cgroup_drm.h
>> +++ b/include/linux/cgroup_drm.h
>> @@ -6,4 +6,85 @@
>>   #ifndef _CGROUP_DRM_H
>>   #define _CGROUP_DRM_H
>>   +#include <linux/types.h>
>> +
>> +#include <drm/drm_managed.h>
>> +
>> +struct drm_device;
>> +struct drm_file;
>> +
>> +struct drmcgroup_state;
>> +
>> +/*
>> + * Use 8 as max, because of N^2 lookup when setting things, can be 
>> bumped if needed
>> + * Identical to TTM_NUM_MEM_TYPES to allow simplifying that code.
>> + */
>> +#define DRMCG_MAX_REGIONS 8
>> +
>> +struct drmcgroup_device {
>> +    struct list_head list;
>> +    struct list_head pools;
>> +
>> +    struct {
>> +        u64 size;
>> +        const char *name;
>> +    } regions[DRMCG_MAX_REGIONS];
>> +
>> +    /* Name describing the card, set by drmcg_register_device */
>> +    const char *name;
>> +
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_CGROUP_DRM)
>> +int drmcg_register_device(struct drm_device *dev,
>> +               struct drmcgroup_device *drm_cg);
>> +void drmcg_unregister_device(struct drmcgroup_device *cgdev);
>> +int drmcg_try_charge(struct drmcgroup_state **drmcg,
>> +             struct drmcgroup_device *cgdev,
>> +             u32 index, u64 size);
>> +void drmcg_uncharge(struct drmcgroup_state *drmcg,
>> +            struct drmcgroup_device *cgdev,
>> +            u32 index, u64 size);
>> +#else
>> +static inline int
>> +drmcg_register_device(struct drm_device *dev,
>> +              struct drm_cgroup *drm_cg)
>> +{
>> +    return 0;
>> +}
>> +
>> +static inline void drmcg_unregister_device(struct drmcgroup_device 
>> *cgdev)
>> +{
>> +}
>> +
>> +static inline int drmcg_try_charge(struct drmcgroup_state **drmcg,
>> +                   struct drmcgroup_device *cgdev,
>> +                   u32 index, u64 size)
>> +{
>> +    *drmcg = NULL;
>> +    return 0;
>> +}
>> +
>> +static inline void drmcg_uncharge(struct drmcgroup_state *drmcg,
>> +                  struct drmcgroup_device *cgdev,
>> +                  u32 index, u64 size)
>> +{ }
>> +#endif
>> +
>> +static inline void drmmcg_unregister_device(struct drm_device *dev, 
>> void *arg)
>> +{
>> +    drmcg_unregister_device(arg);
>> +}
>> +
>> +/*
>> + * This needs to be done as inline, because cgroup lives in the core
>> + * kernel and it cannot call drm calls directly
>> + */
>> +static inline int drmmcg_register_device(struct drm_device *dev,
>> +                     struct drmcgroup_device *cgdev)
>> +{
>> +    return drmcg_register_device(dev, cgdev) ?:
>> +        drmm_add_action_or_reset(dev, drmmcg_unregister_device, cgdev);
>> +}
>> +
>>   #endif    /* _CGROUP_DRM_H */
>> diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
>> index 02c8eaa633d3..a93d9344fd36 100644
>> --- a/kernel/cgroup/drm.c
>> +++ b/kernel/cgroup/drm.c
>> @@ -1,60 +1,557 @@
>> -/* SPDX-License-Identifier: MIT */
>> +// SPDX-License-Identifier: GPL-2.0
>>   /*
>> - * Copyright © 2023 Intel Corporation
>> + * Copyright 2023 Intel
>> + * Partially based on the rdma and misc controllers, which bear the 
>> following copyrights:
>> + *
>> + * Copyright 2020 Google LLC
>> + * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
>>    */
>>     #include <linux/cgroup.h>
>>   #include <linux/cgroup_drm.h>
>> +#include <linux/list.h>
>> +#include <linux/mutex.h>
>> +#include <linux/parser.h>
>>   #include <linux/slab.h>
>>   -struct drm_cgroup_state {
>
> As a side note, it'd be easier to read the diff if you left the name 
> as is, and some other details too, like the static root group (I need 
> to remind myself if/why I needed it, but does it harm you?) and my 
> missed static keywords and needless static struct initialization. I 
> will fix that up in my patch localy. Aynway, that way it would maybe 
> be less churn from one patch to the other in the series.
>
>> +#include <drm/drm_device.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/drm_managed.h>
>> +
>> +struct drmcgroup_state {
>>       struct cgroup_subsys_state css;
>> +
>> +    struct list_head pools;
>>   };
>>   -struct drm_root_cgroup_state {
>> -    struct drm_cgroup_state drmcs;
>> +struct drmcgroup_pool_state {
>> +    struct drmcgroup_device *device;
>> +    struct drmcgroup_resource {
>> +        s64 max, used;
>> +    } resources[DRMCG_MAX_REGIONS];
>> +
>> +    s64 usage_sum;
>> +
>> +    struct list_head    cg_node;
>
> cg always makes me think cgroup and not css so it is a bit confusing.
>
> Why are two lists needed?

The second list is used during teardown of a specific device, but it 
could probably traverse the entirety of  the cgroup hierarchy instead.

This probably needs some more core locks though, and since other cgroups 
implementations used the dual list approach I've used that instead.

>
>> +    struct list_head    dev_node;
>>   };
>>   -static struct drm_root_cgroup_state root_drmcs;
>> +static DEFINE_MUTEX(drmcg_mutex);
>> +static LIST_HEAD(drmcg_devices);
>>   -static inline struct drm_cgroup_state *
>> +static inline struct drmcgroup_state *
>>   css_to_drmcs(struct cgroup_subsys_state *css)
>>   {
>> -    return container_of(css, struct drm_cgroup_state, css);
>> +    return container_of(css, struct drmcgroup_state, css);
>> +}
>> +
>> +static inline struct drmcgroup_state *get_current_drmcg(void)
>> +{
>> +    return css_to_drmcs(task_get_css(current, drm_cgrp_id));
>> +}
>> +
>> +static struct drmcgroup_state *parent_drmcg(struct drmcgroup_state *cg)
>> +{
>> +    return css_to_drmcs(cg->css.parent);
>> +}
>> +
>> +static void free_cg_pool_locked(struct drmcgroup_pool_state *pool)
>> +{
>> +    lockdep_assert_held(&drmcg_mutex);
>> +
>> +    list_del(&pool->cg_node);
>> +    list_del(&pool->dev_node);
>> +    kfree(pool);
>> +}
>> +
>> +static void
>> +set_resource_max(struct drmcgroup_pool_state *pool, int i, u64 new_max)
>> +{
>> +    pool->resources[i].max = new_max;
>> +}
>> +
>> +static void set_all_resource_max_limit(struct drmcgroup_pool_state 
>> *rpool)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < DRMCG_MAX_REGIONS; i++)
>> +        set_resource_max(rpool, i, S64_MAX);
>> +}
>> +
>> +static void drmcs_offline(struct cgroup_subsys_state *css)
>> +{
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(css);
>> +    struct drmcgroup_pool_state *pool, *next;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_for_each_entry_safe(pool, next, &drmcs->pools, cg_node) {
>> +        if (!pool->usage_sum) {
>> +            free_cg_pool_locked(pool);
>> +        } else {
>> +            /* Reset all regions, last uncharge will remove pool */
>> +            set_all_resource_max_limit(pool);
>> +        }
>> +    }
>> +    mutex_unlock(&drmcg_mutex);
>>   }
>>     static void drmcs_free(struct cgroup_subsys_state *css)
>>   {
>> -    struct drm_cgroup_state *drmcs = css_to_drmcs(css);
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(css);
>>   -    if (drmcs != &root_drmcs.drmcs)
>> -        kfree(drmcs);
>> +    kfree(drmcs);
>>   }
>>     static struct cgroup_subsys_state *
>>   drmcs_alloc(struct cgroup_subsys_state *parent_css)
>>   {
>> -    struct drm_cgroup_state *drmcs;
>> +    struct drmcgroup_state *drmcs = kzalloc(sizeof(*drmcs), 
>> GFP_KERNEL);
>> +    if (!drmcs)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    INIT_LIST_HEAD(&drmcs->pools);
>> +    return &drmcs->css;
>> +}
>> +
>> +static struct drmcgroup_pool_state *
>> +find_cg_pool_locked(struct drmcgroup_state *drmcs, struct 
>> drmcgroup_device *dev)
>> +{
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    list_for_each_entry(pool, &drmcs->pools, cg_node)
>> +        if (pool->device == dev)
>> +            return pool;
>> +
>> +    return NULL;
>> +}
>> +
>> +static struct drmcgroup_pool_state *
>> +get_cg_pool_locked(struct drmcgroup_state *drmcs, struct 
>> drmcgroup_device *dev)
>> +{
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    pool = find_cg_pool_locked(drmcs, dev);
>> +    if (pool)
>> +        return pool;
>> +
>> +    pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>> +    if (!pool)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    pool->device = dev;
>> +    set_all_resource_max_limit(pool);
>>   -    if (!parent_css) {
>> -        drmcs = &root_drmcs.drmcs;
>> -    } else {
>> -        drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
>> -        if (!drmcs)
>> -            return ERR_PTR(-ENOMEM);
>> +    INIT_LIST_HEAD(&pool->cg_node);
>> +    INIT_LIST_HEAD(&pool->dev_node);
>> +    list_add_tail(&pool->cg_node, &drmcs->pools);
>> +    list_add_tail(&pool->dev_node, &dev->pools);
>> +    return pool;
>> +}
>> +
>> +void drmcg_unregister_device(struct drmcgroup_device *cgdev)
>> +{
>> +    struct drmcgroup_pool_state *pool, *next;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_del(&cgdev->list);
>> +
>> +    list_for_each_entry_safe(pool, next, &cgdev->pools, dev_node)
>> +        free_cg_pool_locked(pool);
>> +    mutex_unlock(&drmcg_mutex);
>> +    kfree(cgdev->name);
>> +}
>> +
>> +EXPORT_SYMBOL_GPL(drmcg_unregister_device);
>> +
>> +int drmcg_register_device(struct drm_device *dev,
>> +              struct drmcgroup_device *cgdev)
>> +{
>> +    char *name = kstrdup(dev->unique, GFP_KERNEL);
>> +    if (!name)
>> +        return -ENOMEM;
>> +
>> +    INIT_LIST_HEAD(&cgdev->pools);
>> +    mutex_lock(&drmcg_mutex);
>> +    cgdev->name = name;
>> +    list_add_tail(&cgdev->list, &drmcg_devices);
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(drmcg_register_device);
>> +
>> +static int drmcg_max_show(struct seq_file *sf, void *v)
>> +{
>> +    struct drmcgroup_state *drmcs = css_to_drmcs(seq_css(sf));
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +    list_for_each_entry(pool, &drmcs->pools, cg_node) {
>> +        struct drmcgroup_device *dev = pool->device;
>> +        int i;
>> +
>> +        seq_puts(sf, dev->name);
>> +
>> +        for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
>> +            if (!dev->regions[i].name)
>> +                continue;
>> +
>> +            if (pool->resources[i].max < S64_MAX)
>> +                seq_printf(sf, " region.%s=%lld", dev->regions[i].name,
>> +                       pool->resources[i].max);
>> +            else
>> +                seq_printf(sf, " region.%s=max", dev->regions[i].name);
>> +        }
>> +
>> +        seq_putc(sf, '\n');
>>       }
>> +    mutex_unlock(&drmcg_mutex);
>>   -    return &drmcs->css;
>> +    return 0;
>> +}
>> +
>> +static struct drmcgroup_device *drmcg_get_device_locked(const char 
>> *name)
>> +{
>> +    struct drmcgroup_device *dev;
>> +
>> +    lockdep_assert_held(&drmcg_mutex);
>> +
>> +    list_for_each_entry(dev, &drmcg_devices, list)
>> +        if (!strcmp(name, dev->name))
>> +            return dev;
>> +
>> +    return NULL;
>> +}
>> +
>> +static void try_to_free_cg_pool_locked(struct drmcgroup_pool_state 
>> *pool)
>> +{
>> +    struct drmcgroup_device *dev = pool->device;
>> +    u32 i;
>> +
>> +    /* Memory charged to this pool */
>> +    if (pool->usage_sum)
>> +        return;
>> +
>> +    for (i = 0; i < DRMCG_MAX_REGIONS; i++) {
>> +        if (!dev->regions[i].name)
>> +            continue;
>> +
>> +        /* Is a specific limit set? */
>> +        if (pool->resources[i].max < S64_MAX)
>> +            return;
>> +    }
>> +
>> +    /*
>> +     * No user of the pool and all entries are set to defaults;
>> +     * safe to delete this pool.
>> +     */
>> +    free_cg_pool_locked(pool);
>> +}
>> +
>> +
>> +static void
>> +uncharge_cg_locked(struct drmcgroup_state *drmcs,
>> +           struct drmcgroup_device *cgdev,
>> +           u32 index, u64 size)
>> +{
>> +    struct drmcgroup_pool_state *pool;
>> +
>> +    pool = find_cg_pool_locked(drmcs, cgdev);
>> +
>> +    if (unlikely(!pool)) {
>> +        pr_warn("Invalid device %p or drm cgroup %p\n", cgdev, drmcs);
>> +        return;
>> +    }
>> +
>> +    pool->resources[index].used -= size;
>> +
>> +    /*
>> +     * A negative count (or overflow) is invalid,
>> +     * it indicates a bug in the rdma controller.
>> +     */
>> +    WARN_ON_ONCE(pool->resources[index].used < 0);
>> +    pool->usage_sum--;
>> +    try_to_free_cg_pool_locked(pool);
>> +}
>> +
>> +static void drmcg_uncharge_hierarchy(struct drmcgroup_state *drmcs,
>> +                     struct drmcgroup_device *cgdev,
>> +                     struct drmcgroup_state *stop_cg,
>> +                     u32 index, u64 size)
>> +{
>> +    struct drmcgroup_state *p;
>> +
>> +    mutex_lock(&drmcg_mutex);
>> +
>> +    for (p = drmcs; p != stop_cg; p = parent_drmcg(p))
>> +        uncharge_cg_locked(p, cgdev, index, size);
>> +
>> +    mutex_unlock(&drmcg_mutex);
>> +
>> +    css_put(&drmcs->css);
>> +}
>> +
>> +void drmcg_uncharge(struct drmcgroup_state *drmcs,
>> +            struct drmcgroup_device *cgdev,
>> +            u32 index,
>> +            u64 size)
>> +{
>> +    if (index >= DRMCG_MAX_REGIONS)
>> +        return;
>> +
>> +    drmcg_uncharge_hierarchy(drmcs, cgdev, NULL, index, size);
>> +}
>> +EXPORT_SYMBOL_GPL(drmcg_uncharge);
>> +
>> +int drmcg_try_charge(struct drmcgroup_state **drmcs,
>> +             struct drmcgroup_device *cgdev,
>> +             u32 index,
>> +             u64 size)
>> +{
>> +    struct drmcgroup_state *cg, *p;
>> +    struct drmcgroup_pool_state *pool;
>> +    u64 new;
>> +    int ret = 0;
>> +
>> +    if (index >= DRMCG_MAX_REGIONS)
>> +        return -EINVAL;
>> +
>> +    /*
>> +     * hold on to css, as cgroup can be removed but resource
>> +     * accounting happens on css.
>> +     */
>> +    cg = get_current_drmcg();
>
> 1)
>
> I am not familiar with the Xe flows - charging is at the point of 
> actual backing store allocation?
>
> What about buffer sharing?
>
> Also, given how the css is permanently stored in the caller - you 
> deliberately decided not to deal with task migrations? I am not sure 
> that will work. Or maybe just omitted for RFC v1?

The other cgroup implementations don' really seem to implement 
migration; when a resource is allocated, it is charged to the process 
allocating it. I guess we could implement handover in dma-buf, but

but I think the lack of migration should be harmless.

> 2)
>
> Buffer objects which Xe can migrate between memory regions will be 
> correctly charge/uncharged as they are moved?

Correct. It will be charged in both places while moving, and afterwards 
the old allocation is uncharged when freeing.

I'm working on making TTM more cgroups aware while evicting, by adding 
the limiting css as possible return pointer. TTM

can only evict memory that is charged to the cgroup for which the limit 
is hit, or one of its subgroups.

Cheers,

~Maarten


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4]  Add support for DRM cgroup memory accounting.
  2023-05-03  8:34 [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Maarten Lankhorst
                   ` (3 preceding siblings ...)
  2023-05-03  8:35 ` [RFC PATCH 4/4] drm/xe: Add support for the drm cgroup Maarten Lankhorst
@ 2023-05-05 19:50 ` Tejun Heo
  2023-05-10 14:59   ` Maarten Lankhorst
  4 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2023-05-05 19:50 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	dri-devel, amd-gfx, Zefan Li, Johannes Weiner, cgroups, intel-xe

Hello,

On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:
> RFC as I'm looking for comments.
> 
> For long running compute, it can be beneficial to partition the GPU memory
> between cgroups, so each cgroup can use its maximum amount of memory without
> interfering with other scheduled jobs. Done properly, this can alleviate the
> need for eviction, which might result in a job being terminated if the GPU
> doesn't support mid-thread preemption or recoverable page faults.
> 
> This is done by adding a bunch of knobs to cgroup:
> drm.capacity: Shows maximum capacity of each resource region.
> drm.max: Display or limit max amount of memory.
> drm.current: Current amount of memory in use.
> 
> TTM has not been made cgroup aware yet, so instead of evicting from
> the current cgroup to stay within the cgroup limits, it simply returns
> the error -ENOSPC to userspace.
> 
> I've used Tvrtko's cgroup controller series as a base, but it implemented
> scheduling weight, not memory accounting, so I only ended up keeping the
> base patch.
> 
> Xe is not upstream yet, so the driver specific patch will only apply on
> https://gitlab.freedesktop.org/drm/xe/kernel

Some high-level feedbacks.

* There have been multiple attempts at this but the track record is kinda
  poor. People don't seem to agree what should constitute DRM memory and how
  they should be accounted / controlled.

* I like Tvrtko's scheduling patchset because it exposes a generic interface
  which makes sense regardless of hardware details and then each driver can
  implement the configured control in whatever way they can. However, even
  for that, there doesn't seem much buy-in from other drivers.

* This proposal seems narrowly scoped trying to solve a specific problem
  which may not translate to different hardware configurations. Please let
  me know if I got that wrong, but if that's the case, I think a better and
  easier approach might be just being a part of the misc controller. That
  doesn't require much extra code and should be able to provide everything
  necessary for statically limiting specific resources.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.
  2023-05-05 19:50 ` [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Tejun Heo
@ 2023-05-10 14:59   ` Maarten Lankhorst
  2023-05-10 18:46     ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-10 14:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	dri-devel, amd-gfx, Zefan Li, Johannes Weiner, cgroups, intel-xe

[-- Attachment #1: Type: text/plain, Size: 5886 bytes --]

Hey,

On 2023-05-05 21:50, Tejun Heo wrote:
> Hello,
>
> On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:
>> RFC as I'm looking for comments.
>>
>> For long running compute, it can be beneficial to partition the GPU memory
>> between cgroups, so each cgroup can use its maximum amount of memory without
>> interfering with other scheduled jobs. Done properly, this can alleviate the
>> need for eviction, which might result in a job being terminated if the GPU
>> doesn't support mid-thread preemption or recoverable page faults.
>>
>> This is done by adding a bunch of knobs to cgroup:
>> drm.capacity: Shows maximum capacity of each resource region.
>> drm.max: Display or limit max amount of memory.
>> drm.current: Current amount of memory in use.
>>
>> TTM has not been made cgroup aware yet, so instead of evicting from
>> the current cgroup to stay within the cgroup limits, it simply returns
>> the error -ENOSPC to userspace.
>>
>> I've used Tvrtko's cgroup controller series as a base, but it implemented
>> scheduling weight, not memory accounting, so I only ended up keeping the
>> base patch.
>>
>> Xe is not upstream yet, so the driver specific patch will only apply on
>> https://gitlab.freedesktop.org/drm/xe/kernel
> Some high-level feedbacks.
>
> * There have been multiple attempts at this but the track record is kinda
>    poor. People don't seem to agree what should constitute DRM memory and how
>    they should be accounted / controlled.

Thanks for the feedback.

I think for a lot of drivers, what is VRAM might have different meaning, but the intention
is it being accounted in the same way. Most drivers use TTM, which has a standard way
of allocating memory, and a standard way of evicting VRAM.

This makes it very useful for the usecase which I'm looking at, long running compute.
When you have long running jobs, you don't want them to be interrupted because a completely
unrelated process needs some VRAM, and one of the compute jobs buffers are being evicted.

Some hardware does not support mid-thread preemption or page fault recovery, this means that
when memory is evicted, the compute job is terminated.

The full problem statement is in drm-compute.rst in the memory accounting patch.

> * I like Tvrtko's scheduling patchset because it exposes a generic interface
>    which makes sense regardless of hardware details and then each driver can
>    implement the configured control in whatever way they can. However, even
>    for that, there doesn't seem much buy-in from other drivers.

Yeah, that is correct. But it tries to solve a different part of the problem.

> * This proposal seems narrowly scoped trying to solve a specific problem
>    which may not translate to different hardware configurations. Please let
>    me know if I got that wrong, but if that's the case, I think a better and
>    easier approach might be just being a part of the misc controller. That
>    doesn't require much extra code and should be able to provide everything
>    necessary for statically limiting specific resources.

The misc controller is not granular enough. A single computer may have any number of
graphics cards, some of them with multiple regions of vram inside a single card.

For compute and shared hosting you might want to limit the usage of a single memory
region on a single card, and then limit the same limits for the rest too, to prevent
triggering eviction.

The current version doesn't handle eviction correctly, because I was still working
on it and I wanted to post a RFC. As a result, the case where resource limit is hit
will evict the device's entire memory or get stuck in a loop. With some changes, the
next version will not have this bug. This results in a few changes to the core code. [1]

In the next version, I will move all the code for handling the resource limit to
TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.

The effect of moving the code to TTM, is that it will make the code even more generic
for drivers that have vram and use TTM. When using TTM, you only have to describe your
VRAM, update some fields in the TTM manager and (un)register your device with the
cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
nouveau. [2]

If you want to add a knob for scheduling weight for a process, it makes sense to
also add resource usage as a knob, otherwise the effect of that knob is very
limited. So even for Tvrtko's original proposed usecase, it would make sense.

Cheers,
~Maarten

--------
[1] Compared to this version:
  static inline int drmcg_try_charge(struct drmcgroup_state **drmcs,
+                                  struct drmcgroup_state **limitcs,
                                    struct drmcgroup_device *cgdev,
                                    u32 index, u64 size)

This now returns which cgroup's limit is hit on -EAGAIN.

+bool drmcs_grouped(struct drmcgroup_state *limitcs,
+                  struct drmcgroup_state *testcs);
Tells if testcs is the same as limitcs, or a subgroup of it. This allows us to
skip evicting when it's unneeded. If we want to add a min, it will make sense
to pass the size too, to skip some subcgroups below min.

+void drmcs_put(struct drmcgroup_state *drmcs);
Drops the limitcs ref.
-------------------
[2] With the next version, I can very easily implement the cgroup handling on amdgpu too:
- embed a struct drmcgroup_device inside amdgpu_device.
- In amdgpu_vram_mgr_init, populate the struct drmcgroup_device.regions[0] for vram,
   and set ttm_resource_manager->cg to &adev->drmcgroup_device
- Call drmcg_register_device after, and drmcg_unregister_device after cleaning up vram.

So if anyone wants to limit VRAM on amdgpu or qxl or nouveau (left as exercise for reader)
afterwards, it will work as intended, while the driver doesn't have to be cgroups aware.

[-- Attachment #2: Type: text/html, Size: 6674 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.
  2023-05-10 14:59   ` Maarten Lankhorst
@ 2023-05-10 18:46     ` Tejun Heo
  2023-05-11 10:03       ` Maarten Lankhorst
  2023-05-11 10:14       ` [Intel-gfx] " Tvrtko Ursulin
  0 siblings, 2 replies; 16+ messages in thread
From: Tejun Heo @ 2023-05-10 18:46 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	dri-devel, amd-gfx, Zefan Li, Johannes Weiner, cgroups, intel-xe

Hello,

On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
> The misc controller is not granular enough. A single computer may have any number of
> graphics cards, some of them with multiple regions of vram inside a single card.

Extending the misc controller to support dynamic keys shouldn't be that
difficult.

...
> In the next version, I will move all the code for handling the resource limit to
> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
> 
> The effect of moving the code to TTM, is that it will make the code even more generic
> for drivers that have vram and use TTM. When using TTM, you only have to describe your
> VRAM, update some fields in the TTM manager and (un)register your device with the
> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
> nouveau. [2]
> 
> If you want to add a knob for scheduling weight for a process, it makes sense to
> also add resource usage as a knob, otherwise the effect of that knob is very
> limited. So even for Tvrtko's original proposed usecase, it would make sense.

It does make sense but unlike Tvrtko's scheduling weights what's being
proposed doesn't seem to encapsulate GPU memory resource in a generic enough
manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
specific knoweldge of how a specific GPU operates to say "this guy should
get 2x processing power over that guy". This more or less holds for other
major resources including CPU, memory and IO. What you're proposing seems a
lot more tied to hardware details and users would have to know a lot more
about how memory is configured on that particular GPU.

Now, if this is inherent to how all, or at least most, GPUs operate, sure,
but otherwise let's start small in terms of interface and not take up space
which should be for something universal. If this turns out to be the way,
expanding to take up the generic interface space isn't difficult.

I don't know GPU space so please educate me where I'm wrong.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.
  2023-05-10 18:46     ` Tejun Heo
@ 2023-05-11 10:03       ` Maarten Lankhorst
  2023-05-11 10:14       ` [Intel-gfx] " Tvrtko Ursulin
  1 sibling, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-11 10:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tvrtko Ursulin, Thomas Zimmermann, intel-gfx, linux-kernel,
	dri-devel, amd-gfx, Zefan Li, Johannes Weiner, cgroups, intel-xe

Hey,

On 2023-05-10 20:46, Tejun Heo wrote:
> Hello,
>
> On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
>> The misc controller is not granular enough. A single computer may have any number of
>> graphics cards, some of them with multiple regions of vram inside a single card.
> Extending the misc controller to support dynamic keys shouldn't be that
> difficult.
>
> ...
>> In the next version, I will move all the code for handling the resource limit to
>> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
>>
>> The effect of moving the code to TTM, is that it will make the code even more generic
>> for drivers that have vram and use TTM. When using TTM, you only have to describe your
>> VRAM, update some fields in the TTM manager and (un)register your device with the
>> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
>> nouveau. [2]
>>
>> If you want to add a knob for scheduling weight for a process, it makes sense to
>> also add resource usage as a knob, otherwise the effect of that knob is very
>> limited. So even for Tvrtko's original proposed usecase, it would make sense.
> It does make sense but unlike Tvrtko's scheduling weights what's being
> proposed doesn't seem to encapsulate GPU memory resource in a generic enough
> manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
> specific knoweldge of how a specific GPU operates to say "this guy should
> get 2x processing power over that guy". This more or less holds for other
> major resources including CPU, memory and IO. What you're proposing seems a
> lot more tied to hardware details and users would have to know a lot more
> about how memory is configured on that particular GPU.

There's not much need of knowing the specifics of a card, but there might
be a need of knowing the workload to determine what allocation limits to set.

I've left region to be implementation specific, but it would make sense to
standardise it.
TTM, the layer used by drivers that support VRAM, have the following regions:
* sysmem - All system memory allocated; includes evicted VRAM.
* mapped - All physical system memory that is mapped to the GPU, when unbound
           moves to sysmem. When evicting VRAM to sysmem, it's temporarily
           mapped here.
* vramN - All VRAM regions of the device.
* driver specific regions - probably doesn't make sense to put in cgroup at all,
  this includes stolen from the PoC.

That leaves the question, what regions would make sense for a cgroup?
Since vramN can be moved to mapped and sysmem (VRAM eviction, suspend/resume,
driver_madvise), it becomes a subject of debate if we should include the other
regions, since things become complicated fast.

For the first iteration, I focus on a single category, vramN.

Even when not knowing anything about a GPU, it will be easy to partition its
memory like that.

If you can assign a weight for the scheduler, then you can also partition it's
vram by parsing /drm.capacity for total amount, and then splitting it across
cgroups.


> Now, if this is inherent to how all, or at least most, GPUs operate, sure,
> but otherwise let's start small in terms of interface and not take up space
> which should be for something universal. If this turns out to be the way,
> expanding to take up the generic interface space isn't difficult.
>
> I don't know GPU space so please educate me where I'm wrong.

Most GPU's have dedicated vram that works roughly in the same way, some
integrated chips like i915 or arm use shared memory from the host system
only. I would say amd, nvidia and intel's chips with dedicated memory work
roughly in the same way for vram.

I hope this explains it a little bit more,

~Maarten


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.
  2023-05-10 18:46     ` Tejun Heo
  2023-05-11 10:03       ` Maarten Lankhorst
@ 2023-05-11 10:14       ` Tvrtko Ursulin
  2023-05-11 19:58         ` Maarten Lankhorst
  1 sibling, 1 reply; 16+ messages in thread
From: Tvrtko Ursulin @ 2023-05-11 10:14 UTC (permalink / raw)
  To: Tejun Heo, Maarten Lankhorst
  Cc: amd-gfx, Zefan Li, intel-gfx, linux-kernel, dri-devel,
	Thomas Zimmermann, Johannes Weiner, cgroups, intel-xe

On 10/05/2023 19:46, Tejun Heo wrote:
> Hello,
> 
> On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
>> The misc controller is not granular enough. A single computer may have any number of
>> graphics cards, some of them with multiple regions of vram inside a single card.
> 
> Extending the misc controller to support dynamic keys shouldn't be that
> difficult.
> 
> ...
>> In the next version, I will move all the code for handling the resource limit to
>> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
>>
>> The effect of moving the code to TTM, is that it will make the code even more generic
>> for drivers that have vram and use TTM. When using TTM, you only have to describe your
>> VRAM, update some fields in the TTM manager and (un)register your device with the
>> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
>> nouveau. [2]
>>
>> If you want to add a knob for scheduling weight for a process, it makes sense to
>> also add resource usage as a knob, otherwise the effect of that knob is very
>> limited. So even for Tvrtko's original proposed usecase, it would make sense.
> 
> It does make sense but unlike Tvrtko's scheduling weights what's being
> proposed doesn't seem to encapsulate GPU memory resource in a generic enough
> manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
> specific knoweldge of how a specific GPU operates to say "this guy should
> get 2x processing power over that guy". This more or less holds for other
> major resources including CPU, memory and IO. What you're proposing seems a
> lot more tied to hardware details and users would have to know a lot more
> about how memory is configured on that particular GPU.
> 
> Now, if this is inherent to how all, or at least most, GPUs operate, sure,
> but otherwise let's start small in terms of interface and not take up space
> which should be for something universal. If this turns out to be the way,
> expanding to take up the generic interface space isn't difficult.
> 
> I don't know GPU space so please educate me where I'm wrong.

I was hoping it might be passable in principle (in some form, pending 
discussion on semantics) given how Maarten's proposal starts with only 
very specific per-device-per-memory regions controls, which is 
applicable to many devices. And hard limit at that, which probably has 
less complicated semantics, or at least implementation.

My understanding of the proposal is that the allocation either fits, or 
it evicts from the parent's hierarchy (if possible) and then fits, or it 
fails. Which sounds simple enough.

I do however agree that it is a limited use case. So from the negative 
side of the debating camp I have to ask if this use case could be simply 
satisfied by providing a driver/device global over commit yes/no 
control? In other words, is it really important to partition the vram 
space ahead of time, and from the kernel side too? Wouldn't the relevant 
(highly specialised) applications work just fine with global over commit 
disabled? Even if the option to configure their maximum allowed working 
set from the userspace side was needed.

Or if we conclude cgroup controller is the way to go, would adding less 
specific limits make it more palatable? I am thinking here some generic 
"device memory resident". Not per device, not per memory region. So 
userspace/admins have some chance of configuring generic limits. That 
would require coming up with more generic use cases though so another 
thing to discuss. Like who would use that and for what.

Assuming also we can agree that "device memory resident" is a 
stable/solid concept across drm. Should be easier than for integrated 
GPUs, for which I have to admit I currently don't remember if 
allocations are already consistently covered by the memory controller. 
Even if they are ownership is probably wrong.

Group ownership is possibly a concern in this proposal too. Because I 
remember the previous attempt of adding some drm stats to memcg 
explained that for instance on Android all buffers are allocated by a 
central process and then handed over to other processes. So transferring 
ownership was explained as critical.

Regards,

Tvrtko

P.S.
On the matter of naming the uapi interface - in any case I wouldn't use 
the "unqualified" drm namespace such as drm.max/current/capacity. I 
think all those should include a specific prefix to signify it is about 
memory. In some way.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.
  2023-05-11 10:14       ` [Intel-gfx] " Tvrtko Ursulin
@ 2023-05-11 19:58         ` Maarten Lankhorst
  0 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2023-05-11 19:58 UTC (permalink / raw)
  To: Tvrtko Ursulin, Tejun Heo
  Cc: amd-gfx, Zefan Li, intel-gfx, linux-kernel, dri-devel,
	Thomas Zimmermann, Johannes Weiner, cgroups, intel-xe

Hey,

On 2023-05-11 12:14, Tvrtko Ursulin wrote:
>
> On 10/05/2023 19:46, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
>>> The misc controller is not granular enough. A single computer may have any number of
>>> graphics cards, some of them with multiple regions of vram inside a single card.
>>
>> Extending the misc controller to support dynamic keys shouldn't be that
>> difficult.
>>
>> ...
>>> In the next version, I will move all the code for handling the resource limit to
>>> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
>>>
>>> The effect of moving the code to TTM, is that it will make the code even more generic
>>> for drivers that have vram and use TTM. When using TTM, you only have to describe your
>>> VRAM, update some fields in the TTM manager and (un)register your device with the
>>> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
>>> nouveau. [2]
>>>
>>> If you want to add a knob for scheduling weight for a process, it makes sense to
>>> also add resource usage as a knob, otherwise the effect of that knob is very
>>> limited. So even for Tvrtko's original proposed usecase, it would make sense.
>>
>> It does make sense but unlike Tvrtko's scheduling weights what's being
>> proposed doesn't seem to encapsulate GPU memory resource in a generic enough
>> manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
>> specific knoweldge of how a specific GPU operates to say "this guy should
>> get 2x processing power over that guy". This more or less holds for other
>> major resources including CPU, memory and IO. What you're proposing seems a
>> lot more tied to hardware details and users would have to know a lot more
>> about how memory is configured on that particular GPU.
>>
>> Now, if this is inherent to how all, or at least most, GPUs operate, sure,
>> but otherwise let's start small in terms of interface and not take up space
>> which should be for something universal. If this turns out to be the way,
>> expanding to take up the generic interface space isn't difficult.
>>
>> I don't know GPU space so please educate me where I'm wrong.
>
> I was hoping it might be passable in principle (in some form, pending discussion on semantics) given how Maarten's proposal starts with only very specific per-device-per-memory regions controls, which is applicable to many devices. And hard limit at that, which probably has less complicated semantics, or at least implementation.
>
> My understanding of the proposal is that the allocation either fits, or it evicts from the parent's hierarchy (if possible) and then fits, or it fails. Which sounds simple enough.

Yeah, for vram itś that simple. I think for mapped and sysmem regions we may require the
possiblity to ignore the limits as well, as it's possible to move to those regions from
eviction. We probably don't want eviction to fail because of too low limits.

> I do however agree that it is a limited use case. So from the negative side of the debating camp I have to ask if this use case could be simply satisfied by providing a driver/device global over commit yes/no control? In other words, is it really important to partition the vram space ahead of time, and from the kernel side too? Wouldn't the relevant (highly specialised) applications work just fine with global over commit disabled? Even if the option to configure their maximum allowed working set from the userspace side was needed.

Disabling overcommit? Do you mean pinning the memory workload? This causes a denial of service if
done without limits, and that's what we're trying to avoid. There is no need for immunity from
eviction. Overcommit is still useful inside the cgroup itself, we only want immunity from being
evicted by another process performing another workload.

> Or if we conclude cgroup controller is the way to go, would adding less specific limits make it more palatable? I am thinking here some generic "device memory resident". Not per device, not per memory region. So userspace/admins have some chance of configuring generic limits. That would require coming up with more generic use cases though so another thing to discuss. Like who would use that and for what.

You would run into ambiguity with that. I think it's fine to assume any number of vram regions. In most cases, the number is 1.

I don't see that number going much higher, 2 is already on the high side. A simple controller could just half each region separately.

> Assuming also we can agree that "device memory resident" is a stable/solid concept across drm. Should be easier than for integrated GPUs, for which I have to admit I currently don't remember if allocations are already consistently covered by the memory controller. Even if they are ownership is probably wrong.
Likely, especially when evicting another process' memory. That needs some thought. Likely we have to keep the original cgroup as an owner.
> Group ownership is possibly a concern in this proposal too. Because I remember the previous attempt of adding some drm stats to memcg explained that for instance on Android all buffers are allocated by a central process and then handed over to other processes. So transferring ownership was explained as critical.
Is this done using dma-buf? Ownership could be handed over on import then. If not, what is the mechanism they use?
> Regards,
>
> Tvrtko
>
> P.S.
> On the matter of naming the uapi interface - in any case I wouldn't use the "unqualified" drm namespace such as drm.max/current/capacity. I think all those should include a specific prefix to signify it is about memory. In some way.

I've deliberately added region to each key, so what happens is that drm.capacity/max/current contains: $pciid region.vram0=$value, that way, if we want to add non-memory resources, it becomes possible to do so by choosing a different prefix.

For example number of engines that can be created would also be possible to add to those files, but that might be more driver specific.

I tried to keep it generic like that. In this way I didn't immediately pollute the entire namespace.

Of course, if needed, we can make it drm.region_max etc instead and drop it from the cgroup key entries instead.

Cheers,

Maarten


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-05-11 19:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  8:34 [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Maarten Lankhorst
2023-05-03  8:34 ` [RFC PATCH 1/4] cgroup: Add the DRM cgroup controller Maarten Lankhorst
2023-05-03  8:34 ` [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup Maarten Lankhorst
2023-05-03 15:31   ` [Intel-gfx] " Tvrtko Ursulin
2023-05-03 15:33     ` Maarten Lankhorst
2023-05-05 14:21     ` Maarten Lankhorst
2023-05-03  8:34 ` [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC Maarten Lankhorst
2023-05-03  9:11   ` [Intel-xe] " Thomas Hellström
2023-05-03  9:36     ` Maarten Lankhorst
2023-05-03  8:35 ` [RFC PATCH 4/4] drm/xe: Add support for the drm cgroup Maarten Lankhorst
2023-05-05 19:50 ` [RFC PATCH 0/4] Add support for DRM cgroup memory accounting Tejun Heo
2023-05-10 14:59   ` Maarten Lankhorst
2023-05-10 18:46     ` Tejun Heo
2023-05-11 10:03       ` Maarten Lankhorst
2023-05-11 10:14       ` [Intel-gfx] " Tvrtko Ursulin
2023-05-11 19:58         ` Maarten Lankhorst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).