All of lore.kernel.org
 help / color / mirror / Atom feed
From: "T.J. Mercier" <tjmercier@google.com>
To: "Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	"David Airlie" <airlied@linux.ie>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Arve Hjønnevåg" <arve@android.com>,
	"Todd Kjos" <tkjos@android.com>,
	"Martijn Coenen" <maco@android.com>,
	"Joel Fernandes" <joel@joelfernandes.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Hridya Valsaraju" <hridya@google.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Christian König" <christian.koenig@amd.com>,
	"Benjamin Gaignard" <benjamin.gaignard@linaro.org>,
	"Liam Mark" <lmark@codeaurora.org>,
	"Laura Abbott" <labbott@redhat.com>,
	"Brian Starkey" <Brian.Starkey@arm.com>,
	"John Stultz" <john.stultz@linaro.org>,
	"Tejun Heo" <tj@kernel.org>, "Zefan Li" <lizefan.x@bytedance.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>
Cc: kaleshsingh@google.com, Kenny.Ho@amd.com,
	"T.J. Mercier" <tjmercier@google.com>,
	dri-devel@lists.freedesktop.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-media@vger.kernel.org,
	linaro-mm-sig@lists.linaro.org, cgroups@vger.kernel.org
Subject: [RFC v2 1/6] gpu: rfc: Proposal for a GPU cgroup controller
Date: Fri, 11 Feb 2022 16:18:24 +0000	[thread overview]
Message-ID: <20220211161831.3493782-2-tjmercier@google.com> (raw)
In-Reply-To: <20220211161831.3493782-1-tjmercier@google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-cgroup limits on the total size of buffers charged
  to it.
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

From: Hridya Valsaraju <hridya@google.com>
Signed-off-by: Hridya Valsaraju <hridya@google.com>
Co-developed-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
 Documentation/gpu/rfc/gpu-cgroup.rst | 195 +++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst      |   4 +
 2 files changed, 199 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst

diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst
new file mode 100644
index 000000000000..0bb761223b97
--- /dev/null
+++ b/Documentation/gpu/rfc/gpu-cgroup.rst
@@ -0,0 +1,195 @@
+===================================
+GPU cgroup controller
+===================================
+
+Goals
+=====
+This document intends to outline a plan to create a cgroup v2 controller subsystem
+for the per-cgroup accounting of device and system memory allocated by the GPU
+and related subsystems.
+
+The new cgroup controller would:
+
+* Allow setting per-cgroup limits on the total size of buffers charged to it.
+
+* Allow setting per-device limits on the total size of buffers allocated by a
+  device/allocator within a cgroup.
+
+* Expose a per-device/allocator breakdown of the buffers charged to a cgroup.
+
+Alternatives Considered
+=======================
+
+The following alternatives were considered:
+
+The memory cgroup controller
+____________________________
+
+1. As was noted in [1], memory accounting provided by the GPU cgroup
+controller is not a good fit for integration into memcg due to the
+differences in how accounting is performed. It implements a mechanism
+for the allocator attribution of GPU and GPU-related memory by
+charging each buffer to the cgroup of the process on behalf of which
+the memory was allocated. The buffer stays charged to the cgroup until
+it is freed regardless of whether the process retains any references
+to it. On the other hand, the memory cgroup controller offers a more
+fine-grained charging and uncharging behavior depending on the kind of
+page being accounted.
+
+2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model,
+a process takes a reference to the entire buffer(hence keeping it alive) even if
+it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF
+memory accounting would only introduce additional overhead without any benefits.
+
+[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
+
+Userspace service to keep track of buffer allocations and releases
+__________________________________________________________________
+
+1. There is no way for a userspace service to intercept all allocations and releases.
+2. In case the process gets killed or restarted, we lose all accounting so far.
+
+UAPI
+====
+When enabled, the new cgroup controller would create the following files in every cgroup.
+
+::
+
+        gpu.memory.current (R)
+        gpu.memory.max (R/W)
+
+gpu.memory.current is a read-only file and would contain per-device memory allocations
+in a key-value format where key is a string representing the device name
+and the value is the size of memory charged to the device in the cgroup in bytes.
+
+For example:
+
+::
+
+        cat /sys/kernel/fs/cgroup1/gpu.memory.current
+        dev1 4194304
+        dev2 4194304
+
+The string key for each device is set by the device driver when the device registers
+with the GPU cgroup controller to participate in resource accounting(see section
+'Design and Implementation' for more details).
+
+gpu.memory.max is a read/write file. It would show the current total
+size limits on memory usage for the cgroup and the limits on total memory usage
+for each allocator/device.
+
+Setting a total limit for a cgroup can be done as follows:
+
+::
+
+        echo “total 41943040” > /sys/kernel/fs/cgroup1/gpu.memory.max
+
+Setting a total limit for a particular device/allocator can be done as follows:
+
+::
+
+        echo “dev1 4194304” >  /sys/kernel/fs/cgroup1/gpu.memory.max
+
+In this example, 'dev1' is the string key set by the device driver during
+registration.
+
+Design and Implementation
+=========================
+
+The cgroup controller would closely follow the design of the RDMA cgroup controller
+subsystem where each cgroup maintains a list of resource pools.
+Each resource pool contains a struct device and the counter to track current total,
+and the maximum limit set for the device.
+
+The below code block is a preliminary estimation on how the core kernel data structures
+and APIs would look like.
+
+.. code-block:: c
+
+        /**
+         * The GPU cgroup controller data structure.
+         */
+        struct gpucg {
+                struct cgroup_subsys_state css;
+
+                /* list of all resource pools that belong to this cgroup */
+                struct list_head rpools;
+        };
+
+        struct gpucg_device {
+                /*
+                 * list  of various resource pools in various cgroups that the device is
+                 * part of.
+                 */
+                struct list_head rpools;
+
+                /* list of all devices registered for GPU cgroup accounting */
+                struct list_head dev_node;
+
+                /* name to be used as identifier for accounting and limit setting */
+                const char *name;
+        };
+
+        struct gpucg_resource_pool {
+                /* The device whose resource usage is tracked by this resource pool */
+                struct gpucg_device *device;
+
+                /* list of all resource pools for the cgroup */
+                struct list_head cg_node;
+
+                /*
+                 * list maintained by the gpucg_device to keep track of its
+                 * resource pools
+                 */
+                struct list_head dev_node;
+
+                /* tracks memory usage of the resource pool */
+                struct page_counter total;
+        };
+
+        /**
+         * gpucg_register_device - Registers a device for memory accounting using the
+         * GPU cgroup controller.
+         *
+         * @device: The device to register for memory accounting. Must remain valid
+         * after registration.
+         * @name: Pointer to a string literal to denote the name of the device.
+         */
+        void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name);
+
+        /**
+         * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to charge the memory to.
+         * @device: The device to charge the memory to.
+         * @usage: size of memory to charge in bytes.
+         *
+         * Return: returns 0 if the charging is successful and otherwise returns an
+         * error code.
+         */
+        int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+        /**
+         * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to uncharge the memory from.
+         * @device: The device to charge the memory from.
+         * @usage: size of memory to uncharge in bytes.
+         */
+        void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+Future Work
+===========
+Additional GPU resources can be supported by adding new controller files.
+
+Upstreaming Plan
+================
+* Decide on a UAPI that accommodates all use-cases for the upstream GPU ecosystem
+  as well as for Android.
+
+* Prototype the GPU cgroup controller and integrate its usage into the DMA-BUF
+  system heap.
+
+* Demonstrate its usage from userspace in the Android Open Space Project.
+
+* Send out RFCs to LKML for the GPU cgroup controller and iterate.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..0a9bcd94e95d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    gpu-cgroup.rst
-- 
2.35.1.265.g69c8d7142f-goog


WARNING: multiple messages have this Message-ID (diff)
From: "T.J. Mercier" <tjmercier@google.com>
To: "Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	"David Airlie" <airlied@linux.ie>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Arve Hjønnevåg" <arve@android.com>,
	"Todd Kjos" <tkjos@android.com>,
	"Martijn Coenen" <maco@android.com>,
	"Joel Fernandes" <joel@joelfernandes.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Hridya Valsaraju" <hridya@google.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Christian König" <christian.koenig@amd.com>,
	"Benjamin Gaignard" <benjamin.gaignard@linaro.org>,
	"Liam Mark" <lmark@codeaurora.org>,
	"Laura Abbott" <labbott@redhat.com>,
	"Brian Starkey" <Brian.Starkey@arm.com>,
	"John Stultz" <john.stultz@linaro.org>,
	"Tejun Heo" <tj@kernel.org>, "Zefan Li" <lizefan.x@bytedance.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>
Cc: linux-doc@vger.kernel.org, Kenny.Ho@amd.com,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linaro-mm-sig@lists.linaro.org, kaleshsingh@google.com,
	cgroups@vger.kernel.org, "T.J. Mercier" <tjmercier@google.com>,
	linux-media@vger.kernel.org
Subject: [RFC v2 1/6] gpu: rfc: Proposal for a GPU cgroup controller
Date: Fri, 11 Feb 2022 16:18:24 +0000	[thread overview]
Message-ID: <20220211161831.3493782-2-tjmercier@google.com> (raw)
In-Reply-To: <20220211161831.3493782-1-tjmercier@google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-cgroup limits on the total size of buffers charged
  to it.
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

From: Hridya Valsaraju <hridya@google.com>
Signed-off-by: Hridya Valsaraju <hridya@google.com>
Co-developed-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
 Documentation/gpu/rfc/gpu-cgroup.rst | 195 +++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst      |   4 +
 2 files changed, 199 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst

diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst
new file mode 100644
index 000000000000..0bb761223b97
--- /dev/null
+++ b/Documentation/gpu/rfc/gpu-cgroup.rst
@@ -0,0 +1,195 @@
+===================================
+GPU cgroup controller
+===================================
+
+Goals
+=====
+This document intends to outline a plan to create a cgroup v2 controller subsystem
+for the per-cgroup accounting of device and system memory allocated by the GPU
+and related subsystems.
+
+The new cgroup controller would:
+
+* Allow setting per-cgroup limits on the total size of buffers charged to it.
+
+* Allow setting per-device limits on the total size of buffers allocated by a
+  device/allocator within a cgroup.
+
+* Expose a per-device/allocator breakdown of the buffers charged to a cgroup.
+
+Alternatives Considered
+=======================
+
+The following alternatives were considered:
+
+The memory cgroup controller
+____________________________
+
+1. As was noted in [1], memory accounting provided by the GPU cgroup
+controller is not a good fit for integration into memcg due to the
+differences in how accounting is performed. It implements a mechanism
+for the allocator attribution of GPU and GPU-related memory by
+charging each buffer to the cgroup of the process on behalf of which
+the memory was allocated. The buffer stays charged to the cgroup until
+it is freed regardless of whether the process retains any references
+to it. On the other hand, the memory cgroup controller offers a more
+fine-grained charging and uncharging behavior depending on the kind of
+page being accounted.
+
+2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model,
+a process takes a reference to the entire buffer(hence keeping it alive) even if
+it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF
+memory accounting would only introduce additional overhead without any benefits.
+
+[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
+
+Userspace service to keep track of buffer allocations and releases
+__________________________________________________________________
+
+1. There is no way for a userspace service to intercept all allocations and releases.
+2. In case the process gets killed or restarted, we lose all accounting so far.
+
+UAPI
+====
+When enabled, the new cgroup controller would create the following files in every cgroup.
+
+::
+
+        gpu.memory.current (R)
+        gpu.memory.max (R/W)
+
+gpu.memory.current is a read-only file and would contain per-device memory allocations
+in a key-value format where key is a string representing the device name
+and the value is the size of memory charged to the device in the cgroup in bytes.
+
+For example:
+
+::
+
+        cat /sys/kernel/fs/cgroup1/gpu.memory.current
+        dev1 4194304
+        dev2 4194304
+
+The string key for each device is set by the device driver when the device registers
+with the GPU cgroup controller to participate in resource accounting(see section
+'Design and Implementation' for more details).
+
+gpu.memory.max is a read/write file. It would show the current total
+size limits on memory usage for the cgroup and the limits on total memory usage
+for each allocator/device.
+
+Setting a total limit for a cgroup can be done as follows:
+
+::
+
+        echo “total 41943040” > /sys/kernel/fs/cgroup1/gpu.memory.max
+
+Setting a total limit for a particular device/allocator can be done as follows:
+
+::
+
+        echo “dev1 4194304” >  /sys/kernel/fs/cgroup1/gpu.memory.max
+
+In this example, 'dev1' is the string key set by the device driver during
+registration.
+
+Design and Implementation
+=========================
+
+The cgroup controller would closely follow the design of the RDMA cgroup controller
+subsystem where each cgroup maintains a list of resource pools.
+Each resource pool contains a struct device and the counter to track current total,
+and the maximum limit set for the device.
+
+The below code block is a preliminary estimation on how the core kernel data structures
+and APIs would look like.
+
+.. code-block:: c
+
+        /**
+         * The GPU cgroup controller data structure.
+         */
+        struct gpucg {
+                struct cgroup_subsys_state css;
+
+                /* list of all resource pools that belong to this cgroup */
+                struct list_head rpools;
+        };
+
+        struct gpucg_device {
+                /*
+                 * list  of various resource pools in various cgroups that the device is
+                 * part of.
+                 */
+                struct list_head rpools;
+
+                /* list of all devices registered for GPU cgroup accounting */
+                struct list_head dev_node;
+
+                /* name to be used as identifier for accounting and limit setting */
+                const char *name;
+        };
+
+        struct gpucg_resource_pool {
+                /* The device whose resource usage is tracked by this resource pool */
+                struct gpucg_device *device;
+
+                /* list of all resource pools for the cgroup */
+                struct list_head cg_node;
+
+                /*
+                 * list maintained by the gpucg_device to keep track of its
+                 * resource pools
+                 */
+                struct list_head dev_node;
+
+                /* tracks memory usage of the resource pool */
+                struct page_counter total;
+        };
+
+        /**
+         * gpucg_register_device - Registers a device for memory accounting using the
+         * GPU cgroup controller.
+         *
+         * @device: The device to register for memory accounting. Must remain valid
+         * after registration.
+         * @name: Pointer to a string literal to denote the name of the device.
+         */
+        void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name);
+
+        /**
+         * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to charge the memory to.
+         * @device: The device to charge the memory to.
+         * @usage: size of memory to charge in bytes.
+         *
+         * Return: returns 0 if the charging is successful and otherwise returns an
+         * error code.
+         */
+        int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+        /**
+         * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to uncharge the memory from.
+         * @device: The device to charge the memory from.
+         * @usage: size of memory to uncharge in bytes.
+         */
+        void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+Future Work
+===========
+Additional GPU resources can be supported by adding new controller files.
+
+Upstreaming Plan
+================
+* Decide on a UAPI that accommodates all use-cases for the upstream GPU ecosystem
+  as well as for Android.
+
+* Prototype the GPU cgroup controller and integrate its usage into the DMA-BUF
+  system heap.
+
+* Demonstrate its usage from userspace in the Android Open Space Project.
+
+* Send out RFCs to LKML for the GPU cgroup controller and iterate.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..0a9bcd94e95d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    gpu-cgroup.rst
-- 
2.35.1.265.g69c8d7142f-goog


WARNING: multiple messages have this Message-ID (diff)
From: "T.J. Mercier" <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
To: "Maarten Lankhorst"
	<maarten.lankhorst-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
	"Maxime Ripard" <mripard-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	"Thomas Zimmermann" <tzimmermann-l3A5Bk7waGM@public.gmane.org>,
	"David Airlie" <airlied-cv59FeDIM0c@public.gmane.org>,
	"Daniel Vetter" <daniel-/w4YWyX8dFk@public.gmane.org>,
	"Jonathan Corbet" <corbet-T1hC0tSOHrs@public.gmane.org>,
	"Greg Kroah-Hartman"
	<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>,
	"Arve Hjønnevåg" <arve-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>,
	"Todd Kjos" <tkjos-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>,
	"Martijn Coenen" <maco-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>,
	"Joel Fernandes"
	<joel-QYYGw3jwrUn5owFQY34kdNi2O/JbrIOy@public.gmane.org>,
	"Christian Brauner"
	<brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	"Hridya Valsaraju"
	<hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	"Suren Baghdasaryan"
	<surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	"Sumit Semwal"
	<sumit.semwal-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>,
	"Christian König" <christian.koenig-5C7GfCeVMHo@public.gmane.org>,
	"Benjamin Gaignard"
	<benjamin.gaignard-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>,
	"Liam Mark" <lmark-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Cc: kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	Kenny.Ho-5C7GfCeVMHo@public.gmane.org,
	"T.J. Mercier"
	<tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linaro-mm-sig-cunTk1MwBs8s++Sfvej+rw@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: [RFC v2 1/6] gpu: rfc: Proposal for a GPU cgroup controller
Date: Fri, 11 Feb 2022 16:18:24 +0000	[thread overview]
Message-ID: <20220211161831.3493782-2-tjmercier@google.com> (raw)
In-Reply-To: <20220211161831.3493782-1-tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-cgroup limits on the total size of buffers charged
  to it.
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

From: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Co-developed-by: T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/gpu/rfc/gpu-cgroup.rst | 195 +++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst      |   4 +
 2 files changed, 199 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst

diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst
new file mode 100644
index 000000000000..0bb761223b97
--- /dev/null
+++ b/Documentation/gpu/rfc/gpu-cgroup.rst
@@ -0,0 +1,195 @@
+===================================
+GPU cgroup controller
+===================================
+
+Goals
+=====
+This document intends to outline a plan to create a cgroup v2 controller subsystem
+for the per-cgroup accounting of device and system memory allocated by the GPU
+and related subsystems.
+
+The new cgroup controller would:
+
+* Allow setting per-cgroup limits on the total size of buffers charged to it.
+
+* Allow setting per-device limits on the total size of buffers allocated by a
+  device/allocator within a cgroup.
+
+* Expose a per-device/allocator breakdown of the buffers charged to a cgroup.
+
+Alternatives Considered
+=======================
+
+The following alternatives were considered:
+
+The memory cgroup controller
+____________________________
+
+1. As was noted in [1], memory accounting provided by the GPU cgroup
+controller is not a good fit for integration into memcg due to the
+differences in how accounting is performed. It implements a mechanism
+for the allocator attribution of GPU and GPU-related memory by
+charging each buffer to the cgroup of the process on behalf of which
+the memory was allocated. The buffer stays charged to the cgroup until
+it is freed regardless of whether the process retains any references
+to it. On the other hand, the memory cgroup controller offers a more
+fine-grained charging and uncharging behavior depending on the kind of
+page being accounted.
+
+2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model,
+a process takes a reference to the entire buffer(hence keeping it alive) even if
+it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF
+memory accounting would only introduce additional overhead without any benefits.
+
+[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org/#22624705
+
+Userspace service to keep track of buffer allocations and releases
+__________________________________________________________________
+
+1. There is no way for a userspace service to intercept all allocations and releases.
+2. In case the process gets killed or restarted, we lose all accounting so far.
+
+UAPI
+====
+When enabled, the new cgroup controller would create the following files in every cgroup.
+
+::
+
+        gpu.memory.current (R)
+        gpu.memory.max (R/W)
+
+gpu.memory.current is a read-only file and would contain per-device memory allocations
+in a key-value format where key is a string representing the device name
+and the value is the size of memory charged to the device in the cgroup in bytes.
+
+For example:
+
+::
+
+        cat /sys/kernel/fs/cgroup1/gpu.memory.current
+        dev1 4194304
+        dev2 4194304
+
+The string key for each device is set by the device driver when the device registers
+with the GPU cgroup controller to participate in resource accounting(see section
+'Design and Implementation' for more details).
+
+gpu.memory.max is a read/write file. It would show the current total
+size limits on memory usage for the cgroup and the limits on total memory usage
+for each allocator/device.
+
+Setting a total limit for a cgroup can be done as follows:
+
+::
+
+        echo “total 41943040” > /sys/kernel/fs/cgroup1/gpu.memory.max
+
+Setting a total limit for a particular device/allocator can be done as follows:
+
+::
+
+        echo “dev1 4194304” >  /sys/kernel/fs/cgroup1/gpu.memory.max
+
+In this example, 'dev1' is the string key set by the device driver during
+registration.
+
+Design and Implementation
+=========================
+
+The cgroup controller would closely follow the design of the RDMA cgroup controller
+subsystem where each cgroup maintains a list of resource pools.
+Each resource pool contains a struct device and the counter to track current total,
+and the maximum limit set for the device.
+
+The below code block is a preliminary estimation on how the core kernel data structures
+and APIs would look like.
+
+.. code-block:: c
+
+        /**
+         * The GPU cgroup controller data structure.
+         */
+        struct gpucg {
+                struct cgroup_subsys_state css;
+
+                /* list of all resource pools that belong to this cgroup */
+                struct list_head rpools;
+        };
+
+        struct gpucg_device {
+                /*
+                 * list  of various resource pools in various cgroups that the device is
+                 * part of.
+                 */
+                struct list_head rpools;
+
+                /* list of all devices registered for GPU cgroup accounting */
+                struct list_head dev_node;
+
+                /* name to be used as identifier for accounting and limit setting */
+                const char *name;
+        };
+
+        struct gpucg_resource_pool {
+                /* The device whose resource usage is tracked by this resource pool */
+                struct gpucg_device *device;
+
+                /* list of all resource pools for the cgroup */
+                struct list_head cg_node;
+
+                /*
+                 * list maintained by the gpucg_device to keep track of its
+                 * resource pools
+                 */
+                struct list_head dev_node;
+
+                /* tracks memory usage of the resource pool */
+                struct page_counter total;
+        };
+
+        /**
+         * gpucg_register_device - Registers a device for memory accounting using the
+         * GPU cgroup controller.
+         *
+         * @device: The device to register for memory accounting. Must remain valid
+         * after registration.
+         * @name: Pointer to a string literal to denote the name of the device.
+         */
+        void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name);
+
+        /**
+         * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to charge the memory to.
+         * @device: The device to charge the memory to.
+         * @usage: size of memory to charge in bytes.
+         *
+         * Return: returns 0 if the charging is successful and otherwise returns an
+         * error code.
+         */
+        int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+        /**
+         * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to uncharge the memory from.
+         * @device: The device to charge the memory from.
+         * @usage: size of memory to uncharge in bytes.
+         */
+        void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+Future Work
+===========
+Additional GPU resources can be supported by adding new controller files.
+
+Upstreaming Plan
+================
+* Decide on a UAPI that accommodates all use-cases for the upstream GPU ecosystem
+  as well as for Android.
+
+* Prototype the GPU cgroup controller and integrate its usage into the DMA-BUF
+  system heap.
+
+* Demonstrate its usage from userspace in the Android Open Space Project.
+
+* Send out RFCs to LKML for the GPU cgroup controller and iterate.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..0a9bcd94e95d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    gpu-cgroup.rst
-- 
2.35.1.265.g69c8d7142f-goog


  reply	other threads:[~2022-02-11 16:18 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-11 16:18 [RFC v2 0/6] Proposal for a GPU cgroup controller T.J. Mercier
2022-02-11 16:18 ` T.J. Mercier
2022-02-11 16:18 ` T.J. Mercier
2022-02-11 16:18 ` T.J. Mercier [this message]
2022-02-11 16:18   ` [RFC v2 1/6] gpu: rfc: " T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18 ` [RFC v2 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18 ` [RFC v2 3/6] dmabuf: Use the GPU cgroup charge/uncharge APIs T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18 ` [RFC v2 4/6] dmabuf: heaps: export system_heap buffers with GPU cgroup charging T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18 ` [RFC v2 5/6] dmabuf: Add gpu cgroup charge transfer function T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18 ` [RFC v2 6/6] android: binder: Add a buffer flag to relinquish ownership of fds T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-11 16:18   ` T.J. Mercier
2022-02-12  7:19   ` Greg Kroah-Hartman
2022-02-12  7:19     ` Greg Kroah-Hartman
2022-02-12  7:19     ` Greg Kroah-Hartman
2022-02-14 18:33     ` Todd Kjos
2022-02-14 18:33       ` Todd Kjos
2022-02-14 18:33       ` Todd Kjos
2022-02-14 19:29       ` Suren Baghdasaryan
2022-02-14 19:29         ` Suren Baghdasaryan
2022-02-14 19:29         ` Suren Baghdasaryan
2022-02-14 20:19         ` Todd Kjos
2022-02-14 20:19           ` Todd Kjos
2022-02-14 20:19           ` Todd Kjos
2022-02-14 20:37           ` John Stultz
2022-02-14 20:37             ` John Stultz
2022-02-14 20:37             ` John Stultz
2022-02-14 21:14             ` Hridya Valsaraju
2022-02-14 21:14               ` Hridya Valsaraju
2022-02-14 21:14               ` Hridya Valsaraju
2022-02-14 22:25     ` T.J. Mercier
2022-02-14 22:25       ` T.J. Mercier
2022-02-14 22:25       ` T.J. Mercier
2022-02-15  7:01       ` Greg Kroah-Hartman
2022-02-15  7:01         ` Greg Kroah-Hartman
2022-02-15  7:01         ` Greg Kroah-Hartman
2022-02-15  7:19         ` Suren Baghdasaryan
2022-02-15  7:19           ` Suren Baghdasaryan
2022-02-15  7:19           ` Suren Baghdasaryan
2022-02-15  7:30           ` Greg Kroah-Hartman
2022-02-15  7:30             ` Greg Kroah-Hartman
2022-02-15  7:30             ` Greg Kroah-Hartman
2022-02-14 21:25   ` Todd Kjos
2022-02-14 21:25     ` Todd Kjos
2022-02-14 21:25     ` Todd Kjos
2022-02-15  0:03     ` T.J. Mercier
2022-02-15  0:03       ` T.J. Mercier
2022-02-15  0:03       ` T.J. Mercier
2022-02-14 19:23 ` [RFC v2 0/6] Proposal for a GPU cgroup controller Tejun Heo
2022-02-14 19:23   ` Tejun Heo
2022-02-14 19:23   ` Tejun Heo
2022-02-18 19:12 ` T.J. Mercier
2022-02-18 19:12   ` T.J. Mercier
2022-02-18 19:12   ` T.J. Mercier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220211161831.3493782-2-tjmercier@google.com \
    --to=tjmercier@google.com \
    --cc=Brian.Starkey@arm.com \
    --cc=Kenny.Ho@amd.com \
    --cc=airlied@linux.ie \
    --cc=arve@android.com \
    --cc=benjamin.gaignard@linaro.org \
    --cc=brauner@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=corbet@lwn.net \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hridya@google.com \
    --cc=joel@joelfernandes.org \
    --cc=john.stultz@linaro.org \
    --cc=kaleshsingh@google.com \
    --cc=labbott@redhat.com \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=lizefan.x@bytedance.com \
    --cc=lmark@codeaurora.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=maco@android.com \
    --cc=mripard@kernel.org \
    --cc=sumit.semwal@linaro.org \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=tkjos@android.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.