linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] cgroup support for GPU devices
@ 2019-05-01 14:04 Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 1/5] cgroup: Add cgroup_subsys per-device registration framework Brian Welty
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Brian Welty @ 2019-05-01 14:04 UTC (permalink / raw)
  To: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

In containerized or virtualized environments, there is desire to have
controls in place for resources that can be consumed by users of a GPU
device.  This RFC patch series proposes a framework for integrating 
use of existing cgroup controllers into device drivers.
The i915 driver is updated in this series as our primary use case to
leverage this framework and to serve as an example for discussion.

The patch series enables device drivers to use cgroups to control the
following resources within a GPU (or other accelerator device):
*  control allocation of device memory (reuse of memcg)
and with future work, we could extend to:
*  track and control share of GPU time (reuse of cpu/cpuacct)
*  apply mask of allowed execution engines (reuse of cpusets)

Instead of introducing a new cgroup subsystem for GPU devices, a new
framework is proposed to allow devices to register with existing cgroup
controllers, which creates per-device cgroup_subsys_state within the
cgroup.  This gives device drivers their own private cgroup controls
(such as memory limits or other parameters) to be applied to device
resources instead of host system resources.
Device drivers (GPU or other) are then able to reuse the existing cgroup
controls, instead of inventing similar ones.

Per-device controls would be exposed in cgroup filesystem as:
    mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
such as (for example):
    mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
    mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
    mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
    mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight

The drm/i915 patch in this series is based on top of other RFC work [1]
for i915 device memory support.

AMD [2] and Intel [3] have proposed related work in this area within the
last few years, listed below as reference.  This new RFC reuses existing
cgroup controllers and takes a different approach than prior work.

Finally, some potential discussion points for this series:
* merge proposed <subsys_name>.devices into a single devices directory?
* allow devices to have multiple registrations for subsets of resources?
* document a 'common charging policy' for device drivers to follow?

[1] https://patchwork.freedesktop.org/series/56683/
[2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
[3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html


Brian Welty (5):
  cgroup: Add cgroup_subsys per-device registration framework
  cgroup: Change kernfs_node for directories to store
    cgroup_subsys_state
  memcg: Add per-device support to memory cgroup subsystem
  drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
  drm/i915: Use memory cgroup for enforcing device memory limit

 drivers/gpu/drm/drm_drv.c                  |  12 +
 drivers/gpu/drm/drm_gem.c                  |   7 +
 drivers/gpu/drm/i915/i915_drv.c            |   2 +-
 drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
 include/drm/drm_device.h                   |   3 +
 include/drm/drm_drv.h                      |   8 +
 include/drm/drm_gem.h                      |  11 +
 include/linux/cgroup-defs.h                |  28 ++
 include/linux/cgroup.h                     |   3 +
 include/linux/memcontrol.h                 |  10 +
 kernel/cgroup/cgroup-v1.c                  |  10 +-
 kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
 mm/memcontrol.c                            | 183 +++++++++++-
 13 files changed, 552 insertions(+), 59 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/5] cgroup: Add cgroup_subsys per-device registration framework
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
@ 2019-05-01 14:04 ` Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 2/5] cgroup: Change kernfs_node for directories to store cgroup_subsys_state Brian Welty
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Brian Welty @ 2019-05-01 14:04 UTC (permalink / raw)
  To: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

In containerized or virtualized environments, there is desire to have
controls in place for resources that can be consumed by users of a GPU
device.  For this purpose, we extend control groups with a mechanism
for device drivers to register with cgroup subsystems.
Device drivers (GPU or other) are then able to reuse the existing cgroup
controls, instead of inventing similar ones.

A new framework is proposed to allow devices to register with existing
cgroup controllers, which creates per-device cgroup_subsys_state within
the cgroup.  This gives device drivers their own private cgroup controls
(such as memory limits or other parameters) to be applied to device
resources instead of host system resources.

It is exposed in cgroup filesystem as:
  mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/
such as (for example):
  mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
  mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
  mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat

The creation of above files is implemented in css_populate_dir() for
cgroup subsystems that have enabled per-device support.
Above files are created either at time of cgroup creation (for known
registered devices) or at the time of device driver registration of the
device, during cgroup_register_device.  cgroup_device_unregister will
remove files from all current cgroups.

Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
---
 include/linux/cgroup-defs.h |  28 ++++
 include/linux/cgroup.h      |   3 +
 kernel/cgroup/cgroup.c      | 270 ++++++++++++++++++++++++++++++++++--
 3 files changed, 289 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1c70803e9f77..aeaab420e349 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -162,6 +162,17 @@ struct cgroup_subsys_state {
 	struct work_struct destroy_work;
 	struct rcu_work destroy_rwork;
 
+	/*
+	 * Per-device state for devices registered with our subsys.
+	 * @device_css_idr stores pointer to per-device cgroup_subsys_state,
+	 * created when devices are associated with this css.
+	 * @device_kn is for creating .devices sub-directory within this cgroup
+	 * or for the per-device sub-directory (subsys.devices/<dev_name>).
+	 */
+	struct device *device;
+	struct idr device_css_idr;
+	struct kernfs_node *device_kn;
+
 	/*
 	 * PI: the parent css.	Placed here for cache proximity to following
 	 * fields of the containing structure.
@@ -589,6 +600,9 @@ struct cftype {
  */
 struct cgroup_subsys {
 	struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
+	struct cgroup_subsys_state *(*device_css_alloc)(struct device *device,
+							struct cgroup_subsys_state *cgroup_css,
+							struct cgroup_subsys_state *parent_device_css);
 	int (*css_online)(struct cgroup_subsys_state *css);
 	void (*css_offline)(struct cgroup_subsys_state *css);
 	void (*css_released)(struct cgroup_subsys_state *css);
@@ -636,6 +650,13 @@ struct cgroup_subsys {
 	 */
 	bool threaded:1;
 
+	/*
+	 * If %true, the controller supports device drivers to register
+	 * with this controller for cloning the cgroup functionality
+	 * into per-device cgroup state under <cgroup-name>.dev/<dev_name>/.
+	 */
+	bool allow_devices:1;
+
 	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
@@ -664,6 +685,13 @@ struct cgroup_subsys {
 	/* idr for css->id */
 	struct idr css_idr;
 
+	/*
+	 * IDR of registered devices, allows subsys_state to have state
+	 * for each device. Exposed as per-device entries in filesystem,
+	 * under <subsys_name>.device/<dev_name>/.
+	 */
+	struct idr device_idr;
+
 	/*
 	 * List of cftypes.  Each entry is the first entry of an array
 	 * terminated by zero length name.
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 81f58b4a5418..3531bf948703 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -116,6 +116,9 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
 
+int cgroup_device_register(struct cgroup_subsys *ss, struct device *dev,
+			   unsigned long *dev_id);
+void cgroup_device_unregister(struct cgroup_subsys *ss, unsigned long dev_id);
 void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p);
 extern void cgroup_cancel_fork(struct task_struct *p);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3f2b4bde0f9c..9b035e728941 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -598,6 +598,8 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 	struct cgroup *cgrp = of->kn->parent->priv;
 	struct cftype *cft = of_cft(of);
 
+	/* FIXME this needs updating to lookup device-specific CSS */
+
 	/*
 	 * This is open and unprotected implementation of cgroup_css().
 	 * seq_css() is only called from a kernfs file operation which has
@@ -1583,14 +1585,15 @@ struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline)
 	return NULL;
 }
 
-static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft)
+static void cgroup_rm_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
+			   const struct cftype *cft)
 {
 	char name[CGROUP_FILE_NAME_MAX];
+	struct kernfs_node *dest_kn;
 
 	lockdep_assert_held(&cgroup_mutex);
 
 	if (cft->file_offset) {
-		struct cgroup_subsys_state *css = cgroup_css(cgrp, cft->ss);
 		struct cgroup_file *cfile = (void *)css + cft->file_offset;
 
 		spin_lock_irq(&cgroup_file_kn_lock);
@@ -1600,6 +1603,7 @@ static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft)
 		del_timer_sync(&cfile->notify_timer);
 	}
 
+	dest_kn = (css->device) ? css->device_kn : cgrp->kn;
 	kernfs_remove_by_name(cgrp->kn, cgroup_file_name(cgrp, cft, name));
 }
 
@@ -1630,10 +1634,49 @@ static void css_clear_dir(struct cgroup_subsys_state *css)
 	}
 }
 
+static int cgroup_device_mkdir(struct cgroup_subsys_state *css)
+{
+	struct cgroup_subsys_state *device_css;
+	struct cgroup *cgrp = css->cgroup;
+	char name[CGROUP_FILE_NAME_MAX];
+	struct kernfs_node *kn;
+	int ret, dev_id;
+
+	/* create subsys.device only if enabled in subsys and non-root cgroup */
+	if (!css->ss->allow_devices || !cgroup_parent(cgrp))
+		return 0;
+
+	ret = strlcpy(name, css->ss->name, CGROUP_FILE_NAME_MAX);
+	ret += strlcat(name, ".device", CGROUP_FILE_NAME_MAX);
+	/* treat as non-error if truncation due to subsys name */
+	if (WARN_ON_ONCE(ret >= CGROUP_FILE_NAME_MAX))
+		return 0;
+
+	kn = kernfs_create_dir(cgrp->kn, name, cgrp->kn->mode, cgrp);
+	if (IS_ERR(kn))
+		return PTR_ERR(kn);
+	css->device_kn = kn;
+
+	/* create subdirectory per each registered device */
+	idr_for_each_entry(&css->device_css_idr, device_css, dev_id) {
+		/* FIXME: prefix dev_name with bus_name for uniqueness? */
+		kn = kernfs_create_dir(css->device_kn,
+				       dev_name(device_css->device),
+				       cgrp->kn->mode, cgrp);
+		if (IS_ERR(kn))
+			return PTR_ERR(kn);
+		/* FIXME: kernfs_get needed here? */
+		device_css->device_kn = kn;
+	}
+
+	return 0;
+}
+
 /**
  * css_populate_dir - create subsys files in a cgroup directory
  * @css: target css
  *
+ * Creates per-device directories if enabled in subsys.
  * On failure, no file is added.
  */
 static int css_populate_dir(struct cgroup_subsys_state *css)
@@ -1655,6 +1698,10 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
 		if (ret < 0)
 			return ret;
 	} else {
+		ret = cgroup_device_mkdir(css);
+		if (ret < 0)
+			return ret;
+
 		list_for_each_entry(cfts, &css->ss->cfts, node) {
 			ret = cgroup_addrm_files(css, cgrp, cfts, true);
 			if (ret < 0) {
@@ -1673,6 +1720,7 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
 			break;
 		cgroup_addrm_files(css, cgrp, cfts, false);
 	}
+	/* FIXME: per-device files will be removed by kernfs_destroy_root? */
 	return ret;
 }
 
@@ -3665,14 +3713,15 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 			   struct cftype *cft)
 {
 	char name[CGROUP_FILE_NAME_MAX];
-	struct kernfs_node *kn;
+	struct kernfs_node *kn, *dest_kn;
 	struct lock_class_key *key = NULL;
 	int ret;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	key = &cft->lockdep_key;
 #endif
-	kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
+	dest_kn = (css->device) ? css->device_kn : cgrp->kn;
+	kn = __kernfs_create_file(dest_kn, cgroup_file_name(cgrp, cft, name),
 				  cgroup_file_mode(cft),
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
@@ -3709,15 +3758,13 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
  * Depending on @is_add, add or remove files defined by @cfts on @cgrp.
  * For removals, this function never fails.
  */
-static int cgroup_addrm_files(struct cgroup_subsys_state *css,
-			      struct cgroup *cgrp, struct cftype cfts[],
-			      bool is_add)
+static int __cgroup_addrm_files(struct cgroup_subsys_state *css,
+				struct cgroup *cgrp, struct cftype cfts[],
+				bool is_add)
 {
 	struct cftype *cft, *cft_end = NULL;
 	int ret = 0;
 
-	lockdep_assert_held(&cgroup_mutex);
-
 restart:
 	for (cft = cfts; cft != cft_end && cft->name[0] != '\0'; cft++) {
 		/* does cft->flags tell us to skip this file on @cgrp? */
@@ -3741,12 +3788,43 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 				goto restart;
 			}
 		} else {
-			cgroup_rm_file(cgrp, cft);
+			cgroup_rm_file(css, cgrp, cft);
 		}
 	}
 	return ret;
 }
 
+static int cgroup_addrm_files(struct cgroup_subsys_state *css,
+			      struct cgroup *cgrp, struct cftype cfts[],
+			      bool is_add)
+{
+	struct cgroup_subsys_state *device_css, *device_css_end = NULL;
+	int dev_id, ret, err = 0;
+
+	lockdep_assert_held(&cgroup_mutex);
+restart:
+	ret = __cgroup_addrm_files(css, cgrp, cfts, is_add);
+	if (ret)
+		return ret;
+
+	/* repeat addrm for each device */
+	idr_for_each_entry(&css->device_css_idr, device_css, dev_id) {
+		if (device_css == device_css_end)
+			break;
+		ret = __cgroup_addrm_files(device_css, cgrp, cfts, is_add);
+		if (ret && !is_add) {
+			return ret;
+		} else if (ret) {
+			is_add = false;
+			device_css_end = device_css;
+			err = ret;
+			goto restart;
+		}
+	}
+
+	return err;
+}
+
 static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add)
 {
 	struct cgroup_subsys *ss = cfts[0].ss;
@@ -4711,9 +4789,14 @@ static void css_free_rwork_fn(struct work_struct *work)
 
 	if (ss) {
 		/* css free path */
-		struct cgroup_subsys_state *parent = css->parent;
-		int id = css->id;
+		struct cgroup_subsys_state *device_css, *parent = css->parent;
+		int dev_id, id = css->id;
 
+		idr_for_each_entry(&css->device_css_idr, device_css, dev_id) {
+			css_put(device_css->parent);
+			ss->css_free(device_css);
+		}
+		idr_destroy(&css->device_css_idr);
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
@@ -4833,6 +4916,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
+	idr_init(&css->device_css_idr);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4885,6 +4969,79 @@ static void offline_css(struct cgroup_subsys_state *css)
 	wake_up_all(&css->cgroup->offline_waitq);
 }
 
+/*
+ * Associates a device with a css.
+ * Create a new device-specific css and insert into @css->device_css_idr.
+ * Acquires a references on @css, which is released when the device is
+ * dissociated with this css.
+ */
+static int cgroup_add_device(struct cgroup_subsys_state *css,
+			     struct device *dev, int dev_id)
+{
+	struct cgroup_subsys *ss = css->ss;
+	struct cgroup_subsys_state *dev_css, *dev_parent_css;
+	int err;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* don't add devices at root cgroup level */
+	if (!css->parent)
+		return -EINVAL;
+
+	dev_parent_css = idr_find(&css->parent->device_css_idr, dev_id);
+	dev_css = ss->device_css_alloc(dev, css, dev_parent_css);
+	if (IS_ERR_OR_NULL(dev_css)) {
+		if (!dev_css)
+			return -ENOMEM;
+		if (IS_ERR(dev_css))
+			return PTR_ERR(dev_css);
+	}
+
+	/* store per-device css pointer in the cgroup's css */
+	err = idr_alloc(&css->device_css_idr, dev_css, dev_id,
+			dev_id + 1, GFP_KERNEL);
+	if (err < 0) {
+		ss->css_free(dev_css);
+		return err;
+	}
+
+	dev_css->device = dev;
+	dev_css->parent = dev_parent_css;
+	/*
+	 * subsys per-device support is allowed to access cgroup subsys_state
+	 * using cgroup.self.  Increment reference on css so it remains valid
+	 * as long as device is associated with it.
+	 */
+	dev_css->cgroup = css->cgroup;
+	dev_css->ss = css->ss;
+	css_get(css);
+
+	return 0;
+}
+
+/*
+ * For a new cgroup css, create device-specific css for each device which
+ * which has registered itself with the subsys.
+ */
+static int cgroup_add_devices(struct cgroup_subsys_state *css)
+{
+	struct device *dev;
+	int dev_id, err = 0;
+
+	/* ignore adding devices for root cgroups */
+	if (!css->parent)
+		return 0;
+
+	/* create per-device css for each associated device */
+	idr_for_each_entry(&css->ss->device_idr, dev, dev_id) {
+		err = cgroup_add_device(css, dev, dev_id);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
 /**
  * css_create - create a cgroup_subsys_state
  * @cgrp: the cgroup new css will be associated with
@@ -4921,6 +5078,10 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 		goto err_free_css;
 	css->id = err;
 
+	err = cgroup_add_devices(css);
+	if (err)
+		goto err_free_css;
+
 	/* @css is ready to be brought online now, make it visible */
 	list_add_tail_rcu(&css->sibling, &parent_css->children);
 	cgroup_idr_replace(&ss->css_idr, css, css->id);
@@ -5337,6 +5498,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	mutex_lock(&cgroup_mutex);
 
 	idr_init(&ss->css_idr);
+	idr_init(&ss->device_idr);
 	INIT_LIST_HEAD(&ss->cfts);
 
 	/* Create the root cgroup state for this subsystem */
@@ -5637,6 +5799,90 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 	return retval;
 }
 
+void cgroup_device_unregister(struct cgroup_subsys *ss, unsigned long dev_id)
+{
+	struct cgroup_subsys_state *css, *device_css;
+	int css_id;
+
+	if (!ss->allow_devices)
+		return;
+
+	mutex_lock(&cgroup_mutex);
+	idr_for_each_entry(&ss->css_idr, css, css_id) {
+		WARN_ON(css->device);
+		if (!css->parent || css->device)
+			continue;
+		device_css = idr_remove(&css->device_css_idr, dev_id);
+		if (device_css) {
+			/* FIXME kernfs_get/put needed to make safe? */
+			if (device_css->device_kn)
+				kernfs_remove(device_css->device_kn);
+			css_put(device_css->parent);
+			ss->css_free(device_css);
+		}
+	}
+	idr_remove(&ss->device_idr, dev_id);
+	mutex_unlock(&cgroup_mutex);
+}
+
+/**
+ * cgroup_device_register - associate a struct device with @ss
+ * @ss: the subsystem of interest
+ * @dev: the device of interest
+ * @dev_id: index into @ss idr returned
+ *
+ * Insert @dev into set of devices to be associated with this subsystem.
+ * As cgroups are created, subdirectories <subsys_name>.<device>/<dev-name/
+ * will be created within the cgroup's filesystem.  Device drivers can then
+ * have this subsystem's controls applied to per-device resources by use of
+ * a private cgroup_subsys_state.
+ */
+int cgroup_device_register(struct cgroup_subsys *ss, struct device *dev,
+			   unsigned long *dev_id)
+{
+	struct cgroup_subsys_state *css;
+	int css_id, id, err = 0;
+
+	if (!ss->allow_devices)
+		return -EACCES;
+
+	mutex_lock(&cgroup_mutex);
+
+	id = idr_alloc_cyclic(&ss->device_idr, dev, 0, 0, GFP_KERNEL);
+	if (id < 0) {
+		mutex_unlock(&cgroup_mutex);
+		return id;
+	}
+
+	idr_for_each_entry(&ss->css_idr, css, css_id) {
+		WARN_ON(css->device);
+		if (!css->parent || css->device)
+			continue;
+		err = cgroup_add_device(css, dev, id);
+		if (err)
+			break;
+
+		if (css_visible(css)) {
+			/* FIXME - something more lightweight can be done? */
+			css_clear_dir(css);
+			/* FIXME kernfs_get/put needed to make safe? */
+			kernfs_remove(css->device_kn);
+			err = css_populate_dir(css);
+			if (err)
+				/* FIXME handle error case */
+				err = 0;
+			else
+				kernfs_activate(css->cgroup->kn);
+		}
+	}
+
+	if (!err)
+		*dev_id = id;
+	mutex_unlock(&cgroup_mutex);
+
+	return err;
+}
+
 /**
  * cgroup_fork - initialize cgroup related fields during copy_process()
  * @child: pointer to task_struct of forking parent process.
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/5] cgroup: Change kernfs_node for directories to store cgroup_subsys_state
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 1/5] cgroup: Add cgroup_subsys per-device registration framework Brian Welty
@ 2019-05-01 14:04 ` Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 3/5] memcg: Add per-device support to memory cgroup subsystem Brian Welty
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Brian Welty @ 2019-05-01 14:04 UTC (permalink / raw)
  To: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

Change the kernfs_node.priv to store the cgroup_subsys_state (CSS) pointer
for directories, instead of storing cgroup pointer.  This is done in order
to support files within the cgroup associated with devices.  We require
of_css() to return the device-specific CSS pointer for these files.

Cc: cgroups@vger.kernel.org
Signed-off-by: Brian Welty <brian.welty@intel.com>
---
 kernel/cgroup/cgroup-v1.c | 10 ++++----
 kernel/cgroup/cgroup.c    | 48 +++++++++++++++++----------------------
 2 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index c126b34fd4ff..4fa56cc2b99c 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -723,6 +723,7 @@ int proc_cgroupstats_show(struct seq_file *m, void *v)
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 {
 	struct kernfs_node *kn = kernfs_node_from_dentry(dentry);
+	struct cgroup_subsys_state *css;
 	struct cgroup *cgrp;
 	struct css_task_iter it;
 	struct task_struct *tsk;
@@ -740,12 +741,13 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	 * @kn->priv is RCU safe.  Let's do the RCU dancing.
 	 */
 	rcu_read_lock();
-	cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv);
-	if (!cgrp || cgroup_is_dead(cgrp)) {
+	css = rcu_dereference(*(void __rcu __force **)&kn->priv);
+	if (!css || cgroup_is_dead(css->cgroup)) {
 		rcu_read_unlock();
 		mutex_unlock(&cgroup_mutex);
 		return -ENOENT;
 	}
+	cgrp = css->cgroup;
 	rcu_read_unlock();
 
 	css_task_iter_start(&cgrp->self, 0, &it);
@@ -851,7 +853,7 @@ void cgroup1_release_agent(struct work_struct *work)
 static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent,
 			  const char *new_name_str)
 {
-	struct cgroup *cgrp = kn->priv;
+	struct cgroup_subsys_state *css = kn->priv;
 	int ret;
 
 	if (kernfs_type(kn) != KERNFS_DIR)
@@ -871,7 +873,7 @@ static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent
 
 	ret = kernfs_rename(kn, new_parent, new_name_str);
 	if (!ret)
-		TRACE_CGROUP_PATH(rename, cgrp);
+		TRACE_CGROUP_PATH(rename, css->cgroup);
 
 	mutex_unlock(&cgroup_mutex);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9b035e728941..1fe4fee502ea 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -595,12 +595,13 @@ static void cgroup_get_live(struct cgroup *cgrp)
 
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
-	struct cgroup *cgrp = of->kn->parent->priv;
+	struct cgroup_subsys_state *css = of->kn->parent->priv;
 	struct cftype *cft = of_cft(of);
 
-	/* FIXME this needs updating to lookup device-specific CSS */
-
 	/*
+	 * If the cft specifies a subsys and this is not a device file,
+	 * then lookup the css, otherwise it is already correct.
+	 *
 	 * This is open and unprotected implementation of cgroup_css().
 	 * seq_css() is only called from a kernfs file operation which has
 	 * an active reference on the file.  Because all the subsystem
@@ -608,10 +609,9 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 	 * the matching css from the cgroup's subsys table is guaranteed to
 	 * be and stay valid until the enclosing operation is complete.
 	 */
-	if (cft->ss)
-		return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
-	else
-		return &cgrp->self;
+	if (cft->ss && !css->device)
+		css = rcu_dereference_raw(css->cgroup->subsys[cft->ss->id]);
+	return css;
 }
 EXPORT_SYMBOL_GPL(of_css);
 
@@ -1524,12 +1524,14 @@ static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask)
  */
 void cgroup_kn_unlock(struct kernfs_node *kn)
 {
+	struct cgroup_subsys_state *css;
 	struct cgroup *cgrp;
 
 	if (kernfs_type(kn) == KERNFS_DIR)
-		cgrp = kn->priv;
+		css = kn->priv;
 	else
-		cgrp = kn->parent->priv;
+		css = kn->parent->priv;
+	cgrp = css->cgroup;
 
 	mutex_unlock(&cgroup_mutex);
 
@@ -1556,12 +1558,14 @@ void cgroup_kn_unlock(struct kernfs_node *kn)
  */
 struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline)
 {
+	struct cgroup_subsys_state *css;
 	struct cgroup *cgrp;
 
 	if (kernfs_type(kn) == KERNFS_DIR)
-		cgrp = kn->priv;
+		css = kn->priv;
 	else
-		cgrp = kn->parent->priv;
+		css = kn->parent->priv;
+	cgrp = css->cgroup;
 
 	/*
 	 * We're gonna grab cgroup_mutex which nests outside kernfs
@@ -1652,7 +1656,7 @@ static int cgroup_device_mkdir(struct cgroup_subsys_state *css)
 	if (WARN_ON_ONCE(ret >= CGROUP_FILE_NAME_MAX))
 		return 0;
 
-	kn = kernfs_create_dir(cgrp->kn, name, cgrp->kn->mode, cgrp);
+	kn = kernfs_create_dir(cgrp->kn, name, cgrp->kn->mode, css);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 	css->device_kn = kn;
@@ -1662,7 +1666,7 @@ static int cgroup_device_mkdir(struct cgroup_subsys_state *css)
 		/* FIXME: prefix dev_name with bus_name for uniqueness? */
 		kn = kernfs_create_dir(css->device_kn,
 				       dev_name(device_css->device),
-				       cgrp->kn->mode, cgrp);
+				       cgrp->kn->mode, device_css);
 		if (IS_ERR(kn))
 			return PTR_ERR(kn);
 		/* FIXME: kernfs_get needed here? */
@@ -2025,7 +2029,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	root->kf_root = kernfs_create_root(kf_sops,
 					   KERNFS_ROOT_CREATE_DEACTIVATED |
 					   KERNFS_ROOT_SUPPORT_EXPORTOP,
-					   root_cgrp);
+					   &root_cgrp->self);
 	if (IS_ERR(root->kf_root)) {
 		ret = PTR_ERR(root->kf_root);
 		goto exit_root_id;
@@ -3579,9 +3583,9 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 				 size_t nbytes, loff_t off)
 {
 	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
-	struct cgroup *cgrp = of->kn->parent->priv;
+	struct cgroup_subsys_state *css = of_css(of);
 	struct cftype *cft = of->kn->priv;
-	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp = css->cgroup;
 	int ret;
 
 	/*
@@ -3598,16 +3602,6 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 	if (cft->write)
 		return cft->write(of, buf, nbytes, off);
 
-	/*
-	 * kernfs guarantees that a file isn't deleted with operations in
-	 * flight, which means that the matching css is and stays alive and
-	 * doesn't need to be pinned.  The RCU locking is not necessary
-	 * either.  It's just for the convenience of using cgroup_css().
-	 */
-	rcu_read_lock();
-	css = cgroup_css(cgrp, cft->ss);
-	rcu_read_unlock();
-
 	if (cft->write_u64) {
 		unsigned long long v;
 		ret = kstrtoull(buf, 0, &v);
@@ -5262,7 +5256,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	}
 
 	/* create the directory */
-	kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
+	kn = kernfs_create_dir(parent->kn, name, mode, &cgrp->self);
 	if (IS_ERR(kn)) {
 		ret = PTR_ERR(kn);
 		goto out_destroy;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/5] memcg: Add per-device support to memory cgroup subsystem
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 1/5] cgroup: Add cgroup_subsys per-device registration framework Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 2/5] cgroup: Change kernfs_node for directories to store cgroup_subsys_state Brian Welty
@ 2019-05-01 14:04 ` Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 4/5] drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit Brian Welty
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Brian Welty @ 2019-05-01 14:04 UTC (permalink / raw)
  To: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

Here we update memory cgroup to enable the newly introduced per-device
framework.  As mentioned in the prior patch, the intent here is to allow
drivers to have their own private cgroup controls (such as memory limit)
to be applied to device resources instead of host system resources.

In summary, to enable device registration for memory cgroup subsystem:
  *  set .allow_devices to true
  *  add new exported device register and device unregister functions
     to register a device with the cgroup subsystem
  *  implement the .device_css_alloc callback to create device
     specific cgroups_subsys_state within a cgroup

As cgroup is created and for current registered devices, one will see in
the cgroup filesystem these additional files:
  mount/<cgroup_name>/memory.devices/<dev_name>/<mem_cgrp attributes>

Registration of a new device is performed in device drivers using new
mem_cgroup_device_register(). This will create above files in existing
cgroups.

And for runtime charging to the cgroup, we add the following:
  *  add new routine to lookup the device-specific cgroup_subsys_state
     which is within the task's cgroup (mem_cgroup_device_from_task)
  *  add new functions for device specific 'direct' charging

The last point above involves adding new mem_cgroup_try_charge_direct
and mem_cgroup_uncharge_direct functions.  The 'direct' name is to say
that we are charging the specified cgroup state directly and not using
any associated page or mm_struct.  We are called within device specific
memory management routines, where the device driver will track which
cgroup to charge within its own private data structures.

With this initial submission, support for memory accounting and charging
is functional.  Nested cgroups will correctly maintain the parent for
device-specific state as well, such that hierarchial charging to device
files is supported.

Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
---
 include/linux/memcontrol.h |  10 ++
 mm/memcontrol.c            | 183 ++++++++++++++++++++++++++++++++++---
 2 files changed, 178 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dbb6118370c1..711669b613dc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -348,6 +348,11 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 		bool compound);
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
+/* direct charging to mem_cgroup is primarily for device driver usage */
+int mem_cgroup_try_charge_direct(struct mem_cgroup *memcg,
+				 unsigned long nr_pages);
+void mem_cgroup_uncharge_direct(struct mem_cgroup *memcg,
+				unsigned long nr_pages);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
@@ -395,6 +400,11 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
+struct mem_cgroup *mem_cgroup_device_from_task(unsigned long id,
+					       struct task_struct *p);
+int mem_cgroup_device_register(struct device *dev, unsigned long *dev_id);
+void mem_cgroup_device_unregister(unsigned long dev_id);
+
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 81a0d3914ec9..2c8407aed0f5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -823,6 +823,47 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 }
 EXPORT_SYMBOL(mem_cgroup_from_task);
 
+int mem_cgroup_device_register(struct device *dev, unsigned long *dev_id)
+{
+	return cgroup_device_register(&memory_cgrp_subsys, dev, dev_id);
+}
+EXPORT_SYMBOL(mem_cgroup_device_register);
+
+void mem_cgroup_device_unregister(unsigned long dev_id)
+{
+	cgroup_device_unregister(&memory_cgrp_subsys, dev_id);
+}
+EXPORT_SYMBOL(mem_cgroup_device_unregister);
+
+/**
+ * mem_cgroup_device_from_task: Lookup device-specific memcg
+ * @id: device-specific id returned from mem_cgroup_device_register
+ * @p: task to lookup the memcg
+ *
+ * First use mem_cgroup_from_task to lookup and obtain a reference on
+ * the memcg associated with this task @p.  Within this memcg, find the
+ * device-specific one associated with @id.
+ * However if mem_cgroup is disabled, NULL is returned.
+ */
+struct mem_cgroup *mem_cgroup_device_from_task(unsigned long id,
+					       struct task_struct *p)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *dev_memcg = NULL;
+
+	if (mem_cgroup_disabled())
+		return NULL;
+
+	rcu_read_lock();
+	memcg  = mem_cgroup_from_task(p);
+	if (memcg)
+		dev_memcg = idr_find(&memcg->css.device_css_idr, id);
+	rcu_read_unlock();
+
+	return dev_memcg;
+}
+EXPORT_SYMBOL(mem_cgroup_device_from_task);
+
 /**
  * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg.
  * @mm: mm from which memcg should be extracted. It can be NULL.
@@ -2179,13 +2220,31 @@ void mem_cgroup_handle_over_high(void)
 	current->memcg_nr_pages_over_high = 0;
 }
 
+static bool __try_charge(struct mem_cgroup *memcg, unsigned int nr_pages,
+			 struct mem_cgroup **mem_over_limit)
+{
+	struct page_counter *counter;
+
+	if (!do_memsw_account() ||
+	    page_counter_try_charge(&memcg->memsw, nr_pages, &counter)) {
+		if (page_counter_try_charge(&memcg->memory, nr_pages, &counter))
+			return true;
+		if (do_memsw_account())
+			page_counter_uncharge(&memcg->memsw, nr_pages);
+		*mem_over_limit = mem_cgroup_from_counter(counter, memory);
+	} else {
+		*mem_over_limit = mem_cgroup_from_counter(counter, memsw);
+	}
+
+	return false;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
 	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
-	struct page_counter *counter;
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
@@ -2198,17 +2257,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (consume_stock(memcg, nr_pages))
 		return 0;
 
-	if (!do_memsw_account() ||
-	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-			goto done_restock;
-		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, batch);
-		mem_over_limit = mem_cgroup_from_counter(counter, memory);
-	} else {
-		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
-		may_swap = false;
-	}
+	if (__try_charge(memcg, batch, &mem_over_limit))
+		goto done_restock;
+	else
+		may_swap = !do_memsw_account();
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -2892,6 +2944,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 {
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
+	if (memcg->css.device)
+		return 0;
+
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
 
@@ -4496,7 +4551,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 }
 
 static struct cgroup_subsys_state * __ref
-mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
+__mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css, bool is_device)
 {
 	struct mem_cgroup *parent = mem_cgroup_from_css(parent_css);
 	struct mem_cgroup *memcg;
@@ -4530,11 +4585,13 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		 * much sense so let cgroup subsystem know about this
 		 * unfortunate state in our controller.
 		 */
-		if (parent != root_mem_cgroup)
+		if (!is_device && parent != root_mem_cgroup)
 			memory_cgrp_subsys.broken_hierarchy = true;
 	}
 
-	/* The following stuff does not apply to the root */
+	/* The following stuff does not apply to devices or the root */
+	if (is_device)
+		return &memcg->css;
 	if (!parent) {
 		root_mem_cgroup = memcg;
 		return &memcg->css;
@@ -4554,6 +4611,34 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	return ERR_PTR(-ENOMEM);
 }
 
+static struct cgroup_subsys_state * __ref
+mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	return __mem_cgroup_css_alloc(parent_css, false);
+}
+
+/*
+ * For given @cgroup_css, we create and return new device-specific css.
+ *
+ * @device and @cgroup_css are unused here, but they are provided as other
+ * cgroup subsystems might require them.
+ */
+static struct cgroup_subsys_state * __ref
+mem_cgroup_device_css_alloc(struct device *device,
+			    struct cgroup_subsys_state *cgroup_css,
+			    struct cgroup_subsys_state *parent_device_css)
+{
+	/*
+	 * For hierarchial page counters to work correctly, we specify
+	 * parent here as the device-specific css from our parent css
+	 * (@parent_device_css).  In other words, for nested cgroups,
+	 * the device-specific charging structures are also nested.
+	 * Note, caller will itself set .device and .parent in returned
+	 * structure.
+	 */
+	return __mem_cgroup_css_alloc(parent_device_css, true);
+}
+
 static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -4613,6 +4698,9 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	if (css->device)
+		goto free_cgrp;
+
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
 		static_branch_dec(&memcg_sockets_enabled_key);
 
@@ -4624,6 +4712,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	mem_cgroup_remove_from_trees(memcg);
 	memcg_free_shrinker_maps(memcg);
 	memcg_free_kmem(memcg);
+free_cgrp:
 	mem_cgroup_free(memcg);
 }
 
@@ -5720,6 +5809,7 @@ static struct cftype memory_files[] = {
 
 struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
+	.device_css_alloc = mem_cgroup_device_css_alloc,
 	.css_online = mem_cgroup_css_online,
 	.css_offline = mem_cgroup_css_offline,
 	.css_released = mem_cgroup_css_released,
@@ -5732,6 +5822,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
+	.allow_devices = true,
 };
 
 /**
@@ -6031,6 +6122,68 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_try_charge_direct - try charging nr_pages to memcg
+ * @memcg: memcgto charge
+ * @nr_pages: number of pages to charge
+ *
+ * Try to charge @nr_pages to specified @memcg. This variant is intended
+ * where the memcg is known and can be directly charged, with the primary
+ * use case being in device drivers that have registered with the subsys.
+ * Device drivers that implement their own device-specific memory manager
+ * will use these direct charging functions to make charges against their
+ * device-private state (@memcg) within the cgroup.
+ *
+ * There is no separate mem_cgroup_commit_charge() in this use case, as the
+ * device driver is not using page structs. Reclaim is not needed internally
+ * here, as the caller can decide to attempt memory reclaim on error.
+ *
+ * Returns 0 on success.  Otherwise, an error code is returned.
+ *
+ * To uncharge (or cancel charge), call mem_cgroup_uncharge_direct().
+ */
+int mem_cgroup_try_charge_direct(struct mem_cgroup *memcg,
+				 unsigned long nr_pages)
+{
+	struct mem_cgroup *mem_over_limit;
+	int ret = 0;
+
+	if (!memcg || mem_cgroup_disabled() || mem_cgroup_is_root(memcg))
+		return 0;
+
+	if (__try_charge(memcg, nr_pages, &mem_over_limit)) {
+		css_get_many(&memcg->css, nr_pages);
+	} else {
+		memcg_memory_event(mem_over_limit, MEMCG_MAX);
+		ret = -ENOMEM;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mem_cgroup_try_charge_direct);
+
+/**
+ * mem_cgroup_uncharge_direct - uncharge nr_pages to memcg
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Uncharge @nr_pages to specified @memcg. This variant is intended
+ * where the memcg is known and can directly uncharge, with the primary
+ * use case being in device drivers that have registered with the subsys.
+ * Device drivers use these direct charging functions to make charges
+ * against their device-private state (@memcg) within the cgroup.
+ *
+ * Returns 0 on success.  Otherwise, an error code is returned.
+ */
+void mem_cgroup_uncharge_direct(struct mem_cgroup *memcg,
+				unsigned long nr_pages)
+{
+	if (!memcg || mem_cgroup_disabled())
+		return;
+
+	cancel_charge(memcg, nr_pages);
+}
+EXPORT_SYMBOL(mem_cgroup_uncharge_direct);
+
 struct uncharge_gather {
 	struct mem_cgroup *memcg;
 	unsigned long pgpgout;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/5] drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
                   ` (2 preceding siblings ...)
  2019-05-01 14:04 ` [RFC PATCH 3/5] memcg: Add per-device support to memory cgroup subsystem Brian Welty
@ 2019-05-01 14:04 ` Brian Welty
  2019-05-01 14:04 ` [RFC PATCH 5/5] drm/i915: Use memory cgroup for enforcing device memory limit Brian Welty
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Brian Welty @ 2019-05-01 14:04 UTC (permalink / raw)
  To: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

With new cgroups per-device framework, registration with memory cgroup
subsystem can allow us to enforce limit for allocation of device memory
against process cgroups.

This patch adds new driver feature bit, DRIVER_CGROUPS, such that DRM
will register the device with cgroups. Doing so allows device drivers to
charge memory allocations to device-specific state within the cgroup.

Note, this is only for GEM objects allocated from device memory.
Memory charging for GEM objects using system memory is already handled
by the mm subsystem charing the normal (non-device) memory cgroup.

To charge device memory allocations, we need to (1) identify appropriate
cgroup to charge (currently decided at object creation time), and (2)
make the charging call at the time that memory pages are being allocated.
Above is one policy, and this is open for debate if this is the right
choice.

For (1), we associate the current task's cgroup with GEM objects as they
are created.  That cgroup will be charged/uncharged for all paging
activity against the GEM object.  Note, if the process is not part of a
memory cgroup, then this returns NULL and no charging will occur.
For shared objects, this may make the charge against a cgroup that is
potentially not the same cgroup as the process using the memory.  Based
on the memory cgroup's discussion of "memory ownership", this seems
acceptable [1].  For (2), this is for device drivers to implement within
appropriate page allocation logic.

[1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt, "Memory Ownership"

Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 12 ++++++++++++
 drivers/gpu/drm/drm_gem.c |  7 +++++++
 include/drm/drm_device.h  |  3 +++
 include/drm/drm_drv.h     |  8 ++++++++
 include/drm/drm_gem.h     | 11 +++++++++++
 5 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 862621494a93..890bd3c0e63e 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -28,6 +28,7 @@
 
 #include <linux/debugfs.h>
 #include <linux/fs.h>
+#include <linux/memcontrol.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/mount.h>
@@ -987,6 +988,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
 	if (ret)
 		goto err_minors;
 
+	if (dev->dev && drm_core_check_feature(dev, DRIVER_CGROUPS)) {
+		ret = mem_cgroup_device_register(dev->dev, &dev->memcg_id);
+		if (ret)
+			goto err_minors;
+	}
+
 	dev->registered = true;
 
 	if (dev->driver->load) {
@@ -1009,6 +1016,8 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
 	goto out_unlock;
 
 err_minors:
+	if (dev->memcg_id)
+		mem_cgroup_device_unregister(dev->memcg_id);
 	remove_compat_control_link(dev);
 	drm_minor_unregister(dev, DRM_MINOR_PRIMARY);
 	drm_minor_unregister(dev, DRM_MINOR_RENDER);
@@ -1052,6 +1061,9 @@ void drm_dev_unregister(struct drm_device *dev)
 
 	drm_legacy_rmmaps(dev);
 
+	if (dev->memcg_id)
+		mem_cgroup_device_unregister(dev->memcg_id);
+
 	remove_compat_control_link(dev);
 	drm_minor_unregister(dev, DRM_MINOR_PRIMARY);
 	drm_minor_unregister(dev, DRM_MINOR_RENDER);
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 50de138c89e0..966fbd701deb 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -38,6 +38,7 @@
 #include <linux/dma-buf.h>
 #include <linux/mem_encrypt.h>
 #include <linux/pagevec.h>
+#include <linux/memcontrol.h>
 #include <drm/drmP.h>
 #include <drm/drm_vma_manager.h>
 #include <drm/drm_gem.h>
@@ -281,6 +282,9 @@ drm_gem_handle_delete(struct drm_file *filp, u32 handle)
 	if (IS_ERR_OR_NULL(obj))
 		return -EINVAL;
 
+	/* Release reference on cgroup used with GEM object charging */
+	mem_cgroup_put(obj->memcg);
+
 	/* Release driver's reference and decrement refcount. */
 	drm_gem_object_release_handle(handle, obj, filp);
 
@@ -410,6 +414,9 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
 			goto err_revoke;
 	}
 
+	/* Acquire reference on cgroup for charging GEM memory allocations */
+	obj->memcg = mem_cgroup_device_from_task(dev->memcg_id, current);
+
 	*handlep = handle;
 	return 0;
 
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 7f9ef709b2b6..9859f2289066 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -190,6 +190,9 @@ struct drm_device {
 	 */
 	int irq;
 
+	/* @memcg_id: cgroup subsys (memcg) index for our device state */
+	unsigned long memcg_id;
+
 	/**
 	 * @vblank_disable_immediate:
 	 *
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 5cc7f728ec73..13b0e0b9527f 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -92,6 +92,14 @@ enum drm_driver_feature {
 	 */
 	DRIVER_SYNCOBJ                  = BIT(5),
 
+	/**
+	 * @DRIVER_CGROUPS:
+	 *
+	 * Driver supports and requests DRM to register with cgroups during
+	 * drm_dev_register().
+	 */
+	DRIVER_CGROUPS			= BIT(6),
+
 	/* IMPORTANT: Below are all the legacy flags, add new ones above. */
 
 	/**
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index 5047c7ee25f5..ca90ea512e45 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -34,6 +34,7 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/memcontrol.h>
 #include <linux/kref.h>
 #include <linux/reservation.h>
 
@@ -202,6 +203,16 @@ struct drm_gem_object {
 	 */
 	struct file *filp;
 
+	/**
+	 * @memcg:
+	 *
+	 * cgroup used for charging GEM object page allocations against. This
+	 * is set to the current cgroup during GEM object creation.
+	 * Charging policy is up to each DRM driver to decide, but intent is to
+	 * charge during page allocation and use for device memory only.
+	 */
+	struct mem_cgroup *memcg;
+
 	/**
 	 * @vma_node:
 	 *
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 5/5] drm/i915: Use memory cgroup for enforcing device memory limit
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
                   ` (3 preceding siblings ...)
  2019-05-01 14:04 ` [RFC PATCH 4/5] drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit Brian Welty
@ 2019-05-01 14:04 ` Brian Welty
  2019-05-02  8:34 ` [RFC PATCH 0/5] cgroup support for GPU devices Leon Romanovsky
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Brian Welty @ 2019-05-01 14:04 UTC (permalink / raw)
  To: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

i915 driver now includes DRIVER_CGROUPS in feature bits.

To charge device memory allocations, we need to (1) identify appropriate
cgroup to charge (currently decided at object creation time), and (2)
make the charging call at the time that memory pages are being allocated.

For (1), see prior DRM patch which associates current task's cgroup with
GEM objects as they are created.  That cgroup will be charged/uncharged
for all paging activity against the GEM object.

For (2), we call mem_cgroup_try_charge_direct() in .get_pages callback
for the GEM object type.  Uncharging is done in .put_pages when the
memory is marked such that it can be evicted.  The try_charge() call will
fail with -ENOMEM if the current memory allocation will exceed the cgroup
device memory maximum, and allow for driver to perform memory reclaim.

Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  2 +-
 drivers/gpu/drm/i915/intel_memory_region.c | 24 ++++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 5a0a59922cb4..4d496c3c3681 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -3469,7 +3469,7 @@ static struct drm_driver driver = {
 	 * deal with them for Intel hardware.
 	 */
 	.driver_features =
-	    DRIVER_GEM | DRIVER_PRIME |
+	    DRIVER_GEM | DRIVER_PRIME | DRIVER_CGROUPS |
 	    DRIVER_RENDER | DRIVER_MODESET | DRIVER_ATOMIC | DRIVER_SYNCOBJ,
 	.release = i915_driver_release,
 	.open = i915_driver_open,
diff --git a/drivers/gpu/drm/i915/intel_memory_region.c b/drivers/gpu/drm/i915/intel_memory_region.c
index 813ff83c132b..e4ac5e4d4857 100644
--- a/drivers/gpu/drm/i915/intel_memory_region.c
+++ b/drivers/gpu/drm/i915/intel_memory_region.c
@@ -53,6 +53,8 @@ i915_memory_region_put_pages_buddy(struct drm_i915_gem_object *obj,
 	mutex_unlock(&obj->memory_region->mm_lock);
 
 	obj->mm.dirty = false;
+	mem_cgroup_uncharge_direct(obj->base.memcg,
+				   obj->base.size >> PAGE_SHIFT);
 }
 
 int
@@ -65,19 +67,29 @@ i915_memory_region_get_pages_buddy(struct drm_i915_gem_object *obj)
 	struct scatterlist *sg;
 	unsigned int sg_page_sizes;
 	unsigned long n_pages;
+	int err;
 
 	GEM_BUG_ON(!IS_ALIGNED(size, mem->mm.min_size));
 	GEM_BUG_ON(!list_empty(&obj->blocks));
 
+	err = mem_cgroup_try_charge_direct(obj->base.memcg, size >> PAGE_SHIFT);
+	if (err) {
+		DRM_DEBUG("MEMCG: try_charge failed for %lld\n", size);
+		return err;
+	}
+
 	st = kmalloc(sizeof(*st), GFP_KERNEL);
-	if (!st)
-		return -ENOMEM;
+	if (!st) {
+		err = -ENOMEM;
+		goto err_uncharge;
+	}
 
 	n_pages = div64_u64(size, mem->mm.min_size);
 
 	if (sg_alloc_table(st, n_pages, GFP_KERNEL)) {
 		kfree(st);
-		return -ENOMEM;
+		err = -ENOMEM;
+		goto err_uncharge;
 	}
 
 	sg = st->sgl;
@@ -161,7 +173,11 @@ i915_memory_region_get_pages_buddy(struct drm_i915_gem_object *obj)
 err_free_blocks:
 	memory_region_free_pages(obj, st);
 	mutex_unlock(&mem->mm_lock);
-	return -ENXIO;
+	err = -ENXIO;
+err_uncharge:
+	mem_cgroup_uncharge_direct(obj->base.memcg,
+				   obj->base.size >> PAGE_SHIFT);
+	return err;
 }
 
 int i915_memory_region_init_buddy(struct intel_memory_region *mem)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
                   ` (4 preceding siblings ...)
  2019-05-01 14:04 ` [RFC PATCH 5/5] drm/i915: Use memory cgroup for enforcing device memory limit Brian Welty
@ 2019-05-02  8:34 ` Leon Romanovsky
  2019-05-02 22:48   ` Kenny Ho
  2019-05-06 15:16 ` Johannes Weiner
  2019-05-06 15:26 ` Tejun Heo
  7 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2019-05-02  8:34 UTC (permalink / raw)
  To: Brian Welty
  Cc: cgroups, Tejun Heo, Li Zefan, Johannes Weiner, linux-mm,
	Michal Hocko, Vladimir Davydov, dri-devel, David Airlie,
	Daniel Vetter, intel-gfx, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse, RDMA mailing list, Parav Pandit

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> In containerized or virtualized environments, there is desire to have
> controls in place for resources that can be consumed by users of a GPU
> device.  This RFC patch series proposes a framework for integrating
> use of existing cgroup controllers into device drivers.
> The i915 driver is updated in this series as our primary use case to
> leverage this framework and to serve as an example for discussion.
>
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)

Count us (Mellanox) too, our RDMA devices are exposing special and
limited in size device memory to the users and we would like to provide
an option to use cgroup to control its exposure.

> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)
>
> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.
>
> Per-device controls would be exposed in cgroup filesystem as:
>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> such as (for example):
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
>
> The drm/i915 patch in this series is based on top of other RFC work [1]
> for i915 device memory support.
>
> AMD [2] and Intel [3] have proposed related work in this area within the
> last few years, listed below as reference.  This new RFC reuses existing
> cgroup controllers and takes a different approach than prior work.
>
> Finally, some potential discussion points for this series:
> * merge proposed <subsys_name>.devices into a single devices directory?
> * allow devices to have multiple registrations for subsets of resources?
> * document a 'common charging policy' for device drivers to follow?
>
> [1] https://patchwork.freedesktop.org/series/56683/
> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>
>
> Brian Welty (5):
>   cgroup: Add cgroup_subsys per-device registration framework
>   cgroup: Change kernfs_node for directories to store
>     cgroup_subsys_state
>   memcg: Add per-device support to memory cgroup subsystem
>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
>   drm/i915: Use memory cgroup for enforcing device memory limit
>
>  drivers/gpu/drm/drm_drv.c                  |  12 +
>  drivers/gpu/drm/drm_gem.c                  |   7 +
>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
>  include/drm/drm_device.h                   |   3 +
>  include/drm/drm_drv.h                      |   8 +
>  include/drm/drm_gem.h                      |  11 +
>  include/linux/cgroup-defs.h                |  28 ++
>  include/linux/cgroup.h                     |   3 +
>  include/linux/memcontrol.h                 |  10 +
>  kernel/cgroup/cgroup-v1.c                  |  10 +-
>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
>  mm/memcontrol.c                            | 183 +++++++++++-
>  13 files changed, 552 insertions(+), 59 deletions(-)
>
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-02  8:34 ` [RFC PATCH 0/5] cgroup support for GPU devices Leon Romanovsky
@ 2019-05-02 22:48   ` Kenny Ho
  2019-05-03 21:14     ` Welty, Brian
  0 siblings, 1 reply; 19+ messages in thread
From: Kenny Ho @ 2019-05-02 22:48 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Brian Welty, Alex Deucher, Parav Pandit, David Airlie, intel-gfx,
	Jérôme Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian König, RDMA mailing list

> Count us (Mellanox) too, our RDMA devices are exposing special and
> limited in size device memory to the users and we would like to provide
> an option to use cgroup to control its exposure.
Doesn't RDMA already has a separate cgroup?  Why not implement it there?


> > and with future work, we could extend to:
> > *  track and control share of GPU time (reuse of cpu/cpuacct)
> > *  apply mask of allowed execution engines (reuse of cpusets)
> >
> > Instead of introducing a new cgroup subsystem for GPU devices, a new
> > framework is proposed to allow devices to register with existing cgroup
> > controllers, which creates per-device cgroup_subsys_state within the
> > cgroup.  This gives device drivers their own private cgroup controls
> > (such as memory limits or other parameters) to be applied to device
> > resources instead of host system resources.
> > Device drivers (GPU or other) are then able to reuse the existing cgroup
> > controls, instead of inventing similar ones.
> >
> > Per-device controls would be exposed in cgroup filesystem as:
> >     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > such as (for example):
> >     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> >     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> >     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> >     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> >
> > The drm/i915 patch in this series is based on top of other RFC work [1]
> > for i915 device memory support.
> >
> > AMD [2] and Intel [3] have proposed related work in this area within the
> > last few years, listed below as reference.  This new RFC reuses existing
> > cgroup controllers and takes a different approach than prior work.
> >
> > Finally, some potential discussion points for this series:
> > * merge proposed <subsys_name>.devices into a single devices directory?
> > * allow devices to have multiple registrations for subsets of resources?
> > * document a 'common charging policy' for device drivers to follow?
> >
> > [1] https://patchwork.freedesktop.org/series/56683/
> > [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> >
> >
> > Brian Welty (5):
> >   cgroup: Add cgroup_subsys per-device registration framework
> >   cgroup: Change kernfs_node for directories to store
> >     cgroup_subsys_state
> >   memcg: Add per-device support to memory cgroup subsystem
> >   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> >   drm/i915: Use memory cgroup for enforcing device memory limit
> >
> >  drivers/gpu/drm/drm_drv.c                  |  12 +
> >  drivers/gpu/drm/drm_gem.c                  |   7 +
> >  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> >  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> >  include/drm/drm_device.h                   |   3 +
> >  include/drm/drm_drv.h                      |   8 +
> >  include/drm/drm_gem.h                      |  11 +
> >  include/linux/cgroup-defs.h                |  28 ++
> >  include/linux/cgroup.h                     |   3 +
> >  include/linux/memcontrol.h                 |  10 +
> >  kernel/cgroup/cgroup-v1.c                  |  10 +-
> >  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> >  mm/memcontrol.c                            | 183 +++++++++++-
> >  13 files changed, 552 insertions(+), 59 deletions(-)
> >
> > --
> > 2.21.0
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-02 22:48   ` Kenny Ho
@ 2019-05-03 21:14     ` Welty, Brian
  2019-05-05  7:14       ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Welty, Brian @ 2019-05-03 21:14 UTC (permalink / raw)
  To: Kenny Ho, Leon Romanovsky
  Cc: Alex Deucher, Parav Pandit, David Airlie, intel-gfx,
	Jérôme Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian König, RDMA mailing list,
	kenny.ho, Harish.Kasiviswanathan, daniel


On 5/2/2019 3:48 PM, Kenny Ho wrote:
> On 5/2/2019 1:34 AM, Leon Romanovsky wrote:
>> Count us (Mellanox) too, our RDMA devices are exposing special and
>> limited in size device memory to the users and we would like to provide
>> an option to use cgroup to control its exposure.

Hi Leon, great to hear and happy to work with you and RDMA community
to shape this framework for use by RDMA devices as well.  The intent
was to support more than GPU devices.

Incidentally, I also wanted to ask about the rdma cgroup controller
and if there is interest in updating the device registration implemented
in that controller.  It could use the cgroup_device_register() that is
proposed here.   But this is perhaps future work, so can discuss separately.


> Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> 

Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
I gave in the cover letter.  Namely, to implement in rdma controller, would
mean duplicating existing memcg controls there.

Is AMD interested in collaborating to help shape this framework?
It is intended to be device-neutral, so could be leveraged by various
types of devices.
If you have an alternative solution well underway, then maybe
we can work together to merge our efforts into one.
In the end, the DRM community is best served with common solution.


> 
>>> and with future work, we could extend to:
>>> *  track and control share of GPU time (reuse of cpu/cpuacct)
>>> *  apply mask of allowed execution engines (reuse of cpusets)
>>>
>>> Instead of introducing a new cgroup subsystem for GPU devices, a new
>>> framework is proposed to allow devices to register with existing cgroup
>>> controllers, which creates per-device cgroup_subsys_state within the
>>> cgroup.  This gives device drivers their own private cgroup controls
>>> (such as memory limits or other parameters) to be applied to device
>>> resources instead of host system resources.
>>> Device drivers (GPU or other) are then able to reuse the existing cgroup
>>> controls, instead of inventing similar ones.
>>>
>>> Per-device controls would be exposed in cgroup filesystem as:
>>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
>>> such as (for example):
>>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
>>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
>>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
>>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
>>>
>>> The drm/i915 patch in this series is based on top of other RFC work [1]
>>> for i915 device memory support.
>>>
>>> AMD [2] and Intel [3] have proposed related work in this area within the
>>> last few years, listed below as reference.  This new RFC reuses existing
>>> cgroup controllers and takes a different approach than prior work.
>>>
>>> Finally, some potential discussion points for this series:
>>> * merge proposed <subsys_name>.devices into a single devices directory?
>>> * allow devices to have multiple registrations for subsets of resources?
>>> * document a 'common charging policy' for device drivers to follow?
>>>
>>> [1] https://patchwork.freedesktop.org/series/56683/
>>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
>>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>>>
>>>
>>> Brian Welty (5):
>>>   cgroup: Add cgroup_subsys per-device registration framework
>>>   cgroup: Change kernfs_node for directories to store
>>>     cgroup_subsys_state
>>>   memcg: Add per-device support to memory cgroup subsystem
>>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
>>>   drm/i915: Use memory cgroup for enforcing device memory limit
>>>
>>>  drivers/gpu/drm/drm_drv.c                  |  12 +
>>>  drivers/gpu/drm/drm_gem.c                  |   7 +
>>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
>>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
>>>  include/drm/drm_device.h                   |   3 +
>>>  include/drm/drm_drv.h                      |   8 +
>>>  include/drm/drm_gem.h                      |  11 +
>>>  include/linux/cgroup-defs.h                |  28 ++
>>>  include/linux/cgroup.h                     |   3 +
>>>  include/linux/memcontrol.h                 |  10 +
>>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
>>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
>>>  mm/memcontrol.c                            | 183 +++++++++++-
>>>  13 files changed, 552 insertions(+), 59 deletions(-)
>>>
>>> --
>>> 2.21.0
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-03 21:14     ` Welty, Brian
@ 2019-05-05  7:14       ` Leon Romanovsky
  2019-05-05 14:21         ` Kenny Ho
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2019-05-05  7:14 UTC (permalink / raw)
  To: Welty, Brian
  Cc: Kenny Ho, Alex Deucher, Parav Pandit, David Airlie, intel-gfx,
	J??r??me Glisse, dri-devel, Michal Hocko, linux-mm, Rodrigo Vivi,
	Li Zefan, Vladimir Davydov, Johannes Weiner, Tejun Heo, cgroups,
	Christian K??nig, RDMA mailing list, kenny.ho,
	Harish.Kasiviswanathan, daniel

On Fri, May 03, 2019 at 02:14:33PM -0700, Welty, Brian wrote:
>
> On 5/2/2019 3:48 PM, Kenny Ho wrote:
> > On 5/2/2019 1:34 AM, Leon Romanovsky wrote:
> >> Count us (Mellanox) too, our RDMA devices are exposing special and
> >> limited in size device memory to the users and we would like to provide
> >> an option to use cgroup to control its exposure.
>
> Hi Leon, great to hear and happy to work with you and RDMA community
> to shape this framework for use by RDMA devices as well.  The intent
> was to support more than GPU devices.
>
> Incidentally, I also wanted to ask about the rdma cgroup controller
> and if there is interest in updating the device registration implemented
> in that controller.  It could use the cgroup_device_register() that is
> proposed here.   But this is perhaps future work, so can discuss separately.

I'll try to take a look later this week.

>
>
> > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> >
>
> Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> I gave in the cover letter.  Namely, to implement in rdma controller, would
> mean duplicating existing memcg controls there.

Exactly, I didn't feel comfortable to add notion of "device memory"
to RDMA cgroup and postponed that decision to later point of time.
RDMA operates with verbs objects and all our user space API is based around
that concept. At the end, system administrator will have hard time to
understand the differences between memcg and RDMA memory.

>
> Is AMD interested in collaborating to help shape this framework?
> It is intended to be device-neutral, so could be leveraged by various
> types of devices.
> If you have an alternative solution well underway, then maybe
> we can work together to merge our efforts into one.
> In the end, the DRM community is best served with common solution.
>
>
> >
> >>> and with future work, we could extend to:
> >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> >>> *  apply mask of allowed execution engines (reuse of cpusets)
> >>>
> >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> >>> framework is proposed to allow devices to register with existing cgroup
> >>> controllers, which creates per-device cgroup_subsys_state within the
> >>> cgroup.  This gives device drivers their own private cgroup controls
> >>> (such as memory limits or other parameters) to be applied to device
> >>> resources instead of host system resources.
> >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> >>> controls, instead of inventing similar ones.
> >>>
> >>> Per-device controls would be exposed in cgroup filesystem as:
> >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> >>> such as (for example):
> >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> >>>
> >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> >>> for i915 device memory support.
> >>>
> >>> AMD [2] and Intel [3] have proposed related work in this area within the
> >>> last few years, listed below as reference.  This new RFC reuses existing
> >>> cgroup controllers and takes a different approach than prior work.
> >>>
> >>> Finally, some potential discussion points for this series:
> >>> * merge proposed <subsys_name>.devices into a single devices directory?
> >>> * allow devices to have multiple registrations for subsets of resources?
> >>> * document a 'common charging policy' for device drivers to follow?
> >>>
> >>> [1] https://patchwork.freedesktop.org/series/56683/
> >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> >>>
> >>>
> >>> Brian Welty (5):
> >>>   cgroup: Add cgroup_subsys per-device registration framework
> >>>   cgroup: Change kernfs_node for directories to store
> >>>     cgroup_subsys_state
> >>>   memcg: Add per-device support to memory cgroup subsystem
> >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> >>>
> >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> >>>  include/drm/drm_device.h                   |   3 +
> >>>  include/drm/drm_drv.h                      |   8 +
> >>>  include/drm/drm_gem.h                      |  11 +
> >>>  include/linux/cgroup-defs.h                |  28 ++
> >>>  include/linux/cgroup.h                     |   3 +
> >>>  include/linux/memcontrol.h                 |  10 +
> >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> >>>  mm/memcontrol.c                            | 183 +++++++++++-
> >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> >>>
> >>> --
> >>> 2.21.0
> >>>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-05  7:14       ` Leon Romanovsky
@ 2019-05-05 14:21         ` Kenny Ho
  2019-05-05 16:05           ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Kenny Ho @ 2019-05-05 14:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Welty, Brian, Alex Deucher, Parav Pandit, David Airlie,
	intel-gfx, J??r??me Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian K??nig, RDMA mailing list,
	kenny.ho, Harish.Kasiviswanathan, daniel

On Sun, May 5, 2019 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> > >
> >
> > Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> > I gave in the cover letter.  Namely, to implement in rdma controller, would
> > mean duplicating existing memcg controls there.
>
> Exactly, I didn't feel comfortable to add notion of "device memory"
> to RDMA cgroup and postponed that decision to later point of time.
> RDMA operates with verbs objects and all our user space API is based around
> that concept. At the end, system administrator will have hard time to
> understand the differences between memcg and RDMA memory.
Interesting.  I actually don't understand this part (I worked in
devops/sysadmin side of things but never with rdma.)  Don't
applications that use rdma require some awareness of rdma (I mean, you
mentioned verbs and objects... or do they just use regular malloc for
buffer allocation and then send it through some function?)  As a user,
I would have this question: why do I need to configure some part of
rdma resources under rdma cgroup while other part of rdma resources in
a different, seemingly unrelated cgroups.

I think we need to be careful about drawing the line between
duplication and over couplings between subsystems.  I have other
thoughts and concerns and I will try to organize them into a response
in the next few days.

Regards,
Kenny


> >
> > Is AMD interested in collaborating to help shape this framework?
> > It is intended to be device-neutral, so could be leveraged by various
> > types of devices.
> > If you have an alternative solution well underway, then maybe
> > we can work together to merge our efforts into one.
> > In the end, the DRM community is best served with common solution.
> >
> >
> > >
> > >>> and with future work, we could extend to:
> > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > >>>
> > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > >>> framework is proposed to allow devices to register with existing cgroup
> > >>> controllers, which creates per-device cgroup_subsys_state within the
> > >>> cgroup.  This gives device drivers their own private cgroup controls
> > >>> (such as memory limits or other parameters) to be applied to device
> > >>> resources instead of host system resources.
> > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > >>> controls, instead of inventing similar ones.
> > >>>
> > >>> Per-device controls would be exposed in cgroup filesystem as:
> > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > >>> such as (for example):
> > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > >>>
> > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > >>> for i915 device memory support.
> > >>>
> > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > >>> last few years, listed below as reference.  This new RFC reuses existing
> > >>> cgroup controllers and takes a different approach than prior work.
> > >>>
> > >>> Finally, some potential discussion points for this series:
> > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > >>> * allow devices to have multiple registrations for subsets of resources?
> > >>> * document a 'common charging policy' for device drivers to follow?
> > >>>
> > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > >>>
> > >>>
> > >>> Brian Welty (5):
> > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > >>>   cgroup: Change kernfs_node for directories to store
> > >>>     cgroup_subsys_state
> > >>>   memcg: Add per-device support to memory cgroup subsystem
> > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > >>>
> > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > >>>  include/drm/drm_device.h                   |   3 +
> > >>>  include/drm/drm_drv.h                      |   8 +
> > >>>  include/drm/drm_gem.h                      |  11 +
> > >>>  include/linux/cgroup-defs.h                |  28 ++
> > >>>  include/linux/cgroup.h                     |   3 +
> > >>>  include/linux/memcontrol.h                 |  10 +
> > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > >>>
> > >>> --
> > >>> 2.21.0
> > >>>
> > >> _______________________________________________
> > >> dri-devel mailing list
> > >> dri-devel@lists.freedesktop.org
> > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-05 14:21         ` Kenny Ho
@ 2019-05-05 16:05           ` Leon Romanovsky
  2019-05-05 16:34             ` Kenny Ho
  2019-05-05 16:46             ` Chris Down
  0 siblings, 2 replies; 19+ messages in thread
From: Leon Romanovsky @ 2019-05-05 16:05 UTC (permalink / raw)
  To: Kenny Ho
  Cc: Welty, Brian, Alex Deucher, Parav Pandit, David Airlie,
	intel-gfx, J??r??me Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian K??nig, RDMA mailing list,
	kenny.ho, Harish.Kasiviswanathan, daniel

On Sun, May 05, 2019 at 10:21:30AM -0400, Kenny Ho wrote:
> On Sun, May 5, 2019 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> > > >
> > >
> > > Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> > > I gave in the cover letter.  Namely, to implement in rdma controller, would
> > > mean duplicating existing memcg controls there.
> >
> > Exactly, I didn't feel comfortable to add notion of "device memory"
> > to RDMA cgroup and postponed that decision to later point of time.
> > RDMA operates with verbs objects and all our user space API is based around
> > that concept. At the end, system administrator will have hard time to
> > understand the differences between memcg and RDMA memory.
> Interesting.  I actually don't understand this part (I worked in
> devops/sysadmin side of things but never with rdma.)  Don't
> applications that use rdma require some awareness of rdma (I mean, you
> mentioned verbs and objects... or do they just use regular malloc for
> buffer allocation and then send it through some function?)  As a user,
> I would have this question: why do I need to configure some part of
> rdma resources under rdma cgroup while other part of rdma resources in
> a different, seemingly unrelated cgroups.

We are talking about two different access patterns for this device
memory (DM). One is to use this device memory (DM) and second to configure/limit.
Usually those actions will be performed by different groups.

First group (programmers) is using special API [1] through libibverbs [2]
without any notion of cgroups or any limitations. Second group (sysadmins)
is less interested in application specifics and for them "device memory" means
"memory" and not "rdma, nic specific, internal memory".

[1] ibv_alloc_dm()
http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
[2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/

>
> I think we need to be careful about drawing the line between
> duplication and over couplings between subsystems.  I have other
> thoughts and concerns and I will try to organize them into a response
> in the next few days.
>
> Regards,
> Kenny
>
>
> > >
> > > Is AMD interested in collaborating to help shape this framework?
> > > It is intended to be device-neutral, so could be leveraged by various
> > > types of devices.
> > > If you have an alternative solution well underway, then maybe
> > > we can work together to merge our efforts into one.
> > > In the end, the DRM community is best served with common solution.
> > >
> > >
> > > >
> > > >>> and with future work, we could extend to:
> > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > >>>
> > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > > >>> framework is proposed to allow devices to register with existing cgroup
> > > >>> controllers, which creates per-device cgroup_subsys_state within the
> > > >>> cgroup.  This gives device drivers their own private cgroup controls
> > > >>> (such as memory limits or other parameters) to be applied to device
> > > >>> resources instead of host system resources.
> > > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > > >>> controls, instead of inventing similar ones.
> > > >>>
> > > >>> Per-device controls would be exposed in cgroup filesystem as:
> > > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > > >>> such as (for example):
> > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > > >>>
> > > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > > >>> for i915 device memory support.
> > > >>>
> > > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > > >>> last few years, listed below as reference.  This new RFC reuses existing
> > > >>> cgroup controllers and takes a different approach than prior work.
> > > >>>
> > > >>> Finally, some potential discussion points for this series:
> > > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > > >>> * allow devices to have multiple registrations for subsets of resources?
> > > >>> * document a 'common charging policy' for device drivers to follow?
> > > >>>
> > > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > > >>>
> > > >>>
> > > >>> Brian Welty (5):
> > > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > > >>>   cgroup: Change kernfs_node for directories to store
> > > >>>     cgroup_subsys_state
> > > >>>   memcg: Add per-device support to memory cgroup subsystem
> > > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > > >>>
> > > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > > >>>  include/drm/drm_device.h                   |   3 +
> > > >>>  include/drm/drm_drv.h                      |   8 +
> > > >>>  include/drm/drm_gem.h                      |  11 +
> > > >>>  include/linux/cgroup-defs.h                |  28 ++
> > > >>>  include/linux/cgroup.h                     |   3 +
> > > >>>  include/linux/memcontrol.h                 |  10 +
> > > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > > >>>
> > > >>> --
> > > >>> 2.21.0
> > > >>>
> > > >> _______________________________________________
> > > >> dri-devel mailing list
> > > >> dri-devel@lists.freedesktop.org
> > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-05 16:05           ` Leon Romanovsky
@ 2019-05-05 16:34             ` Kenny Ho
  2019-05-05 16:55               ` Leon Romanovsky
  2019-05-05 16:46             ` Chris Down
  1 sibling, 1 reply; 19+ messages in thread
From: Kenny Ho @ 2019-05-05 16:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Welty, Brian, Alex Deucher, Parav Pandit, David Airlie,
	intel-gfx, J??r??me Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian K??nig, RDMA mailing list,
	kenny.ho, Harish.Kasiviswanathan, daniel

(sent again.  Not sure why my previous email was just a reply instead
of reply-all.)

On Sun, May 5, 2019 at 12:05 PM Leon Romanovsky <leon@kernel.org> wrote:
> We are talking about two different access patterns for this device
> memory (DM). One is to use this device memory (DM) and second to configure/limit.
> Usually those actions will be performed by different groups.
>
> First group (programmers) is using special API [1] through libibverbs [2]
> without any notion of cgroups or any limitations. Second group (sysadmins)
> is less interested in application specifics and for them "device memory" means
> "memory" and not "rdma, nic specific, internal memory".
Um... I am not sure that answered it, especially in the context of
cgroup (this is just for my curiosity btw, I don't know much about
rdma.)  You said sysadmins are less interested in application
specifics but then how would they make the judgement call on how much
"device memory" is provisioned to one application/container over
another (let say you have 5 cgroup sharing an rdma device)?  What are
the consequences of under provisioning "device memory" to an
application?  And if they are all just memory, can a sysadmin
provision more system memory in place of device memory (like, are they
interchangeable)?  I guess I am confused because if device memory is
just memory (not rdma, nic specific) to sysadmins how would they know
to set the right amount?

Regards,
Kenny

> [1] ibv_alloc_dm()
> http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
> https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
> [2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/
>
> >
> > I think we need to be careful about drawing the line between
> > duplication and over couplings between subsystems.  I have other
> > thoughts and concerns and I will try to organize them into a response
> > in the next few days.
> >
> > Regards,
> > Kenny
> >
> >
> > > >
> > > > Is AMD interested in collaborating to help shape this framework?
> > > > It is intended to be device-neutral, so could be leveraged by various
> > > > types of devices.
> > > > If you have an alternative solution well underway, then maybe
> > > > we can work together to merge our efforts into one.
> > > > In the end, the DRM community is best served with common solution.
> > > >
> > > >
> > > > >
> > > > >>> and with future work, we could extend to:
> > > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > > >>>
> > > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > > > >>> framework is proposed to allow devices to register with existing cgroup
> > > > >>> controllers, which creates per-device cgroup_subsys_state within the
> > > > >>> cgroup.  This gives device drivers their own private cgroup controls
> > > > >>> (such as memory limits or other parameters) to be applied to device
> > > > >>> resources instead of host system resources.
> > > > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > > > >>> controls, instead of inventing similar ones.
> > > > >>>
> > > > >>> Per-device controls would be exposed in cgroup filesystem as:
> > > > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > > > >>> such as (for example):
> > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > > > >>>
> > > > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > > > >>> for i915 device memory support.
> > > > >>>
> > > > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > > > >>> last few years, listed below as reference.  This new RFC reuses existing
> > > > >>> cgroup controllers and takes a different approach than prior work.
> > > > >>>
> > > > >>> Finally, some potential discussion points for this series:
> > > > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > > > >>> * allow devices to have multiple registrations for subsets of resources?
> > > > >>> * document a 'common charging policy' for device drivers to follow?
> > > > >>>
> > > > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > > > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > > > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > > > >>>
> > > > >>>
> > > > >>> Brian Welty (5):
> > > > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > > > >>>   cgroup: Change kernfs_node for directories to store
> > > > >>>     cgroup_subsys_state
> > > > >>>   memcg: Add per-device support to memory cgroup subsystem
> > > > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > > > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > > > >>>
> > > > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > > > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > > > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > > > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > > > >>>  include/drm/drm_device.h                   |   3 +
> > > > >>>  include/drm/drm_drv.h                      |   8 +
> > > > >>>  include/drm/drm_gem.h                      |  11 +
> > > > >>>  include/linux/cgroup-defs.h                |  28 ++
> > > > >>>  include/linux/cgroup.h                     |   3 +
> > > > >>>  include/linux/memcontrol.h                 |  10 +
> > > > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > > > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > > > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > > > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > > > >>>
> > > > >>> --
> > > > >>> 2.21.0
> > > > >>>
> > > > >> _______________________________________________
> > > > >> dri-devel mailing list
> > > > >> dri-devel@lists.freedesktop.org
> > > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-05 16:05           ` Leon Romanovsky
  2019-05-05 16:34             ` Kenny Ho
@ 2019-05-05 16:46             ` Chris Down
  1 sibling, 0 replies; 19+ messages in thread
From: Chris Down @ 2019-05-05 16:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Kenny Ho, Welty, Brian, Alex Deucher, Parav Pandit, David Airlie,
	intel-gfx, J??r??me Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian K??nig, RDMA mailing list,
	kenny.ho, Harish.Kasiviswanathan, daniel

Leon Romanovsky writes:
>First group (programmers) is using special API [1] through libibverbs [2]
>without any notion of cgroups or any limitations. Second group (sysadmins)
>is less interested in application specifics and for them "device memory" means
>"memory" and not "rdma, nic specific, internal memory".

I'd suggest otherwise, based on historic precedent -- sysadmins are typically 
very opinionated about operation of the memory subsystem (hence the endless 
discussions about swap, caching behaviour, etc).

Especially in this case, these types of memory operate fundamentally 
differently and have significantly different performance and availability 
characteristics. That's not something that can be trivially abstracted over 
without non-trivial drawbacks.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-05 16:34             ` Kenny Ho
@ 2019-05-05 16:55               ` Leon Romanovsky
  0 siblings, 0 replies; 19+ messages in thread
From: Leon Romanovsky @ 2019-05-05 16:55 UTC (permalink / raw)
  To: Kenny Ho
  Cc: Welty, Brian, Alex Deucher, Parav Pandit, David Airlie,
	intel-gfx, J??r??me Glisse, dri-devel, Michal Hocko, linux-mm,
	Rodrigo Vivi, Li Zefan, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, cgroups, Christian K??nig, RDMA mailing list,
	kenny.ho, Harish.Kasiviswanathan, daniel

On Sun, May 05, 2019 at 12:34:16PM -0400, Kenny Ho wrote:
> (sent again.  Not sure why my previous email was just a reply instead
> of reply-all.)
>
> On Sun, May 5, 2019 at 12:05 PM Leon Romanovsky <leon@kernel.org> wrote:
> > We are talking about two different access patterns for this device
> > memory (DM). One is to use this device memory (DM) and second to configure/limit.
> > Usually those actions will be performed by different groups.
> >
> > First group (programmers) is using special API [1] through libibverbs [2]
> > without any notion of cgroups or any limitations. Second group (sysadmins)
> > is less interested in application specifics and for them "device memory" means
> > "memory" and not "rdma, nic specific, internal memory".
> Um... I am not sure that answered it, especially in the context of
> cgroup (this is just for my curiosity btw, I don't know much about
> rdma.)  You said sysadmins are less interested in application
> specifics but then how would they make the judgement call on how much
> "device memory" is provisioned to one application/container over
> another (let say you have 5 cgroup sharing an rdma device)?  What are
> the consequences of under provisioning "device memory" to an
> application?  And if they are all just memory, can a sysadmin
> provision more system memory in place of device memory (like, are they
> interchangeable)?  I guess I am confused because if device memory is
> just memory (not rdma, nic specific) to sysadmins how would they know
> to set the right amount?

One of the immediate usages of this DM that come to my mind is very
fast spinlocks for MPI applications. In such case, the amount of DM
will be property of network topology in given MPI cluster.

In this scenario, precise amount of memory will ensure that all jobs
will continue to give maximal performance despite any programmer's
error in DM allocation.

For under provisioning scenario and if application is written correctly,
users will experience more latency and less performance, due to the PCI
accesses.

Slide 3 in Liran's presentation gives brief overview about motivation.

Thanks

>
> Regards,
> Kenny
>
> > [1] ibv_alloc_dm()
> > http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
> > https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
> > [2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/
> >
> > >
> > > I think we need to be careful about drawing the line between
> > > duplication and over couplings between subsystems.  I have other
> > > thoughts and concerns and I will try to organize them into a response
> > > in the next few days.
> > >
> > > Regards,
> > > Kenny
> > >
> > >
> > > > >
> > > > > Is AMD interested in collaborating to help shape this framework?
> > > > > It is intended to be device-neutral, so could be leveraged by various
> > > > > types of devices.
> > > > > If you have an alternative solution well underway, then maybe
> > > > > we can work together to merge our efforts into one.
> > > > > In the end, the DRM community is best served with common solution.
> > > > >
> > > > >
> > > > > >
> > > > > >>> and with future work, we could extend to:
> > > > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > > > >>>
> > > > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > > > > >>> framework is proposed to allow devices to register with existing cgroup
> > > > > >>> controllers, which creates per-device cgroup_subsys_state within the
> > > > > >>> cgroup.  This gives device drivers their own private cgroup controls
> > > > > >>> (such as memory limits or other parameters) to be applied to device
> > > > > >>> resources instead of host system resources.
> > > > > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > > > > >>> controls, instead of inventing similar ones.
> > > > > >>>
> > > > > >>> Per-device controls would be exposed in cgroup filesystem as:
> > > > > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > > > > >>> such as (for example):
> > > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > > > > >>>
> > > > > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > > > > >>> for i915 device memory support.
> > > > > >>>
> > > > > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > > > > >>> last few years, listed below as reference.  This new RFC reuses existing
> > > > > >>> cgroup controllers and takes a different approach than prior work.
> > > > > >>>
> > > > > >>> Finally, some potential discussion points for this series:
> > > > > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > > > > >>> * allow devices to have multiple registrations for subsets of resources?
> > > > > >>> * document a 'common charging policy' for device drivers to follow?
> > > > > >>>
> > > > > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > > > > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > > > > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > > > > >>>
> > > > > >>>
> > > > > >>> Brian Welty (5):
> > > > > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > > > > >>>   cgroup: Change kernfs_node for directories to store
> > > > > >>>     cgroup_subsys_state
> > > > > >>>   memcg: Add per-device support to memory cgroup subsystem
> > > > > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > > > > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > > > > >>>
> > > > > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > > > > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > > > > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > > > > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > > > > >>>  include/drm/drm_device.h                   |   3 +
> > > > > >>>  include/drm/drm_drv.h                      |   8 +
> > > > > >>>  include/drm/drm_gem.h                      |  11 +
> > > > > >>>  include/linux/cgroup-defs.h                |  28 ++
> > > > > >>>  include/linux/cgroup.h                     |   3 +
> > > > > >>>  include/linux/memcontrol.h                 |  10 +
> > > > > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > > > > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > > > > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > > > > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > > > > >>>
> > > > > >>> --
> > > > > >>> 2.21.0
> > > > > >>>
> > > > > >> _______________________________________________
> > > > > >> dri-devel mailing list
> > > > > >> dri-devel@lists.freedesktop.org
> > > > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
                   ` (5 preceding siblings ...)
  2019-05-02  8:34 ` [RFC PATCH 0/5] cgroup support for GPU devices Leon Romanovsky
@ 2019-05-06 15:16 ` Johannes Weiner
  2019-05-06 15:26 ` Tejun Heo
  7 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2019-05-06 15:16 UTC (permalink / raw)
  To: Brian Welty
  Cc: cgroups, Tejun Heo, Li Zefan, linux-mm, Michal Hocko,
	Vladimir Davydov, dri-devel, David Airlie, Daniel Vetter,
	intel-gfx, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> In containerized or virtualized environments, there is desire to have
> controls in place for resources that can be consumed by users of a GPU
> device.  This RFC patch series proposes a framework for integrating 
> use of existing cgroup controllers into device drivers.
> The i915 driver is updated in this series as our primary use case to
> leverage this framework and to serve as an example for discussion.
> 
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)
> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)

Please create a separate controller for your purposes.

The memory controller is for traditional RAM. I don't see it having
much in common with what you're trying to do, and it's barely reusing
any of the memcg code. You can use the page_counter API directly.

> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.
> 
> Per-device controls would be exposed in cgroup filesystem as:
>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> such as (for example):
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight

Subdirectories for anything other than actual cgroups are a no-go. If
you need a hierarchy, use dotted filenames:

gpu.memory.max
gpu.cycles.max

etc. and look at Documentation/admin-guide/cgroup-v2.rst's 'Format'
and 'Conventions', as well as how the io controller works, to see how
multi-key / multi-device control files are implemented in cgroup2.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
                   ` (6 preceding siblings ...)
  2019-05-06 15:16 ` Johannes Weiner
@ 2019-05-06 15:26 ` Tejun Heo
  2019-05-07 19:50   ` Welty, Brian
  7 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2019-05-06 15:26 UTC (permalink / raw)
  To: Brian Welty
  Cc: cgroups, Li Zefan, Johannes Weiner, linux-mm, Michal Hocko,
	Vladimir Davydov, dri-devel, David Airlie, Daniel Vetter,
	intel-gfx, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Christian König, Alex Deucher, ChunMing Zhou,
	Jérôme Glisse

Hello,

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)
> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)
> 
> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.

I'm really skeptical about this approach.  When creating resource
controllers, I think what's the most important and challenging is
establishing resource model - what resources are and how they can be
distributed.  This patchset is going the other way around - building
out core infrastructure for bolierplates at a significant risk of
mixing up resource models across different types of resources.

IO controllers already implement per-device controls.  I'd suggest
following the same interface conventions and implementing a dedicated
controller for the subsystem.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-06 15:26 ` Tejun Heo
@ 2019-05-07 19:50   ` Welty, Brian
  2019-05-09 16:52     ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Welty, Brian @ 2019-05-07 19:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, Li Zefan, Johannes Weiner, linux-mm, Michal Hocko,
	Vladimir Davydov, dri-devel, David Airlie, Daniel Vetter,
	intel-gfx, Christian König, Jérôme Glisse,
	RDMA mailing list, Leon Romanovsky, kenny.ho


On 5/6/2019 8:26 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
>> The patch series enables device drivers to use cgroups to control the
>> following resources within a GPU (or other accelerator device):
>> *  control allocation of device memory (reuse of memcg)
>> and with future work, we could extend to:
>> *  track and control share of GPU time (reuse of cpu/cpuacct)
>> *  apply mask of allowed execution engines (reuse of cpusets)
>>
>> Instead of introducing a new cgroup subsystem for GPU devices, a new
>> framework is proposed to allow devices to register with existing cgroup
>> controllers, which creates per-device cgroup_subsys_state within the
>> cgroup.  This gives device drivers their own private cgroup controls
>> (such as memory limits or other parameters) to be applied to device
>> resources instead of host system resources.
>> Device drivers (GPU or other) are then able to reuse the existing cgroup
>> controls, instead of inventing similar ones.
> 
> I'm really skeptical about this approach.  When creating resource
> controllers, I think what's the most important and challenging is
> establishing resource model - what resources are and how they can be
> distributed.  This patchset is going the other way around - building
> out core infrastructure for bolierplates at a significant risk of
> mixing up resource models across different types of resources.
> 
> IO controllers already implement per-device controls.  I'd suggest
> following the same interface conventions and implementing a dedicated
> controller for the subsystem.
>
Okay, thanks for feedback.  Preference is clear to see this done as
dedicated cgroup controller.

Part of my proposal was an attempt for devices with "mem like" and "cpu 
like" attributes to be managed by common controller.   We can ignore this
idea for cpu attributes, as those can just go in a GPU controller.

There might still be merit in having a 'device mem' cgroup controller.
The resource model at least is then no longer mixed up with host memory.
RDMA community seemed to have some interest in a common controller at
least for device memory aspects.
Thoughts on this?   I believe could still reuse the 'struct mem_cgroup' data
structure.  There should be some opportunity to reuse charging APIs and
have some nice integration with HMM for charging to device memory, depending
on backing store.

-Brian


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] cgroup support for GPU devices
  2019-05-07 19:50   ` Welty, Brian
@ 2019-05-09 16:52     ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2019-05-09 16:52 UTC (permalink / raw)
  To: Welty, Brian
  Cc: cgroups, Li Zefan, Johannes Weiner, linux-mm, Michal Hocko,
	Vladimir Davydov, dri-devel, David Airlie, Daniel Vetter,
	intel-gfx, Christian König, Jérôme Glisse,
	RDMA mailing list, Leon Romanovsky, kenny.ho

Hello,

On Tue, May 07, 2019 at 12:50:50PM -0700, Welty, Brian wrote:
> There might still be merit in having a 'device mem' cgroup controller.
> The resource model at least is then no longer mixed up with host memory.
> RDMA community seemed to have some interest in a common controller at
> least for device memory aspects.
> Thoughts on this?   I believe could still reuse the 'struct mem_cgroup' data
> structure.  There should be some opportunity to reuse charging APIs and
> have some nice integration with HMM for charging to device memory, depending
> on backing store.

Library-ish sharing is fine but in terms of interface, I think it'd be
better to keep them separate at least for now.  Down the line maybe
these resources will interact with each other in a more integrated way
but I don't think it's a good idea to try to design and implement
resource models for something like that preemptively.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-05-09 16:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-01 14:04 [RFC PATCH 0/5] cgroup support for GPU devices Brian Welty
2019-05-01 14:04 ` [RFC PATCH 1/5] cgroup: Add cgroup_subsys per-device registration framework Brian Welty
2019-05-01 14:04 ` [RFC PATCH 2/5] cgroup: Change kernfs_node for directories to store cgroup_subsys_state Brian Welty
2019-05-01 14:04 ` [RFC PATCH 3/5] memcg: Add per-device support to memory cgroup subsystem Brian Welty
2019-05-01 14:04 ` [RFC PATCH 4/5] drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit Brian Welty
2019-05-01 14:04 ` [RFC PATCH 5/5] drm/i915: Use memory cgroup for enforcing device memory limit Brian Welty
2019-05-02  8:34 ` [RFC PATCH 0/5] cgroup support for GPU devices Leon Romanovsky
2019-05-02 22:48   ` Kenny Ho
2019-05-03 21:14     ` Welty, Brian
2019-05-05  7:14       ` Leon Romanovsky
2019-05-05 14:21         ` Kenny Ho
2019-05-05 16:05           ` Leon Romanovsky
2019-05-05 16:34             ` Kenny Ho
2019-05-05 16:55               ` Leon Romanovsky
2019-05-05 16:46             ` Chris Down
2019-05-06 15:16 ` Johannes Weiner
2019-05-06 15:26 ` Tejun Heo
2019-05-07 19:50   ` Welty, Brian
2019-05-09 16:52     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).